Slashdot Mirror


A Simple Grid Computing Synchronization Solution

atari_kid writes "NewScientist.com is running a article about a simple solution to the synchronization problems involved in distributed computing. Gyorgy Korniss and his colleagues at the Rensselaer Polytechnic Institute proposed that each computer in a grid synchronize by occasionally checking with a randomly chosen computer in the network instead of centralizing the grid by having a global supervisor."

5 of 55 comments (clear)

  1. Not as I read it... by Some+Bitch · · Score: 2, Interesting

    From the original New Scientist article...
    Each individual computer makes occasional checks with randomly-chosen others, to ensure it is properly synchronised.
    "This means each individual processor only has to communicate with a finite number of others," says Korniss.


    To me this would imply processors being 'grouped' with different groups checking with one another randomly. e.g. if you have 2 groups of 2 processors (to keep the example simple) group A gets it's timings by checking with a random processor in group B and vice versa. This way a whole group cannot go out of sync because it's timings are determined by a different group.

    Of course if I'm talking crap I'm absolutely certain someone will tell me.

  2. Re:But... by rusty0101 · · Score: 2, Interesting

    It would depend upon the inaccuracy of the randomness algorithm. If the algorithm is efficient, randomly selecting each of the computers before re-selecting any that have been hit before, then inaccuracies will go away.

    If there is a subset of computers that only consult each other, and never any of the other computers, and none of the other computers consult these, then there is a much greater probability of drift for that set of computers.

    Just my interpretation, I could be wrong.

    -Rusty

    --
    You never know...
  3. the details are scant, but this sounds like.... by smd4985 · · Score: 3, Interesting

    a greedy algorithm. at every iteration, do the best that you can and hope for the best. even if the solution/end-state is suboptimal, the huge resources needed for central coordination aren't needed.

    "Greedy algorithms work in phases. In each phase, a decision is made that appears to be good, without regard for future consequences. Generally, this means that some local optimum is chosen. This 'take what you can get now' strategy is the source of the name for this class of algorithms. When the algorithm terminates, we hope that the local optimum is equal to the global optimum. If this is the case, then the algorithm is correct; otherwise, the algorithm has produced a suboptimal solution. If the best answer is not required, then simple greedy algorithms are sometimes used to generate approximate answers, rather than using the more complicated algorithms generally required to generate an exact answer."
    http://www.cs.man.ac.uk/~graham/cs2022/g reedy/

    --
    smd4985
  4. Mitigating factors by CunningPike · · Score: 3, Interesting
    Its always dangerous to comment about something without the full information available. The NewScientist article is quite vague and the Science paper that the article is based on is currently unavailable on-line, but I'll risk it ;)

    The extent to which communication is a bottleneck in parallel processing depends strongly on the problem at hand and the algorithm used to tackle it. Some problems are amenable to batch processing (e.g. Seti@home), others require some level of boundary-synchonisation (simple fluid codes), others require synchronisation across all nodes (e.g. more complex plasma simulations)

    For batch processing tasks, there isn't an issue. For the other's the loose synchronisation may be acceptable depending on the knock-on effect. Loosening the synchronisation obviously decreases the network and infrastructural burden on the job allowing the algorithm to scale better, but the effect of this has to be carefully studied.

    This is important to the application developer, but is not particularly relevent to grids per-say. Grid activity, at the moment, is mainly towards developing code at a slightly lower level than application-dependant communication. It is already building up an infrastructure in which jobs can run which tries to remove any dependancy on a central machine. This is because having a central server is a design that doesn't scale well (and also introduces a single point-of-failure). The Globus toolkit provides a basic distributed environment for batch parallel processing, including a PKI-based Grid security system: GSI.

    On top of this, several projects are developing extra functionality. For example, the DataGrid project is adding may component, such as automatic target selection, fabrication management (site management, fault tolerance, ...), data management (replica selection, management and optimisation, grid-based RDBMS), network monitoring infrastructure and so on.

    The basic model is currently batch-processing, but this will be extended soon to include sub-jobs (both in parallel and with a dependency tree) and an abstract information communication system which could be used for intra-job communication (R-GMA).

    The applications will need to be coded carefully to fully exploit the grid, and reducing network overhead is an important part of this, but The Grid isn't quite at that stage, yet. But we're close to having the software needed for people to just submit jobs to the grid, without caring who provides the computing resource, or the geographical location they'll run.

    --
    | What, you were expecting
    -O_O- +---- something witty?
  5. Re:However by PetWolverine · · Score: 2, Interesting

    The other computer doesn't have to be correct--that's the beauty of it. Each computer isn't checking some particular other computer at regular intervals, they're choosing a different computer to check with each time.

    So let's say a computer named Louise is trying to stay in sync with the group as a whole. At some point it checks with another computer, Virginia, to see how far ahead it is (it's a very fast computer, it knows it'll be ahead). It finds that it's not very far ahead at all, so it corrects just a small amount. Next time it wants to check itself, it checks with Algernon. Algernon is a 7 year old Macintosh Performa. Louise finds itself to be way, way ahead, and holds itself back a lot.

    The point is that the average amount by which Louise finds itself to be ahead will depend directly on the average amount by which it's ahead, so while it'll always be a bit out of sync, it'll keep itself from ever getting too far off. It's a matter of statistics and keeping errors within acceptable ranges, rather than achieving perfection.

    --
    I found the meaning of life the other day, but I had write-only access.