Slashdot Mirror


A Simple Grid Computing Synchronization Solution

atari_kid writes "NewScientist.com is running a article about a simple solution to the synchronization problems involved in distributed computing. Gyorgy Korniss and his colleagues at the Rensselaer Polytechnic Institute proposed that each computer in a grid synchronize by occasionally checking with a randomly chosen computer in the network instead of centralizing the grid by having a global supervisor."

55 comments

  1. SImilar ideas to P2P by Anonymous Coward · · Score: 4, Insightful

    NewScientist also carried an article how randomly moving search agents can speed up P2P technologies, the current idea of : "Each individual computer makes occasional checks with randomly-chosen others, to ensure it is properly synchronised." is again very similar

    The gist is, use a mathematical ploy to ensure that the ammount by which the system can degrade over time is compoensated by the simplest system possible.

    This idea could perhaps be taken further...

  2. But... by mrtorrent · · Score: 3, Informative

    If I understand this correctly, wouldn't it contain the potential for the computers to become very desynchronized. What I mean is that, since each computer may become slightly off from all the others on its own, if each computer synchronizes to another random computer in the group, couldn't some of the computers become massively off?

    1. Re:But... by NortWind · · Score: 4, Insightful

      It is more like the way that an entire auditorium full of people can clap in unison without a leader.

      Each node just queries some other random node, and if it is behind that node, it advances a little, (say 10% of the difference,) and if it is ahead of the other node, it backs up a little. This way, by repeatedly seeing how the others are doing, each node tracks onto the average of the group. The goal isn't to be right, it is just to agree.

    2. Re:But... by rusty0101 · · Score: 2, Interesting

      It would depend upon the inaccuracy of the randomness algorithm. If the algorithm is efficient, randomly selecting each of the computers before re-selecting any that have been hit before, then inaccuracies will go away.

      If there is a subset of computers that only consult each other, and never any of the other computers, and none of the other computers consult these, then there is a much greater probability of drift for that set of computers.

      Just my interpretation, I could be wrong.

      -Rusty

      --
      You never know...
    3. Re:But... by GooberToo · · Score: 1

      As long as the algorithm truly allows for a random distribution, drift should be kept absolutely to a minimum.

      Law of averages is such a wonderful thing. ;)

    4. Re:But... by hatrisc · · Score: 1

      but, the primary node is still going to what everyone is syncronizing off of. so if node b say gets unsyncronized, and node d randomly syncs with node b, then node d is out of sync as well as node b. this brings a whole mess of out of sync nodes. then supposed node c comes in and syncs off the primary node. node c is in sync. now b randomly selects c, so it gets synced, but c stays synced, b is now synced, and eventually d will become synced since it won't randomly select itself. therefore, eventually being synced off of b/c/primary. hence synced

      --
      I write code.
    5. Re:But... by GlassHeart · · Score: 1
      It is more like the way that an entire auditorium full of people can clap in unison without a leader.

      Uh, no. People clapping in an auditorium can hear the combined audio output of everybody else clapping. I'm not just listening to one random person in the audience.

      Without thinking too hard about it, it seems that what's needed is not just a random (as in unpredictable) number, but a well distributed random number, so that you avoid the formation of subgroups that are just polling each other.

    6. Re:But... by Webmonger · · Score: 1

      That is a bad example. A large group of people never claps in unison. Some people are half a beat off or worse.

    7. Re:But... by ipjohnson · · Score: 1

      problem is you're betting on statistics so ... every once in a while it will break down and you'll get a lemming effect everyone jumps ahead. This is why everyonce in a while you need to synch up with an "offical time" this could be distibuted and done much less frequently. With proper safe gaurds you could provide a solution with no deffects.

    8. Re:But... by Xrikcus · · Score: 1

      In my experience people only claip in unison if there is a leader. The difference between this idea and clapping is that with clapping you always follow the same person, or few people, ie those near you. With this you randomly choose other people anywhere in the audience, which spreads out the comparisons properly.

  3. Re:News Flash! by Anonymous Coward · · Score: 0

    how is that modded up, yet a previous post identifying that this research of randomising and disassociating nodes of a P2P system is still a 0 modded post? /. who do you want to mod today?

  4. BUT??? BUT? by Anonymous Coward · · Score: 0, Troll

    1: You dont understand it correctly
    2: 'massively off' ?
    3: Oh you know everything don't you

    Every random synch would be a degree of hops from the master node, THE KEY WORD is SIMPLE, it isn't bloody rocket science, and you don't even get it.

    By randomly updating, the net effect is a smooth, disassociated update which uses a temporal variable to maintain a degree of acceptable degredation within the system to ensure that all nodes have enough updates.

    TADA BLUDDY DA.

    Fuck /.

    1. Re:BUT??? BUT? by Anonymous Coward · · Score: 0

      As we learnt today... rocket science is kinda hard :-(

  5. The computer equavalent of... by rmarll · · Score: 1, Funny

    Hey buddy, you got any change?

  6. Procrastinating wally by Anonymous Coward · · Score: 0

    I bet you can slice time really thin with the ammount you procrastinate.

    Or I bet your l33t 'knowledge' of noded based systems and dist. computing stems from your l33t pr0n finding abilities on K4z4a

  7. However by unterderbrucke · · Score: 1

    Who knows if the other computer is correct?

    The real answer is a smaller scale super computer controllig the distributed computing.

    1. Re:However by umofomia · · Score: 1
      The real answer is a smaller scale super computer controllig the distributed computing.
      The whole point of distributed computing is to get away from the reliance on central controllers. If the controller crashes, then the whole system dies, whereas in a distributed mechanism, several nodes may die and the entire system can keep running.

      In addition, there are massive problems with scaling if you have a central controller. You simply cannot add more nodes to the system past a certain point without heavily loading the controller.

    2. Re:However by PetWolverine · · Score: 2, Interesting

      The other computer doesn't have to be correct--that's the beauty of it. Each computer isn't checking some particular other computer at regular intervals, they're choosing a different computer to check with each time.

      So let's say a computer named Louise is trying to stay in sync with the group as a whole. At some point it checks with another computer, Virginia, to see how far ahead it is (it's a very fast computer, it knows it'll be ahead). It finds that it's not very far ahead at all, so it corrects just a small amount. Next time it wants to check itself, it checks with Algernon. Algernon is a 7 year old Macintosh Performa. Louise finds itself to be way, way ahead, and holds itself back a lot.

      The point is that the average amount by which Louise finds itself to be ahead will depend directly on the average amount by which it's ahead, so while it'll always be a bit out of sync, it'll keep itself from ever getting too far off. It's a matter of statistics and keeping errors within acceptable ranges, rather than achieving perfection.

      --
      I found the meaning of life the other day, but I had write-only access.
  8. Not as I read it... by Some+Bitch · · Score: 2, Interesting

    From the original New Scientist article...
    Each individual computer makes occasional checks with randomly-chosen others, to ensure it is properly synchronised.
    "This means each individual processor only has to communicate with a finite number of others," says Korniss.


    To me this would imply processors being 'grouped' with different groups checking with one another randomly. e.g. if you have 2 groups of 2 processors (to keep the example simple) group A gets it's timings by checking with a random processor in group B and vice versa. This way a whole group cannot go out of sync because it's timings are determined by a different group.

    Of course if I'm talking crap I'm absolutely certain someone will tell me.

  9. Who knows if by Anonymous Coward · · Score: 0

    Who knows if the 'small scale super computer' is correct?

    Hmmm? hmmmm? And what would it do actually? hmm?

    Oh perhaps the guys who modded you up understodd it, adn could tell me, since they thought it was an honourable mention.

    Procrastination, and modification, who do you want to /. today

    The whole fucking point of the article is how a simple fucking system statistically floats every node about a degradation factor, thus giving the system enough stability.

    But if your system is so good, please post a fucking URL to it bitch.

  10. Not withstanding by Anonymous Coward · · Score: 0

    As I read it, we cannot reply on such system that may break down under a DDoS attack. For global itterative distributed systems, the key fundemantal chaining laws indicate that unless the synchronisation is secure, then the whole system state is unreadable.

    Read Thomas Methodsons paper on distributed backends he developed similar to SETI systems, which run transient security functions on all nodes. These systems require extremely high system stability to function, and such systems would not cater for them convincingly.

    Definately a centralised super computer setup, which could predictively monitor the system remotely. I have seen such solutions in papers, nothing you can look into however.

  11. Re:News Flash! by hey · · Score: 1

    I suppose on difference between random-gridding and P2P is that all machines in the grid would know about all others - they just select random peers to sync with. In P2P a major problem is simply *finding* the other hosts.

    Also in a grid the machines are presumably more reliable - they'll stay up for a much longer time than typical P2P hosts.

    So I think this is a pretty cool - and novel - idea.
    Since it might help grids scale better and many grids are Linux is another step in the world domination campaign :-)

  12. the details are scant, but this sounds like.... by smd4985 · · Score: 3, Interesting

    a greedy algorithm. at every iteration, do the best that you can and hope for the best. even if the solution/end-state is suboptimal, the huge resources needed for central coordination aren't needed.

    "Greedy algorithms work in phases. In each phase, a decision is made that appears to be good, without regard for future consequences. Generally, this means that some local optimum is chosen. This 'take what you can get now' strategy is the source of the name for this class of algorithms. When the algorithm terminates, we hope that the local optimum is equal to the global optimum. If this is the case, then the algorithm is correct; otherwise, the algorithm has produced a suboptimal solution. If the best answer is not required, then simple greedy algorithms are sometimes used to generate approximate answers, rather than using the more complicated algorithms generally required to generate an exact answer."
    http://www.cs.man.ac.uk/~graham/cs2022/g reedy/

    --
    smd4985
    1. Re:the details are scant, but this sounds like.... by Sgs-Cruz · · Score: 2, Funny
      My mother always said, "be careful, son, it's an algorithm-eat-algorithm world out there... Each node will always try to take whatever it can get... you must be sure to always maintain a fair share for yourself of the grid computing resources."

      Yup, good old mom. How right she was. Nowadays, we've got travesties like the Selfish Gene kicking around the gay gene (the second one down.)

      :) -- cruz

      --

      Karma: pi (Mostly due to circular reasoning in posts).

  13. Gridshare? by cmburns69 · · Score: 1

    Gridster, Gridtella and Gridzaa...

    Starcraft RPG? only at

    --
    Online Starcraft RPG? At
    Dietary fiber is like asynchronous IO-- Non-blocking!
  14. Ah, and it all comes down to.. by fateswarm · · Score: 1

    And it all comes down to, the great idea of the international mesh of networked computers. The same philosophy that says tree stractures are a bottleneck to a stable network. And indeed are. A network that feeds from its members, not from the "great master-servers" can serve well on stability and performance. In fact, its stability is guaranteed in a way that is impossible to cut off the net unless the whole neibhorhood gets cut off.

    Lets keep this ideas since they give something new not only to technology but the whole pfilosophy the western civilization is based on.

    We are used to masters and slaves, what about all equal?

    All responsible.

    All powerfull.

  15. Re:TO2 by Jynxeh · · Score: 0, Offtopic

    I really hope not, because I live in Troy, and my dad works at RPI.

    I'd actually heard they were going to be attending Rocehster Institute of Technology, anyway, which is where my boyfriend will most likely be going next year.

    But hey, this is the internet, and random rumours abound. They could be not going at all, or going somewhere entirely different, and chances are, there'd be some rumour that they were going to just about any large university you care to think of.

  16. And this is somehow news? by 6hill · · Score: 2, Informative

    Why is this news?

    Distributed systems that do not rely on a centralised authority, be it for synchronising or resource distribution, are by far not a new thing. To name a random example (and you can find a dozen others with five minutes of Googling), the Prospero Resource Manager was a USC project started in the early 90s that relied on distributed authorities with no centralised command centre.

    Furthermore, if the computers are self-controlling and not guarded by anything besides their internal mechanisms that rely on the checks on other computers, the potential danger lies in a computer in the grid having a seriously fscked-up internal state. In other words, can a malfunctioning computer be trusted to monitor itself correctly? I think not.

  17. Re:TO2 by Anonymous Coward · · Score: 0

    ok, look at your sources next time...

    the article to which you refer "Olsen twins to attend Rensselaer" was a fake CNN site created by a script.

    this script was hosted by spo0fed.com

    it is now down (due to legal reasons) so i cannot show you, but it had a page to inptu all of your fake details and it would slap it into a copy of the CNN.com news page.

    among other things, it said that they got 1600s on SATs and were interested in nanotech

    if you actually looked at teh URL of the post it wasnt from CNN.com, but from someone's personal server...

  18. The Original Research Article by Anonymous Coward · · Score: 0

    See that reference at the bottom of the New Scientist article: "Journal reference: Science (vol 299, p 677)"?
    Science Magazine is where to go for the actual results as explained by the people who did the research (as opposed to results as filtered through the mind of a journalist on a deadline, without the time to really understand the research, and with a desire to make the research more "interesting" and more "accessible" to his readers, regardless of the impact that may have on accuracy).

    For those who have access to Science (your school may have a site subscription?) go here. For everyone else, the abstract is here courtesy of the National Institutes of Health.

    I'll come back and post the full text of the article later.

    -A.C.

  19. Internet Connections by EverStoned · · Score: 0

    Wouldn't this mean that all computers would have to be connected to the internet at all times to be part of the group? This might not be a problem for those with broad-band, but what about those that still use dial-up?

    1. Re:Internet Connections by Xrikcus · · Score: 1

      If they're not connected to anything then how are they going to synchronise however you approach that syncronisation?

      Clearly this idea will only work if it was possible to do it in some way originally, this being just a different way of doing it.

  20. Mitigating factors by CunningPike · · Score: 3, Interesting
    Its always dangerous to comment about something without the full information available. The NewScientist article is quite vague and the Science paper that the article is based on is currently unavailable on-line, but I'll risk it ;)

    The extent to which communication is a bottleneck in parallel processing depends strongly on the problem at hand and the algorithm used to tackle it. Some problems are amenable to batch processing (e.g. Seti@home), others require some level of boundary-synchonisation (simple fluid codes), others require synchronisation across all nodes (e.g. more complex plasma simulations)

    For batch processing tasks, there isn't an issue. For the other's the loose synchronisation may be acceptable depending on the knock-on effect. Loosening the synchronisation obviously decreases the network and infrastructural burden on the job allowing the algorithm to scale better, but the effect of this has to be carefully studied.

    This is important to the application developer, but is not particularly relevent to grids per-say. Grid activity, at the moment, is mainly towards developing code at a slightly lower level than application-dependant communication. It is already building up an infrastructure in which jobs can run which tries to remove any dependancy on a central machine. This is because having a central server is a design that doesn't scale well (and also introduces a single point-of-failure). The Globus toolkit provides a basic distributed environment for batch parallel processing, including a PKI-based Grid security system: GSI.

    On top of this, several projects are developing extra functionality. For example, the DataGrid project is adding may component, such as automatic target selection, fabrication management (site management, fault tolerance, ...), data management (replica selection, management and optimisation, grid-based RDBMS), network monitoring infrastructure and so on.

    The basic model is currently batch-processing, but this will be extended soon to include sub-jobs (both in parallel and with a dependency tree) and an abstract information communication system which could be used for intra-job communication (R-GMA).

    The applications will need to be coded carefully to fully exploit the grid, and reducing network overhead is an important part of this, but The Grid isn't quite at that stage, yet. But we're close to having the software needed for people to just submit jobs to the grid, without caring who provides the computing resource, or the geographical location they'll run.

    --
    | What, you were expecting
    -O_O- +---- something witty?
  21. NTP, anyone? by cperciva · · Score: 1

    Based on the (limited) details available, I'd say it sounds like they've just reinvented NTP -- except they've done it poorly, and without any security.

    1. Re:NTP, anyone? by Anonymous Coward · · Score: 0

      NTP is a hierarchy - tier 1, tier 2, etc.
      This don't sound like one.

    2. Re:NTP, anyone? by cperciva · · Score: 1

      (quoting an AC):
      NTP is a hierarchy - tier 1, tier 2, etc.
      This don't sound like one.


      NTP is organized as a hierarchy because there is a "true" time which it attempts to track. Take away the atomic clocks, and NTP would collapse into a hierarchy-less system which nevertheless converges to agreement about some sort of "time".

  22. The 80s retro work by Anonymous Coward · · Score: 1, Insightful

    In the early 80's I heard a talk at IBM's Almaden
    Research facility by a couple of the people involved in the ethernet development. They we synchronizing Xerox's phone/address list throughout the world by random contact and update. While they are certainly people with a hammer (random control) hitting anything looking vaugely like a nail, the experiment was a great success. They had a strong mathematical analysis developed in the medical community: communicable disease propogation. The system was far more reliable and lower cost (in communications) than any attempt to track the connections and run data propogation that "knows" an even slightly out-of-date view of available network connections.

    If you think random cannot work in practice, don't use ethernet. For that matter, don't use semiconductor technology at all.

  23. OpenMosix does this for load leveling already. by Anonymous Coward · · Score: 0

    It is the same general idea, using decentralized decision making to improve efficiency in a distributed system.

    Google highlighted

    http://openmosix.sourceforge.net

    Of course, the idea is implemented quite differently in the two systems.

    -Dave

  24. But... by thebigmacd · · Score: 1

    ...there is a leader. People clapping in unison only do so in the presence of a rythmic audio source, namely music.

    This is the equivalent of a central timing server. When people start to go by their neighbours (ie people looking around with excited looks on their faces ["aren't we cool, clapping to this music!"]) and stop listening to the source of music, one half of the auditorium gets waaay off beat.

    My apologies if you were referring to people beginning to clap after a performance, because you have a point there.

    My $0.02 CDN
  25. fireflys by Anonymous Coward · · Score: 0

    This is similar to the classic example used in physics classes of fireflys. Each makes a small movement to synchronise with some that are close. (in terms of flashing rythm). The result is a swarm that flashes in synchrony. It is normally used to illustrate concepts around forced oscillators - the key is to give the proper weighting to the change made to bring each unit into time with the local area.

  26. Read the actual article by myd · · Score: 2, Informative

    The New Scientist summary is lame. Pick up a copy of Science and read the actual article if you can. It says, "Here, we show a way to construct fully scalable parallel simulations for systems with asynchronous dynamics and short-range interactions." This method, while interesting, does not generalize to a wide range of applications. For example, you could not apply this approach to molecular dynamics simulations, which involve primarily long-range interactions between atoms. Still, the authors of this article are clearly pretty clever.

  27. flooding protocol by Anonymous Coward · · Score: 0

    I remember reading an article in DDJ about this, back in 1990 or so.

    They called it "flooding protocol" and it's old technology. Call your neighbors and simply issue
    IHAVE and IWANT commands back & forth.

    They were doing stuff like that back in BBS days over modem when long distance calls were expensive.

    It is good to see technology come full circle.

  28. More Details by Anonymous Coward · · Score: 0
    The article that is being referred to is about maintaining the global virtual time in Parallel Discrete Event Simulations (PDES). A pdf version of the article is publicly available on Gyorgy Korniss's Web Page, the full reference is:

    ``Suppressing Roughness of Virtual Times in Parallel Discrete-Event Simulations'' G. Korniss, M.A. Novotny, H. Guclu, Z. Toroczkai, and P.A. Rikvold, Science 299, 677 (2003).

    Now, let's ask why that is such a good contribution. In discrete event simulation, the system consists of a set of entities, each entity has a discrete state, but can undergo state changes at an arbitrary time (so states update asynchronously, unlike in time step driven models). In PDES, the entities are mapped to logical processors (LPs), and communicate via message passing using time stamped messages. It can be shown that the fidelity of the simulation is preserved if each LP processes its messages in ascending time stamp order, this is called the local causality constraint. There are two primary kinds of protocols:

    1. Conservative which always check that processing a message does not violate the local causality constraint, and
    2. Optimistic which employ speculative execution, and use reverse computation or checkpointing and rollback to do a fixup if they process events out of order.
      1. Both sets of protocols need for each LP to get an estimate of the simulation time (a hard problem). This appears to be Korniss et al's contribution.
  29. Last Post! by alpg · · Score: 0

    Dear Emily:
    I saw a long article that I wish to rebut carefully, what should
    I do?
    -- Angry

    Dear Angry:
    Include the entire text with your article, and include your comments
    between the lines. Be sure to post, and not mail, even though your article
    looks like a reply to the original. Everybody *loves* to read those long
    point-by-point debates, especially when they evolve into name-calling and
    lots of "Is too!" -- "Is not!" -- "Is too, twizot!" exchanges.
    -- Emily Postnews Answers Your Questions on Netiquette

    - this post brought to you by the Automated Last Post Generator...