Domain: globus.org
Stories and comments across the archive that link to globus.org.
Comments · 90
-
Mitigating factorsIts always dangerous to comment about something without the full information available. The NewScientist article is quite vague and the Science paper that the article is based on is currently unavailable on-line, but I'll risk it
;)The extent to which communication is a bottleneck in parallel processing depends strongly on the problem at hand and the algorithm used to tackle it. Some problems are amenable to batch processing (e.g. Seti@home), others require some level of boundary-synchonisation (simple fluid codes), others require synchronisation across all nodes (e.g. more complex plasma simulations)
For batch processing tasks, there isn't an issue. For the other's the loose synchronisation may be acceptable depending on the knock-on effect. Loosening the synchronisation obviously decreases the network and infrastructural burden on the job allowing the algorithm to scale better, but the effect of this has to be carefully studied.
This is important to the application developer, but is not particularly relevent to grids per-say. Grid activity, at the moment, is mainly towards developing code at a slightly lower level than application-dependant communication. It is already building up an infrastructure in which jobs can run which tries to remove any dependancy on a central machine. This is because having a central server is a design that doesn't scale well (and also introduces a single point-of-failure). The Globus toolkit provides a basic distributed environment for batch parallel processing, including a PKI-based Grid security system: GSI.
On top of this, several projects are developing extra functionality. For example, the DataGrid project is adding may component, such as automatic target selection, fabrication management (site management, fault tolerance,
...), data management (replica selection, management and optimisation, grid-based RDBMS), network monitoring infrastructure and so on.The basic model is currently batch-processing, but this will be extended soon to include sub-jobs (both in parallel and with a dependency tree) and an abstract information communication system which could be used for intra-job communication (R-GMA).
The applications will need to be coded carefully to fully exploit the grid, and reducing network overhead is an important part of this, but The Grid isn't quite at that stage, yet. But we're close to having the software needed for people to just submit jobs to the grid, without caring who provides the computing resource, or the geographical location they'll run.
-
Mitigating factorsIts always dangerous to comment about something without the full information available. The NewScientist article is quite vague and the Science paper that the article is based on is currently unavailable on-line, but I'll risk it
;)The extent to which communication is a bottleneck in parallel processing depends strongly on the problem at hand and the algorithm used to tackle it. Some problems are amenable to batch processing (e.g. Seti@home), others require some level of boundary-synchonisation (simple fluid codes), others require synchronisation across all nodes (e.g. more complex plasma simulations)
For batch processing tasks, there isn't an issue. For the other's the loose synchronisation may be acceptable depending on the knock-on effect. Loosening the synchronisation obviously decreases the network and infrastructural burden on the job allowing the algorithm to scale better, but the effect of this has to be carefully studied.
This is important to the application developer, but is not particularly relevent to grids per-say. Grid activity, at the moment, is mainly towards developing code at a slightly lower level than application-dependant communication. It is already building up an infrastructure in which jobs can run which tries to remove any dependancy on a central machine. This is because having a central server is a design that doesn't scale well (and also introduces a single point-of-failure). The Globus toolkit provides a basic distributed environment for batch parallel processing, including a PKI-based Grid security system: GSI.
On top of this, several projects are developing extra functionality. For example, the DataGrid project is adding may component, such as automatic target selection, fabrication management (site management, fault tolerance,
...), data management (replica selection, management and optimisation, grid-based RDBMS), network monitoring infrastructure and so on.The basic model is currently batch-processing, but this will be extended soon to include sub-jobs (both in parallel and with a dependency tree) and an abstract information communication system which could be used for intra-job communication (R-GMA).
The applications will need to be coded carefully to fully exploit the grid, and reducing network overhead is an important part of this, but The Grid isn't quite at that stage, yet. But we're close to having the software needed for people to just submit jobs to the grid, without caring who provides the computing resource, or the geographical location they'll run.
-
Re:What the hell?Tell ya what...go install Globus and come back and say what you think of that.
People can do dumb stuff in any language, but sorry, it's not a Java problem.
-
Re:Opensource Grid ComputingWell, they have their own license.
It's also a complete bear to install.
-
Re:Physicists thinking about the Grid
Physicists are more than thinking about the Grid, I should know as they're funding my PhD in Data Grid Computing 8*).
The main reason for this is the Large Hadron Collider, which is due to go into production at CERN in about 2007. For the younger members of the audience, CERN was where Tim Berners-Lee developed the World Wide Web in the early 1990's
When it goes online it has 4 major experiments, each of which stores data at 100-400MB/sec, and I stress stores data at 100+ MB/sec, the first level is processing 40Terabytes a second. This equals a few petabytes a year (1PB = 1000TB = 1000000GB) which then has to be shipped to sites around Europe and the US.
All this is going to have data, processing and network requirements which make most techies gasp, i.e. Google only has a 20TB database, current physics ones are at 650TB+. At this level 14TFlops is kinda a cute little toy.
And yes, most of it's open source and based on the Globus Toolkit.
-
Re:Opensource Grid Computing
It's called The Globus Toolkit.
-
A few optionsI have been looking into this lately, and here are the options I have found:
-
Condor - seems to be the best free as in beer scheduler, but it's not free as in speech.
- OpenPBS - This one is sort of Free, but it is being developed by a company that doesn't seem so sure it likes it that way. The code goes BSD after a couple of years, and they've been doing that for several years, yet they don't make the old (now BSD) versions available, and they make you register just to download.
- Sun GridEngine - Free, and it looks pretty sweet. I couldn't get it to work on Debian, but people on the mailing list said they were using it with Debian.
- Globus Toolkit - Not so sure about this one.
- Maui - Scheduler system for supercomputers
- OSCAR - Sweet project from IBM to put together all the best Free tools for clustering! They are using the Maui scheduler in their system.
What I would really like to see is a HOWTO that gives a good overview of scheduling and clustering. Everything I have found so far is not so good.
-
Condor - seems to be the best free as in beer scheduler, but it's not free as in speech.
-
Capitalize on the hype
It seems like this attempt to market something as "gridMathematica" is really a little deceiving. In reality it is more distributed Mathematica. Grids involve virtual organizations, authentication, etc. For more information see Ian Foster, Carl Kesselman, and Steve Tuecke's paper The Anatomy of the Grid.
There are other packages which do very similar things and have a for a long time, such as NetSolve and Ninf which allow you to do cool stuff with most any application that needs computational power.
There is also a Commodity Grid Kit (standard interface to Globus services) for Matlab that should be out soon, more info can be found here.
So for now, I'll just consider this more someone wanting to capitalize on the hype around Grids at SC2002 than anything else. Unless I'm missing something obvious. -
Capitalize on the hype
It seems like this attempt to market something as "gridMathematica" is really a little deceiving. In reality it is more distributed Mathematica. Grids involve virtual organizations, authentication, etc. For more information see Ian Foster, Carl Kesselman, and Steve Tuecke's paper The Anatomy of the Grid.
There are other packages which do very similar things and have a for a long time, such as NetSolve and Ninf which allow you to do cool stuff with most any application that needs computational power.
There is also a Commodity Grid Kit (standard interface to Globus services) for Matlab that should be out soon, more info can be found here.
So for now, I'll just consider this more someone wanting to capitalize on the hype around Grids at SC2002 than anything else. Unless I'm missing something obvious. -
Capitalize on the hype
It seems like this attempt to market something as "gridMathematica" is really a little deceiving. In reality it is more distributed Mathematica. Grids involve virtual organizations, authentication, etc. For more information see Ian Foster, Carl Kesselman, and Steve Tuecke's paper The Anatomy of the Grid.
There are other packages which do very similar things and have a for a long time, such as NetSolve and Ninf which allow you to do cool stuff with most any application that needs computational power.
There is also a Commodity Grid Kit (standard interface to Globus services) for Matlab that should be out soon, more info can be found here.
So for now, I'll just consider this more someone wanting to capitalize on the hype around Grids at SC2002 than anything else. Unless I'm missing something obvious. -
Re:Their in fault, not you
Why are people coming up with all these irrelevant analogies?
This is from a guy who's .sig is "To ruin the net to save Disney is the equivalent of burning down the library of Alexandria to save monastic scribes"? :-)
The other companies in question chose to allow company X to integrate its code with theirs.
Quite likely. However, you have no way of knowing what this agreement included. It could be an informal verbal agreement, where he gets no particular legal rights. It could be a written one prohibiting redistribution of their own source as part of the agreement, which means that if he tries to redistribute the source, the license they grant him becomes invalid. This is quite common.
Its code happened to be integrated with GPL'ed code, which means that anything it MUST be released under the GPL, as MUST anything else it integrates with (ref. to GPL). The other companies had the obligation to make sure that company X didn't have any GPL'ed code. They didn't do that. THEIR FAULT. Now they pay the price. They are OBLIGATED to distribute their code under the GPL because they "knew or SHOULD HAVE KNOWN that the code they were merging with had GPL'ed code in it". In other words, the burden is on them.
This is simply stupid. If I produce a license, say that's something like this, the only person who gets screwed over is the person licensing it, if he intends to merge with GPLed code. There's no implied license, as you seem to feel is the case. The merging person simply does not have the ability to produce a derivative work an apply both the license he was granted at the same time, because doing so would violate clauses of one or the other licenses.
-
Grid Computing != Timeshare (although related)
No matter how many articles I've read, it always amazes me how few Slashdotters read the article before they feel compelled to post their (usually misguided) opinion. I'm sure plenty do, but there sure are a lot who don't.
IBM is working on the commercialization of Distributed Computing (henceforth, DC). This effort has been around for a while (in a related area, called Grid Computing, which some people use interchangably with DC) in the form of the Globus project, amongst others.
The concept behind DC is essentially a next-gen timeshare-- a distributed timeshare with an abstration layer, if you will. Unlike traditional timeshare, you don't specify where your processing will occur. Unlike existing projects (like folding@home, dsitributed.net), DC doesn't require that you have a parallel, segmentable computing problem.
Let's say (in your best Police Squad voice) I'm a mechanical engineer who's designing a car engine with a few thousand parts. I want to run some simulations on my model to inspect heat flows, vibration, whatever. Car companies (or the little guy with a copy of Catilla and a great idea) don't necessarily have dedicated computing resources to run my simulation. So, until now, I had to band together with a bunch of other mechanical engineers with jobs similar to mine and try to justify a giant simulation node. Or, I might convince management to outsource the computation, requiring a bunch of red tape, NDAs, contracts, negotiation, etc.
Now consider IBM, one of the largest commercial web hosts. IBM maintains giant server farms to support these services. Consider the amount of excess processing capacity sitting in these server farms because (a) a lot of servers are spitting out static pages and (b) extra capacity necessary to cover peak loading for special events.
Expand this idea to include thousands of people who need computation power for discrete, isolated projects and thousands of companies with excess computational capacity. The consumers don't care precisely where or when their computations get completed, they only care that they get done in a "reasonable" amount of time. An intermediary, which it looks like IBM wants to be, can accept jobs from them, break them into as many pieces as they can, farm them out to whichever of their suppliers has excess capacity at any particular moment, combine the results, and return them to the customer.
Even more, IBM can charge more if you want a high priority on your computation or if your job is not symmetric and must be run on fewer nodes.
Actually, if you think about it, IBM is hurting their server sales by advancing this project. Right now, they sell a lot of excess capacity to companies to cover their peak loading. If companies can dynamically purchase exactly the amount of processing they need, that's money IBM's leaving on the table. Now, companies with high-availabity requirements will still purchase their own systems with enough extra capacity to cover their own needs. But, when they're not using that capacity, they'll sell it.
I think IBM saw that the train was leaving the station. They know this technology is coming. And they see that the chance to be the intermediary in this market is worth more than the money they'll lose in hardware sales. And, they know if they don't, someone else will. -
Re:Sun is Right
Yeah. There's actually quite a lot of research going into this currently. It's called the Grid (think "power grid", ubiquitous, simple to use), and I predict it will be the next big buzzword.
See Global Grid Forum, Grid Today and the Globus project for starters.
The problem of buying and selling computation power on some sort of broker basis is a quite interesting problem in itself. Exactly what are you selling? Hardly CPU hours, since the value of those depends on the hardware.
-
Clusters and Grid
Grid computing and clustering technologies are on opposite ends of the parallel computing scale.
Actually the clustering technologies are in the middle of the scale. Symmetric multiprocessing with shared memory is the most tightly coupled end of the scale, then come the clusters, then the Grid technologies at the other end.
Each calls for a different style of application development too. In systems where IPC is really expensive, you want to minimise it as much as possible. Not all apps that are written to run on a Beowulf cluster will necessarily port straight over to a grid framework. However, for apps that can be made to run well on a grid, the potential computing power available is far, far greater.
Yes, the development strategies are certainly different. However, often the Grid technologies can be used to provide a way to access the clusters instead of distributing the whole software on several machines. In that case you usually need only relatively small changes to existing software.
The benefit in this kind of approach can be that the authentication, authorization and encryption services for the connection and data transfer are provided by the Grid framework. For instance you can use the Globus Java CoG kit to authenticate in "Globus style" if you prefer that to the options Java natively offers. (Mobile Analyzer developed in our group at Helsinki Institute of Physics does that.)
Currently it is often still a bit unconvenient (mappings between Grid credentials and local user accounts etc.) but as these services develop users probably will have access to many more machines than they have now, because they don't need an account in each box. Then they can run their job (which is not necessarily parallel at all) where they like or run the job on their desktop but access data in an external database using their Grid account.
The computer or cluster for the job can also be selected automatically. The NorduGRID group has implemented this kind of system which connects several clusters in Nordic countries, they have a status monitor on on their website.
AJT -
Hmm
The future of supercomputing lies in grid technology and creations such as the globus toolkit.
-
Notes and comments
First of all, be sure to check out the links at the end of the article to some of the projects that are going on right now. Some of the ones that I find more interesting are the Particle Physics Data Grid and the Access Grid (no link in article).
One of the great benefits of Grid computing over distributed computing is the access to resources, such as storage. This is what PPDG seeks to do, provide access to physicists, in near real time, to the results of experiments. The problem is that the experiments may be performed at CERN and the researcher may be at CalTech. While normally for a telnet or what not, this isn't a problem, it is a problem when an experiment can produce Petabytes of data. For more information on that see http://www.ppdg.org. There is another project called NEESGrid that will provide access to earthquake simulation equipment remotely. Truly cool.
I also encourage you to check out Globus. Using a system like the Globus Toolkit along with MDS, I can locate a machine and execute my program on it transparently. This transparency is taken care through a network of resource managers, proxys and gatekeepers. It's pretty cool and is pretty easy to install on your favorite Linux box.
Programming Grid enabled applications is pretty easy. There are software libraries called CoG Kits that provide simple APIs for Java, Python and a few other languages. In just a few lines of code you can have a program that looks up a server to run your executable on, connects, executes and returns the data to you.
The current push right now is towards OGSA which is Open Grid Services Architecture. This will form the basis for Globus 3.0. OGSA will take ideas from web services, like WSDL, service advertisement, etc, and implement them to create Grid services. This will be the next thing with services easily able to advertise themselves and clients easily able to find services. -
Notes and comments
First of all, be sure to check out the links at the end of the article to some of the projects that are going on right now. Some of the ones that I find more interesting are the Particle Physics Data Grid and the Access Grid (no link in article).
One of the great benefits of Grid computing over distributed computing is the access to resources, such as storage. This is what PPDG seeks to do, provide access to physicists, in near real time, to the results of experiments. The problem is that the experiments may be performed at CERN and the researcher may be at CalTech. While normally for a telnet or what not, this isn't a problem, it is a problem when an experiment can produce Petabytes of data. For more information on that see http://www.ppdg.org. There is another project called NEESGrid that will provide access to earthquake simulation equipment remotely. Truly cool.
I also encourage you to check out Globus. Using a system like the Globus Toolkit along with MDS, I can locate a machine and execute my program on it transparently. This transparency is taken care through a network of resource managers, proxys and gatekeepers. It's pretty cool and is pretty easy to install on your favorite Linux box.
Programming Grid enabled applications is pretty easy. There are software libraries called CoG Kits that provide simple APIs for Java, Python and a few other languages. In just a few lines of code you can have a program that looks up a server to run your executable on, connects, executes and returns the data to you.
The current push right now is towards OGSA which is Open Grid Services Architecture. This will form the basis for Globus 3.0. OGSA will take ideas from web services, like WSDL, service advertisement, etc, and implement them to create Grid services. This will be the next thing with services easily able to advertise themselves and clients easily able to find services. -
Re:Hmmm, This and the PS3Globus (grid software being developed mainly at Argonne) has a security model built into it. See www.globus.org/security for some details.
Also, anything the DOE would do on this network would be unclassified, and completely non-export controlled. Classified work is done on internal networks separated from the internet by an air gap. -
Re:Hmmm, This and the PS3
PKI and Kerbero's - DOD & DOE have mandates for Kerberos in the short to medium term (5-10 years). Europe currently favours PKI with authenticated certificates (like PGP cert's) but only signed by one government agency.
However, the Globus toolkit was build on the GSSAPI which would allow it to run on anything you want to write an interface to. -
A little more information
This is a little surprising that it got posted and all because it's not all that earth shatterning news, but I'll provides some additional information about grids in General.
There are a wide variety of systems like this that are either currently available or are being developed. Among them are Particle Physics Data Grid, NEESGrid and various European and Asian counterparts.
The basic premise is to allow access to various resources you don't have at your desktop. This is not to be confused to with putting all these computers together an forking a process a billion times and having it run it run all over the globe. It's more like saying I have a process that requires 128 processors and 4GB of ram, go find it an run it for me.
Most of the systems use Globus which is pretty much the defacto standard. There are other systems out there such as Legion and Condor which serve slightly different purposes.
I've also seen some issues about security raised, so I'll mention them quickly. Globus is built upon an API called GSS (Generic Security System), I believe it will soon (if not already) have an RFC published. This is a layer on top of various other security systems that may be local to the server running it. It can use Kerberos or PKI to do encryption across the network (don't flame me if it's wrong, I'm not security expert).
When I wish to start using the grid, I start up my proxy that takes care of all authentication for me. Then my proxy connects to the gatekeeper on the remote machine which authenticates me based on my private key and then authorizes me via a mapping (usually just a text file). The task is then executed by the gatekeeper via the mapping on the remote machine. Input and output can be redirected over a secure layer if you so desire.
My certificate is issued by an authority. In this case the Globus CA. The nice thing if that if you want to set up a grid of your own computers, you can get a cert from them too. Install Globus and it will tell you how.
Certificates also allow you to get access to data. This allows me as a user A to run program B at site C providing results to user D at site E for a period of time F.
It's all terribly neat and remarkably easy to install on your favorite Linux or Solaris box. It's also fairly easy to write programs to utilize the Grid thanks to the various CogKits for Python, Java and Perl.
-
Re:Colleges
What you are describing sounds strikingly similiar to the globus project.
-
Globus is a Linux project, sponsored by Microsoft
Tell the Microsoft sales rep that you are using Linux because that's where many of the advances in clustering technology are being developed. In fact, they recently switched from using Windows as the basis of their development to using Linux, and one of their primary sponsors is Microsoft. Since Linux is clearly Microsoft's first choice for a clustering platform, yours should be too. After all, noone ever got fired for doing what Microsoft told them to!
-
Re:distributed computing
let's see. 1 GB in 10 ms works out to 100 GB per second. how recently did GB ethernet come about? and what would the average bandwidth of users be? i would guess much less, but let us assume 100KB per second.
Well 100 GB per second is the raw data rate, as read out (heavily parallel) from the detector, i.e. the data rate the DAQ (Data AQuisition) system has to keep up with. That's pretty difficult really, but done completely in hardware: the readout chips have relatively large on-chip buffers for each read-out channel. NOST OF THIS DATA IS DISCARDED RIGHT AWAY from the so-called Level 1 Trigger, whose purpose is to throw away the most obviously uninteresting collisions.
Since the data rate after L1 is still WAY too large to be all stored, another trigger, unimaginatively called Level 2 Trigger, sorts out even more crap. Since the data rate is lower than for L1, L2 can use more sophisticated algorithms to figure out which event is crap and which is an ever-famous Higgs decay :-)
One more trigger, Level 3 (you guessed it), is used to even further reduce the amount of data, again with more sophisticated means.
Still, the required bandwidth is quite impressive. At CDF II, the data rate after Level 3 will be about 75 events per second, at half a meg each, summing up to 30-40 MB per second (well enough to saturate Gbit ethernet), which are all reconstructed right away.Note that for the LHC experiments (CMS, ATLAS) the amount of data is more than an order of magnitude larger than for CDF and D0 (at Fermilab).
The LHC data will be spread all over the world, using a multi-tier architecture with CERN being Tier 0, and national computing centers as Tier 1 centers, universities being Tier 2, etc. No national computing center will be able to store ALL data, so the idea is that e.g. your Higgs search will be conducted on the U.S. Tier 1 center, B physics on the German Tier 1 center and so on. Obviously not only US scientists will search for the Higgs, so others will also submit analysis jobs on the US Tier 1 and vice versa. To get this working, the GRID is designed. A current implementation is GLOBUS.
Having said this, it is important to note that right now, the GRID is nowhere near this goal. To submit jobs in this "fire and forget" way is not possible yet. There is a shitload of problems to yet solve, the most important ones: trust and horsepower.
Trust: you must allow complete strangers to utilize your multi-million dollar cluster, and they haven't even signed a term-of-use form.
Horsepower: everybody expects to get more CPU cycles out of the GRID than he/she contributes. Obviously, this will not work. (Albeit the load levveling might improve the overall performance.) -
Grid computing pushing this issue
I work in grid computing and we have some needs that push this idea forward. Over at Argonne labs the Globus team has put forward this draft of extensions for some of what you talk about (i.e. it's secure and multi-path). Code exists under yet another open source license the "Globus Toolkit Public License".
-
Grid computing pushing this issue
I work in grid computing and we have some needs that push this idea forward. Over at Argonne labs the Globus team has put forward this draft of extensions for some of what you talk about (i.e. it's secure and multi-path). Code exists under yet another open source license the "Globus Toolkit Public License".
-
Re:From the wired article
Yes - using linux is all very fine and well but it has some nasty suprises. For example on RedHat 6 upgrading to the next version of Sun's JDK (in this case 1.3) requires an upgrade to a new version of certain libraries and the recompiling of most of the software on the system.
While this is fine on a home hobbyist machine it is not very good if you have multiple users and especially not if you are selling computer time to companies. And why do you need Java 1.3 you ask? You need it because the Globus CoG toolkit needs it.
-
OS/software
Check this out: The software They're running
-
Re:Sun is already there!
Sun's Grid Engine doesn't seem nearly as powerful as the Globus toolkit used by the Grid.
-
More Info About Grid Computing...Is this slashdot's first DataGrid related posting?
More info about the DataGrid...
- Computing Power on tap - Economist - June 21, 2001.
- The Globus Toolkit:
- The Information Power Grid at NASA.
- The EU DataGrid
-
Vaporous, but still gives it exposure...I agree its a vanilla corporate release, but its good news. A lot of people don't even know what grid computing is. This can help spread the word of yet another excellent OSS project
I had heard of grid computing before, but hadn't read much about it. Google turned up lots of resources this mornign - worth teh read. The article was right - the software to manage a grid will be super complex and the security implications are daunting.
-
For some real infohttp://www.globus.org/
http://grid.web.cern.ch/grid/And yes, it'll run on Linux (at CERN anyway, they're quickly getting rid of all the "legacy RISC" platforms here)
It's not really about having fast pipes all over Europe, it's more like having software you can run to have your applications running on thousands of nodes around the world and also managing all of it.
-
Shhh! Don't tell anyone, but ...Well I will assume you have a reason for being distributed, such as parallel or large process sharing. Large process sharing is not a reality yet unless it is predivided into pieces, but Mosix may change that. I will bring everyone up on parallel systems and Globus, a kind of globally distributed system. Redundacy is another issue and does not require any kind of distribution other than redundant device connections such as with SCSI.
Anyway, a little teeny tiny effort and of course comprehension is necessary to find out about Condor and Globus. Condor utilizes idle workstation resources for parallel applications. Kind of like PVM or MPI, but designed for clusters of workstations. It provides a mechanism to link several computers together. Globus, built on Nexus, it is a GPL system that runs on just about any grid (such as what Condor/MPI/PVM can be). It provides a consistent API and is useful for much more than standard parallel work. It is still being developed, but you can get the tools and server stuff.
Most people around here including the asker of the question obviously don't know a parallel app from Microsoft Word. I see this alot. "I will run SETI in parallel!" Huh? Not exactly.
I'll explain... I am all for running process independent stuff by rsh script, but it is not a true use of distribution. That is the whole point. SETI is allocated chunks of non-dependent data out to systems. You then send back results, no messaging. It is simplicity and not a multipurpose distributed system. It is pre-distributed and requires only a yes or no result. This is fine and great for embarassingly parallel applications such as number searches. Your adding more monkeys on typewriters, but they're only monkeys.
Real message passing like that of a Beowulf class server is when there is boundary data required between processes. One computer changes a row on a matrix that row is the boundary so it must send the process it shares the boundary with the updated row. This is usually the real crap, what people buy 100's of nodes to do. See MPI and PVM. These programs must be explicitly written in parallel to be efficient and utilize parallel code structures. They are built on top of message passing libraries (MPI/PVM) that are pre-ported to systems.
It is important to note PVM/MPI can be used to reclaim idle workstation time, it is just inefficient at it and will piss people off. However, a proper queueing system set to run at night could be utilized.Systems like Mosix are OK and they exist now. They give you use of a network of linux workstations with process migration. However, it is very low level and will remain so since it works on x86 process explicitly. It also must have non-I/O bound process to export or it will be limited in utilization. A great project they are working on is the utilization of the networks memory space for large processes. If you ran a 2000x2000 matrix you could solve it using just plain Matlab and 4 256MB systems. It distributes the process state to where the data is. Mosix also is quite useful in dynamic scheduling. PVM and MPI both have very limited use of dynamic scheduling, but thanks to Mosix's peer to peer load balancing it can be utilized as a dynamic scheduler. PVM and MPI issue static process allocation to the nodes, as usage fluxes (finished or waiting process nodes) Mosix can move loads to increase efficiency.
Condor is used on groups of workstations and is heterogenous (NT is getting a port). You can build parallel apps for it just like MPI or PVM, it uses other technologies than them however.
Now Globus, Globus is a huge project utilizing a message passing/thread library called Nexus it can run on any grid. That grid then will connect to other distributed grid resources across the net. The user is presented with a web interface and a secure login. They upload the program and request an allocation of resources. They get the results back when done. It uses whatever servers are availible and can use explicit parallelism through the thread library to make it faster. It is for all purposes a worldwide supercomputer. It goes beyond this to also share all data resources available to the system by database through its directory system. This system allows anyone to join, but you have to be allowed to use other peoples resources.
So if you seriously are thinking about playing with this stuff, figure a real use, figure how much power usage you will be using (NODESx250W 24x7 can be quite a power bill). Then decide what parallel system PVM/MPI/Mosix/Condor you want to use. If you have a whole department of computers Condor might be good, if you have a specific parallel app and a few non-workstation nodes use pvm/mpi. If you run lots of processes or have lots of people logging on, Mosix code be useful. Also, Mosix on MPI/PVM would give probably an efficient cluster. Then you could submit it to Globus so others could utilize it. However, it sounds rather elitest and probably won't use two P133s when they got Cray T3Ds. Also, don't think about actually using Globus yourself. Hey, I guess it would be just about like SETI and others. You could be helping science or at least some grad-student piddle around.
-
Some PointersI do research into how to use statistical and AI techniques to predict resource demand and availability in distributed systems. You might be interested in looking at some of my thesis related papers, systems, and databases.
Another project that is interested in performance prediction is the Network Weather Service (NWS). An important issue in systems such as RPS (my system) and NWS is accurate and scalable measurement of hosts and networks. Remos is able to do this.
A lot of work in this area is taking place in the context of Computational Grids. The Grid Forum is an IETF-like body that is trying to standardize Grid middleware systems. Globus and Legion are examples of Grid middleware systems.
-
NOW, what kind of NOW is it?Hi, I've been doing some small work for the Rock Linux project for Beowulf/MOSIX clustering support. I have read through Progeny's press release and would like more info.
Is NOW based on current work such as MOSIX, or will it be a new system entirely?
If it is new, will it be heterogeneous?- If so, how would it handle process sharing?
- Are you going to lay dynamic process scheduling over some sort of heterogenous message passing system?
- Or similarly, but more simply, are you building a preemptive process distribution without dynamic scheduling? (Gathering data from network when finished.)
Is it going to have a web interface such as Globus and be net distributed?
If it is none of the above, then will it require recompilation to be utilized?
Will it be fully optimized to use resources without hand tuning and using PVM/MPI?
Thanks,
CH -
The GridPeople are working hard, and spending plenty of money solving these problems - check out the Alliance - particularly Globus and Condor. We're doing real-world science now. The other day we solved QAP30, which is was a big problem in the optimization field. We've got people doing particle physics simulations, protein conformation, computer architecture simulation - the list goes on and on.
People need to stop looking at the d.net/Seti@home problems as the only model for Internet computing. They're not that hard of problems. What makes them neat is that they've got lots of CPU's. (SETI is cool because it's space and aliens and everything, but RC5-64 is just plain stupid - they're proving that 64 bit RC5 is 256 times harder to crack than 56bit RC5. Yawn.)
Numerical accuracy is a concern. Latency is a concern - but not for a a huge set of problems. You don't need a T3E for Monte Carlo simulations, and you shouldn't try and put your finite-element simulations all around the world. Networks are getting faster and faster, so code size is really not an issue today for anyone on a real network (ie vBNS.) Data size can be a problem, but again, networks are getting faster, and you can prestage a lot of the data. If your code is too sensitive to risk distributing, then no amount of technological progress is going to change it. User security is not that difficult of a problem - it's not too hard to sandbox an application on a decent OS. And as for FORTRAN, I don't see what the problem is. Processors don't run C or FORTRAN or Pascal, and the FORTRAN compilers still produce some pretty tight code.
The Internet makes great sense for high-performance computing, for the right problems.
-
Sounds like GRID.. which is going to be _the_ thing for distibuted computing. See this link, for example.
Who needs Microsoft anyway ?
-
This is sort of what the grid projects are doing..
There are a couple projects out there in the HPC community that are aimed at something like this. The main ones are the Globus project (mainly distributed computing services) and the Cactus project (an application framework). I saw presentations on both last week at the HPDC conference, and while they still have a lot of work to do both are rather impressive.
--Troy -
HPC software original sources...Some pieces of Sun's HPC software are derivations of freely available code. Their MPI implementation is (or rather was, the last time I looked) based on mpich from ANL. The linear algebra packages are based on ScaLAPACK and crew. Sun may be giving out some tuning implementation, but nothing that can't be found automatically (see the PHiPAC and ATLAS projects). PETsc and PVM are straight builds of older code, bugs and all.
Some of the more interesting pieces, like LSF, are only licensed by Sun, thus will not be included in this `deal.' (For a free improvement over LSF, check out GNU Queue. If it doesn't do something you want, you can support the community and extend it.) If you read the announcement carefully, you'll see that the only new codes to which it applies are the parallel file system (the Sun CTO thinks distributed file systems are dead, anyways), the Prism debugger, and the parallel run-time environment.
Of those, the only with no available substitute is the debugger. The ROMIO library is a good place to start for the MPI file I/O stuff (a good database would be a better place, imho). I already mentioned queue management software. The Ptools Consortium and the Globus Project have links to other HPC cluster tools.
Many of the pieces for debugging are available (combine ddd and gnuplot), but some notable ones are missing. The ability to control multiple GDBs easily from one processes and the visualization of parallel execution are needed, and quite difficult to implement. There seems to be interest in making GDB easier to use from other processes, which is a good start towards solving the larger problem of general, distributed debugging. And both the mpich and LAM MPI implementation have some profiling information, but few tools to dig through it.
To be fair, Sun has contributed (and supported contributions) to the original packages. Why they are releasing the rest under their Exploit the Community license is beyond me.
Jason, ejr@cs.berkeley.edu
-
Odds n Ends
-
Perfect!
More bandwidth wouldn't make a world-wide cluster possible - the killer with a big cluster is the latency, which increasing the bandwidth doesn't help much with.
Check out the globus project, who are actually trying to build something like this
www.globus.org
-Erik