A Distributed Front-end for GCC

GCC version by csnydermvpsoft · 2002-10-12 03:38 · Score: 5, Insightful

The machines don't have to be identical or be running the exact same GCC version

Well, to some extent they probably do. If you're running GCC 3.2 on one, you wouldn't be able to run 3.0 on another because of binary incompatibility.

hmmm by Anonymous Coward · 2002-10-12 03:39 · Score: 5, Funny

Yay! My 133 doesn't have to take 25 billion years to compile anymore! Uhm, wait, I don't have any other computers... Shoot.

Re:hmmm by stevey · 2002-10-12 03:53 · Score: 5, Informative

In that case you might like to look at ccache which is a compiler cache for a single machine.

It will cache the compiler options for each source file and the resultant object file generated. I use a lot when I'm building packages for software - which require multiple compilations. It works very nicely - I'd love to see how well it would integrate with distcc....
Re:hmmm by Blkdeath · 2002-10-12 04:15 · Score: 3, Insightful

In that case you might like to look at ccache
Isn't the default cache size somewhere to the tune of 2-4GB?
I recall that all of my lower powered machines were lucky to see a 6GB drive, letalone have 2-4GB to spare.

--
BD Phone Home!
Shameless plug. Like you weren't expecting it.

Interesting approach by PineGreen · 2002-10-12 03:39 · Score: 5, Interesting

The sun compiler suite comes with dmake, which does the same on the level of make, rather than cc, but is essentially the same.
Definitelly would make beowulf clusters interesting for compilation as well as hard core numerics (no joke intendend).

Re:Interesting approach by csnydermvpsoft · 2002-10-12 03:46 · Score: 3, Insightful

Definitelly would make beowulf clusters interesting for compilation as well as hard core numerics (no joke intendend).

Actually, you wouldn't need a Beowulf cluster at all - just a bunch of networked machines.

Not N by Nashirak · 2002-10-12 03:44 · Score: 5, Informative

You can almost never achevie a speed up of N. You can acheive S(N) = T(1)/(T(1)*alpha+((1-alpha)*T(1))/N+T0) Where T(1) is the time it takes to run the task with 1 computer, alpha is the part of the task that cannot be parallelised (as in startup registers etc.) and T0 is the communications overhead of the task.

Just to clarify. :)

Great for OpenOffice by IronTek · 2002-10-12 03:50 · Score: 5, Funny

This could really spur the development of OpenOffice.

With 50, 100 machines or so hooked up, OpenOffice's compile time could be reduced to as little as 1 or 2 days!!!

YES! N! Re:Not N by angel'o'sphere · 2002-10-12 03:55 · Score: 5, Informative

You can almost never achevie a speed up of N. You can acheive S(N) = (1)/ (T(1)*alpha+((1-alpha)*T(1)) / N+T0) Where
T(1) is the time it takes to run the task with 1 computer, alpha is the part of the task that cannot be parallelised (as in startup registers etc.) and T0 is the communications overhead of the task.

This is the text book. Amdahls law, IIRC.

In reality, and also in most text books, there are exceptions where the solution scales with the number of processes.

And it should be easy enough to see: 5 machines compiling one source file each are 5 times as fast as one machine compiling 5 source files.

As long as you start gcc 5 times in a row you have
the same initialization overhead for EACH instance of gcc one after the other.

If you manage to start gcc with a couple of source files as argument to compile you save the laoding time of the binary at least. That would correspondend roughly to the alpha value.

Amdahls law is usefull for a single program/problem: try to paralelize gcc and you find the compiling source can't get speed up very much. So 5 processors running several threads of one gcc instance, those do not scale by 5.

However it says nothing about just solving the same problem multiple times in parallel.

Regards,
angel'o'sphere

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.

Re:YES! N! Re:Not N by joib · 2002-10-12 04:04 · Score: 4, Informative

That assumes you can divide the work equally. Consider that the number of source files probably aren't an integer*N, and that different source files take varying times to compile. Of course, as the number of source files approaches infinity, and if you have some load balancing scheme, this becomes a non-issue. Of course, in Real Life (TM) most projects don't have an infinite amount of source files.

Apple Projectbuilder by selderrr · 2002-10-12 03:58 · Score: 3, Insightful

I sincerely hope Apple makes this feature into projectbuilder, which compiles insanely slow when compared to codewarrior. If it wasn't for the superior interface and integration with interface builder, I'd swap back to codewarrior right away.

Does anyone here know how good the speed increase is when compiling on dual G4s versus a single proc ?

--
When will I end this grieving ? When will my future begin ?

So, is it better? by FreeLinux · 2002-10-12 03:58 · Score: 5, Interesting

Is this better than say, Group Compiler?

Differece between distributed/parallel make? by angel'o'sphere · 2002-10-12 03:59 · Score: 4, Interesting

Could someone please point out the difference between a parallel and/or distributed make, like pmake?

It sounds not realy reasonable to put the coding work into gcc when you like to have yacc/bison and a bunch of perl scripts and what ever else you have in your makefile also speeded up.

Regards,
angel'o'sphere

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.

Re:Don't need the same version? by Angry+White+Guy · 2002-10-12 04:04 · Score: 5, Informative

From the FAQ:

distcc doesn't care. However, in some circumstances, particularly for C++, gcc object files compiled with one version of gcc are not compatible with those compiled by another. This is true even if they are built on the same machine.
It is usually best to make sure that every compiler name maps to a reasonably similar version on every machine. You can either make sure that gcc is the same everywhere, or use a version-qualified compiler name, such as gcc-3.2 or gcc-3.2-x86-linux.

So in other words, keep them close, especially for gcc versions that break backwards capability.

--
You think that I'm crazy, you should see this guy!

Big benifit by LoudMusic · 2002-10-12 04:07 · Score: 5, Interesting

I think the biggest plus is that you can have one hella fast machine on your network running distcc that basically does all your compiling for all your other machines. I can see this being a big bonus for server farms like rackspace.com. The customers would be getting compile speeds from a big ass server, rather than just their little dinky Duron.

~LoudMusic

--
No sig for you. YOU GET NO SIG!

Perfect for Gentoo by waffle+zero · 2002-10-12 04:11 · Score: 3, Insightful

Whether you're looking to install Gentoo on a old pentium to use as a router or sacrificing your first born to compile KDE, it should make things go quite a bit faster.

Well unless every computer you own runs Gentoo you want to emerge world.

Only true for C++ by FooBarWidget · 2002-10-12 04:11 · Score: 4, Informative

The C ABI between *all* GCC versions (and probably other compilers too) are compatible. You can compile libgnome using GCC 2.95.2 and Nautilus using GCC 3.2 and not have any problems at all.

good software design by ucblockhead · 2002-10-12 04:19 · Score: 4, Insightful

If your system is well designed, compiling the entire thing should be a rare event. In a well designed project, most changes occur in c files or in headers only included in a few c files, so most changes only require compiling a very few files.

Compiling the whole source tree should be the sort of thing you do fairly rarely (for a big project), perhaps once a night, perhaps automated so no one has to watch it.

If compile time is something that is a significant problem for you, you really need to look at your code design.

--
The cake is a pie

Check out also ccache.samba.org by GGarand · 2002-10-12 04:21 · Score: 5, Informative

From the ccache homepage, which is also a Samba hosted project :

ccache is a compiler cache.
It acts as a caching pre-processor to C/C++ compilers, using the -E compiler switch and a hash to detect when a compilation can be satisfied from cache.
This often results in a 5 to 10 times speedup in common compilations.

No, NOT N by HisMother · 2002-10-12 04:21 · Score: 3, Interesting

Ahem. Amdahl's law still operates, and you even say so yourself. There's a constant part that cannot be removed. Let's say it takes 50 msec to initialize gcc and 500 msec to compile the average source file. Then it takes 5.05 sec to compile ten files with one copy of gcc. Ignoring commiunications, it takes 0.550 seconds to compile them on ten machines. Is 5.05/0.550 == 10? No, it's about 9.2. Therefore, the speedup is LESS THAN N. Note that the faster the actual compile time, the lower the speedup would be!

--
Cantankerous old coot since 1957.

Re:No, NOT N by angel'o'sphere · 2002-10-12 08:10 · Score: 4, Informative

Let's say it takes 50 msec to initialize gcc and 500 msec to compile the average source file. Then it takes 5.05 sec to compile ten files with one copy of gcc

Your calculation is wrong:

I explicitly said: you start gcc N times for N files.

So a call like gcc: 1.c 2.c 3.c ..... 10.c is not allowed.

Because that call falls under amdahls law(in so far as a common initialization time is needed which is divided amoung the ten compile tasks).

However 10 calls:
gcc 1.c
gcc 2.c
gcc 3.c ...
gcc 10.c

Those ten calls scale with 10! running those ten calls one after the other on 1 machine takes exactly ten times the time then running one of them each on its own machine.

I repeat: Amadahls law is about parallelizing one algorithm. It is not about starting the same algorithm on different problem sets (differnt c files) in parallel.

Where as the first one does not scale infinite and not scale with N, the second one does(of course with some limitations in RL, e.g. if all compilers use the same file server via NFS).

The interesting difference is this: under Amdahls law you have a maximum processors up to which the solution scales. Adding more processors does not make the problem solving faster. Very often it makes the problem solving slower indeed because of communication overhead. OTOH, by just duplicating the hardware and distributing the problem "identicaly" and not "divided and parallelized" you indeed get nearly infinite scale ups. You scale up to the point where the distributing and the gathering gets to expensive. (Distributing C sources from a CVS repository to compile farm machines and gathering the *.o files or better *.so during linking back)

angel'o'sphere

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.

Source code by ucblockhead · 2002-10-12 04:25 · Score: 4, Insightful

Running different versions could cause really nasty problems if different versions of gcc support different levels of C (like C99 or older C) or if one version has a compiler bug that another doesn't.

Can you imagine code compiling or failing to compile randomly depending on which machine happens to compile it? Yikes! Debugging nightmare...

--
The cake is a pie

Re:Source code by wik · 2002-10-12 06:15 · Score: 5, Insightful

I've had bad experiences with this in a Condor cluster of linux machines which had different versions of glibc. Seemingly randomly, my jobs would blow up into the netherworld without running and without an error message. Until the administrators matched all of the glibc's (but not the linux distributions, for some reason), I had to compile everything with -static on one machine and pray.

I wonder how much of a problem network bandwidth is in this system. With Condor, moving large datasets between machines is a problem. Object files can be pretty big and if you have a lot of them, you might risk pushing the compile bottleneck to the network. Even worse might be the link step, where all of the objects have to be visible to one machine (gcc doesn't have incremental linking yet, does it?).

--
/ \
\ / ASCII ribbon campaign for peace
x
/ \

security? by gooofy · 2002-10-12 04:48 · Score: 5, Informative

looks like this one is not necessarily a good idea to run on a university workstation cluster...

1.4 Security Considerations

distcc should only be used on networks where all machines and all users are trusted.

The distcc daemon, distccd, allows other machines on the network to run arbitrary commands on the volunteer machine. Anyone that can make a connection to the volunteer machine can run essentially any command as the user running distccd.

distcc is suitable for use on a small to medium network of friendly developers. It's certainly not suitable for use on a machine connected to the Internet or a large (e.g. university campus) network without firewalling in place.

inetd or tcpwrappers can be used to impose access control rules, but this should be done with an eye to the possibility of address spoofing.

In summary, the security level is similar to that of old-style network protocols like X11-over-TCP, NFS or RSH.

--
time is a funny concept

Not a new idea, but a noble one anyway by GeckoFood · 2002-10-12 04:48 · Score: 3, Informative

Once upon a time, Symantec had a C++ compiler, and with version 7.5 (1996), the build process could be spread all over a network. This did speed up compilation times as machines that were running the build service that were more or less idle would be sent files to compile, passing back the objects and binaries as oppropriate.

Oh, by the way, that compiler is now called Digital Mars C++.

That said, all the machines on the network had to be running Windows (and at that time, I think only Windows 95 or NT were the only choices available for that compiler). Further, all had to have the same version of the compiler.

For those of us that are running Linux boxes on a primarily Windows network, this system, whether GCC or something else, would be rather hard to implement without a cross-compiler. Additionally, even if all were Linux workstations (or BSD, or Solaris, etc etc etc) wouldn't binary compatibility be driven by not just the version of the compiler but the target OS as well?

It's a noble undertaking. I hope that the developers are putting thought into all the little things like this that will make it tough to pull off.

--
Be excellent to each other. And... PARTY ON, DUDES!

Re:So, is it better? - Quick answer from the site. by panoplos · 2002-10-12 05:09 · Score: 3, Informative

From the Group Compiler [sic] (gecc) site:

gecc is a proof of concept. It is heavily inspired by ccache and distcc. You could chain these tools to achieve the same goals gecc tries to reach. Both tools are much more mature and work in production environments. gecc just started with a little different concept. gecc has a central component (distcc has not).

My idea is that gecc could better handle a varying set of compile nodes: if you have some machines that only from time to time could help in distributed compiling than this is nice.

With a central component it might be easier to monitor the compilation and distribute the compile jobs.

Right now gecc is only useful if you read the source.

Emphasis is mine.

I guess it all depends on whether or not you want to work with production quality code or not.

Live boot CD? by no_such_user · 2002-10-12 05:22 · Score: 5, Insightful

I'd love a speedup, but the time I'd save compiling would be wasted on having to fully install another linux box. Being able to boot a CD with a live linux distro and this software, and then connect to these slave machines to help compile would be immensely helpful. My linux box is a Cyrix 200MHz PC. Being able to stick a CD into my Athlon 1800 to help the compile would be fabulous.

gecc by j.beyer · 2002-10-12 05:31 · Score: 3, Interesting

There is an alternative ( http://gecc.sf.net). gecc has a little different approach, it has a central component that distributes the compilation to a number of compile nodes. The set of compile nodes may change (over time). That is: compile nodes may come and go.

gess is work in progress, distcc is much more mature, but maybe you like to take a look at gecc also.

(yes, gecc is my baby)

oops... take 2 by yerricde · 2002-10-12 05:53 · Score: 3, Interesting

Amdahl's law still operates, and you even say so yourself. There's a constant part that cannot be removed. Let's say it takes 50 msec to initialize gcc and 500 msec to compile the average source file. Then it takes 5.05 sec to compile ten files with one copy of gcc.

Then you go on to tell how using ten machines provides only a 9.2-fold speedup. But what about a project with 100 files? It would take 50.05 seconds to build everything on one machine, and it takes 5.050 seconds to build ten files on each machine. Now we have a 50.05/5.050 == 9.92 fold speedup. In practice, can you notice the difference between 9-fold and 10-fold speedup?

Does the speedup factor not approach the number of machines asymptotically?

(How can I "Use the Preview Button!" when an accidental Enter keypress in the Subject invokes the Submit button? Scoop gets it right by setting Preview as the default button.)

--
Will I retire or break 10K?

Slashdot Mirror

A Distributed Front-end for GCC

29 of 195 comments (clear)