A Distributed Front-end for GCC

← Back to Stories (view on slashdot.org)

A Distributed Front-end for GCC

Posted by CowboyNeal on Saturday October 12, 2002 @03:35AM from the when-one-compiler-is-not-enough dept.

format writes "distcc is a distributed front-end for GCC, meaning you can compile that big project across n number of machines and get it done almost n times as fast. The machines don't have to be identical or be running the exact same GCC version, but having the same OS is helpful." With the advent of faster hardware, I can't complain about kernel compile times anymore, but larger source trees could definitely benefit from this.

9 of 195 comments (clear)

Min score:

Reason:

Sort:

Not N by Nashirak · 2002-10-12 03:44 · Score: 5, Informative

You can almost never achevie a speed up of N. You can acheive S(N) = T(1)/(T(1)*alpha+((1-alpha)*T(1))/N+T0) Where T(1) is the time it takes to run the task with 1 computer, alpha is the part of the task that cannot be parallelised (as in startup registers etc.) and T0 is the communications overhead of the task.

Just to clarify. :)
Re:hmmm by stevey · 2002-10-12 03:53 · Score: 5, Informative

In that case you might like to look at ccache which is a compiler cache for a single machine.

It will cache the compiler options for each source file and the resultant object file generated. I use a lot when I'm building packages for software - which require multiple compilations. It works very nicely - I'd love to see how well it would integrate with distcc....
YES! N! Re:Not N by angel'o'sphere · 2002-10-12 03:55 · Score: 5, Informative

You can almost never achevie a speed up of N. You can acheive S(N) = (1)/ (T(1)*alpha+((1-alpha)*T(1)) / N+T0) Where
T(1) is the time it takes to run the task with 1 computer, alpha is the part of the task that cannot be parallelised (as in startup registers etc.) and T0 is the communications overhead of the task.

This is the text book. Amdahls law, IIRC.

In reality, and also in most text books, there are exceptions where the solution scales with the number of processes.

And it should be easy enough to see: 5 machines compiling one source file each are 5 times as fast as one machine compiling 5 source files.

As long as you start gcc 5 times in a row you have
the same initialization overhead for EACH instance of gcc one after the other.

If you manage to start gcc with a couple of source files as argument to compile you save the laoding time of the binary at least. That would correspondend roughly to the alpha value.

Amdahls law is usefull for a single program/problem: try to paralelize gcc and you find the compiling source can't get speed up very much. So 5 processors running several threads of one gcc instance, those do not scale by 5.

However it says nothing about just solving the same problem multiple times in parallel.

Regards,
angel'o'sphere

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
1. Re:YES! N! Re:Not N by joib · 2002-10-12 04:04 · Score: 4, Informative
  
  That assumes you can divide the work equally. Consider that the number of source files probably aren't an integer*N, and that different source files take varying times to compile. Of course, as the number of source files approaches infinity, and if you have some load balancing scheme, this becomes a non-issue. Of course, in Real Life (TM) most projects don't have an infinite amount of source files.
Re:Don't need the same version? by Angry+White+Guy · 2002-10-12 04:04 · Score: 5, Informative

From the FAQ:

distcc doesn't care. However, in some circumstances, particularly for C++, gcc object files compiled with one version of gcc are not compatible with those compiled by another. This is true even if they are built on the same machine.
It is usually best to make sure that every compiler name maps to a reasonably similar version on every machine. You can either make sure that gcc is the same everywhere, or use a version-qualified compiler name, such as gcc-3.2 or gcc-3.2-x86-linux.

So in other words, keep them close, especially for gcc versions that break backwards capability.

--
You think that I'm crazy, you should see this guy!
Only true for C++ by FooBarWidget · 2002-10-12 04:11 · Score: 4, Informative

The C ABI between *all* GCC versions (and probably other compilers too) are compatible. You can compile libgnome using GCC 2.95.2 and Nautilus using GCC 3.2 and not have any problems at all.
Check out also ccache.samba.org by GGarand · 2002-10-12 04:21 · Score: 5, Informative

From the ccache homepage, which is also a Samba hosted project :

ccache is a compiler cache.
It acts as a caching pre-processor to C/C++ compilers, using the -E compiler switch and a hash to detect when a compilation can be satisfied from cache.
This often results in a 5 to 10 times speedup in common compilations.
security? by gooofy · 2002-10-12 04:48 · Score: 5, Informative

looks like this one is not necessarily a good idea to run on a university workstation cluster...

1.4 Security Considerations

distcc should only be used on networks where all machines and all users are trusted.
The distcc daemon, distccd, allows other machines on the network to run arbitrary commands on the volunteer machine. Anyone that can make a connection to the volunteer machine can run essentially any command as the user running distccd.
distcc is suitable for use on a small to medium network of friendly developers. It's certainly not suitable for use on a machine connected to the Internet or a large (e.g. university campus) network without firewalling in place.
inetd or tcpwrappers can be used to impose access control rules, but this should be done with an eye to the possibility of address spoofing.
In summary, the security level is similar to that of old-style network protocols like X11-over-TCP, NFS or RSH.

--
time is a funny concept
Re:No, NOT N by angel'o'sphere · 2002-10-12 08:10 · Score: 4, Informative

Let's say it takes 50 msec to initialize gcc and 500 msec to compile the average source file. Then it takes 5.05 sec to compile ten files with one copy of gcc

Your calculation is wrong:

I explicitly said: you start gcc N times for N files.

So a call like gcc: 1.c 2.c 3.c ..... 10.c is not allowed.

Because that call falls under amdahls law(in so far as a common initialization time is needed which is divided amoung the ten compile tasks).

However 10 calls:
gcc 1.c
gcc 2.c
gcc 3.c ...
gcc 10.c

Those ten calls scale with 10! running those ten calls one after the other on 1 machine takes exactly ten times the time then running one of them each on its own machine.

I repeat: Amadahls law is about parallelizing one algorithm. It is not about starting the same algorithm on different problem sets (differnt c files) in parallel.

Where as the first one does not scale infinite and not scale with N, the second one does(of course with some limitations in RL, e.g. if all compilers use the same file server via NFS).

The interesting difference is this: under Amdahls law you have a maximum processors up to which the solution scales. Adding more processors does not make the problem solving faster. Very often it makes the problem solving slower indeed because of communication overhead. OTOH, by just duplicating the hardware and distributing the problem "identicaly" and not "divided and parallelized" you indeed get nearly infinite scale ups. You scale up to the point where the distributing and the gathering gets to expensive. (Distributing C sources from a CVS repository to compile farm machines and gathering the *.o files or better *.so during linking back)

angel'o'sphere

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.