A Distributed Front-end for GCC
format writes "distcc is a distributed front-end for GCC, meaning you can compile that big project across n number of machines and get it done almost n times as fast.
The machines don't have to be identical or be running the exact same GCC version, but having the same OS is helpful." With the advent of faster hardware, I can't complain about kernel compile times anymore, but larger source trees could definitely benefit from this.
The machines don't have to be identical or be running the exact same GCC version
Well, to some extent they probably do. If you're running GCC 3.2 on one, you wouldn't be able to run 3.0 on another because of binary incompatibility.
Yay! My 133 doesn't have to take 25 billion years to compile anymore! Uhm, wait, I don't have any other computers... Shoot.
The sun compiler suite comes with dmake, which does the same on the level of make, rather than cc, but is essentially the same.
Definitelly would make beowulf clusters interesting for compilation as well as hard core numerics (no joke intendend).
You can almost never achevie a speed up of N. You can acheive S(N) = T(1)/(T(1)*alpha+((1-alpha)*T(1))/N+T0) Where T(1) is the time it takes to run the task with 1 computer, alpha is the part of the task that cannot be parallelised (as in startup registers etc.) and T0 is the communications overhead of the task.
:)
Just to clarify.
This could really spur the development of OpenOffice.
With 50, 100 machines or so hooked up, OpenOffice's compile time could be reduced to as little as 1 or 2 days!!!
You can almost never achevie a speed up of N. You can acheive S(N) = (1)/ (T(1)*alpha+((1-alpha)*T(1)) / N+T0) Where
T(1) is the time it takes to run the task with 1 computer, alpha is the part of the task that cannot be parallelised (as in startup registers etc.) and T0 is the communications overhead of the task.
This is the text book. Amdahls law, IIRC.
In reality, and also in most text books, there are exceptions where the solution scales with the number of processes.
And it should be easy enough to see: 5 machines compiling one source file each are 5 times as fast as one machine compiling 5 source files.
As long as you start gcc 5 times in a row you have
the same initialization overhead for EACH instance of gcc one after the other.
If you manage to start gcc with a couple of source files as argument to compile you save the laoding time of the binary at least. That would correspondend roughly to the alpha value.
Amdahls law is usefull for a single program/problem: try to paralelize gcc and you find the compiling source can't get speed up very much. So 5 processors running several threads of one gcc instance, those do not scale by 5.
However it says nothing about just solving the same problem multiple times in parallel.
Regards,
angel'o'sphere
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Is this better than say, Group Compiler?
From the FAQ:
distcc doesn't care. However, in some circumstances, particularly for C++, gcc object files compiled with one version of gcc are not compatible with those compiled by another. This is true even if they are built on the same machine.
It is usually best to make sure that every compiler name maps to a reasonably similar version on every machine. You can either make sure that gcc is the same everywhere, or use a version-qualified compiler name, such as gcc-3.2 or gcc-3.2-x86-linux.
So in other words, keep them close, especially for gcc versions that break backwards capability.
You think that I'm crazy, you should see this guy!
I think the biggest plus is that you can have one hella fast machine on your network running distcc that basically does all your compiling for all your other machines. I can see this being a big bonus for server farms like rackspace.com. The customers would be getting compile speeds from a big ass server, rather than just their little dinky Duron.
~LoudMusic
No sig for you. YOU GET NO SIG!
From the ccache homepage, which is also a Samba hosted project
ccache is a compiler cache.
It acts as a caching pre-processor to C/C++ compilers, using the -E compiler switch and a hash to detect when a compilation can be satisfied from cache.
This often results in a 5 to 10 times speedup in common compilations.
looks like this one is not necessarily a good idea to run on a university workstation cluster...
1.4 Security Considerations
distcc should only be used on networks where all machines and all users are trusted.
The distcc daemon, distccd, allows other machines on the network to run arbitrary commands on the volunteer machine. Anyone that can make a connection to the volunteer machine can run essentially any command as the user running distccd.
distcc is suitable for use on a small to medium network of friendly developers. It's certainly not suitable for use on a machine connected to the Internet or a large (e.g. university campus) network without firewalling in place.
inetd or tcpwrappers can be used to impose access control rules, but this should be done with an eye to the possibility of address spoofing.
In summary, the security level is similar to that of old-style network protocols like X11-over-TCP, NFS or RSH.
time is a funny concept
I'd love a speedup, but the time I'd save compiling would be wasted on having to fully install another linux box. Being able to boot a CD with a live linux distro and this software, and then connect to these slave machines to help compile would be immensely helpful. My linux box is a Cyrix 200MHz PC. Being able to stick a CD into my Athlon 1800 to help the compile would be fabulous.
I've had bad experiences with this in a Condor cluster of linux machines which had different versions of glibc. Seemingly randomly, my jobs would blow up into the netherworld without running and without an error message. Until the administrators matched all of the glibc's (but not the linux distributions, for some reason), I had to compile everything with -static on one machine and pray.
I wonder how much of a problem network bandwidth is in this system. With Condor, moving large datasets between machines is a problem. Object files can be pretty big and if you have a lot of them, you might risk pushing the compile bottleneck to the network. Even worse might be the link step, where all of the objects have to be visible to one machine (gcc doesn't have incremental linking yet, does it?).
/ \
\ / ASCII ribbon campaign for peace
x
/ \