Slashdot Mirror


Open Source Distributed Shell Tools?

ColonelForbin74 asks: "While some may assume that most larger server clusters run advanced / custom software(i.e. Beowulf, cfengine, OSCAR), many of those stuck in the not-research-this-site-runs-production world know this simply isn't the case. Many people like myself are working with medium-to-large scale clusters with little help other than shell for() loops and some SSH trusted keys. What application-level tools are out there that might help SysAdmin / AppSupport types like myself run commands across a given cluster, push files out, etc? In my desperation to have some sort of tool in my toolbox, I've actually created one. However, I have a hard time believing this is the best thing out there, and would appreciate all the ideas and links I can get!"

1 of 31 comments (clear)

  1. We created a "load balancing" parallel SSH tool. by kcurrie · · Score: 4, Interesting

    Where I work (a LARGE networking company that makes all kinds of networking hardware) a co-worker and I created multiple parallel SSH tools which enable you to run hundreds to thousands of concurrent outgoing sessions, depending on hardware. We have not yet had the cycles to look into open sourcing it, but hope to.

    I can share the basics of it here though, which should enable somebody else to easily build their own. On a day to day basis we needed to be able to run commands on 10,000+ Solaris and Linux boxes, and wanted to use SSH key authentication, but not keys with a null passphrase (as if the private key was stolen, major security implications present themselves :-) ) . The only way to do this (other than having some type of expect type program typing in the passphrase for you) is to use the ssh-agent. The problem with the ssh-agent is that is simply does not have the ability to authenticate more than say 20+ ssh sessions as once (depending on machine load, etc). What happens when too many ssh sessions attempt to authenticate against the ssh-agent is that you get many authentication failures due to timeouts. There are some hacks you can do to the ssh source code that will increase the number of times ssh will attempt to contact the agent, as well as the delay between attempts. We've done these hacks, but they still were nowhere near enough.
    The solution instead is to use MULTIPLE ssh-agents, and load balance between them. We wrote a tool that will prompt for our key passphrase and then load say 100 ssh-agents with that key loaded. When it starts the agents it records the variables SSH_AUTH_SOCK and SSH_AGENT_PID for each agent in a single file. We then have shell scripts wrapped around ssh commands that just randomly pick an agent to connect to, effectively load balancing.
    We run this whole thing on an OpenMosix cluster, which allows the ssh-agents and ssh processes to migrate across the machines once they start to use too much CPU time on their current node. We've found that Linux boxes seem to be much faster for SSH operations than Solaris (sparc) boxes, BTW.
    We have also written a parallel ssh tool that works similarly to others discussed here (and others NOT discussed here, like Ed Hill's clsh which in a previous life I used extensively), except our tool has a couple of other major features which (IMHO) are required in an enterprise environment. The biggest thing that we've found is that when working on boxes in the far reaches of the world, we cannot assume that any common group of NFS mounts will exist, or work properly when we need them to. If you cannot be sure what remote mounts are available, how can you run scripts on the remote box? This prompted us to make our program have the ability to both run perl code directly fed to it, as well as (basically) remotely deliver scripts for running and delete them afterwards. So if we've written an administrative script called foo.sh, our tool will basically pipe the script across a SSH session to the remote end and run it, usually never having to touch the remote disk at all. This is VERY useful because when talking about 10k+ boxes, many of which are desktops, you can never be sure which partitions will be full.

    Using our parallel ssh tool, along with the ssh-agent load balancing and a 3 node OpenMosix cluster we've been able to run 1000 outgoing ssh sessions without issues. This means if you want to change root passwords on 10k boxes it only takes slightly longer than changing passwords on 10 boxes. A real time saver, to say the least :-)

    Comments anyone?

    BTW, is anybody using any hacks of OpenSSH to work similarly to sudo for giving out root access?

    --
    -- I speak only for myself.