Ask Slashdot: Scientific Computing Workflow For the Cloud?
diab0lic writes "I have recently come into the situation where I need to run cloud computing on demand for my research. Amazon's EC2 Spot Instances are an ideal platform for this as I can requisition an appropriate instance for the given experiment {high cpu, high memory, GPU instance} depending on its needs. However I currently spin up the instance manually, set it up, run the experiment, and then terminate manually. This gets tedious monitoring experiments for completion, and I incur unnecessary costs if a job finishes while I'm sleeping, for example. The whole thing really should be automated.
I'm looking for a workflow somewhat similar to this:
- Manually create Amazon machine image (AMI) for experiment.
- Issue command to start AMI on specified spot instance type.
- Automatically connect EBS to instance for result storage.
- Automatically run specified experiment, bonus if this can be parameterized.
- Have AMI automatically terminate itself upon experiment completion.
Something like docker that spun up on-demand spot instances of a specified type for each run and terminated said instance at run completion would be absolutely perfect. I also know HTCondor can back onto EC2 spot instances but I haven't really been able to find any concise information on how to set up a personal cloud — I also think this is slight overkill. Do any other Slashdot users have similar problems? How did you solve it? What is your workflow? Thanks!"
EC2 is inherently scriptable. There's nothing stopping you from using the command-line tools to fire up an instance, and let it run, and store its results to S3, and then decommission the instance. You can even set the instances to terminate on shutdown, which deletes the instance EBS stores (if you're using EBS) and deletes the instance. Sounds like you just need to spend 30 minutes reading the docs.
You mean a computer? A server farm? A beowulf cluster?
To me, 'personal cloud' is a totally meaningless term and doesn't correspond to what the cloud is. If it's a couple of servers you own and control, to me that doesn't sound like 'cloud computing' -- it sounds like a marketing term.
Lost at C:>. Found at C.
Bunch of papers at SC13 presented this year. Suggest sunny look them up.
http://sc13.supercomputing.org/content/papers
Does exactly what you need and is designed explicitly for integration with third party tools. Spins up everything from disks to automating webforms and jobs and imports and exports of jobs. There really isn't anything else out there that comes close to what Workflow will do. Used to be called Altiris Workflow. Works with everything from CMDB, change management, service desk to multiple languages.
http://www.symantec.com/connect/articles/learn-about-symantec-workflow
Because your workflow is likely to be customized to your tasks, it should be straightforward to write these kinds of tools yourself, with any number of available toolkits, based on what language you're most comfortable using.
There's the straight CLI: http://aws.amazon.com/cli/
And lots of sample code for the various SDKs: http://aws.amazon.com/code
Best to just dive in. If you have any development experience at all, even just scripting, you should be able to figure it out pretty quickly.
Since my scientific workflow always includes Python it is natural for me to use boto.
https://github.com/boto/boto
http://boto.readthedocs.org/en/latest/
http://aws.amazon.com/sdkforpython/
You could use GlideinWMS, which was made to manage a pool of dynamic grid resources for scientific computing, such as the Open Science Grid. It can also manage personal Condor pools too. I believe it can also connect to Amazon EC2, but I don't see a lot of information on their web-page about that. You may have to contact them for more information, but I know that the team is very responsive and interested in finding more scientific users. You can find more information here: http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html
To the OP: Please refer to the provided documentation or use a search engine to find tutorials, if you dare. There is an official API for this. We won't recite manuals here.
To ./ community: Why is a question that can be answered with a "rtfm" landing on the front page?
Jenkins would probably be useful in this case, with this plugin:
https://wiki.jenkins-ci.org/display/JENKINS/Amazon+EC2+Plugin
You can create your own personal cloud, call it private cloud, and then automate all your tasks. I have been doing the same, I utilised fabric (for automation), boto (euca2ools) for controlling the cloud (creating instances, volumes, etc). Eucalyptus helps you create your own private cloud, you will have your IaaS implementation easy. OpenStack has a growing following, you may prefer to adopt it than Eucalyptus. There are lots of other available tools however.
I agree. This problem is easily scriptable using python so I'm honestly surprised a legitimate researcher is asking slashdot instead of jumping into a writing a python script.
Feed the need: Digitaladdiction.net
Amazons http://aws.amazon.com/cloudformation/ can get you 95% of the way there (add a few small scripts via Boto, or some integration with http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cfn-customresource.html)
A little elbow grease will get you the rest of the way without additional costs.
-- (appended to the end of comments you post, 120 chars)
every time I've looked into scripting my manual tasks with AWS I've found their documentation to overwhelming and not concise or clear.
Have you tried Google or the AWS documentation? What you are asking for is the bare-bones most basic use case. They even have services setup to make this kind of thing easier, like the Simple Workflow Service, Messaging Service and Simple Que Service.
high-level introduction to workflow service:
http://docs.aws.amazon.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html
recipes using workflow service:
http://aws.amazon.com/code/2535278400103493
Here's how I ran my PhD simulations on EC2: .tar.gz to download, and the path to download it
- The AMI downloads a manifest file at startup.
- The manifest has one record per line, two fields per record: the s3 URL of a
- The AMI then runs a shell script (/etc/run.sh) that's been put there by a manifest entry
Shell scripts upload new files to s3 (e.g., /etc/run.sh) and have ec2 run new VMs. When the VMs are loaded, they're running everything I need, ready to go.
Other shell scripts stopped/started experiments on these VMs.
Other shell scripts shut down the VMs when I'm done.
The scripts did little more than scan the appropriate machine list from the ec2 tools and ssh into them with a specific command.
At the end, I had some of the experiment-specific scripts quickly have git clone/pull in files I was changing quickly per experiment.
All of it worked really well for me. Nothing fancier than the ec2 command-line tools, bash, ssh, & git necessary.
Care about electronic freedom? Consider donating to the EFF!
I have used MIT's starcluster In the past for something very similar to this workflow. It provides a very user friendly interface for EC2 spot interfaces for almost the exact workflow you're looking for. They provide AMI's you can customize and a relatively well documented set of commands to easily launch spot instances.
Docker looks promising, but there are other existing services stacked on EC2 that address the needs of science workloads. PiCloud does exactly the things you're asking for: http://www.picloud.com/platform/ . And the folks at Cycle Computing use Condor to manage the largest jobs ever run on EC2: http://www.cyclecomputing.com/ . I'm still working on my own stuff based on Groovy and Condor which I call Gondor, but it isn't at all ready for others to use. One thing I have found to be great is that there is a MacPorts portfile for Condor which works dandy. Just "sudo port install htcondor && sudo port load htcondor". http://research.cs.wisc.edu/htcondor/HTCondorWeek2013/presentations/SingerL_MacPorts.pdf . I don't yet see a nice single workflow that gets us to an integrated reproducible published result at the other end like Elsevier's Executable Paper http://www.elsevier.com/physical-sciences/computer-science/executable-papers, but I think we'll be there soon.
We have a product in development that does just this - it can spin up spot nodes with the best price/performance ratio, dispatch tasks and restart them if a spot node fails. With lots of other goodies.
Drop me a note if you're interested: alex.besogonov@gmail.com
putting aside my slashvert suspicions of the post, (hard to see how you could have chose AWS at all and be so clueless )
I've done this kind of thing a lot. Here's my approach
1. Fire up an EBS backed AMI from an existing stock version of your favorite OS ( ubuntu 12.04 for me just cos i use it on desktop and can't be bothered with differences) /etc/rc.local a script to customize things further.. and because you don't want to faff about changing the AMI every time you change shit, have the startup script pull the latest stuff you need straight out of your source code repository and then run further initialisation stuff .. the official api documentation is last resort reference only.
2. customize it with your own shit
3. include in the
4. make an image from that instance (easily done from AWS control panel)
5. learn how to use boto (python AWS api) to fire up instances, attach storeage, shutdown instances etc. Using the command line tools is fine for the simplest stuff but as soon as stuff gets a little harder you really want to use a programming language, so unless you're extremely fond of java python is best fit for this)
The boto documentation is kinda shit, so every time you need to do something just google for an example doing something similiar
http://rareformnewmedia.com/
Virtualbox will not in any way help me. I don't own, and don't want to purchase or manage the hardware myself -- time tends to be short for researchers and an automated, easy, pay per use solution is very ideal.
This is more or less exactly the problem, their spot instances for science page is a friggin joke.[0] Their API seems reasonable for spinning up instances and I am now looking at writing some scripts to do this, however their docs avoid ever telling you that you can run scripts in the "user data" field when starting an instance... kind of a major hurdle that the command line tools don't make clear. I've actually got something going now with the CLI tools + docker that makes getting an environment running pretty simple. I'm going to formalize it and post it online in the near future. [0] http://aws.amazon.com/ec2/spot-and-science/
Thanks for the links, aminator looks to be perfect for easily crafting job specific environments -- I'll probably include this in whatever solution I come up with. Asgard on the other hand, and correct me if I'm wrong, looks to be much more oriented to those who have a lot of things running for an indefinite time frame in the cloud. Thanks!
http://star.mit.edu/cluster/
The rest of it is easily scriptable. I have some ebs based AMIs that on bootup, connects to a central server, /etc/hosts).
registers itself (ticks up a text file, and adds itself to
If you combine starcluster for generic cluster management with the existing Amazon provided tools
http://blog.roozbehk.com/post/35277172460/installing-amazon-ec2-tools)
this is really only a days worth of scripting and testing.
There are also several public AMIs on Ec2 that are oriented towards scientific computing.
http://www.google.com/search?q=ec2%20ami%20scientific
This is my day job stuff.
Check out Cycle Computing's CycleCloud product: http://www.cyclecomputing.com/wiki/index.php?title=CycleCloud They offer meta-scheduling products specifically for managing HTCondor pools in AWS. The Cycle team works closely with the HTCondor team and supports loads of scientific projects. Their products have historically been free for academic use.
As others have pointed out, deploying EC2 instances automatically is fairly easy using the well-documented EC2 APIs.
The difficult part about distributed computing is synchronizing the work between available instances. For this, you might want to look at RabbitMQ or other queueing servers. One way to do this would be to have one thread (on your computer) generating problem instances, while you spawn spot instances on EC2 as desired, which consume the work and report the results. I suspect you could accomplish something similar using Hadoop/MapReduce.
If you're willing to look beyond AWS, there's something called Manta out there (http://www.joyent.com/products/manta). The data rests on some servers, and you submit UNIX map/reduce jobs. The jobs are run on the nodes where the data is resting, you get a full UNIX environment, and you only get charged as you'd expect (compute time, combined with the cheaper at-rest time). It might be a better fit for what you're doing than your proposal, plus it'll likely be faster too due to reduced data movement.
Look up cloudify on cloudifysource.org.
It enables spinning up machines on the cloud of your choice (including EC2). Then it installs and configures your software on those VMs. Finally it monitors all processes that you request it to monitor, including listening to exposed custom metrics, e.g. over a jmx port.
In your case, when your experiment ends, if your software exposes some api or metric that can indicate that, cloudify can take that as a trigger for shutting down or spinning up the next experiment.
A nice bonus is that it can elastically scale in and out your VMs to handel varying loads and automatically restart problematic VMs or processes.
sigo ergo sum
I strongly recommend this command line tool. With this, you can do all those operations and more, and in a sensible and uncluttered fashion:
http://www.timkay.com/aws/