Ask Slashdot: Best Language To Learn For Scientific Computing?
New submitter longhunt writes "I just started my second year of grad school and I am working on a project that involves a computationally intensive data mining problem. I initially coded all of my routines in VBA because it 'was there'. They work, but run way too slow. I need to port to a faster language. I have acquired an older Xeon-based server and would like to be able to make use of all four CPU cores. I can load it with either Windows (XP) or Linux and am relatively comfortable with both. I did a fair amount of C and Octave programming as an undergrad. I also messed around with Fortran77 and several flavors of BASIC. Unfortunately, I haven't done ANY programming in about 12 years, so it would almost be like starting from scratch. I need a language I can pick up in a few weeks so I can get back to my research. I am not a CS major, so I care more about the answer than the code itself. What language suggestions or tips can you give me?"
I have a friend who works for a company that does gene sequencing and other genetic research and, from what he's told me, the whole industry uses mostly python. You probably don't have the hardware resources that they do, but I'd bet you also don't have data sets that are nearly as large as theirs are.
You might also get better results from something less general purpose like Julia, which is designed for number crunching.
"Don't blame me, I voted for Kodos!"
Obviously.
Depending on your needs, R may be your best bet if it is statistical processing you are interested in.
Some people die at 25 and aren't buried until 75. -Benjamin Franklin
What do you mean by scientific computing?
Modelling: Hard core finite element simulations or the like. Then C or Fortran and you will be linking with the math libraries.
Log Processing: A lot of other stuff you will be parsing data logs and doing statistics. So perl or python then octive.
Data Mining: Python or other SQL front end.
Install these 2 and you'll be good to go
http://ipython.org/notebook.html
http://pandas.pydata.org/
You should all be sharing your codes to avoid rewriting and to perfect it.
And if you are not a member of a team then I seriously question the quality of your graduate program.
What language suggestions or tips can you give me?"
Timothy, shame on you. You should know better than to start a holy war.
#fuckbeta #iamslashdot #dicemustdie
If you can find anything that resembles a math library with the correct tools then go with Python. Numpy is everyones friend here.
If you have to do the whole thing from scratch then Fortran is the fastest platform. I can't say I've meet anyone who enjoyed Fortran but it's wicked fast.
TCAP-Abort
For numeric-intensive work, I can get within 20% of the speed of C++ using the usual techniques -- minimize garbage collection by allocating variables once, use the "server" VM, perform "warmup" iterations in benchmark code to stabilize the JIT. I use the Eclipse IDE, copy and paste numeric results from the Console View into a spreadsheet program, and voila, instant journal article tables.
Most of the cutting edge data mining I've seen is done using R (which acts as a scripting wrapper for the C or Fortran code that the fast analysis libraries are coded in), or alternatively in python. Some people swear by MatLab if they have trained in it (so your octave would come in handy there). Have a look at some discussions at places like kaggle.com to see what the competitive machine learning community uses (if that is what you mean by data mining).
Korma: Good
A lot of people will propose a language because it is their favorite. Others because they believe it is very easy to learn. I will give you a third line of thought.
I would not look for a language in this case, I would look for a library, then teach myself whatever language is easiest/quickest to access it. I would try to profile what you are building, figure out where the bottlenecks are likely to be (profiling your existing mockup can help here but dont trust it entirely) and try to find the best stable well-designed high performance library for that particular type of code.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Clearly you are not involved in serious science.
And if you think FORTRAN is some ancient esoteric languge, you're ignorent as well. The most recent standard, ISO/IEC 1539-1:2010, informally known as Fortran 2008, was approved in September 2010.
Fortran is, for better or worse, the only major language out there specifically designed for scientific numerical computing. It's array handling is nice, with succinct array operations on both whole arrays and on slices, comparable with matlab or numpy but super fast. The language is carefully designed to make it very difficult to accidentally write slow code -- pointers are restricted in such a way that it's immediately obvious if there might be aliasing, as the standard example -- and so the optimizer can go to town on your code. Current incarnations have things like coarray fortran, and do concurrent and forall built into the language, allowing distributed memory and shared memory parallelism, and vectorization.
The downsides of Fortran are mainly the flip side of one of the upsides mentioned; Fortran has a huge long history. Upside: tonnes of great libraries. Downsides: tonnes of historical baggage.
If you have to do a lot of number crunching, Fortran remains one of the top choices, which is why many of the most sophisticated simulation codes run at supercomputing centres around the world are written in it. But of course it would be a terrible, terrible, language to write a web browser in. To each task its tool.
If you want news from today, you have to come back tomorrow.
It take all the work out of the computations..
Have you fscked your local propeller head today?
Better yet, Fortran + Python.
http://docs.scipy.org/doc/numpy/user/c-info.python-as-glue.html#f2py
I used it to wrap some crazy magnetometer processing code written in Fortran into a nice Python program. I ripped out all the I/O from the Fortran code and moved it into the Python layer. It worked great. Fortran is AWESOME at number crunching but SUCKS ASS at IO or well pretty much anything else, hence Python.
-73, de n1ywb
www.n1ywb.com
Since you mention VBA, I suspect that your data is in Excel spreadsheets? If you want to try to speed this up with minimum effort, then consider using Python with Pyvot to access the data, and then numpy/scipy/pandas to do whatever processing you need. This should give you a significant perf boost without the need to significantly rearchitecture everything or change your workflow much.
In addition, using Python this way gives you the ability to use IPython to work with your data in interactive mode - it's kinda like a scientific Python REPL, with graphing etc.
If you want an IDE that can connect all these together, try Python Tools for Visual Studio. This will give you a good general IDE experience (editing with code completion, debugging, profiling etc), and also comes with an integrated IPython console. This way you can write your code in the full-fledged code editor, and then quickly send select pieces of it to the REPL for evaluation, to test it as you write it.
(Full disclosure: I am a developer on the PTVS team)
FORTAN used to be it back in the day, but now days Matlab is the stuff that many engineers use for scientific computing. Many of the math libraries are very good in Matlab and don't require you to be a computer scientist to make them run fast. I used to work with scientists in my old lab to port their Matlab code to run on HPC clusters porting them to FORTAN or C. Often the matlab libraries smoked the BLAS/Atlas packages that you find on Linux/UNIX machines for instance. The same would hold true for Octave since they just build on the standard GNU math pacakges like BLAS.
I'm a MSEE and I've been working in the digital signal processing realm for the last 10 years since graduating. I should mention that I haven't done a lot of low level hardware work, I haven't programmed actual DSP cards or played with CUDA. I have written software that did real-time signal processing just on a GPU. Everyone in my industry at this point uses C or C++. There is some legacy FORTRAN, and I shudder when I have to read it. Some old types swear by it, but it's fallen out of favor mostly just because it's antiquated and most people know C/C++ and libraries are available for it.
For non-real-time prototypes I'd recommend learning python (scipy, numpy, matplotlib). Perhaps octave and/or Matlab would be useful as well.
At some point you have to decide what your strength will be. I love learning about CS and try to improve my coding skills, but it's just not my strength. I'm hired because of my DSP knowledge, and I need to be able to program well enough to translate algorithms to programs. If you really want to squeeze out performance then you'll probably want to learn CUDA, assembly, AVX/SSE, and DSP specific C programming. But I haven't delved to that level because, honestly, we have a somewhat different set of people at the company that are really good in those realms.
Of course, it would be great if I could know everything. But at the moment it's been good enough to know C/C++ for most of our real time signal processing. If something is taking a really long time, we might look at implementing a vectorized version. I would like to learn CUDA for when I get a platform that has GPUs but part of me wonders if it's worth it. The reason C/C++ has been enough so far is that compilers are getting so good that you really have to know what you're doing in assembly to beat them. Casual assembly knowledge probably won't help. I might be wrong, but I envision that being the case in the not too distant future with GPUs and parallel programming.
Upside: tonnes of great libraries.
Those great libraries are spread across several different "FORTRAN"s. gfortran. gfortran44. Intel's fortran. f77. f90. PGI pgif90. etc. etc etc.
Gfortran is woooonderful. It allows complete programming idiots to write functional code, since the libraries all do wonderful input error checking. Want to extract a substring from the 1 to -1 character location? gfortran will let you do it. Quite happily. Not a whimper.
PGI pgif90 will not. PGI writes compilers that are intended to do things fast. Input error checking takes time. If you want the 1 to -1 substring, your program crashes. PGI assumes you know not to do something that stupid, and it forces you to write code that doesn't take shortcuts.
So, if you get a program from someone else that runs perfectly for them, and you want to use it for serious work and get it done in a reasonable amount of time so you compile it with pgif90, you may find it crashes for no obvious reason. And then you have to debug seriously stupidly written code wondering how it could ever have worked correctly, until you find that it really shouldn't have worked at all. They want to extract every character in an input line up to the '=', and they never check to see if there wasn't an '=' to start with. 'index' returns zero, and they happily try to extract from 1 to index-1. Memcpy loves that.
The other issue is what is an intrinsic function and what isn't. I've been bitten by THAT one, too.
And someone I work with was wondering why code that used to run fine after being compiled with a certain compiler was now segment faulting when compiled with the same compiler, same data. Switching to the Intel compiler fixed it.
Sigh. But yes, FORTRAN is a de-facto standard language for modeling earth sciences, even if nobody can write it properly.
Perl Data Language
The power of Perl + the speed of C
You know C. C is simple, as fast as any alternative, it's straightforward to optimize (aside from pointer abuse), and you always know what the compiler/runtime is doing. And threading libraries like pthreads or CUDA are best served via C/C++. Why use anything else?
Another thought: scientific libraries. If you need external services/algorithms then your chosen language should support the libraries you need. C/C++ are well served by many fast machine learning libs such as FANN, LIBSVM, OpenCV, not to mention CBLAS, LinPACK, etc.