Slashdot Mirror


Python Gets a Big Data Boost From DARPA

itwbennett writes "DARPA (the U.S. Defense Advanced Research Projects Agency) has awarded $3 million to software provider Continuum Analytics to help fund the development of Python's data processing and visualization capabilities for big data jobs. The money will go toward developing new techniques for data analysis and for visually portraying large, multi-dimensional data sets. The work aims to extend beyond the capabilities offered by the NumPy and SciPy Python libraries, which are widely used by programmers for mathematical and scientific calculations, respectively. The work is part of DARPA's XData research program, a four-year, $100 million effort to give the Defense Department and other U.S. government agencies tools to work with large amounts of sensor data and other forms of big data."

30 of 180 comments (clear)

  1. I get the impression that by Chrisq · · Score: 5, Interesting

    I get the impression that in the Engineering and Scientific community Python is the new Fortran. I hope so, because it would be "Fortran done right".

    1. Re:I get the impression that by jma05 · · Score: 5, Interesting

      > I might get to learn Python one day but I'm afraid I'd become a so-so programmer in both languages.

      I empathize since I conversely only barely use Ruby. Once someone learns one of these languages, there is not that much that the other offers. But happily, one need not learn advanced Python to benefit from these projects.

      > it's a shame that so much effort is being divided between communities

      AFAIK, all scientific funding from US and Europe is/was always directed to Python, not Ruby. So Python is firmly established as a research language and there is not much effort being divided with Ruby (which seems to have a much more spotted and amateur movement in this direction), at least as far as scientific stuff is concerned (Ruby is more popular on web app side). For me the tension for scientific use is not between Python and Ruby, but between Python and R. Python community is replicating a lot of R functionality these days but R still has a much better lead in science libraries. Happily, it is quite easy to call R from Python.

    2. Re:I get the impression that by solidraven · · Score: 5, Informative

      You're dead wrong, nothing quite beats Fortran in speed when it comes to number crunching. If you need to go through hundreds of gigabytes of data and performance is important there's only one realistic choice: Fortran. Python isn't fit to run on a large cluster to simulate things, too much overhead. And lets not forget what sort of efficiency you can get if you use a good compiler (Intel Composer). You won't find Fortran on the way out over here, it's here to stay!

    3. Re:I get the impression that by ctid · · Score: 2

      Why would Fortran be any faster than any other compiled language?

      --
      Reality is defined by the maddest person in the room
    4. Re:I get the impression that by ssam · · Score: 2

      FORTRAN does arrays in a way thats slightly easier for the compiler to optimise. But some modern techniques and data structures are much harder to do in FORTRAN compared to c++. It is also quite easy to call C, C++ or FORTRAN functions from python.

      Writing a loop in python is slow. You express that loop as a numpy array operation you get a substantial way towards c speed. if you use numexpr you will get something faster than a simple C version.

      Processing big data is as much about moving the data around, and minimising latency in this movement as the raw processing speed. so a language that lets you express things efficiently will win in the end.

    5. Re:I get the impression that by Anonymous Coward · · Score: 5, Informative

      Short answer, Fortran has stricter aliasing rules so the compiler has more optimization opportunities. Long answer, see Stack Overflow.

    6. Re:I get the impression that by Anonymous Coward · · Score: 2, Informative

      I guess the problem is that people who speak about Fortran actually think about FORTRAN. The last FORTRAN standard was from 1977, and that shows. After that, there had been no new standard and little new development until the Fortran 90 standard (note the different capitalization). Fortran 90 got rid of the old punch card based restrictions by giving it completely new, much more reasonable code parsing rules (it still accepts old form code for backwards compatibility, but you cannot mix both forms in one file because they are too different), gave it a full set of properly nesting flow control statements (actually that was one thing already commonly available as non-standard extension to FORTRAN), and added very powerful array processing, operator overloading, and modules (and probably a few other things I don't remember right now). Later versions even added object orientation (and probably a whole set of other things; I haven't really followed Fortran development beyond Fortran 90).

    7. Re:I get the impression that by Kwyj1b0 · · Score: 4, Interesting

      Compared to plain old Python, yes. But Cython offers a lot of capabilities that improve speed dramatically - just using a type for your data in Cython gives programs a wonderful boost in speed.

      As someone who uses Matlab for most of my programming, I have come to detest languages that do not force specifying a variable type and/or declaring variables. Matlab offers neither, but it is a standard in some circles.

    8. Re:I get the impression that by LourensV · · Score: 5, Insightful

      You're probably right, but you're also missing the point. Most scientists are not programmers who specialise in numerical methods and software optimisation. Just getting something that does what they want is hard enough for them, which is why they use high-level languages like Matlab and R. If things are too slow, they learn to rewrite their computations in matrix form, so that they get deferred to the built-in linear algebra function libraries (which are written in C or Fortran), which usually gets them to within an order of magnitude of these low-level languages.

      If that still isn't good enough, they can either 1) choose a smaller data set and limit the scope of their investigations until things fit, 2) buy or rent a (virtual) machine with more CPU and more memory, or 3) hire a programmer to re-implement everything in a low-level language and so that it can run in parallel on a cluster. The third option is rarely chosen, because it's expensive, good programmers are difficult to find, and in the course of research the software will have to be updated often as the research question and hypotheses evolve (scientific programming is like rapid prototyping, not like software engineering), which makes option 3) even more expensive and time-consuming.

      So yes, operational weather forecasts and big well-funded projects that can afford to use it will continue to use Fortran and benefit from faster software. But for run-of-the-mill science, in which the data sets are currently growing rapidly, having a freely available "proper" programming language that is capable of relatively efficiently processing gigabytes of data while being easy enough to learn for an ordinary computer user is a godsend. R and Matlab and clones aren't it, but Python is pretty close, and this new library would be a welcome addition for many people.

    9. Re:I get the impression that by Chrisq · · Score: 3, Informative

      The entire point of Fortran is that it has difficult-to-deal-with aliasing rules that make the compiler more free to produce optimized code. That's why it is suitable for things that require every last bit of performance you can wring out of it. Today probably you can get the same thing with C or C++ provided you are prepared to use things like restrict, but it used to be you couldn't, so Fortran ruled certain topics.

      Python is an easy-to-use system with abysmal performance - expect 10-100x slowdown for code that runs in pure Python over a similar C version. If you can get things set up so Python is only gluing other C components together and the data never has to touch native Python data structures or loops, then performance will be fine, but now you aren't really coding in Python any more.

      The point is, the purpose of Fortran and the purpose of Python are entirely opposed. They are exactly the opposite of each other. So it boggles the mind how you can think that Python can be Fortran "done right". So much so that now I suspect I got trolled. Well done, sir.

      Yes I understand, and many people made the same point. However Fortran was for a lot of scientists and engineers the hammer to crack any nut. It was used for simple "try outs" where performance wasn't needed, simply because it was the language that Engineers knew. I think the same thing is happening with Python now, it is the first and sometimes only language that many engineers know. Now for the performance issue, it will not give the best performance but packages like SciPy and NumPy do give very good performance (arguably by using these libraries you are just using python to string c functions together, but it is properly integrated). Tests show that you are getting about a third of the performance of Fortran, (with the exception of the Fortran DGEMM marix multiply which greatly outperforms Python and other Fortran variants). The typical engineering reaction to performance needs is to throw hardware at the problem, then optimise your algorithm, and only change language if absolutely necessary!

    10. Re:I get the impression that by nadaou · · Score: 4, Insightful

      You're probably right, but you're also missing the point. Most scientists are not programmers who specialise in numerical methods and software optimisation.

      Which is exactly why FORTRAN is an excellent choice for them instead of something else fast (close to assembler) like C/C++, and why so many of the top fluid dynamics models continue to use it. It is simple (perhaps a function of its age) and because of that it is simple to do things like break up the calculation for MPI or tell the compiler to "vectorize this" or "automatically make it multi-threaded" in a way which is still a long from maturity for other languages.

      Can you guess which language MATLAB was originally written in? You know that funny row,column order on indexes? Any ideas on the history of that?

      R is great an all, and is brilliant in its niche, but how's that RAM limitation thing going? It's not a solution for everything.

      MATLAB is pretty good too, as is Octave and SciLab, and it has gotten a whole lot faster recently, but ever try much disk I/O or array resizing for something which couldn't be vectorized? Becomes slow as molasses.

      If that still isn't good enough, they can either 1) choose a smaller data set and limit the scope of their investigations until things fit,

      heh. I don't think you know these people.

      2) buy or rent a (virtual) machine with more CPU and more memory,

      Many problems are I/O limited and require real machines with high speed low latency network traffic. VMs just don't cut it for many parallelized tasks which need to pass messages quickly.

      Forgive me if I'm wrong, but your post sounds a bit like you think you're pretty good on the old computers, but don't know the first thing about FORTRAN and are feeling a bit defensive about that, and attacking something out of ignorance.

      --
      ~.~
      I'm a peripheral visionary.
    11. Re:I get the impression that by lattyware · · Score: 3, Interesting

      The GIL is an overblown issue. Threading is designed to get around issues with accessing slow resources, not for serious parallel computing. Just use multiprocessing if you want to do lots of computing in parallel, problem solved.

      --
      -- Lattyware (www.lattyware.co.uk)
    12. Re:I get the impression that by lattyware · · Score: 2

      Oh, and Python without a GIL exists, it's called Jython.

      --
      -- Lattyware (www.lattyware.co.uk)
    13. Re:I get the impression that by pthisis · · Score: 2

      The core processing in SciPy/NumPy is done in compiled C or Fortran libraries (LAPACK is used extensively where available), not in Python.

      I'm unaware of a widely-used interpreted version of Python. Whether Python is byte-compiled (CPython), JIT'd (psyco, pypy, IronPython, many Jython stacks), or compiled ahead of time to machine code (Jython+gcj, ShedSkin) depends on which Python implementation you're talking about.

      --
      rage, rage against the dying of the light
  2. Python 2 or 3? by toQDuj · · Score: 3, Interesting

    So is this going to focus on Python 2 or 3? Might be a reason to upgrade..

    --
    Every experiment which ends in a big bang is a good experiment.
    1. Re:Python 2 or 3? by SQL+Error · · Score: 4, Informative

      Both. The prebuilt "Anaconda" distro defaults to Python 2.7, but it also works with 3.3 and 2.6.

  3. Wrong language by Dishwasha · · Score: 4, Funny

    The put the money in the wrong place. They should have put it in to R which very popularly interfaces with Python.

    1. Re:Wrong language by SQL+Error · · Score: 3, Informative

      DARPA runs a lot of these research seed programs, putting a couple of million dollars into a bunch of different but related research projects. In this case the program budget is $100 million in total, and Continuum got $3 million for their Python work (Numba, Blaze, etc). Some of the program money may have gone to R as well; there's a couple of dozen research groups, but I don't have a full list.

    2. Re:Wrong language by hyfe · · Score: 2
      http://en.wikipedia.org/wiki/R_(programming_language)

      R is a statistical programming language. It has lots of neat methods and functions implemented, and is rules the world of statistical analysis.. which is kinda cool, since it's also open source.

      It sits pretty much halfway between Matlab and Python.. It's pretty usuable and convenient because of the huge library, but as a programming language it just, well, sucks ball. Building up the objects some of the methods there need, if you get data from an unexpected source, is just an utter pain in the bottomhole.

      --
      "" How about taking the safety labels off everything, and let the stupidity-problem solve itself? """
  4. Good news for the Python community by kauaidiver · · Score: 3, Funny

    As a full time Python developer for going on 6 years this is good to hear! Now if we can get a Python-lite to replace Javascript in the browser.

  5. Re:Great. Just Great by Kwyj1b0 · · Score: 5, Insightful

    Yeah the govt needs better systems to manage the huge databases and dossiers they are building on everybody with their warrentless wiretaps and reading everybody's emails. Anybody who helps with this project is pretty damn naive if they don't think it will also be used for this.

    For that matter anybody who trusts the govt and thinks the govt is your friend is pretty damn naive. Yeah I would like to believe that too. No I won't ignore the mountains of evidence to the contrary. I won't treat all the counterexamples as isolated cases. I see them for what they are: an amazingly consistent pattern. The rule, not the exception. Govt positions are really attractive to sociopath types who just love power and control and a feeling that they are important and they get that feeling by imposing their will on us.

    So what you are saying is that DARPA funds will be used in a way to further the goals of DARPA/The government? Shocking. I haven't read anything that says which agencies will/won't have access to these tools - so I'd hazard a guess that any department that wants it can have it (including the famous three letter agencies).

    FYI, Continuum Analytics is a company that is based on providing high-performance python-based computing to clients. Any packages they might release will either be open source (and can be checked), or closed source (in which case you don't have to use it). They aren't hijacking the Numpy/Scipy libraries. They are developing libraries/tools for a client (who happens to be DARPA). (Frankly, I'd hope that Continuum Analytics open sources their development because it might be useful to the larger community). You do know that DARPA funds also go to improve robotics, they supported ARPANET, and a lot of their space programs later got transferred to NASA?

    Basically, I have no idea what you are ranting about. One government organization funded a project - it happens all the time. Do you rant about NSF/NIH/NASA money as well? If so, you'd better live in a cave - a lot of government sponsored research has gone into almost every modern convenience that we take for granted.

  6. So... by CAIMLAS · · Score: 2

    So, they're porting R and Perl PDL to Python, then?

    --
    ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  7. Re:Great. Just Great by Anonymous Coward · · Score: 5, Funny

    What is this APRANET thing? It sounds like some useless crap loaded acronym to me.

  8. Re:Matlab by sophanes · · Score: 2

    matplotlib already does this in conjunction with Numpy and Scipy - its plotting quality and flexibility compares favourably to Matlab.

    Its biggest drawback is that it is pretty glacial even by Matlab's standards when rendering large datasets (think millions of points). I'm not sure whether matplotlib or the interactive backend is at fault, but anything DARPA can do to improve the situation would be welcome.

  9. There's more to XDATA by seekthirst · · Score: 2

    It's strange that this article focused on Python and Continuum when there is a much bigger story to be had. The XDATA program is being run in a very open source manner, and there will be a multitude of open source tools created and delivered by the end of the contract. The program is focusing on two major tasks: the analytics/algorithmic tools to process big data; and the visualization/interaction tools that go along with them.

  10. Python? by Murdoch5 · · Score: 2

    Have they heard of Matlab?

    1. Re:Python? by Anonymous Coward · · Score: 2, Insightful

      Okay, look. I used Octave for a long time on Linux and on Windows. On Linux (Ubuntu) it generally worked rather well and I used it for classwork where possible. On Windows, it works well as long as you don't need to plot anything. I can't tell you the number of times I installed/uninstalled various versions of Octave on Windows to find out that the plotting was broken in some way. MATLAB is great until you run in to licensing issues.

      Then I found out about the combination of IPython/Numpy/Scipy/Matplotlib, which now all seems to fall under the name of "Scipy". It runs circles around Octave in just about every way, except that the syntax doesn't try to be matlab compatible. The plotting isn't as good as MATLAB's plotting, for large data sets, but for 99% of use cases, it works quite well, and for that other 1% I've been able to reduce my data set or view the data differently. Where "Scipy" destroys Octave and MATLAB is that in the same language as I do scientific computing, I have access to database libraries, asynchronous networking, good HDF5 support, GUI toolkits, multithreading, multiprocessing, etc. This is because Python is a computer language that makes it easy to integrate or "glue" things together. To the point that people created and glued really some really good numerical processing and plotting libraries. Saying "Fine then use Octave" is ridiculous because it ignores how much better "Scipy" is than Octave. Also, with Anaconda CE, you get a bunch of useful packages installed by default, available as 64-bit on every major OS. I understand that Octave is maintained by volunteers and that Numpy/Scipy have some degree of financial backing, but they're both open source, and I'm going to use the open source option that is more polished. If you don't explicitly care about trying to adhere to matlab syntax(which mathworks continually tries to break, anyways), then I don't know why someone would choose Octave over Scipy.

  11. Re:Great. Just Great by sdaug · · Score: 5, Informative

    Frankly, I'd hope that Continuum Analytics open sources their development because it might be useful to the larger community

    Open sourcing is a requirement of the XDATA program.

  12. YAY by sproketboy · · Score: 2

    Now China can win!

  13. Big Data != Analytics by michaelmalak · · Score: 2

    The summary and article seem to conflate Big Data with Analytics. These days the two often go together, but it's quite possible to have either one without the other. Big Data is "more data than can fit on one machine", and analytics means "applying statistics to data". E.g. many Big Data projects start out as "capture now, analyze a year or two from now," and maybe just do simple counts in the interim, which is not "analytics". And of course, many useful analytics take place in the sub-terabyte range.

    The irony with this story is that Python is useful for in-memory processing, and not "Big Data" per se. To process "Big Data" typically requires (today, based on available tools, not inherent language advantages) JVM-based tools, namely Hadoop or GridGain, and distributed data processing tasks on those platforms require Java or Scala. Both of those platforms leverage the uniformity of the JVM to launch distributed processes across a heterogeneous set of computers.

    The real use case here is one first reduces Big Data using the JVM platform, and only then once it can fit into the RAM of a single workstation, use Python, R, etc. to analyze the reduced data. So typically, yes, these Python libraries will be used in Big Data scenarios, but pedantically, analytics doesn't require Big Data and Python isn't even capable (generally, based on today's tools) of processing raw Big Data.