Python Gets a Big Data Boost From DARPA
itwbennett writes "DARPA (the U.S. Defense Advanced Research Projects Agency) has awarded $3 million to software provider Continuum Analytics to help fund the development of Python's data processing and visualization capabilities for big data jobs. The money will go toward developing new techniques for data analysis and for visually portraying large, multi-dimensional data sets. The work aims to extend beyond the capabilities offered by the NumPy and SciPy Python libraries, which are widely used by programmers for mathematical and scientific calculations, respectively. The work is part of DARPA's XData research program, a four-year, $100 million effort to give the Defense Department and other U.S. government agencies tools to work with large amounts of sensor data and other forms of big data."
The work is part of DARPA's XData research program, a four-year, $100 million effort to give the Defense Department and other U.S. government agencies tools to work with large amounts of sensor data and other forms of big data.
Yeah the govt needs better systems to manage the huge databases and dossiers they are building on everybody with their warrentless wiretaps and reading everybody's emails. Anybody who helps with this project is pretty damn naive if they don't think it will also be used for this.
For that matter anybody who trusts the govt and thinks the govt is your friend is pretty damn naive. Yeah I would like to believe that too. No I won't ignore the mountains of evidence to the contrary. I won't treat all the counterexamples as isolated cases. I see them for what they are: an amazingly consistent pattern. The rule, not the exception. Govt positions are really attractive to sociopath types who just love power and control and a feeling that they are important and they get that feeling by imposing their will on us.
I get the impression that in the Engineering and Scientific community Python is the new Fortran. I hope so, because it would be "Fortran done right".
Bye-bye Matlab. I liked your plotting capabilities, but that was about it.
Seriously- the Continuum folks do great work, and after hanging out with them a bit at the last PyCon I was really impressed with where they seemed to be headed. Hope they make it there.
So is this going to focus on Python 2 or 3? Might be a reason to upgrade..
Every experiment which ends in a big bang is a good experiment.
The put the money in the wrong place. They should have put it in to R which very popularly interfaces with Python.
As a full time Python developer for going on 6 years this is good to hear! Now if we can get a Python-lite to replace Javascript in the browser.
I wonder how this effort compares to the work being done by Enthought Python. Hopefully it is more open and freely available to all, or better yet, incorporated into the mainline python distro.
Yeah the govt needs better systems to manage the huge databases and dossiers they are building on everybody with their warrentless wiretaps and reading everybody's emails. Anybody who helps with this project is pretty damn naive if they don't think it will also be used for this.
For that matter anybody who trusts the govt and thinks the govt is your friend is pretty damn naive. Yeah I would like to believe that too. No I won't ignore the mountains of evidence to the contrary. I won't treat all the counterexamples as isolated cases. I see them for what they are: an amazingly consistent pattern. The rule, not the exception. Govt positions are really attractive to sociopath types who just love power and control and a feeling that they are important and they get that feeling by imposing their will on us.
So what you are saying is that DARPA funds will be used in a way to further the goals of DARPA/The government? Shocking. I haven't read anything that says which agencies will/won't have access to these tools - so I'd hazard a guess that any department that wants it can have it (including the famous three letter agencies).
FYI, Continuum Analytics is a company that is based on providing high-performance python-based computing to clients. Any packages they might release will either be open source (and can be checked), or closed source (in which case you don't have to use it). They aren't hijacking the Numpy/Scipy libraries. They are developing libraries/tools for a client (who happens to be DARPA). (Frankly, I'd hope that Continuum Analytics open sources their development because it might be useful to the larger community). You do know that DARPA funds also go to improve robotics, they supported ARPANET, and a lot of their space programs later got transferred to NASA?
Basically, I have no idea what you are ranting about. One government organization funded a project - it happens all the time. Do you rant about NSF/NIH/NASA money as well? If so, you'd better live in a cave - a lot of government sponsored research has gone into almost every modern convenience that we take for granted.
... to Python operated railguns. That would be awesome :D
So, they're porting R and Perl PDL to Python, then?
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Pandas http://pandas.pydata.org/ is another great tool for data analysis. It use numpy and is highly optimized with critical code paths which is written in C.
This DARPA work sound like it's in the same space as the Pandas library. I hope they can work together.
What is this APRANET thing? It sounds like some useless crap loaded acronym to me.
Only half a troll, seriously the sphinx/numpy documentation themes are terrible compared to javadoc standard.
Finding epydoc has dropped my swearing to lines of code ratio by heaps.
http://en.wikipedia.org/wiki/ARPANET
you're using it now. or a derivative work, anyway.
python linking to llvm is the way to really speed it up and a few groups are seriously working on it
It's strange that this article focused on Python and Continuum when there is a much bigger story to be had. The XDATA program is being run in a very open source manner, and there will be a multitude of open source tools created and delivered by the end of the contract. The program is focusing on two major tasks: the analytics/algorithmic tools to process big data; and the visualization/interaction tools that go along with them.
Seriously? Sphinx makes beautiful documentation that is easy to find your way around. Compared to the ugly-ass JavaDocs that are painful to browse through, I wouldn't even give it a second thought.
-- Lattyware (www.lattyware.co.uk)
Have they heard of Matlab?
Frankly, I'd hope that Continuum Analytics open sources their development because it might be useful to the larger community
Open sourcing is a requirement of the XDATA program.
Now China can win!
You have no idea what he's talking about? It was pretty clear: factions within the US government wants these tools to datamine all the ISP data they have been snarfing up so they can spy on everyone in the world. Saying that you believe otherwise is a pretty extreme view and, as such, requires a very high standard of proof. Do you have that proof? No, then STFU while us adults try to figure out how to stop this obvious slide into tyranny.
The summary and article seem to conflate Big Data with Analytics. These days the two often go together, but it's quite possible to have either one without the other. Big Data is "more data than can fit on one machine", and analytics means "applying statistics to data". E.g. many Big Data projects start out as "capture now, analyze a year or two from now," and maybe just do simple counts in the interim, which is not "analytics". And of course, many useful analytics take place in the sub-terabyte range.
The irony with this story is that Python is useful for in-memory processing, and not "Big Data" per se. To process "Big Data" typically requires (today, based on available tools, not inherent language advantages) JVM-based tools, namely Hadoop or GridGain, and distributed data processing tasks on those platforms require Java or Scala. Both of those platforms leverage the uniformity of the JVM to launch distributed processes across a heterogeneous set of computers.
The real use case here is one first reduces Big Data using the JVM platform, and only then once it can fit into the RAM of a single workstation, use Python, R, etc. to analyze the reduced data. So typically, yes, these Python libraries will be used in Big Data scenarios, but pedantically, analytics doesn't require Big Data and Python isn't even capable (generally, based on today's tools) of processing raw Big Data.
Python isn't fit to run on a large cluster to simulate things, too much overhead.
Have you heard of Stackless Python? Your presumption that Python isn't fit for large clusters to simulate things may be news to the largest single instance human particapatory simulation ever done: New Eden.
cash and put it to advancing applied sciences to better the nation. We piss billions down the drain marketing to morons and yet whine about spending billions on DARPA, DoE and whatnot. This county is truly too stupid for its own well-being.
What is this APRANET thing? It sounds like some useless crap loaded acronym to me.
You gotta be fucking kidding me. Either you are trolling or you are completely clueless about technology. In the case of the later, it begs the question what are you doing in /. If you don't know what ARPANET you should be posting in MySpace instead of posting on a nerd/tech news site. It'd be like me posting opinions on a medicine-related site without knowing the meaning of the word 'penicilin'.
I think you're right. I love Ruby, it's a very fun and effective language, I could write it in my sleep but there are so many cool projects that are written in Python. Those languages are *very* similar, and it's a shame that so much effort is being divided between communities. I might get to learn Python one day but I'm afraid I'd become a so-so programmer in both languages.
Both languages suffer from the global interpreter lock defect and will require a rewrite in the next 5-10 years if the languages have any chance of surviving in the servers.
Gee, because there are no distributed enterprise solutions written on Python or Ruby <rolls eyes/>
It will take some very serious, dedicated, low level work and I just don't see it happening.
It already has happened. The solutions aren't just in the mainstream versions, though. Take Jython. On a typical JVM, it is the fastest Python in-the-trenches implementation available. Throw that over specialized Java-focused hardware (like the Azul Vega 3), and you are on fire.
Furthermore, a solution to the GIL problem is not necessary in the general case. In any modern system, the cost of communicating processes vs threads is no longer so much of an issue as it was a decade ago. Depending on the nature of computation, context switching between processes can be as cheap as switching between threads, and the former is typically somewhat (but no completely free) of the locking issues that are experienced with threading paradigms as seen in, say, Java/JEE solutions.
In the back-end server arena where the greatest bottlenecks are those between http servers, app servers and database servers, there are so many, tried and true solutions to the so-called GIL process that it typically renders it as a non-issue. More processes per box, more RAM and SDDs, more boxes collocated on the same subnets running more processes, all communicating with some type of messaging queue. For these typical solutions, the issue of the GIL get blurred into non-existence.
It's only for those applications where you have to squeeze every last drop out of your cores that the GIL becomes an issue, and where Java/JEE shines. But for the typical bizneyty application, a platform with a GIL issue does just fine by simply scaling horizontally.
I have this fantasy where Guido and Matsumoto will sit down and write the common code together for a super-interpreter that will handle different syntax in a modular way. I know it's technically possible since GCC is doing something very similar but, again, I just can't see this happening.
In the meantime, Go is looking mighty good...
Google Go looks mighty good... for systems-level programming. That's what Google intended it to be. For app development, sorry, you need more than a language. You need a tried and true app stack. Until that happens (and it will take some time for that to happen), Java, Python, Ruby and even .NET do more than fine.
You need more than the language (however greatly designed it might be) to make potentially complex domain-specific shit happen.
So how much of that $3M will go towards development of the NumPy port to Pypy? I'm guessing 0%, which is unfortunate, since that is one of the best places to push the state of the art in speed for numerical processing with python. The Pypy community has the modest goal of raising $60k for that work (just 2% of the grant to this company), and they are still only 3/4 of the way to achieving those funds after a year with their shingle out.
http://pypy.org/numpydonate.html
Speaking as someone who's been employed by Python nearly a decade, and prior to that was involved in porting scientific Fortran to C & Java (Dear Fortran guys -- I so sorry, it was the job the idiots paid for because management thought Fortran was dying).
It won't work.
It's not that Python can't do it. It's that without a real programmer, python is slow. Even with a real programmer, it's slower -- but that's /often/ recoverable in many ways, particularly in development time.
Scientists that don't code have an easier time learning python. Scientists that do code (well) can learn python, but are often going to want to move into other languages because they *always* want more data and more refined models. I've seen them learn java and c -- but that's a total nightmare, worse than even python.
Contrast to a friend who thinks they can program but can't even fizzbuzz -- they have a dataset that they think is too slow for python. It is too slow the way they do it, but they've copy-pasted a O(log log N) algorithm so badly it's at least O( N log(N)) . Going out of asymptotics, there really is a constant of about "5" before that for all the extra iterations and wholly unecessary subdivisions they do, plus the output is total shit because they don't understand what it means to working in floating point. So a process that I can finish on my desktop in a few hours as long as I have enough RAM takes them three weeks to run on a server.
The thing runs -- except for the 10% of the data they drop, but it's a wholly unreadable mess.
Some of the people that want to do this are "real programmers" -- but many are scientists that just want a visualization and don't give a damn what tool does it as long as the output looks like what they think it should.
They're the same researchers that cut and paste from stackoverflow or expert sexchange, and who just drag and drop code around in notepad trying to get rid of errors.
They'll get an example given a CSV to make a beautiful clustergraph from examples or a friend that knows it, but they'll still develop deeply flawed research and modeling code and never know why or catch it.
Doing this in python may make some of the analysis more accessible as a whole, but it won't fix the 'problem' that most scientists can't actually program.
Maybe they shouldn't have to -- but somebody does.
The problem is really best summed on when describing a bug to a new programmer that wasn't great at math, and clearly used to having a single error mess them up. They figured they could change one thing and fix a totally flawed algorithm...
"Just tell me what line the bug is on"
The answer was : "All but these two".
To the new-non-programmer...this answer was inconceivable. They 'knew' what they told the computer to do, and it was being unreasonable in interpreting their source according to the rules of the language. There had to be one line to fix it -- the notion that the fundamental structure of their logic was wrong was so counterintuitive they didn't believe it even when pointed out.
Poe's Law.
Plenty of reasons.
The home of only four character TLD suffix .arpa
By the way, did you know Bill Joy wrote the BSD IP-stack in one weekend? :-)
Well.. there's C, of course...
I work with C and C++ on a daily basis, and I have to ask/answer: For parallelized scientific computation or data crunching? No thank you. You don't use a phillips screw driver to unscrew a hexagonal bolt, do you? Know your tools, their strenghts and limitations.
Yeah, the issue is that Python is pretty hard to sandbox, being the hugely dynamic language it is.
Forgive me but JavaScript is also hugely dynamic. How does this prevent effective sand boxing in the general sense?
I imagine it would take a lot to get the browsers to stop working on their JavaScript implementations that they have sunk insane amounts of time and effort into, and start something brand new.
Another solution is to program in a subset of Python that gets verified at compile time with additional restrictions, and then compiled into JavaScript (the way CoffeeScript does.) That way we re-capture the investment already made in browser-side JavaScript technology.
Trust me, I'd love to see it happen, but I don't think it will.
That sounds more like a solution looking for a problem. No need to reinvent the browser vm wheel. Reuse what's there to greatest extend possible and get the best ROI.
It might not sound as cool as re-inventing browser script vm technology, but it is certainly a more pragmatic solution for which working precedents already exist. Plus, it's not as if it were trivial. Language-to-language compilers are fertile ground for very cool experimentation.
Poe's Law.
In /., you never know.
...this to Julia as it is made for number science anyway.
Yeah the govt needs better systems to manage the huge databases and dossiers they are building on everybody with their warrentless wiretaps and reading everybody's emails. Anybody who helps with this project is pretty damn naive if they don't think it will also be used for this.
Isn't this true of all useful open source projects?
You have no idea what he's talking about? It was pretty clear: factions within the US government wants these tools to datamine all the ISP data they have been snarfing up so they can spy on everyone in the world. Saying that you believe otherwise is a pretty extreme view
He has no idea why there is ranting about open source code that everyone in the world can use for any purposes. Did you rant about git being open source? I'm betting the gov't can use that to manage code related to data mining. Do you rant about postgres or any of the databases used by the US gov't? Would postgres suddenly become evil because the gov't threw some money their way?
So what you are saying is that DARPA funds will be used in a way to further the goals of DARPA/The government? Shocking.
First rule of Slashdot: never, ever, EVER miss the opportunity to be a condescending ass. It's much more important than the point being made, after all.
Which, since you seem confused, the point was they want OUR HELP. No, says I, they can work on their rights-infringing projects without my personal assistance. You see, actively assisting tyranny would be the opposite of furthering MY goals.
See how simple that is? You aren't that stupid. You are, however, too insecure and thus too eager to portray another as the fool, and this stops you from realizing what is being said. After all, the other guy is a fool so you MUST interpret what he said in the dumbest way possible. Then you can cry about how dumb he was.
It's sort of like making an idol with your own two hands, and then bowing down and worshipping the idol you have made. You have to forget that you did in fact make it. That's how it is when you assume you're so smart and everyone else is so stupid. You have to forget that you took it upon yourself to assume that. Then you can believe in it.
People like you are why mature, rational adult conversations are so hard to find these days. The sense of worth you're looking for is found within yourself by cultivating a healthy mind and spirit and an attitude of joy and appreciation. You will never have satisfaction, fulfillment, or security by playing the condescending ass. You will only miss the point being made.
See Travis Oliphant's announcement about this on the numpy-discussion list: http://comments.gmane.org/gmane.comp.python.numeric.general/52397
No, they don't. The CPython and MRI/YARV implementations of Python and Ruby, respectively, have global interpreter locks, but those are implementation quirks not language features. On the Python side, IronPython and Jython don't have a GIL, on the Ruby side neither JRuby, MacRuby, IronRuby nor Rubinius (the latter being particularly important, because it has been widely suggested as the next mainline Ruby platform, in the same way that the YARV-based Ruby, which replaced the old mainline interpreter from 1.9, was prior to 1.9) have a GIL.
Further, I'm not sure the GIL is that big of an issue going forward: threadsafe native code in the runtime or extensions can release the GIL, and directly using native system threads with shared mutable state at the application level rather than using isolated task abstractions at the application level with threads managed in the runtime doesn't seem to be all that great a way to build scalable application code (there is a reason why languages designed specifically for scalable concurrency often don't directly expose threading at the language level: this is true both of newer languages like Go and Rust, and older and more widely used concurrency-focussed languages like Erlang.)
Finally, insofar as the GIL is important, its not like a ground-up rewrite that starts from square one would need to be done to get rid of it in the next 5-10 years: the mainline interpreters have been working on improving the thread-safety of the underlying code for years with the intent of removing the GIL in both CPython and MRI, and as noted previously, alternative implementations of the languages have already been built that don't have a GIL -- so the work of the "rewrite" has already been done and is available (multiple times, for each language.)
The home of only four character TLD suffix .arpa
Other than, you know, these:
Reference: http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains