Slashdot Mirror


Is Python a Legitimate Data Analysis Tool?

Back in May we discussed using Python, R, and Octave as data analysis tools, and compared the relative strength of each. One point of contention was whether Python could be considered a legitimate tool for such work. Now, Bei Lu writes while Python on its own may be lacking, Python with packages is very much up to the task: "My passion with Python started with its natural language processing capability when paired with the Natural Language Toolkit (NLTK). Considering the growing need for text mining to extract content themes and reader sentiments (just to name a few functions), I believe Python+packages will serve as more mainstream analytical tools beyond the academic arena." She also discusses an emerging set of solutions for R which let it better handle big data.

44 of 67 comments (clear)

  1. really? by Anonymous Coward · · Score: 2, Interesting

    Any Turing-complete language is a legitimate data analysis tool.

    1. Re:really? by Meshach · · Score: 2

      Any Turing-complete language is a legitimate data analysis tool.

      The question is not whether or not it is possible but whether or not it is realistic and practical.

      --
      "Maybe this world is another planet's hell"
      Aldous Huxley
    2. Re:really? by Billly+Gates · · Score: 3, Funny

      No the question is whether it is legitimate.

      Then that case Excel because you can email it and share it with colleagues and it is PHB approved.

    3. Re:really? by KiloByte · · Score: 1

      With the right libraries, it ALWAYS is both realistic and practical.

      Of course, you'd need really good libraries to overcome malbolge or brainfuck, but hey, no one says the underlying language has to be visible from behind them...

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    4. Re:really? by Phibz · · Score: 1

      You joke but you'd be surprised in the marketing industry Excel is quite popular for data analysis. For small data sets it does a fine job.

    5. Re:really? by jonadab · · Score: 1

      > The question is not whether or not it is possible but whether or not it is realistic and practical.

      Using Python for data analysis is realistic, assuming you know Python (or have enough background in computer science to pick it up quickly -- it's not a particularly difficult language, as languages go: I've seen accounting software packages that would be much harder to learn).

      Python is perhaps not quite as practical as some other choices. In particular, object-oriented programming is not an especially good fit for many data analysis tasks; a multiparadigmatic language would often be better, because it lets you use functional techniques, which is often very handy for working with data sets. (It's no coincidence that SQL bears a striking resemblance to the data-filtering portions of a typical impure functional language.) OTOH, it's good that not everyone uses exactly the same thing. The right tool for the specific job you're doing and all that -- all data analysis is not created identical.

      Personally, I use Perl.

      --
      Cut that out, or I will ship you to Norilsk in a box.
    6. Re:really? by Anonymous Coward · · Score: 1

      I've been seeing this way too often. People don't bother considering what the intent of the person asking was. They fixate on one word with a relatively vague meaning and choose one particular interpretation of it, and then go haring off into oblivion. The discussion turns into a verbal fight over the precise definition of a word that's vague in the first place.

      It's really rather annoying.

      So: can you replace R with Python (let's say, for a new project), assuming that you know both languages and all the relevant libraries, without a significant hit to productivity? Without reinventing stuff that R has by default? Without pining for CRAN every ten minutes?

    7. Re:really? by luis_a_espinal · · Score: 1

      Any Turing-complete language is a legitimate data analysis tool.

      Legitimately =/= feasible without regards to cost.

      Otherwise, let's use assembly to write our own analytics package.</rollseyes>

    8. Re:really? by Billly+Gates · · Score: 1

      It does a great job.

      The problem is when you need to share the data. Then a database is the answer and it can do data mining as well if it is a good non free one ... well except if its Oracle. UGH.

      But to me that is common sense rather than feeding it into a script in some programming langauge.

    9. Re:really? by NadMutter · · Score: 1

      Replying to undo accidental moderation (sticky trackpad)

      Sadly I see way too many corporate 'documents' that subscribe to the putative logic 'if it has numbers, it must be a spreadsheet; if it has pictures, it goes into powerpoint'. Where's the 'Ironic' moderation option

    10. Re:really? by cynyr · · Score: 1

      you should see small business engineering... if it is, it is in excel or autocad. These are then printed to PDF for publication.

      --
      All of the above was encrypted with a Quad ROT-13 method. Unauthorized decryption is in violation of the DMCA.
    11. Re:really? by cynyr · · Score: 2
      --
      All of the above was encrypted with a Quad ROT-13 method. Unauthorized decryption is in violation of the DMCA.
  2. It Works by mrsquid0 · · Score: 4, Insightful

    Python may not be a legitimate data analysis tool, but it is widely used for data analysis, and it gives the right results. For the most part that is what really matters.

    --
    Just because you are paranoid does not mean that no-one is out to get you.
    1. Re:It Works by mcgrew · · Score: 5, Insightful

      Python is a language. It's a tool to build other tools with, including data analysis tools.

    2. Re:It Works by ceoyoyo · · Score: 4, Insightful

      What does "legitimate data analysis tool" mean? MatLab was included in the comparison, and MatLab is more of an engineering tool. The built in (excuse me, optional paid for) stats library is pretty limited.

      R is great for doing statistical analysis, but it's not great for doing things like image analysis. Without additional libraries R isn't nearly as good as it is with libraries either.

    3. Re:It Works by Instine · · Score: 2

      or use other libraries easily and quickly. PyCUDA gives genuinely huge number crunching power to the language. And allows meta programming which suits scripting languages and machine learning very well. http://mathema.tician.de/dl/pub/nvidia-gtc-2009.mp4

      The readability and flexibility and speed of development are what it brings, the raw power comes from the libraries it can talk to.

      --
      Because you can - or because you should?
    4. Re:It Works by roman_mir · · Score: 3, Funny

      What does "legitimate data analysis tool" mean?

      - obviously it means to ask whether Python is legitimate or is bastard, what do you think it means? It is not asking whether Python is a 'data analysis tool', it is asking whether Python is a legitimate something or other.

      So to answer the question you have to look at the Python's descendancy. You'll quickly discover that Python was actually conceived in a huge orgy of different programming paradigms, styles and languages, it's even named after a circus!

      I believe the answer is that Python is a bastard of data analysis tools, but so what, bastards are people too.

  3. Re:Call me old fashioned by ceoyoyo · · Score: 5, Interesting

    It depends how complicated the math is.

    I wrote a general linear model in Python because I was unhappy with the existing ones and I wanted an intimate knowledge of how it worked. I wrote most of a general linear mixed model, but then decided it wasn't worth the time and just used the one in R via RPy2. Then it turned out the one built into R was too slow, so I upgraded to the one in the lme R package. That exists because a lot of smart people use R.

    But sure, if your "data analysis" involves multiplication and maybe a t-test or two, it doesn't really matter what you use.

  4. these articles are not informative by Anonymous Coward · · Score: 1

    Someone who knows so little about tools like R, python, etc. should spend their time learning about what is available rather than writing articles on the topic using their own cursory knowledge.

    1. Re:these articles are not informative by Eponymous+Hero · · Score: 1

      seems legit

      --
      insensitive clod overlords obligatory xkcd car analogy russian reversals whoosh pedant fanbois ftfy in 3...2...1..PROFIT
  5. Use what works by hawguy · · Score: 5, Insightful

    Since people do use python for data analysis (hence the data analysis related packages that are available), of course it's legitimate.

    Just like how when you're standing on the roof and you need to pound in a couple nails, that heavy pair of pliers in your pocket is a legitimate tool. It may not be the best tool for the job, the best tool might be a pneumatic nail gun, but if all you have with you and what you know how to use is pliers, then that's the right tool. Why spend time and money learning some other "more appropriate" language (or buying an air compressor and nail gun) when you already have a tool at your fingertips that will do what you need.

    As your needs grow you might need to find another more appropriate tool, but if you can get the job done with Python, why bother searching for the "perfect" tool?

    Depending on your needs, sh, awk, sed, sort, and uniq may be all the tools you need - many log parsing, analysis and reporting programs have been writing with those tools, often ingesting more rows of data per day than many small business BI systems.

    1. Re:Use what works by betterunixthanunix · · Score: 1

      Why spend time and money learning some other "more appropriate" language (or buying an air compressor and nail gun) when you already have a tool at your fingertips that will do what you need.

      Indeed, although sometimes you save yourself a lot of headaches by getting a tool that was built for your task. I have, in a pinch, used a screw driver to hammer nails, but a screw driver is no replacement for a hammer.

      That being said, Python+SciPy+NumPy is fine for data analysis; people use it all the time, and it works as well as R or MatLab. It is not as though we are talking about QuickBasic for data analysis.

      --
      Palm trees and 8
  6. Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o by jdgeorge · · Score: 3, Funny
  7. Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o by godrik · · Score: 2

    Tomorrow on slashdot:

    "Can all questions in headlines be answered by 'no' ?"

  8. Python can do anything by Anonymous Coward · · Score: 5, Funny

    http://xkcd.com/353/

  9. Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o by mooingyak · · Score: 1

    Tomorrow on slashdot:

    "Can all questions in headlines be answered by 'no' ?"

    Most, but not all

    --
    William of Ockham had no beard. The most likely explanation is that it was chewed off by squirrels every morning.
  10. Better than R by Anonymous Coward · · Score: 1

    I looked at R and it's one of the most deranged languages I've ever seen in terms of syntax (up there with Erlang). At least Python is readable to the average programmer who knows C or Java.

    1. Re:Better than R by ceoyoyo · · Score: 3, Informative

      R is MUCH nicer when you use it through a bridge from Python.

    2. Re:Better than R by Anonymous Coward · · Score: 1

      I think he means RPy2

  11. Is it reproducible? by Anonymous Coward · · Score: 1

    I work in the biosciences and we occasionally have a similiar discussion.

    In our context, it isn't about how one analyzes the the data, it is a question about how anyone else can recreate your experiment: that is, set up the experimental system, acquire the data, analyze it which will yield approximately the same results. It is in our best interest [and mandated by our funding agency and the journals] to publish papers that clearly define how we made our observations and how we analyzed the data.

    My group concludes that any tool is fine, but it must part of a well-described logaical framework in which we generate a hypothesis, test it, and make a conclusion.

  12. Legitimate? by jdavidb · · Score: 1

    "legitimate" is such a disrespectful value judgment. Are you saying that people who do data analysis with Python are illegitimate? Are you calling them bastards?

    No, seriously, you can have a profitable conversation all about the reasons why you think there are serious drawbacks to using Python as your data analysis tool. Lots of people might benefit from that. But when you start saying things like "That's not a legitimate data analysis tool" or "That's not a real programming language" or whatever, then you are getting down into contentless arguing, passing off disrespect as if it were legitimate discourse.

    If you really think use of Python as a data analysis tool is that bad, go all the way: don't try to have a serious subject on the discussion, turn it into a humorous essay on people who are so stupid and unenlightened that they can't see what is blindingly obvious to you.

    A long time ago in my academic life, I took a neural networks class that did a lot of data analysis with matlab. I poked around with octave, but I finally wound up writing my projects in Perl with PDL. I'm sure not many people would do that, but I just wanted the learning experience. It was legitimate for my purposes, which was learning and the joy of being able to say I did it. But you might want to mock me for it. :)

    1. Re:Legitimate? by Sir_Sri · · Score: 1

      legitimate" is such a disrespectful value judgment. Are you saying that people who do data analysis with Python are illegitimate? Are you calling them bastards?

      I'm not sure that's how it's meant, but I agree, it's an odd choice of phrase. If I were to look at it another way, what would make a language 'illegitimate' for data analysis? In that case you look at things like excel and access for financial transactions, or some of the early versions of CUDA that didn't support proper IEEE floating point maths (or at least, not fast IEEE floating point maths). In those cases you can use the language, and it will spit out results, but they might not be right, and there's no obvious way to know. Lets say by default some language doesn't convert in any sort of reasonably obvious way between number types (ints to floats, floats to ints, floats to double precision floats, that sort of thing), or if the math has some bizarre errors in it. A classic example would be division, that's slow to do properly, so if your language takes some shortcuts by default that are faster, but wrong, well that'd be bad.

      These of course are all things that can be changed or fixed with appropriate libraries and so on, but you'd need to know those are problems. Which I guess is why you'd ask.

      So ya, overall, it's a strange way to phrase the problem. In a broadly theoretical sense there's no reason any decent language couldn't be used for data analysis, obviously, so from there it's a matter of whether or not it's up front about when it does things badly.

  13. Already considered such.... by Anonymous Coward · · Score: 1

    Just ask the astronomy community. They've been moving away from IDL as an analysis environment and towards the use of python with scipy (with numpy and pyfits offering similar performance). You're asking this question several years after it's already been effectively declared as such.

    1. Re:Already considered such.... by goatbar · · Score: 1

      If you are talking astronomy, you've left out http://yt-project.org/

  14. Re:Call me old fashioned by highacnumber · · Score: 1

    If the math is more abstract, then Sage (python-based) is a better bet than R (and Sage includes R): www.sagemath.org.

  15. Re:Call me old fashioned by ceoyoyo · · Score: 1

    Sage is basically a batteries included Python distribution. Lots of people like a bit package to use like that. I prefer putting the pieces together myself. My other complaint about Sage, last time I looked at it, is that it's more difficult to install your own packages in the Sage environment than it is to do so with stock Python. One of the great things about Python is needing a particular algorithm, typing it into Google, and downloading and installing the handy package that someone else has already written and shared.

  16. Re:Call me old fashioned by hey! · · Score: 3, Insightful

    Alright, you're old-fashioned. And you're mixing up apples and oranges.

    I think what most people these days are talking about is not just having some kind of online analytics data resource, but having a system where having that resource is taken as a given and the task is to use mathematics and AI to classify records, discover patterns and relationships, locate unusual data (without necessarily specifying the nature of the anomaly in advance), and whatnot.

    A spreadsheet is fine for doing simple summaries of small, heterogeneous, tabular datasets (calculating averages and whatnot). But it's not going to help you find one record out of millions where your search criteria are too complex to be expressed in a SQL where clause.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  17. Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o by physburn · · Score: 1

    Can betteridge's Law cause paradox?

  18. "And it's critical to finding the Higgs Boson" by Narrowband · · Score: 1

    This story seems like an echo of the one a day or so ago about Linux being critical to the success of the LHC. Something with generic programmability supports something specific, then gets discussed as a tool for that specific task. Probably a lot of the comments there apply here.

  19. Perl and Python both by gizmo_mathboy · · Score: 2

    Python and Perl make great data analysis tools.

    They have a plethora libraries to handle things: Numpy/Scipy for Python and PDL/GSL for Perl.

    They can access FORTRAN and C libraries as necessary for either performance or legacy needs.

    THey are probably best because they are high level languages, very platform neutral, and cost signficantly less than other "serious" data analysis tools/languages.

  20. Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o by VortexCortex · · Score: 2

    No.

    Working link for subject. In other news, How hot is vehicle theft is your area?

    "No." is the correct answer. That headline is just wrong.

  21. Pandas - Data Analysis for Python by Anonymous Coward · · Score: 1

    Yes absolutely. Its being used to do all sort of data analysis in the real world.

    Check out Pandas (http://pandas.pydata.org/) the Python data analysis library.

    Also there are lots of machine learning libraries: scikits-learn is probably the best known (http://scikit-learn.org/)
    Both of these are built on NumPy.

    You should also check out the videos from the 2012 PyData workshop: http://marakana.com/s/2012_pydata_workshop,1090/index.html

  22. CERN by Roger+W+Moore · · Score: 1

    The question is not whether or not it is possible but whether or not it is realistic and practical.

    Not only is it realistic and practical but it is already in use for data analysis! Everyone on the ATLAS experiment at CERN uses python to some degree in their analysis and my grad students and I use an analysis framework almost entirely in Python with ROOT for I/O.

  23. size of data by transonic_shock · · Score: 1

    I love R and Python. However, both of them choke on big data sets. What they need is an in-built mechanism to store data on disk rather than in-memory. There are some really convoluted ways of doing this..but then dont always work with modeling packages that weren't written with the convoluted approach you are taking, in mind. So, if the base language has the ability to store object on disk, say with a simple flag, and its transparent to the rest of the system, most downstream libraries/packages would still work.

    ff package in R is a good approach..maybe that should be adopted as the memory model for R.

    I hate to say this but maybe R/Python can learn something from SAS here.