Slashdot Mirror


Comparing R, Octave, and Python for Data Analysis

Here is a breakdown of R, Octave and Python, and how analysts can rely on open-source software and online learning resources to bring data-mining capabilities into their companies. The article breaks down which of the three is easiest to use, which do well with visualizations, which handle big data the best, etc. The lack of a budget shouldn't prevent you from experiencing all the benefits of a top-shelf data analysis package, and each of these options brings its own set of strengths while being much cheaper to implement than the typical proprietary solutions.

61 comments

  1. Get the Popcorn by eldavojohn · · Score: 4, Funny

    So, you're linking a SlashdotBI article to the Slashdot front page?

    Well then.

    --
    My work here is dung.
    1. Re:Get the Popcorn by Anonymous Coward · · Score: 0

      And with such a lightweight article, too? Please. There was only one code sample, and no example of how one task would be accomplished across all three.

    2. Re:Get the Popcorn by Anonymous Coward · · Score: 0

      Thanks for providing that feedback Anonymous!

      .. just channeling "startrekkie"

    3. Re:Get the Popcorn by ObsessiveMathsFreak · · Score: 1

      If people thought Idle was bad, the Business Intelligence takes Slashdot an order of magnitude lower.

      How long until the BI editors demand outright access to the frontpage?

      --
      May the Maths Be with you!
    4. Re:Get the Popcorn by Anonymous Coward · · Score: 1

      It's even more moronic when you consider that the articles comments had more useful content than the actual article!

      It's no wonder that taco left... /. It was a nice ride but you've really fallen by the wayside in the last few years as in nearly irrelevant with late story postings and garbage like this one.

  2. Did I seriously miss something? by ACK!! · · Score: 4, Informative

    The whole article was not much more than a high level review. The graphic naturally draws attention to the parameters the writer wanted to cover but he did not back up his graphic with any sort of serious textual review of what he felt were the weaknesses or advantages of the different programming language at least not in any detail.

    --
    ACK /ak/ interj. 2. [from the comic strip "Bloom County"] An exclamation of surprised disgust, esp. i
    1. Re:Did I seriously miss something? by Ruie · · Score: 4, Interesting

      The whole article was not much more than a high level review. The graphic naturally draws attention to the parameters the writer wanted to cover but he did not back up his graphic with any sort of serious textual review of what he felt were the weaknesses or advantages of the different programming language at least not in any detail.

      And what he has is flawed as well. For example, he marked R as having issue with big data which is quite wrong - I routinely analyze multi-GB datasets in memory, and my databases go into TB. Of all the three languages R is the only one to have a native format (data.frame) that interfaces easily to database queries. Both Octave (Matlab) or Python have to use compound types which make addressing difficult.

      Also, I found R easier to master than either Octave or Python, but this is probably because I am familiar with Lisp.

    2. Re:Did I seriously miss something? by Anonymous Coward · · Score: 1

      he did not back up his graphic with any sort of serious textual review

      She is Geeknet's "Senior Director of Analytics".

    3. Re:Did I seriously miss something? by Anrego · · Score: 2

      Indeed. This is high level "meeting for the suits" bullshit. I can picture this showing up on powerpoint presentation.

      Here are your three options.. this is the one that sucks, this is the one that sucks for a different reason, and this is the one I want you to go with. Oh, and here is a chart with some pretty checkmarks and stuff to help clarify! Lets do lunch!

    4. Re:Did I seriously miss something? by Anonymous Coward · · Score: 0

      I think the difference is when you use file formats that are flatter than databases and certain GUIs. In those cases, rather than taking the data as it needs it, it attempts to load all of it into memory and can max out the memory allowed to the process in 32 bit systems. But even then, there are ways around that through smart planning, variable use, and multiple data files for different variables so not all are in memory at once (of course databases implements all three at once internally).

    5. Re:Did I seriously miss something? by dondelelcaro · · Score: 2

      there are ways around that through smart planning, variable use, and multiple data files for different variables so not all are in memory at once

      There are also packages like ff and others which handle absolutely gigantic files by offloading parts of them to storage and only allocating memory for them (and storage) when required. R certainly has some problems with dealing with huge amounts of data, but they aren't insurmountable for datasets less than 1T.

      --
      http://www.donarmstrong.com
    6. Re:Did I seriously miss something? by Ruie · · Score: 1

      I think the difference is when you use file formats that are flatter than databases and certain GUIs. In those cases, rather than taking the data as it needs it, it attempts to load all of it into memory and can max out the memory allowed to the process in 32 bit systems. But even then, there are ways around that through smart planning, variable use, and multiple data files for different variables so not all are in memory at once (of course databases implements all three at once internally).

      This only happens if you issue a call like read.table("mytable.txt") - you can read the file piece by piece if you want to. Granted, this requires some work (unlike SAS), but in return you can do loops ;)

    7. Re:Did I seriously miss something? by Anonymous Coward · · Score: 1

      And what he has is flawed as well. For example, he marked R as having issue with big data which is quite wrong - I routinely analyze multi-GB datasets in memory, and my databases go into TB.

      Dude. That's not what people mean when they say big data. HP and Dell will both quite happily sell you machines with 2TB of main memory, and SGI will go to 16TB, and anything which can fit in memory on a single machine without custom hardware isn't big data. It's only big data once you get up to a few hundred terabytes.

    8. Re:Did I seriously miss something? by martin-boundary · · Score: 1
      Thats not the real problem. The real problem with R and Octave/Matlab etc is that when you want to use a specialized function to analyze your data, the function isn't usually implemented in an efficient way (ie it will create temporary tables/vectors and perform operation that don't scale, etc).

      So effectively your rich exploration environment is unusable unless you refrain from using all but the simplest operations, or you write your own versions of commands from scratch with efficiency in mind.

      This is _particularly_ noticeable with graphics. Try plotting a terabyte dataset _entirely_ on the screen in 3d and rotating it.

    9. Re:Did I seriously miss something? by plopez · · Score: 1

      If you know Lisp and OOP R is easy. Unfortunately Lsip has become arcane and most programmers I met did not understand OOP.

      --
      putting the 'B' in LGBTQ+
    10. Re:Did I seriously miss something? by plopez · · Score: 2

      32 bits? are you serious?

      --
      putting the 'B' in LGBTQ+
    11. Re:Did I seriously miss something? by ceoyoyo · · Score: 1

      It wasn't even that. It came down to one of the last paragraphs:

      "In my [limited and misleading] experience...."

      Python isn't good at visualization? I guess the author has never used VTK-Python or Matplotlib. R isn't good with big data? I suppose that comes from R not having great database interactivity... so just feed it data via Python using rpy2.

    12. Re:Did I seriously miss something? by Anonymous Coward · · Score: 0

      Right. I'm just saying that if you are insistent on using badly-written GUIs on the 32-bit version using flatter files that you can get around them. And the only reason I say that is that all the people at the top of my organization who are persuaded by charts like those in TFA would be the ones using it on their 32-bit computers with information they entered into excel or called from the databases into an excel spreadsheet that they then try to use in an R GUI.

      Thankfully, R has more than one way to skin a mouse as other responses to my post has pointed out.

    13. Re:Did I seriously miss something? by Ruie · · Score: 1

      And what he has is flawed as well. For example, he marked R as having issue with big data which is quite wrong - I routinely analyze multi-GB datasets in memory, and my databases go into TB.

      Dude. That's not what people mean when they say big data. HP and Dell will both quite happily sell you machines with 2TB of main memory, and SGI will go to 16TB, and anything which can fit in memory on a single machine without custom hardware isn't big data. It's only big data once you get up to a few hundred terabytes.

      Heh ! I am sure I can use R on such hardware, as long as I have access to it ;)

  3. I wish he had learning resources. by Anonymous Coward · · Score: 4, Insightful

    I wish there was also a column for availibility of resources for learning like: tutorials, free books, example code, etc ....

  4. Never selected that way by vlm · · Score: 4, Insightful

    how analysts can rely on open-source software

    I've done that kind of stuff at work and those criteria are NEVER how a package is selected.

    If I need a commercial product I need all manner of signoffs requiring at least weeks of delay and massive IT involvement so they can insert it into windoze images automatically or whatever it is they do.

    If I'm doing FOSS it just ... gets done that day. No agony. And it just works, and instead of a call center script reader in India who can only tell me to reinstall the software over and over, with FOSS the "whole internet" is my support system and they as in the whole internet know what they're doing.

    Nothing about this has changed in about 15 years, so I'm not sure how this is "news". This would have been a good "news" story in the early/mid nineties.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    1. Re:Never selected that way by Anonymous Coward · · Score: 1

      Even with proprietary software, the "whole internet" can support your system, it is bunk to say that only happens with FOSS.

      And to say it just works is bunk too, I see plenty of problems with FOSS where the "whole internet" has no f*cking clue other than, go to the source and figure it out yourself - not always a trivial exercise.

      But go ahead and and keep believing your own bullshit.

    2. Re:Never selected that way by Sebastopol · · Score: 2

      This is a thinly veiled attempt to put Python on the same level as R. /shakes head/

      --
      https://www.accountkiller.com/removal-requested
    3. Re:Never selected that way by Anonymous Coward · · Score: 2, Insightful

      Besides, in research, using something opensource (or at the very least gratis) makes it that much easier for others to replicate what you did. Getting SAS scripts just isn't fun.

    4. Re:Never selected that way by seanzig · · Score: 1

      Absolutely - we all know that Python is much greater than R. ;-) Seriously though, I know where he's coming from, but it really should have had better explanations regarding his ratings for each language. For example, if one uses the Visualization Toolkit (VTK, www.vtk.org), it has Python bindings. I think the author simply doesn't know about that.

    5. Re:Never selected that way by Anonymous Coward · · Score: 0

      Where I work the whole Cisco fiasco put the fear of god into the high level suits. The fall out is a huge and cumbersome process for getting approval to use FOSS tools... even though we arn't modifying or distributing them. It's to the point where it's less headache to _buy_ something than go through the lengthly FOSS approval process.

    6. Re:Never selected that way by Anonymous Coward · · Score: 5, Insightful

      I'm an astronomer. At this point in my career, I move to a new research institution every couple of years. Each institution may have a site licence for some piece of commercial software like IDL or Matlab, but I use free software (Python, in my case) because I know that I can keep using it, rather than rewriting all my scripts for a new language every time I move.

    7. Re:Never selected that way by MikeBabcock · · Score: 1

      ... except the 'whole internet' often says "too bad, you'll have to wait for a fix" with proprietary software whereas "Oh, try this patch over here" often happens on FOSS instead.

      --
      - Michael T. Babcock (Yes, I blog)
    8. Re:Never selected that way by Anonymous Coward · · Score: 0

      In big companies there are often oppressive/cautious/conservative soft are restrictions that don't allow one to find a FOSS and use it so we are stuck going through corporate IT for commercial software.

    9. Re:Never selected that way by hawguy · · Score: 1

      Where I work the whole Cisco fiasco put the fear of god into the high level suits. The fall out is a huge and cumbersome process for getting approval to use FOSS tools... even though we arn't modifying or distributing them. It's to the point where it's less headache to _buy_ something than go through the lengthly FOSS approval process.

      What was the Cisco fiasco? My company uses Opensource tools routinely and I've never even heard of the Cisco fiasco.

    10. Re:Never selected that way by Anonymous Coward · · Score: 0

      The FSF took on CISCO over improper use of GPLed code. I too have seen it put the "fear of god" into management. They don't actually understand why it happened or why it's non-applicable or the difference between using eclipse or svn on your workstation and including GPL code in your product .. the message that got through was using FOSS == getting sued.

  5. What an awful article. by Anonymous Coward · · Score: 1

    n/t

  6. Superficial and arbitrary by Anonymous Coward · · Score: 0

    As someone who regularly programs in all three of those languages I'd like to point out that the comparison is completely arbitrary. This is one of the most lazily writting articles I've seen Slashdot link to.

    1. Re:Superficial and arbitrary by MattBecker82 · · Score: 1

      This is one of the most lazily writting articles I've seen Slashdot link to.

      Mod +1 Ironic

  7. More crap from /. by NoMaster · · Score: 4, Insightful

    "Here is a breakdown of R, Octave and Python ..."

    No there isn't - that's there is not much more than a shitty 'feature' table, too high level to be anything other than facile, which is "Based on [the author's] own user experience and research".

    As an student user of all 3 I would have been interested in reading a good comparative review or explanation aimed at outsiders. This ain't it; it's just more slashvertising.

    --
    What part of "a well regulated militia" do you not understand?
    1. Re:More crap from /. by Anonymous Coward · · Score: 1

      Yes, but the advantage of the author's approach is that it'd be real easy to extend the review to include Scilab.

  8. Low Quality Article by Anonymous Coward · · Score: 0

    This is a really low quality article. Ironically, even though it's a /.-BI article, it's not up to /. quality.
    I had a colleague ask me recently about the strengths and weaknesses of R, Octave, and Python. When I saw the summary of the article, I was about to send the link to him. Then I read the article. Forget that.

  9. Or if you can't make up your mind by Anonymous Coward · · Score: 2, Interesting

    Sage math http://www.sagemath.org/

  10. Julia? by Chrisq · · Score: 3, Informative

    There was a previous article about Julia which looked cool. I wonder how this measures up

    1. Re:Julia? by Anonymous Coward · · Score: 0

      Julia is (AFAIK) a compiled language, intended to be high-performance in every aspect (like Fortran). R, Octave, and python are all interpreted (with some exceptions for e.g. pypy), and can only perform well through calling C/Fortran/Julia functions where the overhead is small compared to the computation done in lower-level languages. I don't think anyone uses Fortran for data analysis unless you get into piles of disks and heavy algorithms, and I believe that's the crowd Julia is targeted at. If you want to calculate a few statistical numbers out of a few GB of data, python/R/octave takes (at most) a few hours, but who cares anyway...

  11. Re:I h8 Python! by MetalliQaZ · · Score: 1

    Spoken like a man who earned a C in freshman year intro to programming, but for some reason didn't switch to a humanities major.

    --
    "Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
  12. Oh.. by Anrego · · Score: 2, Insightful

    Now that's just desperation.

    Come on .. keep this shit in bi. Either it takes off or it doesn't.

  13. Re:I h8 Python! by Anonymous Coward · · Score: 0

    Spoken like a man who works the receiving end of a glory hole on a nightly basis.

  14. Both! by Kludge · · Score: 3, Insightful

    The best option is to use python and R, through rpy for example.
    R rocks for statistical libraries and good documentation.
    Python rocks for everything else.

  15. SlashBI is very disappointing by Chuck+Chunder · · Score: 1

    It's full of puff pieces and press releases.

    I think a lot of Slashdot readers (me included) would be interested to get an introduction in various practical aspects of analytics, especially with Open Source tools we can experiment with ourselves. SlashBI could be a good gateway for that. So far every article I have read there has seems like a waste of time.

    --
    Boffoonery - downloadable Comedy Benefit for Bletchley Park
  16. cheaper to implement depends on salaries by Anonymous Coward · · Score: 0

    An abacus is cheaper to implement than most things on a computer as long as you don't count developer time; pull out the Dick Feynman method from LANL in the 1940's and you are good to go.

  17. Read this while listening to the Mensroom segment by tehlinux · · Score: 1

    My suggestion is to try all three, and see which offering’s toolbox solves your specific problems.

    Well no **** Sherlock!

    --
    Most linux users don't know this, but the man pages were named after Chuck Norris. Chuck Norris fsck'ing hates noobs!
  18. I don't understand by utkonos · · Score: 4, Informative

    This article compares three languages that have different purposes. R's purpose is statistical analysis and visualization. Octave is a general mathematical analysis and visualization language. Python is a generalist language that has it's own focuses on code readability among other things.

    These languages also have a target audience. R is for statisticians and scientists. Octave is for mathematicians, and Python is for programmers.

    1. Re:I don't understand by Anonymous Coward · · Score: 0

      But from a data analyst's perspective, all three could serve Machine Learning purposes.

    2. Re:I don't understand by Anonymous Coward · · Score: 0

      From a data analyst's perspective you use data analysis software when doing any serious work of which there are many OSS alternatives.

    3. Re:I don't understand by Anonymous Coward · · Score: 0

      Generally, you're right, but there is an ever growing list of Python modules for scientific computing and data visualization and I'd argue that even for non-programmers, it's starting to surpass Octave and is more competitive in terms of useful library functions and performance with commercial MATLAB (provided MATLAB compatibility is not a requirement). Sure, Python is a general purpose programming language, but it's growing into a full featured interactive numerical environment, like Octave and MATLAB, too.

    4. Re:I don't understand by Anonymous Coward · · Score: 0

      Python is for programmers.

      What's interesting is that part of Python's current popularity is because there is a large number of users who aren't programmers. SciPy and NumPy are super powerful data analysis libraries for python. Couple this with python's approachability for non programmers and you end up seeing a lot of people from the scientific community using it.

    5. Re:I don't understand by utkonos · · Score: 1

      Fantastic! When Python's libraries surpass what is available in CRAN (think CPAN but for R) I'll switch, and I'm sure everyone else will as well they're both just tools. Statisticians use R because its designed for statisticians. And that was my original point. The original article is strange because it is comparing apples and oranges. Plus, it was absolutely flame-bait, because there aren't really any R or Octave zealots. People that use them think of them as tools. The author compared them to Python to get the Python zealots to come out of the woodwork and make a stir around the article.

  19. At lease this one brought some juicy comments! by Anonymous Coward · · Score: 0

    Hey guys, if you are interested in having more details on those 3 software and else, some of the comments in / BI are pretty good (at least from my perspective). For example, one anonymous reader posted "Both Octave and R have specific places in the pantheon of analytics, usually adjacent to their respective work-alikes. Unfortunately, there is no current operational Octave nor R compiler (as in optimizing compiler), so in both cases, you have something interpreted. This isn't a terrible thing ... its great for interactive debugging ... but performance on non-natively compiled code is horrible. Just try a dense LU decomposition on a large matrix (say 4k x 4k) just to see how painful it is compared to well optimized Fortran/C." ... Just check out the rest!

    1. Re:At lease this one brought some juicy comments! by Anonymous Coward · · Score: 0

      As far as I remember, most of Octave magic IS done in fortran.

    2. Re:At lease this one brought some juicy comments! by plopez · · Score: 1

      Oh oh.... you mentioned Fortran. Here come the "Fortran is ugly and out of date" posts. To nip it in the bud I will link to http://en.wikipedia.org/wiki/Fortran#Fortran_2008

      Check out Fortran 2008,which is way cool! Everything you could want from a modern programming language.

      --
      putting the 'B' in LGBTQ+
  20. Python does have data.frame.. by csirac · · Score: 3, Informative

    Through pandas, for a start. The SciPy/NumPy stack is quite nifty, I'm especially interested in how to apply it for working with irregular time series data.

    Not to say anybody should ditch R, I still support our researchers most weeks at work in using it. But it's not as clear-cut as you seem to think it is, especially in terms of memory efficiency.

    1. Re:Python does have data.frame.. by Ruie · · Score: 1

      Didn't know about this one - thanks !

  21. apples vs oranges by plopez · · Score: 1

    I still don't get it. How can you compare specialized statistical and number crunching languages with a general purpose programming language.

    --
    putting the 'B' in LGBTQ+