Slashdot Mirror


Ask Slashdot: Switching From SAS To Python Or R For Data Analysis and Modeling?

An anonymous reader writes "I work for a huge company. We use SAS all the time for everything, which is great if you have a bunch of non-programmer employees and you want them to do data analysis and build models... but it ends up stifling any real innovation, and I worry we will get left behind. Python and R both seem to be emerging stars in the data science game, so I would like to steer us towards one of them. What compelling arguments can you give that would help an old company change its standard if that company is pretty set in its ways?"

25 of 143 comments (clear)

  1. R... by Rockoon · · Score: 2

    This is what R is for.

    Why Python and not C or ERLANG or COBOL? ..

    --
    "His name was James Damore."
    1. Re:R... by MightyYar · · Score: 4, Informative

      R is definitely still ahead for data modeling, but Python has some advantages too. With a bigger set of modules (libraries) to choose from and high popularity in the financial sector, there are big improvements all the time. For the purposes of this discussion, the most important Python modules are:
      IPython: powerful interactive shell
      numpy and scipy: numerical, matrix, and scientific functions (matlab-ish)
      pandas: R-like data structures and data analysis tools (analysis mostly limited to regression)
      statsmodels: statistical analysis, complements pandas
      sk-learn: machine learning

      So can Python do everything that R can? No. Or, at least, not as easily. But it is improving in that direction quite quickly, and if Python's data analysis capability meets your needs, then you can likely do everything in one language instead of calling R routines from another.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    2. Re:R... by radtea · · Score: 4, Interesting

      So can Python do everything that R can?

      No, but Rpy can.

      I've used R, and it really has a lot of strong points, but I prefer to access it these days via Rpy, which gives me all the power of R along with everything else I get from Python (other libraries, better application development frameworks, etc.)

      Both R and Python are real programming languages that are going to be completely useless to non-programmers, so neither of them is a SAS replacement, but of the two, I'd choose Python+Rpy over R for flexibility, power and ease of use (the latter is of course a strongly personal preference... if you really think like a traditional stats geek R will likely seem nicer, as it is clearly created for and by such people.)

      --
      Blasphemy is a human right. Blasphemophobia kills.
  2. Pandas by MightyYar · · Score: 4, Interesting

    Python and R are sort-of converging via Pandas. I'm partial to Python, but Pandas really starts to blur the lines conceptually.

    --
    W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    1. Re:Pandas by joeblog · · Score: 2

      Using R vs Python with Pandas brought home the microlanguage vs libraries debate for me. I'm more experienced and comfortable programing in Python, so generally prefer it. But writing a program to solve the same problem in R or Python, I found the R version would be much faster. On the other hand, the Python version tended to give the correct answer, whereas the R version tended to have weird bugs I couldn't figure out.

      As an open source enthusiast, I'd say an unfortunate advantage Python has is its "benevolent dictator" rule for libraries. R (as with Perl, TeX...) has a bewildering number of contributed packages for any given problem, some of which once worked well with old versions, others were never developed properly... so users are left with the frustration of finding something that works.

      Python, on the other hand, comes with "batteries included" with a few external libraries like Pandas that are well supported. So unless speed is a big deal, I'd advise Python.

      --
      If it works, it's obsolete
  3. Innovation is more than tools by Abroun · · Score: 5, Insightful

    It's unlikely that SAS is the root cause of a lack of innovation, so it's unlikely that introducing a new tools by itself will make a difference. The fact that you work for a 'huge company' is more likely the problem. Does senior management agree that innovation is a priority? Are they willing to make the changes to encourage it (which usually means breaking down fiefdoms, giving up power, and lots of things that senior managers hate doing)? The choice of language is kinda irrelevant absent the right environment.

  4. Cost by TemporalBeing · · Score: 4, Informative

    The cost of training them to use R will be signifantly cheaper than what you are spending on the SAS licenses, which (last I knew) was a yearly purchase for each user.

    And yes, while I have not used R myself, I would certainly recommend it over Python for this use case as it is very dedicated to doing the kinds of things that SAS is good at in a very efficient, friendly manner. I've seen a number of people use it to do some very neat statistical analysis, and their stuff was a lot simpler than the SAS scripts that I use to write years back.

    --
    Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
    1. Re:Cost by Sobrique · · Score: 2

      Slightly different beasts I think. R is a really impressive analysis tool. Python is a scripting language. The latter is quite a bit more versatile, but ... probably isn't the right tool to solve the problem outlined in the OP.

    2. Re:Cost by OnioOnio · · Score: 2

      This is absolutely true, especially if you work for a huge company. Big companies that are hooked get raked over the coals, and I guarantee that your SAS licenses are costing you millions every year - yes millions (unless you're academic...). The company I work for is stuck with this very dilemma, where the more programming oriented love and use R, whereas those trained in house with little programming background get stuck in the SAS rut (I do think it's easier for beginners to use SAS, but I wholly agree with you that it's likely stifling innovation). R is still a good intermediary, and will be easier to get up an running for many employees than Python. In fact, check out RStudio if you haven't.

  5. What's the business case? by mwvdlee · · Score: 5, Insightful

    Is it your feeling that SAS is "stifling any real innovation" or do you have examples of projects that are impossible with SAS but possible with Python or R?
    Do those example projects actually help the bottom line of the company or are they just "cooler"?

    If you can think of examples that have clear financial benefits to the company, you have a solid business case already.
    If there are no such examples or other factors negate the benefits, then the company has nothing to gain by switching and should not switch.

    Short answer; if you're asking on Slashdot for reasons to switch from product X to product Y, you probably have no real reason to switch.

    --
    Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
  6. There are no complelling arguments... by Rob+Riggs · · Score: 4, Insightful

    Emerging? They were emerging a decade ago. They have emerged. Look, if the company is, as you say, "set in its ways", that is a cultural problem. Unless you are an executive that gets to set goals and compensation, you have very little influence over it. If that is not you, either stay and live with what you have, or leave for greener pastures. The basic question you have to ask yourself is "how will staying here using these outdated tools affect my lifetime earnings potential?" Put another way: "are they paying me enough to put up with this shit?" That is my prime criteria for deciding whether to stay at any job. Your job is to make recommendations. I assume you have already done that and been shot down. Decision time: should I stay or should I go.

    --
    the growth in cynicism and rebellion has not been without cause
  7. Apples, meet Oranges... by Shoten · · Score: 4, Insightful

    SAS is not a language; it's a full multi-tiered solution for the aggregation, normalization, and analysis of data. There's a language as well, but that's just one part of the whole solution. Python and R, while absolutely fantastic languages, are not a full solution.

    So, first step...if you're going to offer an alternative, actually have an alternative. I don't know your SAS buildout nor do I know the data sources it consumes, so I can't really point to what else you need to add or how you need to construct it to produce a more flexible replacement to your existing and current SAS infrastructure.

    Second step...a roadmap for migration. It's one thing to sign a lease for a new apartment or to buy a new house, and another to shift your life from the old place to the new. If you don't have a plan, at least in broad strokes, then you're going to be doomed when you look for executive sponsorship. You need to make sure that you get all the stakeholders' input as well, lest you leave something out in your roadmap...and then end up with someone who sees you as a problem. That person will most likely be in a position to scuttle the whole thing, as well.

    Third step...figure out how to define the benefits in terms of the stakeholders' needs. You're going to replace a system they use; why should they want you to do so? And you have to define it from their perspective, with regard to things they care about. Beware of getting geeky on this...it's very likely that at least one of the people whose support you will need will not be a geek and will be concerned with the output more than the technical means used to produce it. Don't hard-sell, either...pushing too hard will get the door slammed in your face, and even potentially polarize people against you. (See above, under "in a position to scuttle the whole thing.")

    There will be steps after that, but those will be largely determined by how the first three steps go. It may involve bringing in outside vendors, doing requirements analysis...a lot of it depends on details of your company as well and how they normally do things. But above all else, remember this: don't buck the system too hard, and don't knock the company you work for. Trying to get a lot of people to support and cooperate with you while telling them that their way of doing things sucks is suicide.

    --

    For your security, this post has been encrypted with ROT-13, twice.
  8. I made the switch by TyFoN · · Score: 4, Informative

    Personally at least.
    I used to work in one of the largest banks in the world, and everything we did was SAS/MSSQL.
    I had some personal stuff in R, but most of the other analysts didn't seem too interested except using what I made for them except for one phd in the German department. I never pushed it though since there was so much legacy code, including code I had written my self.

    Now I have switched to a start-up bank, and I am the only analyst.
    I've used R/RStudio/Shiny with PostgreSQL in the back very successfully, with all code in git. Now I can bring good analysis forth much faster than I used to in SAS that can be viewed on any device with the option of downloading the source data in excel and csv.

    The management loves this.

    If you show them a few good ones they will want more, but I wouldn't start to rewrite all the legacy code. SAS isn't bad when you have it set up properly.

    But another good thing about R is that you get access to innovation in the statistics fields faster, and you don't have to pay huge sums of money for extra features.

    RStudio and Shiny is a bit expensive for the pro versions, but nothing compared to SAS, and the open source versions are free.

    1. Re:I made the switch by nullchar · · Score: 2

      If you show them a few good ones they will want more, but I wouldn't start to rewrite all the legacy code.

      This. Submitter should build a few small projects that give a different end result than the current code base. If you're just swapping R for SAS but delivering the exact same output, no management will care. The sample projects either needs to report the data in different ways, or visualize the data, or even as this parent suggested, simply provide a copy of the output as a spreadsheet.

      Innovation will come by thinking about the problem differently and exploring different ways to ask questions to gain insight into your business. If you're just crunching the same numbers, don't bother. For the submitter personally, it's great to learn R and Python, but don't expect an organization shift unless it provides something unique.

  9. You have to *demonstrate* that SAS is better by Nutria · · Score: 2

    Go do something in R or Python that is useful to the company but impossible or very difficult in SAS.

    Then show it to the hard-core SAS users. If they're interested, demonstrate it to your boss along with how it can save the company (and especially your cost center) money.

    --
    "I don't know, therefore Aliens" Wafflebox1
  10. R is better for non-programmers by DaBombDotCom · · Score: 3, Insightful

    In my experience, R is better for non-programmers precisely because it doesn't often behave like a typical programming language. It is *designed* for statistical analysis and so for someone just starting out it can be very intuitive.

  11. Re:Python FTW by DaBombDotCom · · Score: 2

    Sorry but calling R from Python just doesn't cut it. Some of the best tools in R rely on complex data structures that are not compatible with Rpy. Plus Rpy support on windows is abysmal. You are better off using Python for all non-stats scripting, get your data set up, then analyze and plot with R.

  12. Research and Recruitment by Alan+Shutko · · Score: 3, Interesting

    I work for a large Fortune 25 company. We have an existing SAS presence and we do some good work in SAS. There are two main reasons that we are bringing R into our environment: research and recruitment/retention.

    R is extremely common across research right now. When a new paper comes out describing a new algorithm or modeling technique, the odds are extremely good that it comes with R source code. With R in-house, there is very little time or effort to try these things out to see if they can help our current work. With SAS, we would need to invest time recoding everything or worse, wait until it is baked into SAS itself. That is a huge barrier to adopting new approaches.

    Recruitment and retention are related to R's popularity in research. Let's face it, data scientists are a hot commodity right now. Lots of companies are looking to hire them and there aren't enough good people to go around. We're seeing that a lot of the new talent have been using R in their graduate work rather than SAS, and are interested in an environment where they can continue using R. Additionally, it's harder to retain people once you've hired them if they can't use what's become a lingua franca.

    SAS remains a great tool, and we're not going to get rid of it. Rather, we want to add R to the toolbox.

    (I don't mention python here... We've got some folks working with Python especially for NLP, but for the work we do there's a lot more folks using R across industry and academia.)

  13. As someone who moved from SAS 1 year ago... by Anonymous Coward · · Score: 3, Interesting

    I work in IT at a large company (>30k employees) who recently dropped SAS. Before we did, we tried out R but what we found out was that except for IT and some tech savvy engineers, nobody seemed to get anything done without help, even after training.
    We had decided to drop SAS due to the ludicrous license costs (at one point we were paying more on renewals than we did when we purchased it! WTF?) and due to some issues with their installation/upgrade process that they were not able to resolve within a reasonable timeframe. We ended up switching to StatSoft's STATISTICA, which has a much lower price point (~30% of what we paid for SAS), predictable renewal fees (20% of purchase price), vast feature set (in the Data Miner package we have), excellent Office integration and import/export compatibility with SAS data files. Oh, and it also features R integration so you can still use R from within it if you want. Users became proficient very quickly, after receiving some training.
    I recommend you consider their solutions... Open source is not always best, especially when it comes to borderline tech-illiterate business users.

  14. One vote for Python by werepants · · Score: 4, Informative

    Granted, I don't have much experience with R, but Python has some notable benefits - it is very well established and you can find tools to do just about anything. It is fast and easy to develop, and very easy to learn thanks to the readability and plentiful resources online. I imagine you'll have an easy time finding people with python experience, as well.

    I haven't used it for any "big data" tasks, but for a number of small, interactive data analysis utilities it has been really enjoyable to work with. One standout tool for me has been pyqtgraph, which is lightning fast and creates some really impressive interactive visualizations. It's also got some pretty incredible features out of the box - arbitrary user-definable ROIs, instantly change any plot to a log-log, or even do a Fast Fourier transform with just a right click. If I sound like a fanboi, I kind of am - after trying to deal with the agony of 3D data manipulation in matplotlib (python's matlab package), it's a whole different world.

  15. Python is better overall but R is more like SAS by goombah99 · · Score: 4, Insightful

    R has more single function high level commands devoted to stats, these are done right internally and are self consistent with other functions for further processing. But its not as general a programming language as python. if you want something different than the canned functions in R then you will need to write them yourself at which point you might as well be using python. however if you like SAS then chances are R will seem more like what you are hoping for.

    --
    Some drink at the fountain of knowledge. Others just gargle.
  16. Belief vs Experience by westlake · · Score: 2

    The cost of training them to use R will be signifantly cheaper than what you are spending on the SAS licenses
    And yes, while I have not used R myself, I would certainly recommend it over Python for this use case

    So not having used R yourself, why do you believe it is the better and cheaper solution?

  17. R for Speed of Implementation, Python for Scale by manlygeek · · Score: 2

    There is a classical problem here. R is great for getting trained and productive VERY quickly. It has 4,600 packages that will do almost anything you need to and it does some very sophisticated statistical methods right out of the box. What can't be done out of the box (or from the core download since it's not really a boxed product) has likely been coded in a package -- even very complex biostatistical and bioinformatics methods. Also R has a lot of graphical data visualization functionality built in and extended by some awesome packages like ggplot2. Additionally, R does a great job with documentation as it can inject data, visualizations and code into markdown documents, which makes publication a whole lot easier. R's functional/imperative/quasi-object oriented approaches have their quirks (but then what language doesn't?). One thing to note however is that R is not in itself multithreaded and it requires that all the data it is working on reside in memory. For very large, very complex data sets that could be a bit of problem. So where R is great from a quick ramp up perspective, Python will probably scale better to huge datasets in the tera- and peta- byte range. It has come along way especially with scipy, numpy and other packages listed above. So if you anticipate having to scale in this way, then Python maybe a better long term toolset. I like them both and use them both. I choose which one I am going to use for a project (and stick with the toolset for the whole project) based on dataset size, statistical/visualization complexity and documentation requirements. R tends to win out a bit more often for me.

    --
    Be More, Be Manly, The Manly Geek Ubergeek Extraordinaire Blogger: www.manlygeek.com/blog Podcaster: podcast.man
    1. Re:R for Speed of Implementation, Python for Scale by MightyYar · · Score: 2

      It isn't, but many of the modules are written in C or other thread-capable languages. For instance, if you are using sk-learn to analyze a dataset with a machine-learning algorithm, your Python code will run on a single processor but the calls to sk-learn to do your heavy lifting will distribute across cores.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
  18. Suck it up and Program in SAS by ichabod801 · · Score: 3, Insightful

    I used to be you, almost exactly. Almost everything we do at work is in SAS, and I was pushing hard for R and Python and getting nowhere. I hated SAS because it was so clunky and out of date. So many SAS programs are bad because they're being done by statisticians with no programming background. Then I went to NESUG a few years ago and saw presentations by the likes of Whitlock, Dorfman, and others, and realized serious programming *was* being done in SAS. I resolved to just become the best SAS programmer I could. The first thing you need to do is stop programming Python in SAS. SAS is like Lisp in that it is a different paradigm, and not programming in that paradigm only makes things harder. Learn that paradigm. Learn the data step inside and out. Every time you have a %do loop, ask yourself if you can do it in a data step. Every time you wish you had OOP, ask yourself if you could represent the objects in a data set. Or learn the new ds2 data step that has OOP. Learn proc sql and know when it's better to use than a data step. That's what I did, and it took my SAS programming to a whole new level, and allowed me to innovate legacy code and transform the applications we were using. Because back when I was you, SAS wasn't the obstacle to innovation, I was.