Interviews: Ask Author and Programmer Andy Nicholls About R
Andy Nicholls has been an R programmer and consultant for Mango Solutions since 2011 (where he currently manages the R consultancy team), after a long stint as a statistician in the pharmaceutical industry. He has a serious background in mathematics, too, with a Masters in math and another in Statistics with Applications in Medicine. Andy has taught more than 50 on-site R training courses and has been involved in the development of more than 30 R packages; he's also a regular contributor to events at LondonR, the largest R user group in the UK. But since not everyone can get to London for a user group meeting, you can get some of the insights he's gained as an R expert in Sams Teach Yourself R In 24 Hours (available in print or at Safari), of which he is the lead author. Today, though, you can ask Andy about the much-lauded statistics-oriented free software (GPL) language directly -- Why to use it, how to get started, how to get things done, and where those intriguing release names come from. (The about page is helpful, too.) As usual, please ask as many questions as you'd like, but one question at a time, please.
Note: Slashdot is always looking for interesting interview guests. Who do you want to ask? Let us know!
Is that a pirates-only language?
How has the way you use R changed over time? For myself, I don't think I've gone through an entire R session in the past six months without loading dplyr. Combine that with the pipeline operator and I think if you'd shown the R code I wrote yesterday to me of two years ago, I wouldn't have believed it was the same language.
What's your take on the future of R? It used to be that it was a tool for statisticians, and now it's been discovered by programmers. As a statistician who's not a programmer, but who hangs out sometimes on slashdot and stackoverflow, it feels sometime like it's in danger of becoming just another language for programmers, instead of a tool for statisticians. Should I be worried? Can it be both? Is this mass inflow of programmers going to change it somehow? Or am I just having a "get off my lawn" moment?
More about me, I'm a PhD statistician at a major public research university, and use R every day for data manipulation, exploration, and analysis, and have for 10+ years. I've done a few packages and enough coding that I know most of R's quirks, but would not consider myself a programmer.
In your view, what are the key advantages of R over other scientific computing languages, most notably Matlab (which has to be considered with its plethora of toolboxes of course)?
Hoisting the AC for asking a good question.
To add on: R is gaining massive traction in graduate programs but so many professors teach it like it's SPSS, almost as a cargo cult coding language, and so much of the documentation is written for people who are already experienced coders. Is there any decent introduction to R for someone that doesn't already know it (or another programming language) fluently?
A bullet may have your name on it but splash damage is addressed "To whom it may concern."
There's an entire book, the R Inferno, dedicated to R's many "quirks" and problems. Is there ever a plan to dedicate some time to focusing on cleaning up the language and making it less painful to use?
A bullet may have your name on it but splash damage is addressed "To whom it may concern."
In my experience (from searching for R advice online - I've never mailed the R discussion list myself) the R community is incredibly harsh and unforgiving of new users. Answers to beginners' questions are normally brusque - often extremely so. (I remember one exchange, where a user basically asked "I've read the documentation for par, and I don't understand ...", and the response was, in its entirety, "?par" -- which, for those unfamiliar with R, is the command to bring up the documentation for par.)
On the statistical end of things, too, the community seems less than helpful. My impression is that it's normally assumed that all R users have good (graduate student-level) backgrounds on the statistical aspects, and little to no consideration is given to those who might not be up to speed on the theoretical basis of some of the functions in R, or who haven't read the (pay-walled, mathematically dense) 1963 paper where the method was first described.
What are your thoughts on the helpfulness and "beginner friendliness" of the R community? Do you think there might be an issue with going from a very hand-holdy "Teach Yourself In 24 Hours" type work and being abruptly dumped into a much more brusque "why are you asking us? - figure it out yourself!" type environment?
I encountered R via Johns Hopkins University's data science series of Coursera courses which I highly recommend. The first one is at https://www.coursera.org/learn...
As a mainly Python programer, but someone with an eclectic interest in programing languages (I enjoy Prolog, Lisp, ML...), I've found R very intriguing: it's a very "functional" programing language, but also object oriented (using dollar signs instead of the customary dots). I've also found R to be incredibly quick -- provided you know and use the right builtin functions. I once tried to solve an assignment with a for loop and killed the process after it hadn't finished within a day. Using "aggregate" did the job within an instant of pressing enter.
I've found R to have numerous strange quirks I haven't got the hang of, resulting in weird results sometimes which I can't debug. The Coursera course mentioned above teaches a style of R I'm not particularly fond of using various libraries, which I'm ideologically opposed to in the same way I prefer battling with JavaScript directly rather than learning JQuery as an intermediary "dialect".
What are your pointers for the "right way" to program in R?
If it works, it's obsolete
What topic(s) in statistics do you think students can learn easier today using R than years ago when there was nothing like R widely available?
I feel that one of the weakest points of R is the error handling, reporting, and debugging available. Do you have advice on tools or techniques for people coding in R (aside from using RStudio? Are there plans for improvements in this area? The current facilities are reminiscent, at least to me, of using gdb back in the 1990s.
I have in mind cases like the following, in which a confusion about list access using the [ operator (when the [[ should have been used) provides a cryptic error message with no traceback available.
> symlog_scaler <- list(linear_to=2.5, abscissa=2.0,
+ scaling_function=function(x,linear_to=2.5,abscissa=2.0){
+ y <- x; linear_to = abs(linear_to); big_ix = (linear_to<x)
+ y[big_ix] = linear_to + log(1+(x[big_ix] - linear_to), base=abscissa)
+ small_ix = (-linear_to>x)
+ y[small_ix] = -(linear_to + log(1+(-x[small_ix] - linear_to),base=abscissa))
+ y})
> symlog_scaler$scaling_function(-5:5)
[1] -4.307355 -3.821928 -3.084963 -2.000000 -1.000000 0.000000 1.000000 2.000000 3.084963
[10] 3.821928 4.307355
> symlog_scaler['scaling_function'](-5:5)
Error: attempt to apply non-function
> traceback()
No traceback available
>