R 3.0.0 Released
DaBombDotCom writes "R, a popular software environment for statistical computing and graphics, version 3.0.0 codename "Masked Marvel" was released. From the announcement: 'Major R releases have not previously marked great landslides in terms of new features. Rather, they represent that the codebase has developed to a new level of maturity. This is not going to be an exception to the rule. Version 3.0.0, as of this writing, contains only [one] really major new feature: The inclusion of long vectors (containing more than 2^31-1 elements!). More changes are likely to make it into the final release, but the main reason for having it as a new major release is that R over the last 8.5 years has reached a new level: we now have 64 bit support on all platforms, support for parallel processing, the Matrix package, and much more.'"
Someone who can't afford license fee of SAS or Matlab, this is the best alternative out there. And in some cases a better alternative.
Not well known but R's accessibility support is far better. Here is an example from a paper accepted in R Journal
Statistical Software from a Blind Person's Perspective
A. Jonathan R. Godfrey
http://journal.r-project.org/accepted/2012-14/Godfrey.pdf
pie(c(85,15),init.angle=25,col=c("yellow",1),labels=c("pacman","not pacman"))
Are you aware of better alternatives?
Buck lease && nun b sun
bæ8Ã0sÃOE?5r©oÂÃ?âz:ÃÃAÃ?ÃOEÂ6fXÃ?]Â
Julia: http://julialang.org/
As in: when they release it, you can trust it to work.
Hence they didn't mess around with major reconstruction of R's guts until they could release something that's finished (and well-tested !) and bumped the version number to 3.0.0 when they did in order to properly differentiate it from previous versions.
This is one of the differences between amateur OSS offerings (like for example KDE with its miriad half-baked Kxxx packages, sundry horrible OSS games, etc.) and genuine production-quality OSS (like R, Lapack, Octave, Libre Office, PostgressQL, MySQL, GRASS GIS, QGIS, Maria DB, GNU CC, the Linux kernel etc.)
This is very gratifying as R happens to see widespread use in academia, government and business when it comes to data analysis and statistics.
If R has a weakness, it is that uses an in-memory approach to data-processing, unlike e.g. SPSS, which keeps almost nothing in memory and simply makes passes through datafiles whenever it needs something. R is also a bit memory-hungry, so the need for genuine 64-bit implementations should be clear.
Apart from sporting about 4000 useful and ready-to-run statistical applications packages, R has convenient and efficient integration with C code and has what's probably a contender for the best support for data-graphics anywhere.
For those who didn't know, even packages like SPSS and SAS have incorporated R interfaces to tap into the wealth of application packages that R offers. Can't think of a more significant compliment right now.
I recently switched my scientific programming from R to Python with NumPy and Matplotlib, as I couldn't bear programming in such a misdesigned and underdocumented language any more. R is fine as a statistical analysis system, i.e. as a command line interface to the many ready-made packages available in CRAN, but for programming it's a perfect example of how not to design and implement a programming language. It's also unusably slow unless you vectorise your code or have a tiny amount of data. Unfortunately, vectorisation is not always possible (i.e. the algorithm may be inherently serial), and even when it is, it tends to yield utterly unreadable code. Then there is the disfunctional memory management system which leads you to run out of memory long before you should, and documentation even of the core library that leaves you no choice but to program by coincidence.
As an example of a fundamental problem, here's an R add-on package that has as its goal to be "[..] a set of simple wrappers that make R's string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA's and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.". Needless to say that there is absolutely no excuse for having such problems in the first place; if you can't write consistent interfaces, you have no business designing the core API of any programming language, period.
Python has its issues as well, but it's overall much nicer to work with. It has sane containers including dictionaries (R's lists are interface-wise equivalent to Python's dictionaries, but the complexity of the various operations is...mysterious.) and with NumPy all the array computation features I need. Furthermore it has at least a rudimentary OOP system (speaking of Python 2 here, I understand they've overhauled it in 3, but I haven't looked into that) and much better performance than R. On the other hand, for statistics you'd probably be much better off with R than with Python. I haven't looked at available libraries much, but I don't think the Python world is anywhere near R in that respect.
Anyway, for doing statistics I don't really think there's anything more extensive out there than R, proprietary or not, although some proprietary packages have easier to learn GUIs. In that field, R is not going to go anywhere in the foreseeable future. For programming, almost anything is better than R, and I agree that those improvements you mention are not doing much to improve Rs competitiveness in that area.
From Julia's web page:
"Currently, Julia is available only for 32-bit Windows."
Julia is available in 64 bits on other platforms, but posting it as a reply in a thread that was complaining how late R is to the 64 bit game is a bit rich. R has had 64 bit releases for all platforms for 3 years now. What's new in 3.0.0 is the removal of the remaining 32 bit limit on individual objects.
Even i have used R in the past for my thesis. My statistician was using S-plus to do magical things that the hospitals SPSS definitely could not do.
However, S-plus was not available to us non-statisticians.
As a complete non-programmer, mediocre statistician, i was able to reproduce en build upon his examples in R.
But what i truly missed was a usable GUI. there were some, and i tried them all at the time, but none were able to do more than the basics. For someone using R daily, a GUI will be more trouble and limited. But for someone like me, a well developed GUI like S-Plus had at the time would have bee more than welcome.
Seeing the headline R 3.0.0, the first thing i was looking for: did they include a GUI by default???
Why are other peoples sig's always more witty ???
Hard to know where to start, especially as you give no information on your target audience...Do they know stats already?
Also, if your target audience is used to GUIs rather than CL, then...
http://answers.oreilly.com/topic/954-introducing-the-r-graphical-user-interface/
Alternative, you use Web front-end here, (disclaimer, I've not tried it)
http://www.squirelove.net/r-node/doku.php
Writing a tutorial from nothing is hard. You can do this to get some good ideas:
1. Download a free evaluation copy of 'Minitab'. :) Obviously, don't just rip off their stuff; not cool
(I'm not connected with Minitab, but I've used it a lot, and it's great 'basic' stats analysis software)
2. Install, and then open help
3. Consult 'tutorials' section
As a suggested flow, I've found that, as a start, you can introduce basic stats, then demonstrate how the software works.
Using the same data-set for the first few, (say ten), lessons is better. Minitab tutorials keep changing the data, which confuses students.
You'll only need 5 columns or so, and remember to include some discrete variables to enable stratification of your continuous variables.
Use a real-world example, such as household expenses for different families, whatever.
For tutorial flow, what works for me as a 'basic' intro to a stats package:
1. What is data? What are statistics?
2. Types of data, how they look as raw data, (in the database) and then once we start to analyse them with stats and graphs (to start, just 'common' stuff like continuous variables, normal & lognormal, and discrete, binomial & poisson).
3. Basic stats & graphical analysis for single variables. Normality tests. Include time series plots as well as histograms / dotplots / boxplots.
4. Multivariate analysis; x/y charts, matrix plots, interaction plots.
5. Hypo tests (for both cont & disc variables)
6. Regression, (simple, then multiple if you're feeling brave)
7. Control charts (for both cont & disc variables)
If you work out how to do this in 'R', by actually using it, your tutorial will pretty much write itself, (keep saving your screens - Irfanview is a great, free, tool I use for this. Install, open, hit 'C' for manual or automatic screen save options.)
RPy2. I never touch actual R code because I agree with you - the language itself isn't as bad as some, but it's not good either. RPy2 lets you have access to R without having to actually code in it.
I can somewhat relate to the documentation issue although I believe that it is more a question of organizing the documentation.
When you mention "a fundamental problem" you mention function implementations, thus library rather than language issues. R itself is an extremely expressive, functional (or rather multi-paradigm) language that can be programmed to run efficient code. Yet it is syntactically minimalistic without unneeded syntax (as opposed to all of the scripting languages perl/python/ruby). This makes it a truly postmodern language IMO. Efficiency can sometimes be a problem but the break-even point for implementing parts in say C/C++ is only slightly different than for other languages (say perl/python) and is enabled by an excellent interface (Rcpp package).
For myself the biggest change to make was to start thinking in functional concepts coming from a procedural background. Much of R criticism IMO stems on a failure to realize conceptual differences between functional and procedural programming. Another problem that might spoil the impression of R sometimes is the plethora of packages of highly varying quality.
Despite R's weaknesses as a programming language, R has such a large number of well-documented, well-tested, statistical functions with a wide array of arguments to vary that it is very difficult for another language to match. For example, maybe you want to build an arima time series model. OK, not too tough to find a library in Python or C++ that does that. Now what if you want to add an exogenous variable to the arima model? Maybe a seasonal component? Next maybe you want to automatically pick the best model according to AIC? Oops, make that BIC. Looking at it again maybe a Vector Autoregressive model is best. Or a VECM?
While I'm sure there are excellent implementations of all of these wrinkles in other languages, with R, I have great confidence that the functions that I want and need now and in the future are going to be there and are going to be implemented correctly, and kudos to the R team for giving us that kind of confidence.
R does have a lot of problems, among the worst is loop performance. It really forces you to vectorize everything, which leads to less maintainable code, and is generally a coding technique that new hires coming from other languages will face a steep learning curve with. What I have found useful is to use R as a data exploration and model parameterization tool, but once the model is ready to be put into production, you can use the parameters calculated by R in an implementation in the language of your choice, e.g., C++.
I guess this is a long winded way of saying that as with so many questions of "which language is best," the real question is "which question is best for you and your application?" R is usually the best language only for people who are regularly using a such a wide variety of statistical analyses that you won't find a large part of what you need in the libraries of other languages. For me, I couldn't imagine working without it.
A new, easy to use, free, online R system is StatAce (www.statace.com). The GUI analysis is still in infancy (only descriptives, correlation and OLS at this stage) but it supports any and all R code, many libraries, and has good data management (e.g. allows you to save results).
In the same category, MATLAB.
Otherwise you also have real programming languages, ranging from C to Python.
I can somewhat relate to the documentation issue although I believe that it is more a question of organizing the documentation.
One of the things that bothers me about the documentation is that there's often no distinction between interface and implementation. Instead of a description of what a function does, you get implementation details mixed up with what it approximately hopes to achieve, leaving you unable to see the forest for the trees.
When you mention "a fundamental problem" you mention function implementations, thus library rather than language issues. R itself is an extremely expressive, functional (or rather multi-paradigm) language that can be programmed to run efficient code. Yet it is syntactically minimalistic without unneeded syntax (as opposed to all of the scripting languages perl/python/ruby). This makes it a truly postmodern language IMO.
Well, there's only one implementation, so it's rather pointless that it could be implemented efficiently. The language specification isn't exactly good enough to create a competing, compatible implementation either. I agree that the syntax is minimalistic and that there's extremely little boilerplate, but I could really do with some way of defining data types (Python 2 is lacking there as well IMO), and namespaces...
Efficiency can sometimes be a problem but the break-even point for implementing parts in say C/C++ is only slightly different than for other languages (say perl/python) and is enabled by an excellent interface (Rcpp package).
Ah, the universal solution to problems with R: here's how to do it in some other language or software instead. Sorry for being sarcastic, but it's amazing how often effectively that advice showed up whenever I searched the web for a solution to some problem I encountered with R.
As an example of my experience, I use JAGS to fit models to data, and JAGS wants to have the model as a text file description. My model has a node for every combination of some 13000 sites and 11 years, and the text file gets to several tens of megabytes depending on model options. Creating it is basically a matter of running through all the combinations of sites and years, looking up some additional data, and spitting out a line of text describing them. My first implementation was very naive, nested for loops that essentially did a nested loop on the data. It generated output at several tens of kilobytes per second, getting slower and slower as it went on. I managed to speed it up by preallocating memory (R seems to not double the capacity of a vector when it runs out, as the C++ STL does, but add a constant extra amount, so that growing a vector made the loop run in quadratic time, except that when measured it actually seemed to be exponential, for who knows what reason.), pre-sorting data and changing to a merge join, and vectorising as much as possible. It now does about a megabyte per second, which is fast enough for my purposes. However, the code is now completely unreadable, and it's still not anywhere near what the hardware can do (PostgreSQL does the equivalent nested loop in less than a second). R turned what should have been a trivial programming task into a frustrating adventure, and the result is still not very good.
For myself the biggest change to make was to start thinking in functional concepts coming from a procedural background. Much of R criticism IMO stems on a failure to realize conceptual differences between functional and procedural programming. Another problem that might spoil the impression of R sometimes is the plethora of packages of highly varying quality.
True, but this is really another instance of the don't-do-it-in-R solution, because those functional programming functions effectively just run your loop in C, rather than in R (if they don't forward the whole operation to a C scientific maths library), which makes the performance bearable. If R were really a multi-paradigm language, then you would be able to solve a problem procedurally as well if it happened to be the best way to do it.
If you just use R to run data through a package (which in my opinion is the quickest way to get a lot of value out of R) then the learning curve is tolerable. Less steep than for SAS (I think), but steeper than for SPSS.
On the other hand: R in and by itself is mostly a tool for statisticians and data analysts (or anyone else who doesn't flinch at having to write scripts, who's acquainted with the phenomenon of 'manual', and who's used to spending a few hours or so reading before they try to do anything). That in itself represents a barrier.
I've found the on-line R documentation mostly unhelpful for beginners (thorough but pedantic, often implicit, and tending to use jargon). The offline 'Introduction to R' is a lot better though, and there are some good user-contributed texts that can be freely downloaded. I agree that it's useless to buy a book on the actual language (be that S or R) because as a beginner you will only use R's ready-made functionality and script that. If you fins yourself delving into the language you're probably doing something wrong (for a beginner). Your best bet is to buy one of the 'cookbooks' for R.
I tried to use it for an undergraduate statistics course in conjunction with Excel using the RExcel package and Rcmder.
The RExcel package establishes a com link between MS Excel and R and comes with an Excel plugin that creates Rcmdr menus in Excel. The net result is that people can load, view, and edit their data in MS Excel, open the menu, send the data to R, do menu-driven analyses in Rcmdr, and bring results back in Excel if required.
It was less than a success. Students stumbled over having to realise you have to send the data to R before the menu options take effect, had difficulty of keeping track of where their 'live' data actually was (Excel or R), and on top of that had difficulty remembering where to look for the menu options.
Yes, I know. Well ... they were business school students but still, eh?
I believe that R commander can work for an introductory course, provided you match the content of the course exactly to the RCommander menu or vice-versa. Your students will be a bit hemmed in aftwerwards: they'll be able to replicate the stuff you prepared for them, but as soom as they try anything else they will have to sit down, think, and spend time figuring out how to use the software.
Mod parent up, this is *by far* the best GUI for R I have seen. Is it open source? I would think statistical analysis would be an especially good target for paid open-source SAAS.
Is there anything better than clicking through Microsoft ads on Slashdot?
A couple of years ago I ran into SAS at a trade show. It really surprised me that they were still around; I'd previously seen their products on mainframes back in the late 70s, with punch cards. (I forget by now whether I'd used SAS or SPSS, which were the two competing commercial stats packages in that environment.)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
The single best R resource I've ever used was The R Book, by Crawley. Before buying it I invested way too much time searching all over the web for solutions to simple and complicated things alike, almost always with poor or incomplete results. The O'Reilly R books are barely OKi. Short circuit the BS and go straight to The R Book. It paid for itself in about 2 hours of coding (it's expensive and runs between $80 and $150, when it's available -- my time is way more valuable, though).
For applied R to problem solving, my suggestion would be to go with Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill. It requires you to have gone through college level stats -- fantastic book.
Another good book is Data Mining with R, Learning with Case Studies -- especially the Introduction, which offers one of the best R programming overviews out there. It's about the only reference that explains the R object-oriented and functional features well.
Web resources are vital AFTER you've sunk 200 to 500 hours into R work. By then you'll have grasped the language and many of its quirks (it's a language by and for scientists, not programming professionals -- so, saying it's "quirky" is a hella of understatement), and the web resources will be more helpful because most are incomplete, but by you'll have enough experience to "fill in the blanks".
You're welcome to swing by irc://irc.freenode.net/#R -- we welcome n00bz!
Cheers!
pr3d
http://eugeneciurana.com | http://ciurana.eu
I guess you missed the memo that the K&R string functions are deprecated in many projects such as OpenBSD which has their own recommended set of string functions.
Way back when, Iverson and his APL cronies put a great deal of effort into defining the APL arithmetic operator set to conform to the largest possible set of simple arithmetic identities. Has the definition of the modulus operator concerning negative arguments been consistent in all languages since? That they shrouded this deep elegance with inscrutable Greek letters matters exactly how? They wrote a paper detailing all the identities they had discovered concerning the APL operator set. I've never seen a single other language bother to do this. Perhaps because identities written out with floor() and modulus() and spzkrm() lose a lot in translation.
Language designers preoccupied with consistency are known as dreamers (or Hurd developers). The formula that seems to grow up to become a language people actually use is 75% utility and 25% elegance.
I guess you missed the memo that when elegance dies on the vine in infancy, it does no one any damn good.
Why are programmers reluctant to refactor code when an elegant API becomes available to replace a hastily conceived core API? Because you can rarely trust the equivalence all the way down to the last edge case, because few APIs documents their identity sets listing all the cases you'd like to be true (and a few you hadn't even considered yourself) as well as the cases you presumed should be true, without realizing that these cases fundamentally conflict with other identities that made the cut.
Programmers who don't declare their identity sets shouldn't be allowed to write APIs, because a reliable identity set is the only way the downstream programmer will dare to refactor your API out of his applications if it turns out your API sucks--as part of a mass exodus from superior documentation.
Are you beginning to see the problem here?