In-Database R Coming To SQL Server 2016
theodp writes: Wondering what kind of things Microsoft might do with its purchase of Revolution Analytics? Over at the Revolutions blog, David Smith announces that in-database R is coming to SQL Server 2016. "With this update," Smith writes, "data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where it will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database." It'll no doubt intrigue Data Scientist types, but the devil's in the final details, which Microsoft was still cagey about when it talked-the-not-exactly-glitch-free-talk (starts @57:00) earlier this month at Ignite. So, brush up your R, kids, and you can see how Microsoft walks the in-database-walk when SQL Server 2016 public preview rolls out this summer.
Check out http://alteryx.com/ which is already doing in database R with Oracle and Hadoop (Spark R.) Its great that Microsoft is joining the club, but they aren't exactly the 1st.
How about introducing schema-less tables in 2016? Are we going to have to store fuzzy data in a silly full-text search enabled field forever?
The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore. An interpreted language like R, and even many compiled languages, expect memory accesses to be quick. However, if the data accesses are requiring SQL calls, then the R-SQL server marriage will be very slow. I'm sure they will be able to do some small demonstrations that look quick, but once the database becomes large, then things will be very slow.
On the good news side, there are some operations like average and standard deviation that reduce into loops of sums. Those should map onto SQL queries relatively well.
On the bad news side, a popular operation is to build a covariance matrix. With a large data set, it is easy to create a covariance matrix that does not fit into RAM.
R would be a better match against an distributed database (NoSQL, MongoDB), where the memory requirements of the vectors could be split across multiple computers. Although, that too might require some changes to R.
Wouldn't Microsoft need to release SQL Server under GPL by including R?
So if you aren't first with a feature, you shouldn't bother?
expect it to be in the enterprise version at $7000 a physical CPU core
I'm curious whether it will be exposed via OLAP - when I was doing some proteomics work with MS OLAP some years back, the retrieval speed was stellar, but the math libraries were pathetic, which seemed pretty sad for something allegedly aimed at analytics. (Yes, I know, most people assumed business analytics, but there's an awful lot of potential for scientific analysis, especially with large, messy datasets.)
Embrace, Extend, Extinguish
Microsoft, just like they did to Lotus 123, Wordperfect, just like they did to Java with their J++ before getting spanked, just like they tried to do with C++, just like they're trying to do with porting Android and IOS apps to their OS, they're doing it again -- creating a Roach Motel of software in which the developer or user can check into the Microsoft Roach Motel OS, but they sure cannot check out.
What is so egregiously evil about this? They're taking an Open Source product that is OS agnostic and putting their hand on the scale to favor using Windows with SQL Server and ONLY SQL Server. Way to be a good neighbor and a member of the community. I salute them for their "endeavor to persevere".
How about making SQL server respect ASCII nulls on unique constraints?
I would be more impressed with EBCDIC nulls.
In the KELVIN character set, NULL equals ABS(NULL)
When the copyright term is "forever minus a day", live every day like it's the last.
Being able to remotely transmit commands in a new general-purpose programming language to the server that stores your irreplaceable data? What could possibly go wrong?
Also, how do you say "Robert'); DROP TABLE Students;" in R?
Why R? The R syntax is deranged. Python is at least more normal for programming. Why not have a .NET like set of language-neutral libraries to interface with this in-memory whatever-it-is feature and let hackers plug in their own languages? Why bake any one language into the database?
This. The language is horrible. What R has going for it is (1) some quite good graph plotting and (2) Support any statistical function you can think of, since every statistics researcher works in R and so the functions a available. No other statistics product comes close.
A python statistics library with some funky C linkage to the R library would take over in milliseconds when people find they can get all the stats functions while being able to program in a sane language.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
I use both R and Python. R itself is actually quite nice and more efficient for interactive use, once you get used to it. For interactive exploration with statistics, I actually prefer it to Python (and I have been using Python for ~15 years). Lots of helper functions. Everything uses the DataFrame datastructure. Good, concise and consistent documentation.
Unless you are a R library dev, for most users, its best to see R as a shell for statistics, rather than a programming language. So its language horribleness does not matter much.
I use Python to process data and R to explore it. Once I settle on something, if I need to put it into a larger pipeline, I either find an implementation in Python or link R to Python via rpy2.
> A python statistics library with some funky C linkage to the R library would take over in milliseconds
That's what rpy2 is.
Just out of curiosity, when you say "Python" are you including iPython Notebook and Pandas and the rest of the SciPy/NumPy modules or are you comparing R strictly to "plain" Python scripts?
I mean the full Python stack (IPython notebook + Spyder with IPython, PyLab, Pandas, statsmodels).
For almost everything in stats, I prefer the RStudio experience. The flow feels much better, even though my Python is much better than my R. Machine Learning is one stats topic though, where I still prefer Python - I just like Scikit-learn.
If I was doing linear algebra directly, I would have preferred the Python stack with NumPy. PyLab stack is more for Matlab users than R users. On the stats side, Pandas and statsmodels are still not yet an R replacement for me. They are a great start though and seem to have gotten everything right so far.
Wake me up when SQL Server comes with an MP3 player built in.
That's what rpy2 is.
Thank you. I didn't know rpy2 existed.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.