In-Database R Coming To SQL Server 2016
theodp writes: Wondering what kind of things Microsoft might do with its purchase of Revolution Analytics? Over at the Revolutions blog, David Smith announces that in-database R is coming to SQL Server 2016. "With this update," Smith writes, "data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where it will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database." It'll no doubt intrigue Data Scientist types, but the devil's in the final details, which Microsoft was still cagey about when it talked-the-not-exactly-glitch-free-talk (starts @57:00) earlier this month at Ignite. So, brush up your R, kids, and you can see how Microsoft walks the in-database-walk when SQL Server 2016 public preview rolls out this summer.
The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore. An interpreted language like R, and even many compiled languages, expect memory accesses to be quick. However, if the data accesses are requiring SQL calls, then the R-SQL server marriage will be very slow. I'm sure they will be able to do some small demonstrations that look quick, but once the database becomes large, then things will be very slow.
On the good news side, there are some operations like average and standard deviation that reduce into loops of sums. Those should map onto SQL queries relatively well.
On the bad news side, a popular operation is to build a covariance matrix. With a large data set, it is easy to create a covariance matrix that does not fit into RAM.
R would be a better match against an distributed database (NoSQL, MongoDB), where the memory requirements of the vectors could be split across multiple computers. Although, that too might require some changes to R.
MonetDB has a nice comparison on different in and out of database performance: https://www.monetdb.org/conten...
Support Eachother, Copy Dutch Property!
for example those using SQL Server.
Though to be fair, that was a questionable decision to begin with. You just don't get any value for your subscription fees.
Databases are one area that open source is beating closed source.
"First they came for the slanderers and i said nothing."
No - MS will only need to release any changes they make to R.
This sort of thing comes up quite often and largely comes down to coupling. If Microsoft included R code in the binary of SQL Server then they would run into complications. However as long as they keep R on its own and arrange interprocess communication sensibly, they will not be affected by the GPL.
It's quite likely MS will modify R, e.g. writing low level routines for getting data out of SQL without needing to go via ODBC and those sort of changes will need to be released. It's also possible MS will want things like .RData readers for putting into SQL and similar - and they might choose to do a clean-room implementation of such bits rather than calling out to R for the loading code in order to avoid too tight coupling.
Incidentially, this has been done before. The PgR project gives Postgres (BSD) has tight coupling with R (GPL) without requiring Postgres to be relicenced. Tableau also released similar features, though they don't add much value at this stage.
Yeah exactly.
MS SQL has a lot of good things going for it - but what you're asking for is one area where Postgres just runs rings around it. You can achieve similar benefits in MS using a CLR but it will be faster and easier in Postgres. Unless you have some compelling reason to stay MS, I suggest you take the hit and learn a new platform.
Why R? The R syntax is deranged. Python is at least more normal for programming. Why not have a .NET like set of language-neutral libraries to interface with this in-memory whatever-it-is feature and let hackers plug in their own languages? Why bake any one language into the database?
This. The language is horrible. What R has going for it is (1) some quite good graph plotting and (2) Support any statistical function you can think of, since every statistics researcher works in R and so the functions a available. No other statistics product comes close.
A python statistics library with some funky C linkage to the R library would take over in milliseconds when people find they can get all the stats functions while being able to program in a sane language.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.