Slashdot Mirror


In-Database R Coming To SQL Server 2016

theodp writes: Wondering what kind of things Microsoft might do with its purchase of Revolution Analytics? Over at the Revolutions blog, David Smith announces that in-database R is coming to SQL Server 2016. "With this update," Smith writes, "data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where it will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database." It'll no doubt intrigue Data Scientist types, but the devil's in the final details, which Microsoft was still cagey about when it talked-the-not-exactly-glitch-free-talk (starts @57:00) earlier this month at Ignite. So, brush up your R, kids, and you can see how Microsoft walks the in-database-walk when SQL Server 2016 public preview rolls out this summer.

5 of 94 comments (clear)

  1. This might not be a good idea ... by Cassini2 · · Score: 5, Interesting

    The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore. An interpreted language like R, and even many compiled languages, expect memory accesses to be quick. However, if the data accesses are requiring SQL calls, then the R-SQL server marriage will be very slow. I'm sure they will be able to do some small demonstrations that look quick, but once the database becomes large, then things will be very slow.

    On the good news side, there are some operations like average and standard deviation that reduce into loops of sums. Those should map onto SQL queries relatively well.

    On the bad news side, a popular operation is to build a covariance matrix. With a large data set, it is easy to create a covariance matrix that does not fit into RAM.

    R would be a better match against an distributed database (NoSQL, MongoDB), where the memory requirements of the vectors could be split across multiple computers. Although, that too might require some changes to R.

  2. Re:Alteryx by Skinkie · · Score: 5, Interesting

    MonetDB has a nice comparison on different in and out of database performance: https://www.monetdb.org/conten...

    --
    Support Eachother, Copy Dutch Property!
  3. Re:Alteryx by phantomfive · · Score: 4, Insightful

    for example those using SQL Server.

    Though to be fair, that was a questionable decision to begin with. You just don't get any value for your subscription fees.

    Databases are one area that open source is beating closed source.

    --
    "First they came for the slanderers and i said nothing."
  4. Re:Isn't R GPL? by lakeland · · Score: 4, Informative

    No - MS will only need to release any changes they make to R.

    This sort of thing comes up quite often and largely comes down to coupling. If Microsoft included R code in the binary of SQL Server then they would run into complications. However as long as they keep R on its own and arrange interprocess communication sensibly, they will not be affected by the GPL.

    It's quite likely MS will modify R, e.g. writing low level routines for getting data out of SQL without needing to go via ODBC and those sort of changes will need to be released. It's also possible MS will want things like .RData readers for putting into SQL and similar - and they might choose to do a clean-room implementation of such bits rather than calling out to R for the loading code in order to avoid too tight coupling.

    Incidentially, this has been done before. The PgR project gives Postgres (BSD) has tight coupling with R (GPL) without requiring Postgres to be relicenced. Tableau also released similar features, though they don't add much value at this stage.

  5. Re: But, but? by lakeland · · Score: 4, Informative

    Yeah exactly.

    MS SQL has a lot of good things going for it - but what you're asking for is one area where Postgres just runs rings around it. You can achieve similar benefits in MS using a CLR but it will be faster and easier in Postgres. Unless you have some compelling reason to stay MS, I suggest you take the hit and learn a new platform.