In-Database R Coming To SQL Server 2016
theodp writes: Wondering what kind of things Microsoft might do with its purchase of Revolution Analytics? Over at the Revolutions blog, David Smith announces that in-database R is coming to SQL Server 2016. "With this update," Smith writes, "data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where it will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database." It'll no doubt intrigue Data Scientist types, but the devil's in the final details, which Microsoft was still cagey about when it talked-the-not-exactly-glitch-free-talk (starts @57:00) earlier this month at Ignite. So, brush up your R, kids, and you can see how Microsoft walks the in-database-walk when SQL Server 2016 public preview rolls out this summer.
Check out http://alteryx.com/ which is already doing in database R with Oracle and Hadoop (Spark R.) Its great that Microsoft is joining the club, but they aren't exactly the 1st.
How about introducing schema-less tables in 2016? Are we going to have to store fuzzy data in a silly full-text search enabled field forever?
The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore. An interpreted language like R, and even many compiled languages, expect memory accesses to be quick. However, if the data accesses are requiring SQL calls, then the R-SQL server marriage will be very slow. I'm sure they will be able to do some small demonstrations that look quick, but once the database becomes large, then things will be very slow.
On the good news side, there are some operations like average and standard deviation that reduce into loops of sums. Those should map onto SQL queries relatively well.
On the bad news side, a popular operation is to build a covariance matrix. With a large data set, it is easy to create a covariance matrix that does not fit into RAM.
R would be a better match against an distributed database (NoSQL, MongoDB), where the memory requirements of the vectors could be split across multiple computers. Although, that too might require some changes to R.
How about making SQL server respect ASCII nulls on unique constraints?
Glad you asked. See
Microsoft: always ariving late at the party.
Is it web scale?
Why R? The R syntax is deranged. Python is at least more normal for programming. Why not have a .NET like set of language-neutral libraries to interface with this in-memory whatever-it-is feature and let hackers plug in their own languages? Why bake any one language into the database?
Wouldn't Microsoft need to release SQL Server under GPL by including R?
I think it may have been embedded since 2012 or so.
http://www.oracle.com/technetwork/topics/bigdata/r-offerings-1566363.html
expect it to be in the enterprise version at $7000 a physical CPU core
I'm curious whether it will be exposed via OLAP - when I was doing some proteomics work with MS OLAP some years back, the retrieval speed was stellar, but the math libraries were pathetic, which seemed pretty sad for something allegedly aimed at analytics. (Yes, I know, most people assumed business analytics, but there's an awful lot of potential for scientific analysis, especially with large, messy datasets.)
This has been available in PostgreSQL for a while now.
Embrace, Extend, Extinguish
Microsoft, just like they did to Lotus 123, Wordperfect, just like they did to Java with their J++ before getting spanked, just like they tried to do with C++, just like they're trying to do with porting Android and IOS apps to their OS, they're doing it again -- creating a Roach Motel of software in which the developer or user can check into the Microsoft Roach Motel OS, but they sure cannot check out.
What is so egregiously evil about this? They're taking an Open Source product that is OS agnostic and putting their hand on the scale to favor using Windows with SQL Server and ONLY SQL Server. Way to be a good neighbor and a member of the community. I salute them for their "endeavor to persevere".
I gather from the summary that it is Microsoft's SQL Server, but wouldn't it be much more appropriate for a product with such an extremely generic name to prepend the vendor name?
Not because I want linux to take over the world or anything (*ugh* systemd *barf*) but because when all is said and done proprietary desktop emulators posing as operating systems just don't interest me. I'm not going to use that as a server, ever, so there's really no point in caring about server software that only works on that platform anyway. Then there's the proprietaryness, the cost, and the availability of FOSS multi-platform alternatives where I run much less risk of getting locked in.
Of course, there are plenty people that do believe servers need 3d desktops and opengl screen savers, so for them this'll be peachy. Then again, there's a strong case to be made never to let such people procure server software for others to use.
Being able to remotely transmit commands in a new general-purpose programming language to the server that stores your irreplaceable data? What could possibly go wrong?
Also, how do you say "Robert'); DROP TABLE Students;" in R?
Wake me up when SQL Server comes with an MP3 player built in.
HPCC using Ecl is elegant in 70s fashion and can run big data jobs super quick due to its behind the scenes MASSIVELY parallel processing. As long as the big data is reasonable in size that is.