In-Database R Coming To SQL Server 2016
theodp writes: Wondering what kind of things Microsoft might do with its purchase of Revolution Analytics? Over at the Revolutions blog, David Smith announces that in-database R is coming to SQL Server 2016. "With this update," Smith writes, "data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where it will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database." It'll no doubt intrigue Data Scientist types, but the devil's in the final details, which Microsoft was still cagey about when it talked-the-not-exactly-glitch-free-talk (starts @57:00) earlier this month at Ignite. So, brush up your R, kids, and you can see how Microsoft walks the in-database-walk when SQL Server 2016 public preview rolls out this summer.
The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore. An interpreted language like R, and even many compiled languages, expect memory accesses to be quick. However, if the data accesses are requiring SQL calls, then the R-SQL server marriage will be very slow. I'm sure they will be able to do some small demonstrations that look quick, but once the database becomes large, then things will be very slow.
On the good news side, there are some operations like average and standard deviation that reduce into loops of sums. Those should map onto SQL queries relatively well.
On the bad news side, a popular operation is to build a covariance matrix. With a large data set, it is easy to create a covariance matrix that does not fit into RAM.
R would be a better match against an distributed database (NoSQL, MongoDB), where the memory requirements of the vectors could be split across multiple computers. Although, that too might require some changes to R.
MonetDB has a nice comparison on different in and out of database performance: https://www.monetdb.org/conten...
Support Eachother, Copy Dutch Property!
So if you aren't first with a feature, you shouldn't bother?
SQL Server 2016 will have a Json column type, so its most of the way there.
expect it to be in the enterprise version at $7000 a physical CPU core
for example those using SQL Server.
Though to be fair, that was a questionable decision to begin with. You just don't get any value for your subscription fees.
Databases are one area that open source is beating closed source.
"First they came for the slanderers and i said nothing."
No - MS will only need to release any changes they make to R.
This sort of thing comes up quite often and largely comes down to coupling. If Microsoft included R code in the binary of SQL Server then they would run into complications. However as long as they keep R on its own and arrange interprocess communication sensibly, they will not be affected by the GPL.
It's quite likely MS will modify R, e.g. writing low level routines for getting data out of SQL without needing to go via ODBC and those sort of changes will need to be released. It's also possible MS will want things like .RData readers for putting into SQL and similar - and they might choose to do a clean-room implementation of such bits rather than calling out to R for the loading code in order to avoid too tight coupling.
Incidentially, this has been done before. The PgR project gives Postgres (BSD) has tight coupling with R (GPL) without requiring Postgres to be relicenced. Tableau also released similar features, though they don't add much value at this stage.
Yeah exactly.
MS SQL has a lot of good things going for it - but what you're asking for is one area where Postgres just runs rings around it. You can achieve similar benefits in MS using a CLR but it will be faster and easier in Postgres. Unless you have some compelling reason to stay MS, I suggest you take the hit and learn a new platform.
Postgres has done replication for over a decade. What's wrong with it?
"First they came for the slanderers and i said nothing."
Vertica says HELLO! Even though its -absurdly- expensive, it runs circle around anything open source.
Vertica is a data warehouse
Open source dbs have companies doing that support, but few have the kind of manpower I'd want when things go very sour.
If you sincerely need help from Oracle/Microsoft/HP to deal with your database problems, then your technical expertise isn't very high.
"First they came for the slanderers and i said nothing."
The line is so thin between data warehouse and transactional dbs. Heck, in this case the only difference is how data is stored and which type of query is fast and which is slow.
No, that is actually the difference lol
"First they came for the slanderers and i said nothing."
You can already do that with the CLR.
Harrison's Postulate - "For every action there is an equal and opposite criticism"
Why R? The R syntax is deranged. Python is at least more normal for programming. Why not have a .NET like set of language-neutral libraries to interface with this in-memory whatever-it-is feature and let hackers plug in their own languages? Why bake any one language into the database?
This. The language is horrible. What R has going for it is (1) some quite good graph plotting and (2) Support any statistical function you can think of, since every statistics researcher works in R and so the functions a available. No other statistics product comes close.
A python statistics library with some funky C linkage to the R library would take over in milliseconds when people find they can get all the stats functions while being able to program in a sane language.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Faster, yes. Easier? Maybe. I'm migrating a project from SQL Server to Postgres and I will say that SSMS is definitely better than pgAdmin any day. I'm almost tempted to write my own console due to the bugs I encounter.
. Define sqrt(x) as something really evil like (x / rand()), and bury it deep. Watch your coworkers go nuts.
When I was doing research into databases and total cost of ownership, Postgres was pretty much the best until about $100k, then MS-SQL caught up and it was pretty much a tie. MySQL was pretty bad the entire way through. There were a few other databases, but they were both uncommon and not ever better.
With Postgres and MS-SQL being pretty much a tie on TCO, just choose whichever best fits your situation. Postgres does have a low barrier of entry and can do some pretty nifty things, but those things increase the base technical expertise required to program and administrate.