In-Database R Coming To SQL Server 2016

← Back to Stories (view on slashdot.org)

In-Database R Coming To SQL Server 2016

Posted by Soulskill on Saturday May 16, 2015 @06:49AM from the r,-me-hearties dept.

theodp writes: Wondering what kind of things Microsoft might do with its purchase of Revolution Analytics? Over at the Revolutions blog, David Smith announces that in-database R is coming to SQL Server 2016. "With this update," Smith writes, "data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where it will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database." It'll no doubt intrigue Data Scientist types, but the devil's in the final details, which Microsoft was still cagey about when it talked-the-not-exactly-glitch-free-talk (starts @57:00) earlier this month at Ignite. So, brush up your R, kids, and you can see how Microsoft walks the in-database-walk when SQL Server 2016 public preview rolls out this summer.

52 of 94 comments (clear)

Min score:

Reason:

Sort:

Alteryx by Anonymous Coward · 2015-05-16 06:53 · Score: 1

Check out http://alteryx.com/ which is already doing in database R with Oracle and Hadoop (Spark R.) Its great that Microsoft is joining the club, but they aren't exactly the 1st.
1. Re:Alteryx by Anonymous Coward · 2015-05-16 07:00 · Score: 1
  
  PostgreSQL has had PL/R since 2003.
2. Re:Alteryx by Skinkie · 2015-05-16 07:26 · Score: 5, Interesting
  
  MonetDB has a nice comparison on different in and out of database performance: https://www.monetdb.org/conten...
  
  --
  Support Eachother, Copy Dutch Property!
3. Re:Alteryx by kthreadd · 2015-05-16 08:49 · Score: 1
  
  PostgreSQL has had PL/R since 2003.
  Which is nice but doesn't really do anything for you if you're not using PostgreSQL, for example those using SQL Server.
4. Re:Alteryx by phantomfive · 2015-05-16 09:33 · Score: 4, Insightful
  
  for example those using SQL Server.
  Though to be fair, that was a questionable decision to begin with. You just don't get any value for your subscription fees.
  
  Databases are one area that open source is beating closed source.
  
  --
  "First they came for the slanderers and i said nothing."
5. Re:Alteryx by phantomfive · 2015-05-16 11:07 · Score: 2
  
  Postgres has done replication for over a decade. What's wrong with it?
  
  --
  "First they came for the slanderers and i said nothing."
6. Re:Alteryx by Shados · 2015-05-16 12:18 · Score: 1
  
  Vertica says HELLO! Even though its -absurdly- expensive, it runs circle around anything open source.
  Though in general, large (really large) databases is an area where you actually want commercial support, because things can go wrong in the most fucked up ways.
  Open source dbs have companies doing that support, but few have the kind of manpower I'd want when things go very sour.
7. Re:Alteryx by phantomfive · 2015-05-16 12:25 · Score: 2
  
  Vertica says HELLO! Even though its -absurdly- expensive, it runs circle around anything open source.
  Vertica is a data warehouse
  
  Open source dbs have companies doing that support, but few have the kind of manpower I'd want when things go very sour.
  If you sincerely need help from Oracle/Microsoft/HP to deal with your database problems, then your technical expertise isn't very high.
  
  --
  "First they came for the slanderers and i said nothing."
8. Re:Alteryx by Shados · 2015-05-16 12:36 · Score: 1
  
  The line is so thin between data warehouse and transactional dbs. Heck, in this case the only difference is how data is stored and which type of query is fast and which is slow. You can insert, run SQL (we use Postgres as a mock to run persistance layer tests, because its so close to Vertica), all in real time. Close enough.
  And even the biggest of big data giants sometimes end up with issues where you need help. When you need to write a patch for your RDBMS, its nice to be able to have a vendor to do it, open source or not. Not many companies keep Postgres core developers in house (ironically, my wife has been a postgres contributor in the past, but not everyone has them handy =P).
9. Re:Alteryx by phantomfive · 2015-05-16 14:14 · Score: 2
  
  The line is so thin between data warehouse and transactional dbs. Heck, in this case the only difference is how data is stored and which type of query is fast and which is slow.
  
  No, that is actually the difference lol
  
  --
  "First they came for the slanderers and i said nothing."
10. Re:Alteryx by phantomfive · 2015-05-16 14:15 · Score: 1
  
  And even the biggest of big data giants sometimes end up with issues where you need help. When you need to write a patch for your RDBMS, its nice to be able to have a vendor to do it, open source or not. Not many companies keep Postgres core developers in house (
  I'm interested though, is this an issue you've run into?
  
  --
  "First they came for the slanderers and i said nothing."
11. Re:Alteryx by Aighearach · 2015-05-16 17:49 · Score: 1
  
  3rd Party Vendors. It is a scary world out there. If you don't like working with it, double your prices to drive them away. Oops, now you're the highest paid person, you're the expert in what you hate. It happens all the time.
  If you're willing to use proprietary COTS crapware inside a business, you'll probably get stuck with crap like SQL Server. This is a huge service to poor souls stuck working on these things and doing statistics. You can throw away a whole layer of crapware and move it into the database where you can control the functionality.
12. Re:Alteryx by Bengie · 2015-05-17 00:14 · Score: 2
  
  When I was doing research into databases and total cost of ownership, Postgres was pretty much the best until about $100k, then MS-SQL caught up and it was pretty much a tie. MySQL was pretty bad the entire way through. There were a few other databases, but they were both uncommon and not ever better.
  
  With Postgres and MS-SQL being pretty much a tie on TCO, just choose whichever best fits your situation. Postgres does have a low barrier of entry and can do some pretty nifty things, but those things increase the base technical expertise required to program and administrate.
13. Re:Alteryx by Bengie · 2015-05-17 00:17 · Score: 1
  
  Derp, the only difference between a transaction database and data warehouse is the datastructures and algorithms.. herpa derpa.
  
  And the only difference between a train and semi is the engine and body.
14. Re:Alteryx by phantomfive · 2015-05-17 03:40 · Score: 1
  
  Postgres was pretty much the best until about $100k, then MS-SQL caught up and it was pretty much a tie.
  How did MS-SQL catch up?
  
  --
  "First they came for the slanderers and i said nothing."
But, but? by cablepokerface · 2015-05-16 07:00 · Score: 1

How about introducing schema-less tables in 2016? Are we going to have to store fuzzy data in a silly full-text search enabled field forever?
1. Re:But, but? by cablepokerface · 2015-05-16 07:29 · Score: 1
  
  I want both. The best of both worlds. Without using multiple products.
2. Re:But, but? by Richard_at_work · 2015-05-16 07:48 · Score: 2
  
  SQL Server 2016 will have a Json column type, so its most of the way there.
3. Re: But, but? by lakeland · 2015-05-16 09:54 · Score: 4, Informative
  
  Yeah exactly.
  MS SQL has a lot of good things going for it - but what you're asking for is one area where Postgres just runs rings around it. You can achieve similar benefits in MS using a CLR but it will be faster and easier in Postgres. Unless you have some compelling reason to stay MS, I suggest you take the hit and learn a new platform.
4. Re: But, but? by cdwiegand · 2015-05-16 17:55 · Score: 2
  
  Faster, yes. Easier? Maybe. I'm migrating a project from SQL Server to Postgres and I will say that SSMS is definitely better than pgAdmin any day. I'm almost tempted to write my own console due to the bugs I encounter.
  
  --
  . Define sqrt(x) as something really evil like (x / rand()), and bury it deep. Watch your coworkers go nuts.
5. Re: But, but? by lakeland · 2015-05-16 18:18 · Score: 1
  
  Yup, SSMS is far, far better than pgAdmin. SSIS is years ahead of any postgres ETL tool. There's a bunch of other awesome features in SQL Server too - from memory merge doesn't work in Postgres, procedures/functions are harder to use and ...
  I wasn't trying to say Postgres is all-round better than SQL Server. But there are a few things including R integration and spatial queries where Postgres is so far ahead that you are probably better to put up with the weaknesses.
6. Re: But, but? by Bengie · 2015-05-17 02:11 · Score: 1
  
  As someone with a bad memory, command lines suck for everything except scripting. Give me multiple choice. 80% of my work can be done more quickly via a UI than commands.
7. Re: But, but? by phantomfive · 2015-05-17 03:36 · Score: 1
  
  As someone with a bad memory,
  Improve your memory. It can be done.
  
  --
  "First they came for the slanderers and i said nothing."
8. Re: But, but? by Anonymous Coward · 2015-05-17 04:15 · Score: 1
  
  No, the reality is the person would have to be a jack of all trades and master of none because of all of the job duties assigned to the position. Said person would constantly be changing between working switches, routers, firewall, PBX, Linux servers, Windows servers, Windows clients, multiple SAN manufacturers, 2 different hypervisors, 2 different relational databases, IIS, Apache, PHP, C#, various shell scripting languages, etc.
9. Re: But, but? by Hognoxious · 2015-05-17 05:42 · Score: 1
  
  Crazy idea I know, but how about making notes?
  
  --
  Confucius say, "Find worm in apple - bad. Find half a worm - worse."
10. Re: But, but? by DescX · 2015-05-19 00:47 · Score: 1
  
  No. You're wrong. Use your eyes. Do it his way. ...ahh, that feels better ;).
  Memorizing CLIs is a waste of brain space for all but the most static of job descriptions. Why? Because everything changes in this field rapidly. In a world of disposable code, where my random OSS framework could suddenly become the next big thing tomorrow morning, memorizing APIs is about the least efficient thing a developer can spend their time on.
  But I will say that it does give an advantage in the workplace to be the square eyed wizard who cranks at people constantly because his working memory is at 110% use every minute of the day. Vomiting arcane words from a terminal app in a meeting will create strong alliances with other IT jerks set in their ways. This will strike fear in the hearts of your weaker enem-- coworkers*, enabling you to beat perceived dunces over the head until their capacity comes up to your level.
  bash just added getStringIteratorFunctorWithCustomStringType()! You didn't know about this?! Read the man page and study it out, luser
11. Re: But, but? by DescX · 2015-05-19 01:30 · Score: 1
  
  So you've had one of those crazy jobs too, eh? ;)
  I love that CLIs let people automate tool chains and crud stuff with ease. I can't stand the groupthink that it's somehow better to use CLI all the time. A billion context switches between languages and esoteric interfaces over the course of an hour is initially exciting and feels productive. Unfortunately, I've never met a program written by a polyglot CLI warrior that wasn't a nightmare of spaghetti fragments that could break with one bad keystroke. Building a GUI forces decelopers to consider how someone will use their software. I can't count the number of times that slapping a GUI over a tool didn't make me realize usability flaws within my own code. Most of the time I just bind a hotkey to produce a floating menu with important commands in it. Not all GUIs need to be elaborate design demos with pretty widgets ;).
  Not a huge fan of overengineered stuff like SOAP or the old MS UIs either, but that's MS and has nothing to do with GUI design per say. I've never had issues using GUI programs to extract features, which is what CLI people bitch about chiefly. Linking utility DLLs, COM objects, xdotool, etc. Interfacing without a terminal is a small price to pay for delivering software that mere mortals can use wihout having to visualize the internals of a computer.
12. Re: But, but? by phantomfive · 2015-05-19 02:58 · Score: 1
  
  Then take notes, as the other guy suggested. If learning command-line commands is so hard for you, that you write whole posts justifying why you shouldn't do it, then you could use some skill improvement before afflicting the world with your incompetence.
  
  Using postgres from the command line is not hard. Really.
  
  --
  "First they came for the slanderers and i said nothing."
This might not be a good idea ... by Cassini2 · 2015-05-16 07:05 · Score: 5, Interesting

The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore. An interpreted language like R, and even many compiled languages, expect memory accesses to be quick. However, if the data accesses are requiring SQL calls, then the R-SQL server marriage will be very slow. I'm sure they will be able to do some small demonstrations that look quick, but once the database becomes large, then things will be very slow.
On the good news side, there are some operations like average and standard deviation that reduce into loops of sums. Those should map onto SQL queries relatively well.
On the bad news side, a popular operation is to build a covariance matrix. With a large data set, it is easy to create a covariance matrix that does not fit into RAM.
R would be a better match against an distributed database (NoSQL, MongoDB), where the memory requirements of the vectors could be split across multiple computers. Although, that too might require some changes to R.
1. Re:This might not be a good idea ... by Anonymous Coward · 2015-05-16 07:19 · Score: 1
  
  The problem with R is that everything is a vector. When you hit something as big as a multi-terabyte database, the vector doesn't fit in memory anymore.
  library(bigmemory)
  
  Create, store, access, and manipulate massive matrices. Matrices are, by default, allocated to shared memory and may use memory-mapped files. Packages biganalytics, synchronicity, bigalgebra, and bigtabulate provide advanced functionality.
2. Re:This might not be a good idea ... by jbolden · 2015-05-16 07:29 · Score: 1
  
  RDBMS engines are designed to convert routines of in memory row by row or group by group statistical operations and figure out good (optimal) disk / memory organizations. That's one of the things they are very very good at.
3. Re:This might not be a good idea ... by mindwhip · 2015-05-16 21:11 · Score: 1
  
  DBAs won't like it and will disable it in most corporate environments. This in effect lets the users/developers "inside" their precious servers where they are the ultimate power in a way they can't fully control (and lets face it Control Freak is a job requirement for a DBA). Add to that the potential to bring a server to its knees with a badly written fragment of code and the possibility of security holes in a new component and they will have all the ammo they need to convince their bosses that it is a Bad Thing.
  
  --
  [The Universe] has gone offline.
4. Re:This might not be a good idea ... by meta-monkey · 2015-05-18 05:04 · Score: 1
  
  I would imagine one would only be performing datamining/statistical analysis on the data warehouse server, not the transactional database server.
  
  --
  We don't have a state-run media we have a media-run state.
5. Re:This might not be a good idea ... by rp · 2015-05-19 22:41 · Score: 1
  
  The sqldf package helps me out with this.
Isn't R GPL? by dalleboy · 2015-05-16 07:44 · Score: 1

Wouldn't Microsoft need to release SQL Server under GPL by including R?
1. Re:Isn't R GPL? by Richard_at_work · 2015-05-16 07:51 · Score: 1
  
  An implementation of R is GPL, but that doesn't extend to all independent implementations, such as the one MS is writing to do this.
2. Re:Isn't R GPL? by lakeland · 2015-05-16 09:47 · Score: 4, Informative
  
  No - MS will only need to release any changes they make to R.
  This sort of thing comes up quite often and largely comes down to coupling. If Microsoft included R code in the binary of SQL Server then they would run into complications. However as long as they keep R on its own and arrange interprocess communication sensibly, they will not be affected by the GPL.
  It's quite likely MS will modify R, e.g. writing low level routines for getting data out of SQL without needing to go via ODBC and those sort of changes will need to be released. It's also possible MS will want things like .RData readers for putting into SQL and similar - and they might choose to do a clean-room implementation of such bits rather than calling out to R for the loading code in order to avoid too tight coupling.
  Incidentially, this has been done before. The PgR project gives Postgres (BSD) has tight coupling with R (GPL) without requiring Postgres to be relicenced. Tableau also released similar features, though they don't add much value at this stage.
3. Re:Isn't R GPL? by RuffMasterD · 2015-05-16 20:27 · Score: 1
  
  Oracle already does this too, embedding R as part of Oracle Advanced Analytics, but only if your boss can afford to sell your kidney. Looks like MS is falling behind.
  
  --
  Human Rights, Article 12: Freedom from Interference with Privacy, Family, Home and Correspondence
Re:Big deal. PostgreSQL's doing it already. by E-Rock · 2015-05-16 07:47 · Score: 2

So if you aren't first with a feature, you shouldn't bother?
expect to pay $$ by alen · 2015-05-16 07:53 · Score: 2

expect it to be in the enterprise version at $7000 a physical CPU core
MS OLAP by tylikcat · 2015-05-16 07:59 · Score: 1

I'm curious whether it will be exposed via OLAP - when I was doing some proteomics work with MS OLAP some years back, the retrieval speed was stellar, but the math libraries were pathetic, which seemed pretty sad for something allegedly aimed at analytics. (Yes, I know, most people assumed business analytics, but there's an awful lot of potential for scientific analysis, especially with large, messy datasets.)
1. Re:MS OLAP by lakeland · 2015-05-16 09:50 · Score: 1
  
  I'm guessing they'll slowly phase out OLAP.
  OLAP got its stellar retrieval speed through lots of precomputation and that just isn't compatible with where the whole big data stuff is going. I'd guess instead they will bring in a NoSQL database as a per-table query engine and use that as the OLAP replacement.
And so it begins by Anonymous Coward · 2015-05-16 08:50 · Score: 1

Embrace, Extend, Extinguish
Microsoft, just like they did to Lotus 123, Wordperfect, just like they did to Java with their J++ before getting spanked, just like they tried to do with C++, just like they're trying to do with porting Android and IOS apps to their OS, they're doing it again -- creating a Roach Motel of software in which the developer or user can check into the Microsoft Roach Motel OS, but they sure cannot check out.
What is so egregiously evil about this? They're taking an Open Source product that is OS agnostic and putting their hand on the scale to favor using Windows with SQL Server and ONLY SQL Server. Way to be a good neighbor and a member of the community. I salute them for their "endeavor to persevere".
Re:Priorities by CanEHdian · 2015-05-16 09:44 · Score: 1

How about making SQL server respect ASCII nulls on unique constraints?
I would be more impressed with EBCDIC nulls.
In the KELVIN character set, NULL equals ABS(NULL)

--
When the copyright term is "forever minus a day", live every day like it's the last.
Are you sure this is a good idea? by goodmanj · 2015-05-16 12:25 · Score: 1

Being able to remotely transmit commands in a new general-purpose programming language to the server that stores your irreplaceable data? What could possibly go wrong?
Also, how do you say "Robert'); DROP TABLE Students;" in R?
1. Re:Are you sure this is a good idea? by Virtucon · 2015-05-16 14:14 · Score: 2
  
  You can already do that with the CLR.
  
  --
  Harrison's Postulate - "For every action there is an equal and opposite criticism"
Re:Why not Python? by TechyImmigrant · 2015-05-16 14:47 · Score: 3, Interesting

Why R? The R syntax is deranged. Python is at least more normal for programming. Why not have a .NET like set of language-neutral libraries to interface with this in-memory whatever-it-is feature and let hackers plug in their own languages? Why bake any one language into the database?
This. The language is horrible. What R has going for it is (1) some quite good graph plotting and (2) Support any statistical function you can think of, since every statistics researcher works in R and so the functions a available. No other statistics product comes close.
A python statistics library with some funky C linkage to the R library would take over in milliseconds when people find they can get all the stats functions while being able to program in a sane language.

--
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Re:Why not Python? by jma05 · 2015-05-16 18:43 · Score: 1

I use both R and Python. R itself is actually quite nice and more efficient for interactive use, once you get used to it. For interactive exploration with statistics, I actually prefer it to Python (and I have been using Python for ~15 years). Lots of helper functions. Everything uses the DataFrame datastructure. Good, concise and consistent documentation.
Unless you are a R library dev, for most users, its best to see R as a shell for statistics, rather than a programming language. So its language horribleness does not matter much.
I use Python to process data and R to explore it. Once I settle on something, if I need to put it into a larger pipeline, I either find an implementation in Python or link R to Python via rpy2.
> A python statistics library with some funky C linkage to the R library would take over in milliseconds
That's what rpy2 is.
Re:Why not Python? by Paul+Carver · 2015-05-16 23:42 · Score: 1

Just out of curiosity, when you say "Python" are you including iPython Notebook and Pandas and the rest of the SciPy/NumPy modules or are you comparing R strictly to "plain" Python scripts?
Re:Why not Python? by jma05 · 2015-05-17 00:46 · Score: 1

I mean the full Python stack (IPython notebook + Spyder with IPython, PyLab, Pandas, statsmodels).
For almost everything in stats, I prefer the RStudio experience. The flow feels much better, even though my Python is much better than my R. Machine Learning is one stats topic though, where I still prefer Python - I just like Scikit-learn.
If I was doing linear algebra directly, I would have preferred the Python stack with NumPy. PyLab stack is more for Matlab users than R users. On the stats side, Pandas and statsmodels are still not yet an R replacement for me. They are a great start though and seem to have gotten everything right so far.
zzzzz by vilanye · 2015-05-17 07:52 · Score: 1

Wake me up when SQL Server comes with an MP3 player built in.
Re:Why not Python? by TechyImmigrant · 2015-05-17 14:50 · Score: 1

That's what rpy2 is.
Thank you. I didn't know rpy2 existed.

--
I should use this sig to advertise my book ISBN-13 : 978-1501515132.