Ask Slashdot: Choosing a Data Warehouse Server System?
New submitter puzzled_decoy writes The company I work has decided to get in on this "big data" thing. We are trying to find a good data warehouse system to host and run analytics on, you guessed it, a bunch of data. Right now we are looking into MSSQL, a company called Domo, and Oracle contacted us. Google BigQuery may be another option. At its core, we need to be able to query huge amounts of data in sometimes rather odd ways. We need a strong ETLlayer, and hopefully we can put some nice visual reporting service on top of wherever the data is stored. So, what is your experience with "big data" servers and services? What would you recommend, and what are the pitfalls you've encountered?
The first step is to ask Slashdot a really vague question to a highly technical and expensive undertaking.
Open-source so you don't have to cough up millions of dollars to see if you can get business.
Clusterable, scalable and standards-based so you're not locking down too far into one solution-space.
Trying to become famous by taking photos. Visit my homepage please.
Maybe. However I would also be interested in any answer (especially any answer involving FLOSS software). Interested not because it's my job or my company is looking to use such software, but because I'm curious and like to expand my knowledge.
In general I don't mind such questions on Slashdot, as they're usually interesting and informative to the rest of us. And if they're not, then I (we) don't read the article!
You never know what is enough unless you know what is more than enough. - Blake
Big data is an entire field of study, this is not "should I use vi or emacs or nano" and even that requires a shitload of context and the source of flame wars until the end of time.
Think about your budget, your audience, and the value that you can add by spending time and money on this.
MapReduce (hadoop) is awesome and open source, you can run it in house or in multiple cloud offerings and has a tremendous community. BUT it sucks at relationships (foreign keys) graph calculations and others.
Graph databases can make connections between things that are impossible in other systems, but are only good for graph relationships.
OLAP data stored in n-dimensional cubes allows reporting and analysis if familiar tools that many analysts (not programmers) think is the cat's pajamas.
Your best be is to slow down and talk to your users, while reading Seven Databases in Seven Weeks
https://pragprog.com/book/rwdata/seven-databases-in-seven-weeks
And then realize that you probably need to hire a consultant so you have somebody to fire when the whole thing goes south.
If the data fits in a database, it is not Big Data.
Sounds like you're very good in the buzzword-department but have no idea what you're doing at all.... What kind of data are we talking about? Lots of writes? Lots of reads? Is the data suitable for splitting up? What kind of queries will you need to run? Do you need uptime? Or consistency?
Also if you're looking at MSSQL or Oracle, you obviously DO NOT HAVE Big Data. Big Data is data that cannot be dealt with using regular RDBMSes. Do you really have or plan to have multiple terabytes of data? If not, you don't have big data.
Based on the information you've given us we cannot give you any advice at all apart from stopping what you're doing and hiring an expert.
0x or or snor perron?!
1. Hire some bonehead that is expendable and ask him to make the decision.
2. Fire him when the project fails.
3. Nobody will ever bring this up again.
Got Code?
Just from a technical and financial point of view, I wouldn't recommend Oracle either. Oracle Advanced Analytics just seems to be a very expensive way to get R.
Financially - R is open source and free (as in both free as a bird, and free beer), so you don't need to buy it from Oracle. No doubt Oracle will make you buy their DBMS as well to work with Advanced Analytics, and a big server to run it on, plus support to get it up and running.
Technically - Oracle make a good DMBS for sure, but you don't need all the advanced features their DBMS is good at, such record level locking, three phase commit, redo logs, conflict resolution etc. You need that sort of stuff to maintain data integrity on transaction processing systems, but not for analysis. For analysis you just need a giant de-normalised table, and maybe indexes if you want to pick out specific subsets of records without full table scans.
Personally I use SAS. It's not sexy, but I have never found a dataset too large to handle. It will thrash the harddrive all night if it has to to get a result, but it won't crash. SPSS will definitely crap itself with even moderate datasets. Stata does OK, but even that can't handle the larger datasets. I haven't pushed R hard enough to find it's limit.
Human Rights, Article 12: Freedom from Interference with Privacy, Family, Home and Correspondence
There was a crime, and Oracle was a willing accomplice.
Because that kind of setup works mostly for highly specialized requirements, such as processing ad clicks or log files. That's totally different from a data warehouse, where you store a lot of data with the idea that users can do a bit of exploration and analysis on their own using client tools like Excel, Tableau or MicroStrategy.
There's 3 kinds of setup for Big Data:
1) Massively parallel processing, such as AWS Redshift or Google Big Query (or IBM Netezza if you have money). Those are regular databases on steroids and they let you query data on your own. Redshift is basically a huge multi-tenants Postgres cluster.
2) MapReduce, such as AWS EMR. This is more or less a clunky kind of ETL where you need to code every single question to which you want an answer. It scales well on the volume side (because of Hadoop distributed file system) but it is extremely tedious to implement and offers zero self-service capabilities for data analysts beyond what is hard-coded in your setup. The ETL language from Apache, Pig, is very basic - for just about everything you need to fire up Eclipse and write Java code. There are a few SQL frameworks that can sit on top of Hadoop, but none are blazing fast or immensely reliable, and for the most part with those SQL solutions it ends up being a cheapskate alternative to a proper DW.
3) Machine learning, such as Spark or Mahout (also based on Hadoop file system). Those also require extensive programming and typically won't offer clear answers, they are mostly useful to find trends or patterns. It's all the rage right now with "data scientist", just like MR was all the rage 3 years ago and did not really stick because it's too clunky. Again this is a scenario where you know what you are looking for, because you have to "train" your system for specific tasks.
HortonWorks is an all-inclusive Hadoop setup that includes most of what is needed for #2 or #3, but since AWS and Azure offer for pennies a totally scalable Hadoop environment, in my experience HortonWorks is for companies who want nothing to do with the cloud or for total newbies who want to see what is that Hadoop thing. But it does not offer the benefits of letting you learn what are the moving pieces because it comes all configured.
So unless you have a very specific set of reports of indicators and a shitload of data, the only serious answer is to keep doing what BI people have been doing for decades: build data warehouses and use a decent front-end that includes a flexible reporting platform and self-service capabilities (such as OLAP). And only if you have tons of data should you even bother with Big Data products, as none of those are cheap. Redshift is in the $1000-$5000/TB/year range. For a large organization that's nothing, but for some guy trying to start a vague BI initiative that's expensive.
When it comes to non-Big Data BI (i.e. something to setup on a few servers at most), the options are the following:
1) SQL Server and its built-in BI suite, or Oracle and its built-in BI suite. A bit expensive but very flexible. Not ideal for self-service unless you have experienced DBAs.
2) Any RDBMS + IBM Cognos or + SAP BusinessObjects. Expensive but you can define data universe then let users build their own reports. Ideal for self-service and for situations where you don't have a full time DBA who can write queries or build OLAP cubes.
3) A patchwork of FOSS: MySQL, Mondrian, Jasper, Talend, etc. Free but not integrated so it requires a bit of work.
Big Data != BI. It just means that you have more data that you could process on a regular database cluster. Even with social networks, ads and blogs, I haven't seen that many situations where this is truly needed.
lucm, indeed.