Ask Slashdot: Choosing a Data Warehouse Server System?
New submitter puzzled_decoy writes The company I work has decided to get in on this "big data" thing. We are trying to find a good data warehouse system to host and run analytics on, you guessed it, a bunch of data. Right now we are looking into MSSQL, a company called Domo, and Oracle contacted us. Google BigQuery may be another option. At its core, we need to be able to query huge amounts of data in sometimes rather odd ways. We need a strong ETLlayer, and hopefully we can put some nice visual reporting service on top of wherever the data is stored. So, what is your experience with "big data" servers and services? What would you recommend, and what are the pitfalls you've encountered?
Oregon Resident here. After the recent issues with Oracle..... yup. Not gonna recommend 'em again. Not a big fan of my tax money being wasted.
The first step is to ask Slashdot a really vague question to a highly technical and expensive undertaking.
Help do my job for me.
Only the State obtains its revenue by coercion. - Murray Rothbard
The way you're going at it you're basically burning money. "We must have this big data thing too!" is every hardware vendor's eyes going "ka-ching" and you'll be overpaying whatever you do. Even if you think you're getting a good price.
The problem with big data as a thing (BDaaT) is that without a clear goal you'll be gathering too much data and storing it for too long. Thereby you "need" too much processing power to shoot through it, and the only way left is downhill. This creates myriads of problems, of which overpaying for too much hardware is but the least.
So, you think you're serious about this big data thing? Just bring sacks of money to your fave distie. That is all.
Open-source so you don't have to cough up millions of dollars to see if you can get business.
Clusterable, scalable and standards-based so you're not locking down too far into one solution-space.
Trying to become famous by taking photos. Visit my homepage please.
Pretty easy to try it out immediately... http://aws.amazon.com/redshift
Don't waste your time and money, just go with Hadoop.
Need ETL? Well for one there is PIG, but if you want to do stream processing Apache Storm / Kafka.
Take a look at this, http://hortonworks.com/hdp/
All completely Open Source.
MSSQL?
why would anyone in their right mind go with MICROSOFT for a company database ? specially a big data database ?
I will not claim any "big data" experience.
At least you have an opinion informed by no experience.
Whatever you do, don't go mssql as you will end up processing most of your data in the analytics tool.
I've seen it lock tables even on only reads causing other processes to be terminated.
The closest it has got to materialized views are clustered indexed views which suck and can barely do any processing.
Big data is an entire field of study, this is not "should I use vi or emacs or nano" and even that requires a shitload of context and the source of flame wars until the end of time.
Think about your budget, your audience, and the value that you can add by spending time and money on this.
MapReduce (hadoop) is awesome and open source, you can run it in house or in multiple cloud offerings and has a tremendous community. BUT it sucks at relationships (foreign keys) graph calculations and others.
Graph databases can make connections between things that are impossible in other systems, but are only good for graph relationships.
OLAP data stored in n-dimensional cubes allows reporting and analysis if familiar tools that many analysts (not programmers) think is the cat's pajamas.
Your best be is to slow down and talk to your users, while reading Seven Databases in Seven Weeks
https://pragprog.com/book/rwdata/seven-databases-in-seven-weeks
And then realize that you probably need to hire a consultant so you have somebody to fire when the whole thing goes south.
If the data fits in a database, it is not Big Data.
If I was tasked with coming up with ideas for a Data Warehouse Server System, and given that I know almost nothing about such systems, my first port of call would probably be Apache. What about Cassandra, Hadoop, Hive, Mahout or Pig (or combinations thereof)? All of these are downloadable and playable-with (and being Apache, FLOSS).
As a previous poster pointed out, there is also PostgreSQL, again FLOSS. Again downloadable and playable-with.
You never know what is enough unless you know what is more than enough. - Blake
Sounds like you're very good in the buzzword-department but have no idea what you're doing at all.... What kind of data are we talking about? Lots of writes? Lots of reads? Is the data suitable for splitting up? What kind of queries will you need to run? Do you need uptime? Or consistency?
Also if you're looking at MSSQL or Oracle, you obviously DO NOT HAVE Big Data. Big Data is data that cannot be dealt with using regular RDBMSes. Do you really have or plan to have multiple terabytes of data? If not, you don't have big data.
Based on the information you've given us we cannot give you any advice at all apart from stopping what you're doing and hiring an expert.
0x or or snor perron?!
If your company buys 'big data', I have a bridge to sell you.
Know your data. Don't build a castle in the sky; that's how SAP happened.
All rites reversed 2010
1. Hire some bonehead that is expendable and ask him to make the decision.
2. Fire him when the project fails.
3. Nobody will ever bring this up again.
Got Code?
The ELK Stack might be an option. In my field, (many) web servers can stream all their logs off-site in Real-Time using Logstash Forwarder (or instead they might use rysnc, or rsyslog, or...). A central server, in the secure private intranet perhaps reads and indexes this log data, (that's ElasticSearch, which is sort of like a personal Google for your logs, any logs of any kind, or other Big Data). Kibana is a user-friendly Angular.js application and presentation layer. If you're familiar with NewRelic for server monitoring, you can save views just like when using that tool.
http://jakege.blogspot.nl/2014...
Okay, maybe this is sort of like 'when all you have is a hammer, everything looks like a nail', but this suggestion is the extent of my background in this area. Although I have had an itch to scratch, and so far, this is my best open-source result.
There's a ton of citations you should search for yourself, but I'll provide one I found that might start to help. Using this tool, it is fairly easy to parse out the myriad of hacker efforts at attacking the servers for example; even when you're the NY Times.
Microsoft doesn't win the real "Big Data" contracts, but there's many medium data contracts with delusions of grandeur. I work with a TB-size (as in, >1 TB...) database and while it's certainly no longer small data it's not "Big Data". It fits in a traditional RDBMS, when we get past the buzzwords what our users want are fairly traditional cubes/reports with drilldown that OLAP systems provide. If Microsoft is bad, the alternatives like Oracle, SAS, SAP or IBM are worse. Looking at an open source stack replacing the database is actually the easy bit, I'm sure we'd do fine running on PostgreSQL or MariaDB. Reporting tools on par with Reporting Services are also easy to come by. I've seen nothing as user-friendly as Integration Services on the data flow side which we use a lot, but I guess we could use it with foreign sources and destinations too.
Probably the biggest lack on the data warehouse side is an open source OLAP server. The wikipedia page lists two, one is Palo/Jedox which is a very limited marketing version for their commercial product and the other is Mondarian which by closer inspection seems to just translate MDX to SQL and let the RDBMS database do the aggregation which I suppose is okay for small data sets but will choke on any significant volume. Basically it comes down to all the Microsoft tools being "good enough" and working nicely together, while the rest ends up being a mix of different pieces from here and there. Either that or you're looking at a whole different stack, and I got lots of requirements that'd make a NoSQL solution squirm.
Live today, because you never know what tomorrow brings
Former back-office Teradata employee here. Teradata makes a very powerful product, but if security and availability of your data is critical, then I would look elsewhere. I'm not going to divulge any company secrets, but I will copy some snippets from employee reviews on Glassdoor:
"Security is nonexistent. LAN credentials are sent in plain text (unencrypted) everywhere... CUSTOMER credentials to CUSTOMER systems (IP addresses and credentials) are sent in plain text (unencrypted)"
"IT outages are frequent, long, and completely avoidable. This is true for all aspects of IT (Network, Data Storage Solutions, Servers, Application, and Databases etc)"
"Disaster Recovery is always a second thought and most applications have no or very little 'actual' DR capability"
"Customer data, including IP addresses and passwords for their production systems, is not secure and is not treated with respect"
"Customer contracts are not accessible to IT so that we can claim 'plausible deniability'"
"Patching is a joke and no preventative maintenance is ever approved or done"
"If our customers knew how their data was being treated, they wouldn't be our customers for very long."
Ahh, yes. Cloud stuff. Where you are processing a lot of data and where your processing and I/O resources are not your own. I always laugh at people who say "Oh, we don't need all that infrastructure stuff" and start moaning "Oh, why does it cost so much and why do we have to spend so much more when we add data?" Not to mention putting your important data on a platform that is financially questionable, has outages that providers simply don't care about and where it's going to be one hell of a PITA to move at any time later owing to the amount of data.
Sounds like a recipe for success.
I recommend against MSSQL not because it's not a good DB (it is -- it was originaly Sybase) but because it's cumbersome to work with outside of the Microsoft ecosystem. You mainly interface with it using ODBC and that's a pain outside of Windows. You're stuck with windows boxes on the back end AND on the front end. You can add ODBC systems to the mid-layer/server boxes you'd rather have (Linux, usually) but now you're paying money to add a kludge. Furthermore, because it absolutely needs to run on Windows on the back end, you have to pay employees who are generally of the sort who are going to want more Microsoft tools, so you'll be creeping more and more away from free stuff which is easy to maintain to a bunch of licenses and a complex setup. (Had to get a bunch of Windows boxes set up with precisely this sort of issue just a few weeks ago -- man! was it painful.)
You could start your project with Postgres and find out why you're unhappy with it and plan for a migration to something which is better for you post-hoc: Don't write SQL procs, and don't weave your SQL through a whole lot of code. Though frankly, the suggestions for Red Shift seem right on the money. They use Postgres drivers, JDBC, and ODBC, so you're set on any platform you want to work on without any added cost. They have a two-month free trial. You could try that out first and figure out what you're unhappy with there as a first step. Same rules apply -- keep things simple.
DBs are not for chewing data -- they're for giving you just the data you need so you can chew on it. You use the right tool for the chewing job once you have the data. (Some DB pre-chew is fine in situations where it's efficient and easy -- group by's, mostly.) So it doesn't matter that much how long the feature set of your DB is. What matters is that it's fast and you can get data in and out of it just about anywhere you want to. I've seen shops where they do all their data chewing in SQL server. They write reams of ugly, ugly code. They do this because they know how, and don't realize that a little work learning other things would make them vastly more efficient. The thing to always remember is that you don't buy a hammer and assume everything is a nail. Buy something which works with lots of other tools and pick the right ones for your job.
I recommend against MSSQL not because it's not a good DB
I'm assuming this is based on your extensive MSSQL experience, right?
but because it's cumbersome to work with outside of the Microsoft ecosystem.
Well, at least MS has half-decent tools for it. Other than Oracle, they're the only player with a decent GUI interface.
You mainly interface with it using ODBC and that's a pain outside of Windows.
Or JBDC. It depends. Its not really Microsoft's fault if it doesn't work in your environment, is it?
You're stuck with windows boxes on the back end AND on the front end.
You just summarized 2/3rds of the corporate world.
You could start your project with Postgres and find out why you're unhappy with it and plan for a migration to something which is better for you post-hoc: Don't write SQL procs, and don't weave your SQL through a whole lot of code.
You could, and then figure out how to integrate it with your environment. And *WHICH* ODBC driver to choose. So, the pain you just described previously, its right there. With a half-assed, subpar connection driver.
The only OSS solution that comes somewhere *near* what MSSQL does is PostgreSQL, and its a second-class citizen in Windows. And even PgSQL is easily suprassed when looking at features and replication options.
Note: I'm a huge PostgreSQL fan, to the point of writing C# applications with the native PostgreSQL driver without LINQ support.I'd take PostgreSQL over MSSQL everyday of the week if reporting tools, support, features, replication and integration doesn't matter. But saying that MSSQL is bad, is just a silly mantra.
Personally, I think that the RedShift suggestion is perfect for OP. Judging by the vague requirements ("the big boss wants to get on the Big Data bandwagon!"), OP's company has no clue what it wants to do with its Big Data yet. So why throw down a ton of cash on a solution without having a good idea of what problem needs solving?
Playing around with RedShift a bit and seeing what value they can extract from their data would be a great pilot program. Later, once they know what they're doing, they can implement their "real" solution.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock