Slashdot Mirror


Ask Slashdot: Choosing a Data Warehouse Server System?

New submitter puzzled_decoy writes The company I work has decided to get in on this "big data" thing. We are trying to find a good data warehouse system to host and run analytics on, you guessed it, a bunch of data. Right now we are looking into MSSQL, a company called Domo, and Oracle contacted us. Google BigQuery may be another option. At its core, we need to be able to query huge amounts of data in sometimes rather odd ways. We need a strong ETLlayer, and hopefully we can put some nice visual reporting service on top of wherever the data is stored. So, what is your experience with "big data" servers and services? What would you recommend, and what are the pitfalls you've encountered?

147 comments

  1. Skip Oracle. by iamwhoiamtoday · · Score: 2

    Oregon Resident here. After the recent issues with Oracle..... yup. Not gonna recommend 'em again. Not a big fan of my tax money being wasted.

    1. Re:Skip Oracle. by RuffMasterD · · Score: 5, Informative

      Just from a technical and financial point of view, I wouldn't recommend Oracle either. Oracle Advanced Analytics just seems to be a very expensive way to get R.

      Financially - R is open source and free (as in both free as a bird, and free beer), so you don't need to buy it from Oracle. No doubt Oracle will make you buy their DBMS as well to work with Advanced Analytics, and a big server to run it on, plus support to get it up and running.

      Technically - Oracle make a good DMBS for sure, but you don't need all the advanced features their DBMS is good at, such record level locking, three phase commit, redo logs, conflict resolution etc. You need that sort of stuff to maintain data integrity on transaction processing systems, but not for analysis. For analysis you just need a giant de-normalised table, and maybe indexes if you want to pick out specific subsets of records without full table scans.

      Personally I use SAS. It's not sexy, but I have never found a dataset too large to handle. It will thrash the harddrive all night if it has to to get a result, but it won't crash. SPSS will definitely crap itself with even moderate datasets. Stata does OK, but even that can't handle the larger datasets. I haven't pushed R hard enough to find it's limit.

      --
      Human Rights, Article 12: Freedom from Interference with Privacy, Family, Home and Correspondence
    2. Re:Skip Oracle. by Anonymous Coward · · Score: 0

      1. Why did you recommend them the first time?

      Cluser should use AWS until he figures out which end is up. Alternatively, he can hack together a cheap iomega NAS hanging off an old linksys wrt54g router with a used T5500 as a head node. Total cost should be less than a grand. - It's the poor man's entry level "Nigerian Prince startup kit".
      Or you could just set yourself up as a drqopbox reseller.

    3. Re: Skip Oracle. by Livius · · Score: 5, Informative

      There was a crime, and Oracle was a willing accomplice.

    4. Re:Skip Oracle. by Anonymous Coward · · Score: 1

      While your post is informative, it sort of misses the mark. Granted, TFS is clearly more on the clueless side, but you should have realized that 'analytics' here does not mean actual analytics, as much as simple BI reporting. The main requirement is ETL, reports are 'nice to have'.

      Back to you post, it's nice to know about SAS. For SPSS you want to try and push as much processing into the DB as possible, otherwise you need to get the server version and throw hardware at it, the local server that comes with the regular Modeler client is more of a 'demo mode' thing. As for R, to go around the 'all objects live in memory' restrictions you need either to go again with DB processing or use bigmemory and friends. However, in the end the point is that for some algorithms huge amounts of data are simply a no-no and not only due to access pattern needs that makes scaling awful, but also because the correct approach is sampling anyway, since in trying to use all the data you're very likely long past the point where meaningful statistical precision is leveling off. (And sampling includes things like running several instances of R in parallel on different samples to create an ensemble of models)

    5. Re:Skip Oracle. by wonkavader · · Score: 1

      I don't get it. Why are you denormalizing your tables?

      If you're talking about denormalizing, you're talking about a relatively complicated data set, else there would be nothing to denormalize. Almost nothing you'll do in SAS on any resonably complex data requires all the fields. So any DB on the back end (postgres, mysql) should be able to join up what you need from a well-normalized dataset quickly.

      Or do you mean you're just making a big text file (or SAS data blob) and using that in SAS? If that's the case, I'm still left with more confusion. SAS is a terible programming language -- truely ghastly. So adding even an SQL layer can be useful in fixing up data for SAS churning.

      What's your general technique?

    6. Re:Skip Oracle. by Anonymous Coward · · Score: 1

      I don't get it. Why are you denormalizing your tables?

      Most likely to simplify a star schema with many dimensions, it's a standard approach to keep your query run times relatively sane.

    7. Re:Skip Oracle. by DriveDog · · Score: 1

      SAS may be the best answer to "query huge amounts of data in sometimes rather odd ways". Using SQL Server for storage is fine, but not using anything else in front of it (SSAS is useless) is bringing a knife to a gun fight. Trying to do everything in a relational way means tying a hand and a foot behind your back. The real world doesn't neatly fit the model, hard as you might try to make it, so performance suffers greatly and doing unusual ad hoc things takes longer to figure out. Get SAS to send pure relational operations to the DBMS to do but perform other operations within. SAS's own SQL engine gives the user much more convenience since it supports SAS's functions and macro language, far richer than plain DBMS's, but I haven't found it to be particularly quick. In interoperability, SQL Server continues to improve, but SAS still works better with many more other applications. It has always been a best choice for moving data around. Organizations often choke on the licensing model, since most do "capital investments" every few years instead of paying a "licensing fee" every year (hefty, but does include some of the best support going). All this was about plain SAS. SAS/BI is really the product SAS will try to sell you to do what you describe; I haven't used it so can't rate it.

      As far as those comments about writing code with SAS being "terrible", well, it can be inconsistent, but mostly those people have just never grasped the somewhat unique models it uses of handling observations. I find T-SQL to be seriously lacking for many tasks. If going all-MS, get VS and use a regular procedural language along with SQL Server.

  2. First step by Anonymous Coward · · Score: 5, Insightful

    The first step is to ask Slashdot a really vague question to a highly technical and expensive undertaking.

  3. Elastic Search by Anonymous Coward · · Score: 0

    Elastic Search is getting there as a tool for this, but it isn't really ready yet.

  4. KISS by Anonymous Coward · · Score: 0, Insightful

    AWS RedShift. Don't bother with old school operating servers, patching OS's, etc.... Just focus on data + business logic. That's where you really add value, right?

    1. Re:KISS by Zarmvenius · · Score: 1, Insightful

      This. Redshift is far and away the cheapest and most straightforward solution. Hooks up nicely with Tableau to help analysts, efficient ingestion.

    2. Re:KISS by salesgeek · · Score: 0

      Redshift is a fantastic way to get started... the kind where you end up not needing to migrate to something else.

      --
      -- $G
    3. Re:KISS by segedunum · · Score: 2, Insightful

      Ahh, yes. Cloud stuff. Where you are processing a lot of data and where your processing and I/O resources are not your own. I always laugh at people who say "Oh, we don't need all that infrastructure stuff" and start moaning "Oh, why does it cost so much and why do we have to spend so much more when we add data?" Not to mention putting your important data on a platform that is financially questionable, has outages that providers simply don't care about and where it's going to be one hell of a PITA to move at any time later owing to the amount of data.

      Sounds like a recipe for success.

  5. Define the goals by Anonymous Coward · · Score: 1

    Define the goals. Don't mistake software for creativity and insight. If your company is going to crunch a lot of data find someone qualified to think analytically and recommend the correct tools for the job.

    I hear that R is very upcoming in statistical work. I also hear that any other 'big data' solution is going to cost you as much as a full time employee anyway.

    Also, yes, skip Oracle. If you put that much effort in to tuning a system/the way you're asking the question nearly anything could come up with a valid answer that quickly.

  6. Dear Slashdot, by ArchieBunker · · Score: 3, Funny

    Help do my job for me.

    --
    Only the State obtains its revenue by coercion. - Murray Rothbard
    1. Re:Dear Slashdot, by Sesostris+III · · Score: 5, Insightful

      Maybe. However I would also be interested in any answer (especially any answer involving FLOSS software). Interested not because it's my job or my company is looking to use such software, but because I'm curious and like to expand my knowledge.

      In general I don't mind such questions on Slashdot, as they're usually interesting and informative to the rest of us. And if they're not, then I (we) don't read the article!

      --
      You never know what is enough unless you know what is more than enough. - Blake
    2. Re:Dear Slashdot, by mlookaba · · Score: 1

      Are you one of those people that think developers should do everything themselves without asking for assistance? That shit leads to really, really bad code.

      It may not be fashionable in your circles, but human communication is, and will always be, a basic element of engineering.

  7. Call up your favourite channel distie by Anonymous Coward · · Score: 2

    The way you're going at it you're basically burning money. "We must have this big data thing too!" is every hardware vendor's eyes going "ka-ching" and you'll be overpaying whatever you do. Even if you think you're getting a good price.

    The problem with big data as a thing (BDaaT) is that without a clear goal you'll be gathering too much data and storing it for too long. Thereby you "need" too much processing power to shoot through it, and the only way left is downhill. This creates myriads of problems, of which overpaying for too much hardware is but the least.

    So, you think you're serious about this big data thing? Just bring sacks of money to your fave distie. That is all.

    1. Re:Call up your favourite channel distie by Anonymous Coward · · Score: 0

      BDaaT is clearly a pitfall of IT Buzzword Bingo ITBB. However, you can mitigate the crap out of everything you just listed above if you know what the hell you are doing (RTFM). EMC published (paid someone to publish) some silly white paper talking about how their Data Lake architecture was 2-4 times faster at various Hadoop operations. What they neglected to mention was that they compared 6 nodes each with 8 10k sas drives against a cluster of 4 nodes each with 34 drives, 2 SSD cache drives and a crap load of caching memory. They also paid their subsidiary VMWare an unnecessary 60k for the privalege of virtualizing it all. Rough price comparison was 60k vs 600k.

      If you are going to build it on premise, you need to have people who know what they are doing and build it on commodity/FOSS products. This is SuperMicro/whitebox servers with efficiently priced hardware - Buy 7.2k hard drives, 8GB DIMMs, and e5-2620 or lower... The idea is to scale out, not up. The difference between a 12 nodes at 24TB with 64GB of Ram on a dual e5-2620 costs about $4500 per node or ($54000). If you did that with Dell you're looking at close to double that AFTER you beat them down on the volume pricing. ($100,000) Forget Cisco/IBM/HP as they will be even more. You are paying for the engineering and warranty for a clustered system that is designed to not require it. Know what you are doing and you will save a lot of money.

        The most expensive of all of this is people. But since you need people in any instance, might as well save a few hundred K.

  8. Postgres-XL by bruce_the_loon · · Score: 4, Informative

    Open-source so you don't have to cough up millions of dollars to see if you can get business.

    Clusterable, scalable and standards-based so you're not locking down too far into one solution-space.

    --
    Trying to become famous by taking photos. Visit my homepage please.
  9. Please run away from oracle by Anonymous Coward · · Score: 0

    I am in luck to witness how bad their ETL tool is. In the end it works (In same way that assembler would also work..).
    There are bugs all over the place (the most "pleasant" one is the one that occurs randomly during the saving and previously unsaved work gets lost). Also it would be good to have very very good relations with your admins who would need to spend enormous amount of time optimizing it (otherwise you will have lots of time for drinking coffee, while things are "opening"). Some features are completely mis-implemented (e.g. "copying" feature, analogous would be to have some C++ object representing hierarchy (tree), and just doing memory copy without adjusting any pointers, which would mean that in the copy of the object all references, that should be internal to the object, actually point to old object). And on all top of that, logging seems to be slapped on really as an afterthought by somebody's nephew..

  10. What would you recommend.. by Anonymous Coward · · Score: 1

    do your job or go apply at mcdonalds.

  11. Check out Amazon Redshift by Anonymous Coward · · Score: 2, Insightful

    Pretty easy to try it out immediately... http://aws.amazon.com/redshift

  12. Microsoft? by Anonymous Coward · · Score: 0

    MSSQL?

    why would anyone in their right mind go with MICROSOFT for a company database ? specially a big data database ?

    I will not claim any "big data" experience.

    but make realistic goals that are testable and expect to pay thru the nose when dealing with Oracle and the other big money options.

    and you will need to clean your data up. databases collect a lot of crap over time.

    1. Re:Microsoft? by WaffleMonster · · Score: 3, Informative

      MSSQL?

      why would anyone in their right mind go with MICROSOFT for a company database ? specially a big data database ?

      I will not claim any "big data" experience.

      At least you have an opinion informed by no experience.

    2. Re:Microsoft? by Anonymous Coward · · Score: 0

      that's no big data

      i have lots of database experience, i was building them for fast access / update

      big data is a whole differant beast than anything normal

      the size of the data, blows most vendors products out of the water.

    3. Re:Microsoft? by Anonymous Coward · · Score: 0

      At least it isn't SAP.

    4. Re: Microsoft? by jrumney · · Score: 1

      I'm not sure if you're trying to be funny here, or just lacking in knowledge about the history of Sybase.

    5. Re:Microsoft? by Livius · · Score: 1

      I'm guessing he has experience with Microsoft, with respect to which his opinion is highly informed.

    6. Re: Microsoft? by Anonymous Coward · · Score: 0

      We use MSSQL clusters to crunch financial data for very large financial institutions. Yes they need some serious hardware but they rock and are reliable. We do offload a few things to Azure with mixed results.

    7. Re:Microsoft? by Kjella · · Score: 3, Interesting

      Microsoft doesn't win the real "Big Data" contracts, but there's many medium data contracts with delusions of grandeur. I work with a TB-size (as in, >1 TB...) database and while it's certainly no longer small data it's not "Big Data". It fits in a traditional RDBMS, when we get past the buzzwords what our users want are fairly traditional cubes/reports with drilldown that OLAP systems provide. If Microsoft is bad, the alternatives like Oracle, SAS, SAP or IBM are worse. Looking at an open source stack replacing the database is actually the easy bit, I'm sure we'd do fine running on PostgreSQL or MariaDB. Reporting tools on par with Reporting Services are also easy to come by. I've seen nothing as user-friendly as Integration Services on the data flow side which we use a lot, but I guess we could use it with foreign sources and destinations too.

      Probably the biggest lack on the data warehouse side is an open source OLAP server. The wikipedia page lists two, one is Palo/Jedox which is a very limited marketing version for their commercial product and the other is Mondarian which by closer inspection seems to just translate MDX to SQL and let the RDBMS database do the aggregation which I suppose is okay for small data sets but will choke on any significant volume. Basically it comes down to all the Microsoft tools being "good enough" and working nicely together, while the rest ends up being a mix of different pieces from here and there. Either that or you're looking at a whole different stack, and I got lots of requirements that'd make a NoSQL solution squirm.

      --
      Live today, because you never know what tomorrow brings
    8. Re:Microsoft? by Anonymous Coward · · Score: 0

      SSIS, user-friendly? that's funny.

    9. Re:Microsoft? by Anonymous Coward · · Score: 0

      except not...he wasn't referring to ms themselves, but to MS SQL. Of which, in the realm of big data, he's not informed at all by his own admission.

      Thus, his opinion is worthless here in this context

    10. Re:Microsoft? by Bengie · · Score: 1

      Data Warehouses are a completely different beast. I've had to do research into several DWH offerings in the past, and Microsoft actually does a very good job. Each system has a lot of pros and cons and different performance characteristics for different kinds of loads, but there are plenty of 100TB+ Microsoft Data Warehouses.

    11. Re:Microsoft? by wonkavader · · Score: 3, Informative

      I recommend against MSSQL not because it's not a good DB (it is -- it was originaly Sybase) but because it's cumbersome to work with outside of the Microsoft ecosystem. You mainly interface with it using ODBC and that's a pain outside of Windows. You're stuck with windows boxes on the back end AND on the front end. You can add ODBC systems to the mid-layer/server boxes you'd rather have (Linux, usually) but now you're paying money to add a kludge. Furthermore, because it absolutely needs to run on Windows on the back end, you have to pay employees who are generally of the sort who are going to want more Microsoft tools, so you'll be creeping more and more away from free stuff which is easy to maintain to a bunch of licenses and a complex setup. (Had to get a bunch of Windows boxes set up with precisely this sort of issue just a few weeks ago -- man! was it painful.)

      You could start your project with Postgres and find out why you're unhappy with it and plan for a migration to something which is better for you post-hoc: Don't write SQL procs, and don't weave your SQL through a whole lot of code. Though frankly, the suggestions for Red Shift seem right on the money. They use Postgres drivers, JDBC, and ODBC, so you're set on any platform you want to work on without any added cost. They have a two-month free trial. You could try that out first and figure out what you're unhappy with there as a first step. Same rules apply -- keep things simple.

      DBs are not for chewing data -- they're for giving you just the data you need so you can chew on it. You use the right tool for the chewing job once you have the data. (Some DB pre-chew is fine in situations where it's efficient and easy -- group by's, mostly.) So it doesn't matter that much how long the feature set of your DB is. What matters is that it's fast and you can get data in and out of it just about anywhere you want to. I've seen shops where they do all their data chewing in SQL server. They write reams of ugly, ugly code. They do this because they know how, and don't realize that a little work learning other things would make them vastly more efficient. The thing to always remember is that you don't buy a hammer and assume everything is a nail. Buy something which works with lots of other tools and pick the right ones for your job.

    12. Re:Microsoft? by Anonymous Coward · · Score: 1

      For a more SQL-oriented approach with open source, take a look at the madlib library that extends PostgreSQL with user-defined types and many stored procedures for in-database analysis. It can also be scaled up with $$$ by running it on Greenplum instead of pure PostgreSQL, but you can go a long way with PostgreSQL on a modern, commodity server with large RAM (256GB...1TB) and/or fast disks (hardware RAID and/or SSD). You may be able to focus more funds on this hardware rather than a bunch of software licenses.

      My experience is that many SAS and STATA developers are blind to the extremely inefficient data handling they do. The steps they perform to produce extracts as new intermediate files before running a final aggregation are shockingly primitive. It's like they write a 1970s RDBMs query plan by hand and put all intermediate results to disk as new files: Load a table. Filter it. Sort it. Dump a table. Load another table. Filter it. Sort it. Dump it. Load the first dump. Load-and-merge the second dump. We finally have a trivial join with two tables and a where clause!

      Writing a normal SQL query to extract the same intermediate table from the RDBMS is a night and day speedup, and incorporating the actual aggregate calculations into that SQL query can often obviate the need for the SAS or STATA code entirely. Allowing the query planner to optimize the whole data flow is a big win, compared to the naive sequence of tasks the programmer would write by hand.

      However, the main obstacle is cultural. Those SAS and STATA developers often have no interest in learning declarative SQL nor allowing their existing programming skills to look unnecessary to management. So, you can get a big backlash just suggesting that the legacy methods in an analysis group might be part of the problem rather than state-of-the-art wizardry. It's a bit like telling Java ORM users that they should actually use the strong capabilities of their RDBMs instead of treating it like a dumb store and doing all the filtering, sorting, and joining in their Java code.

    13. Re: Microsoft? by jedidiah · · Score: 1

      I've worked for large financial institutions and Unix is something that would be considered "small" in that environment. I've also seen departments shoehorn Microsoft products into a big problem and fail just to have to turn around and use something else.

      --
      A Pirate and a Puritan look the same on a balance sheet.
    14. Re:Microsoft? by rev0lt · · Score: 2

      I recommend against MSSQL not because it's not a good DB

      I'm assuming this is based on your extensive MSSQL experience, right?

      but because it's cumbersome to work with outside of the Microsoft ecosystem.

      Well, at least MS has half-decent tools for it. Other than Oracle, they're the only player with a decent GUI interface.

      You mainly interface with it using ODBC and that's a pain outside of Windows.

      Or JBDC. It depends. Its not really Microsoft's fault if it doesn't work in your environment, is it?

      You're stuck with windows boxes on the back end AND on the front end.

      You just summarized 2/3rds of the corporate world.

      You could start your project with Postgres and find out why you're unhappy with it and plan for a migration to something which is better for you post-hoc: Don't write SQL procs, and don't weave your SQL through a whole lot of code.

      You could, and then figure out how to integrate it with your environment. And *WHICH* ODBC driver to choose. So, the pain you just described previously, its right there. With a half-assed, subpar connection driver.

      The only OSS solution that comes somewhere *near* what MSSQL does is PostgreSQL, and its a second-class citizen in Windows. And even PgSQL is easily suprassed when looking at features and replication options.

      Note: I'm a huge PostgreSQL fan, to the point of writing C# applications with the native PostgreSQL driver without LINQ support.I'd take PostgreSQL over MSSQL everyday of the week if reporting tools, support, features, replication and integration doesn't matter. But saying that MSSQL is bad, is just a silly mantra.

    15. Re:Microsoft? by wonkavader · · Score: 1

      I recommend against MSSQL not because it's not a good DB

      I'm assuming this is based on your extensive MSSQL experience, right?

      Yes, it is.

      You're right on the replication. I think that's Postgres's obvious weak point. It's what you'd find that you didn't like. I assume that's why you ignored Red Shift. The rest of your arguments simply prove my point.

    16. Re:Microsoft? by rev0lt · · Score: 1

      It's what you'd find that you didn't like.

      It still is a huge limitation, as you cannot easily sync a local dataset with a remote one.

      I assume that's why you ignored Red Shift.

      RedShift has a limitation of 16TB per node. Its nice, but not really "big data". Its more like RDS on steroids, and I think RDS is "so-so". Also, you either use sync from/to amazon interfaces, or you're stuck with JBDC, so basically the same limitation you mentioned apply.

    17. Re:Microsoft? by inline_four · · Score: 1

      DBs are not for chewing data -- they're for giving you just the data you need so you can chew on it. You use the right tool for the chewing job once you have the data. (Some DB pre-chew is fine in situations where it's efficient and easy -- group by's, mostly.)

      Seems like there aren't many responses here talking about columnar databases. This a class of relational databases very well suited for data warehousing. I have been working with Vertica, which is a proprietary technology, but the license terms are much more favorable and fair than what you get out of Oracle (they aren't comparable anyway). It's a mindset change when you get into columnar databases, but on the whole they can be simpler than what you get trying to tune a traditional relational database for big data warehousing purposes.

      You will still need to think about your ETL and reporting technologies. This can be difficult depending on the nature and stability of your data and reporting customers' needs. On the whole, some things to think about are adherence to standards, not being afraid to operate multiple data marts, separating different reporting functions into different applications (internal vs external, raw data extracts vs analytics, etc.), and look at some map-reduce technologies (some shared-nothing databases give you that under the hood for free, some make it explicit, like Hadoop).

      --
      Alexey
    18. Re:Microsoft? by Anonymous Coward · · Score: 0

      I recommend against MSSQL not because it's not a good DB

      I'm assuming this is based on your extensive MSSQL experience, right?

      Does anybody else think he didn't see the first 'not' in that sentence? His response seems rather aggressive, as though he thought you were claiming it wasn't a good DB.

    19. Re:Microsoft? by Walking+The+Walk · · Score: 1

      Oops, down-modded by accident, sorry about hat. Wish I could undo mods within 10 seconds or something. Posting to undo my mod.

      --
      A recursive sig
      Can impart wisdom and truth
      Call proc signature()
    20. Re:Microsoft? by rev0lt · · Score: 1

      Yah, you are actually right. I didn't see it. My mistake.

  13. In source by Anonymous Coward · · Score: 0

    I would in my very limited experience recommend an in house solution. Leaning toward a mssql setup myself. Currently my company uses a third party app which we have to shovel money into year round. Save your money.

  14. Big data is even bigger than web scale!! by Anonymous Coward · · Score: 0

    Lifted this secret dictionary from a marketing professional while they were asleep at great risk. If they find out I took it they will use big data to track me down and kill me so please hush.

    Page 1.

    "The Cloud" - Market speak for our terms may change at any time and when they do your fucked Pinocchio.

    "IoT" - Shorthand for "idiot" a derogatory term used to describe those who believe building Internet connected toasters make them "kewl"

    "Big Data" - Is a "BFD" stated in most sarcastic tone humanity can deliver. It's also realization after stalking millions of people, collecting countless trillions of data points all of your efforts were in fact completely worthless.

    "Map Re-Duce" - This is an appalling socially unacceptable practice of counting duces as they make contact with the bowl after having fully submerged beneath the water line. Once all duces are counted they are picked up by hand and dropped one at a time on a map where the Gaussian distribution of droppings is carefully catalogued for further statistical analysis.

    "NoSQL" - Shorthand for "No SQuirreLs" .. This is often shouted as cry for help by the hopelessly clueless who believe squirrels have cursed them when in actuality their problems were self-inflicted.

  15. gigo by Anonymous Coward · · Score: 0

    " query huge amounts of data in sometimes rather odd ways. "

    clients who spend over 2k.month and also are interested in dating shemales? you don't say....

    my personal experiences with big data and "odd queries" is that unless used for targeted advertising (which can harm the brand if done TOO well or too clunkily) it is collected and never sees any real use. Why? People required to provide information usually decide that their time is more important than your data point and write in whatever they think will pass a cursory inspection. This makes trends impossible to spot, and thus: gigo.

     

  16. Hadoop by MrEcho.net · · Score: 2

    Don't waste your time and money, just go with Hadoop.
    Need ETL? Well for one there is PIG, but if you want to do stream processing Apache Storm / Kafka.
    Take a look at this, http://hortonworks.com/hdp/
    All completely Open Source.

    1. Re: Hadoop by Anonymous Coward · · Score: 0

      Having worked in BI for the last decade. This is quite possibly the most ridiculous response to a warehouse question ever. But also one that is becoming more common.

    2. Re: Hadoop by Anonymous Coward · · Score: 0

      Why is it ridiculous?

    3. Re:Hadoop by mattcasters · · Score: 1

      And if you want visual drag and drop ETL development and orchestration, use Pentaho Data Integration (a.k.a. Kettle). Comes in open source with an Apache license or professionally supported. Supports visual Map/Reduce development, integration with pig, scoop, oozie, ...
      For SQL you can use Hive but try one of the alternative engines like Impala as well.

      --
      News about the Kettle Open Source project: on my blog
    4. Re: Hadoop by mattcasters · · Score: 2

      And yet, data warehouse data off-load and outright replacement is one of the more popular Big Data applications right now.
      The main driver is the prohibitively expensive storage, user license and "per core" cost of traditional databases.

      There's also a fundamental questions hidden underneath the big data vs BI dilemma: how do you model against requirements (Kimball) when you don't have the requirements yet and you still want to keep all options open? Another one is how you can successfully open up PBs to end-users without breaking the bank and without a query taking weeks to complete?

      --
      News about the Kettle Open Source project: on my blog
    5. Re: Hadoop by Dishwasha · · Score: 1

      If you want your Hadoop cluster to be fast and easy to use, go with Spark https://spark.apache.org/.

    6. Re: Hadoop by lucm · · Score: 4, Informative

      Because that kind of setup works mostly for highly specialized requirements, such as processing ad clicks or log files. That's totally different from a data warehouse, where you store a lot of data with the idea that users can do a bit of exploration and analysis on their own using client tools like Excel, Tableau or MicroStrategy.

      There's 3 kinds of setup for Big Data:

      1) Massively parallel processing, such as AWS Redshift or Google Big Query (or IBM Netezza if you have money). Those are regular databases on steroids and they let you query data on your own. Redshift is basically a huge multi-tenants Postgres cluster.

      2) MapReduce, such as AWS EMR. This is more or less a clunky kind of ETL where you need to code every single question to which you want an answer. It scales well on the volume side (because of Hadoop distributed file system) but it is extremely tedious to implement and offers zero self-service capabilities for data analysts beyond what is hard-coded in your setup. The ETL language from Apache, Pig, is very basic - for just about everything you need to fire up Eclipse and write Java code. There are a few SQL frameworks that can sit on top of Hadoop, but none are blazing fast or immensely reliable, and for the most part with those SQL solutions it ends up being a cheapskate alternative to a proper DW.

      3) Machine learning, such as Spark or Mahout (also based on Hadoop file system). Those also require extensive programming and typically won't offer clear answers, they are mostly useful to find trends or patterns. It's all the rage right now with "data scientist", just like MR was all the rage 3 years ago and did not really stick because it's too clunky. Again this is a scenario where you know what you are looking for, because you have to "train" your system for specific tasks.

      HortonWorks is an all-inclusive Hadoop setup that includes most of what is needed for #2 or #3, but since AWS and Azure offer for pennies a totally scalable Hadoop environment, in my experience HortonWorks is for companies who want nothing to do with the cloud or for total newbies who want to see what is that Hadoop thing. But it does not offer the benefits of letting you learn what are the moving pieces because it comes all configured.

      So unless you have a very specific set of reports of indicators and a shitload of data, the only serious answer is to keep doing what BI people have been doing for decades: build data warehouses and use a decent front-end that includes a flexible reporting platform and self-service capabilities (such as OLAP). And only if you have tons of data should you even bother with Big Data products, as none of those are cheap. Redshift is in the $1000-$5000/TB/year range. For a large organization that's nothing, but for some guy trying to start a vague BI initiative that's expensive.

      When it comes to non-Big Data BI (i.e. something to setup on a few servers at most), the options are the following:

      1) SQL Server and its built-in BI suite, or Oracle and its built-in BI suite. A bit expensive but very flexible. Not ideal for self-service unless you have experienced DBAs.

      2) Any RDBMS + IBM Cognos or + SAP BusinessObjects. Expensive but you can define data universe then let users build their own reports. Ideal for self-service and for situations where you don't have a full time DBA who can write queries or build OLAP cubes.

      3) A patchwork of FOSS: MySQL, Mondrian, Jasper, Talend, etc. Free but not integrated so it requires a bit of work.

      Big Data != BI. It just means that you have more data that you could process on a regular database cluster. Even with social networks, ads and blogs, I haven't seen that many situations where this is truly needed.

      --
      lucm, indeed.
    7. Re: Hadoop by Anonymous Coward · · Score: 0

      This post is amazing in that it uses all the right tool names, but describes the situation almost entirely incorrectly and manages to come across as insulting towards someone making a perfectly reasonable suggestion. There are plenty of packaged Hadoop based solutions that address all 3 of the false-divisions Mr. Echo is trying to make - MapReduce is in fact a form of massively parallel processing, and that type of processing is necessary for performing machine learning so treating them as though they are 3 totally different things makes no sense. Incidentally, Hadoop's primary function is to implement MapReduce, so it easily accomplishes the first two and with Mahout thrown in, it covers all 3.

      If you want to make the jump and just pick Hadoop rather than messing with all the nonsense described above, Hortonworks is a good option, but there's also Cloudera and DataStax. Those all include a tool called Hive that is designed for doing exactly the sort of business intelligence data warehouse exploration MrEcho described - Tableau can even connect to it. So we can drop all this nonsense of making false distinctions. I'd actually say that I consider the MapReduce only focus as a limitation of Hadoop, but the fact that so many other tools have been built on top and so many things integrate is definitely a huge asset in its favor.

    8. Re: Hadoop by lucm · · Score: 2

      I'd actually say that I consider the MapReduce only focus as a limitation of Hadoop, but the fact that so many other tools have been built on top and so many things integrate is definitely a huge asset in its favor.

      Most of the tools built on top of Hadoop use HDFS (the Hadoop filesystem) and no Map Reduce at all. I think you are a textbook example of someone who learned Hadoop by using HortonWorks and therefore has no idea what are the various underlying moving parts.

      --
      lucm, indeed.
    9. Re: Hadoop by Anonymous Coward · · Score: 0

      A (kind of -- they've since been purchased and dissolved) former employer of mine ran a profitable ad business using a purely relational IBM DB2 database and a ColdFusion front-end. It wasn't the prettiest or sexiest system, but it worked quite well. It's replacement ran on SQL Server with all of the OLAP extensions (with planned uses of Hadoop where it made since.... mainly to replace the log processing system versus replacing the database).

    10. Re:Hadoop by hangngoainhap.com.vn · · Score: 1

      This site 's good. I will use it for my business http://www.hangngoainhap.com.v...

      --
      http://www.hangngoainhap.com.vn/
  17. Analytics + mssql = fail by TyFoN · · Score: 3, Informative

    Whatever you do, don't go mssql as you will end up processing most of your data in the analytics tool.
    I've seen it lock tables even on only reads causing other processes to be terminated.
    The closest it has got to materialized views are clustered indexed views which suck and can barely do any processing.

    1. Re:Analytics + mssql = fail by Anonymous Coward · · Score: 1

      Yeah, I use only the sql side, and I really hate ssrs and I never use ssas.
      SQL Server reporting services is great, until you want to do something really cool and complex, and then it is a hellish wasteland of tears.
      I WARNED YOU!

    2. Re:Analytics + mssql = fail by lucm · · Score: 1

      I've seen it lock tables even on only reads causing other processes to be terminated.

      That's because someone who does not understand how the product works has configured a serializable transaction isolation level. I would suggest to RTFM but maybe you need to start with the basics: http://en.wikipedia.org/wiki/I...

      --
      lucm, indeed.
    3. Re:Analytics + mssql = fail by Bengie · · Score: 1

      SSRS is free with any SQL license. There are paid 3rd party reporting services.

    4. Re:Analytics + mssql = fail by Anonymous Coward · · Score: 1

      Whatever you do, don't go mssql as you will end up processing most of your data in the analytics tool.

      Why?

      I've seen it lock tables even on only reads causing other processes to be terminated.

      Try enabling snapshot isolation if you want MVCC

      The closest it has got to materialized views are clustered indexed views which suck and can barely do any processing.

      Try columnstore indexes if you want your mind blown.

    5. Re:Analytics + mssql = fail by greenwow · · Score: 0

      Like Oracle, READ COMMITTED is the default transaction isolation setting for Microsoft SQL so if this is the problem, it is because they strangely changed the default.

      FYI, for data warehouse work, MySQL does pretty darn well if you relax the transaction isolation level. By default, InnoDB uses a stronger transaction isolation than either Oracle or Microsoft. You can change it with this command:

      SET TRANSACTION ISOLATION LEVEL READ COMMITTED

      InnoDB defaults to REPEATABLE READ which while very nice (and nicer than Oracle's READ COMMITTED), but it is too high for either large tables or for high loads.

      For our data warehouse with several billion row tables that are queried on every user login to do correlations between product purchases and views in order to recommend other products, changing that setting reduced the time it took to login from about three seconds down to less than 200 milliseconds. There's a good reason Oracle and Microsoft default to a looser transaction isolation setting in order to try to make their databases appear more performant.

    6. Re:Analytics + mssql = fail by Anonymous Coward · · Score: 0

      Wow, marked as troll. I guess one of the moderators is a Microsoft fanboi.

      I worked on an accounting system for school districts using Microsoft's attempt at an SQL server, and every so often our end of the month reports would be incorrect. We finally tracked the problem down to data that was changed while a few of our stored procs ran. Many of the districts had hugely complicated distribution rules so the logic was very complicated. In other words, the data was changed after it was read. We couldn't simply set the transaction isolation with Microsoft SQL to repeatable read, because we had a ton of deadlocks which we could never find. Instead, we ran our serialized our batch processing so no more than one process would run at a time and disabled user logins. That meant the system was unavailable for four or more hours a month for the larger districts, but it was better than having incorrect data.

  18. we need more detasils on this "big data thing" by Anonymous Coward · · Score: 5, Informative

    Big data is an entire field of study, this is not "should I use vi or emacs or nano" and even that requires a shitload of context and the source of flame wars until the end of time.

    Think about your budget, your audience, and the value that you can add by spending time and money on this.

    MapReduce (hadoop) is awesome and open source, you can run it in house or in multiple cloud offerings and has a tremendous community. BUT it sucks at relationships (foreign keys) graph calculations and others.

    Graph databases can make connections between things that are impossible in other systems, but are only good for graph relationships.

    OLAP data stored in n-dimensional cubes allows reporting and analysis if familiar tools that many analysts (not programmers) think is the cat's pajamas.

    Your best be is to slow down and talk to your users, while reading Seven Databases in Seven Weeks
    https://pragprog.com/book/rwdata/seven-databases-in-seven-weeks
    And then realize that you probably need to hire a consultant so you have somebody to fire when the whole thing goes south.

    1. Re:we need more detasils on this "big data thing" by dotancohen · · Score: 1

      I'm out of modpoints but I would like to stress that _this post_ is an example of why Ask Slashdot is so successful at answering questions that boil down to "I don't know what I need to know to get this job done". This is the type of answer that will put the OP on the right track to figuring out what he needs.

      --
      It is dangerous to be right when the government is wrong.
    2. Re:we need more detasils on this "big data thing" by Cytotoxic · · Score: 2

      Plus the strategic element of bringing in a consultant. Outside expertise is valuable not only for the expertise, but also because of other less tangible benefits. The outside guy is always more trusted by the business units. It is just human nature. You can lecture everyone on the benefits of some new initiative until you are blue in the face and get nowhere, but bring in a consulting firm to say the same thing and everyone suddenly thinks it is a great idea.

      The same goes for having a scapegoat when things go south. A huge change like moving to a new data warehousing technology has a very high probability of hitting major snags and having lots of growing pains as end users figure out what it is that they really want it to do. Having a place outside the shop to shoulder the blame is a big deal, as is having someone outside say "your requirements specified X", something that is often not well received when it comes from the in-house team.

    3. Re:we need more detasils on this "big data thing" by Anonymous Coward · · Score: 0

      I have used "vi" with big data, and you're right it is not very good. I use grep now instead.

  19. That isn't big data by thogard · · Score: 4, Insightful

    If the data fits in a database, it is not Big Data.

    1. Re:That isn't big data by Anonymous Coward · · Score: 1

      You'd make a great CIO

    2. Re:That isn't big data by YA_Python_dev · · Score: 1

      There are companies with multi-PB databases (PB, not TB).

      --
      There's a hidden treasure in Python 3.x: __prepare__()
  20. Apache family by Sesostris+III · · Score: 2

    If I was tasked with coming up with ideas for a Data Warehouse Server System, and given that I know almost nothing about such systems, my first port of call would probably be Apache. What about Cassandra, Hadoop, Hive, Mahout or Pig (or combinations thereof)? All of these are downloadable and playable-with (and being Apache, FLOSS).

    As a previous poster pointed out, there is also PostgreSQL, again FLOSS. Again downloadable and playable-with.

    --
    You never know what is enough unless you know what is more than enough. - Blake
  21. But what do you need? by zmooc · · Score: 4, Insightful

    Sounds like you're very good in the buzzword-department but have no idea what you're doing at all.... What kind of data are we talking about? Lots of writes? Lots of reads? Is the data suitable for splitting up? What kind of queries will you need to run? Do you need uptime? Or consistency?

    Also if you're looking at MSSQL or Oracle, you obviously DO NOT HAVE Big Data. Big Data is data that cannot be dealt with using regular RDBMSes. Do you really have or plan to have multiple terabytes of data? If not, you don't have big data.

    Based on the information you've given us we cannot give you any advice at all apart from stopping what you're doing and hiring an expert.

    --
    0x or or snor perron?!
    1. Re:But what do you need? by jchevali · · Score: 1

      My thoughts exactly. This question is stupid.

    2. Re:But what do you need? by Anonymous Coward · · Score: 0

      I work as a data warehousing consultant. I mostly do MSSQL.

      I totally agree with the parent. If it fits in a database, it is not big data. My company would be happy to sell you a solution with your selection of buzzwords, if you so desire (and got the cash).

      --

      On a sidenote, Microsoft does have this:
      http://www.microsoft.com/en-us/server-cloud/products/analytics-platform-system/

      But seriously, you need to be more specific about your needs. I have run a system with 1 GB ETL every day, and still in a regular RDBMS. How big is your data?

    3. Re:But what do you need? by leuk_he · · Score: 1

      Exactly

      Big data is a different thing from datawarehousing.

      In a big data scenario you have lots of data, that you process with a highly scalable solution.

      In a databasehouse you collect data from different sources and transform them in several steps to a datamodel you can create reports from it in a simple way. .

      And there is the other option you just have to process lots of records from a simular source (measuring data), where you carefully monitor and tune the processing of that data.

      The question does not even make a difference between this. Most of the above options are not a single person job if you want it done in a reasonable timeframe(that is before business needs catch up).

      The hardware depends on the size of the data. hardware sellers are happy to sell their high-end servers with a scaleable SAN, for a price you can buy a mid class car, or even a price that is not listed.But depending on your needs a simple dell server with a good backup facility under 1000$ might suffice.

      no way to tell

      PS, "services" might be a way to convince you that that the solution needs to run in the cloud (to complete the buzzworld cicle) or a way to sell consultancy hours.

    4. Re:But what do you need? by jellomizer · · Score: 1

      The key problem is most business run into their own insecurities.
      They are afraid of picking the uncool system that in 5 years would be scoffed at.
      Such as creating a new web app in Perl, nothing technically wrong but it isn't cool anymore.
      It's the no one has gotten fired for choosing IBM. It is more about picking the name that suppose to impress your customers. Not what is best for the job.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    5. Re:But what do you need? by drolli · · Score: 1

      "Big Data is data that cannot be dealt with using regular RDBMSes"

      Let's say: you may need or use a "regular" (RDBMs) for some things in big data, but it's not going to be your "Data Warehouse".

    6. Re:But what do you need? by cat_jesus · · Score: 1

      If you get a chance to work on APS, you should. I'm very impressed with it and it's just going to get better. MPP data warehouse combined with Hadoop? It's like combining peanut butter and chocolate.

    7. Re:But what do you need? by Anonymous Coward · · Score: 0

      A corollary is that just because you use some map-reduce tool or other trivial scale-out technology does not mean you have big data.

      I've seen a shocking number of projects that cobble together some dozen-node cluster and run a bunch of flaky "big data" methods on datasets that would easily fit into a single $5k-$10k server and perform much better with traditional RDBMs tools.

      People have this weird notion that databases are stupid tools and they can do much better with a few thousand lines of Python, Ruby, Perl, etc. Then they proceed to write the most naive codes imaginable which have big-O complexity problems and terribly inefficient data I/O, and then they think they can fix it all with brute force and a scaling factor of 10x from their small rack of servers. It's hilarious and yet sad.

    8. Re:But what do you need? by Anonymous Coward · · Score: 0

      Predictably, the /. hivemind has downvoted you for using SQL Server: "oh noes, it's teh ebil".

    9. Re:But what do you need? by Anonymous Coward · · Score: 0

      Lots of writes? Lots of reads?

      Lot of writes? Lot of WRITES?

  22. Open source by phantomfive · · Score: 1

    I would use open source and my own servers, but since you're considering Oracle and Microsoft,

    You should look at IBM Bluemix. I've heard good things about it. Watson integration.

    --
    "First they came for the slanderers and i said nothing."
  23. Ob by Anonymous Coward · · Score: 0

    Just avoid anything that uses systemd and you'll be fine.

    1. Re:Ob by Anonymous Coward · · Score: 0

      systemctl start stfu

  24. You are looking for the wrong product/service by Afty0r · · Score: 1

    We are trying to find a good data warehouse system to host and run analytics on

    You're asking the wrong questions.You should start higher up the chain in business-value land - WHYdo you need a data warehouse system (to run analytics)... great WHY do you need to run analytics (to discover XXXXX from the data we generate/own/handle). OK now you're getting closer... now, armed with the knowledge about what data you will be storing, and what kind of insights you would like to generate, you need to approach a specialist data analysis & insights company who can help you to select the correct products and platforms for your data storage, processing and analysis needs.

    The way you have phrased the questions in your post makes it obvious you don't really have a lot of experience in this arena, and this is not a decision you can afford to get wrong. This company may also be able to offer consultancy about generating your queries, reports, and carrying out some of the data analysis, but it sounds like you want to do this yourself - now that's actually quite reasonable to attempt in-house.

    1. Re:You are looking for the wrong product/service by JaredOfEuropa · · Score: 1

      I would definitely recommend to go with a reputable external consultant when it comes to getting started with queries and reports. They will be able to come up with good questions to get from your data, but more importantly they can help avoid bad answers. For instance, given the initials, height, eye color, age and other such data of presidential candidates, I can probably come up with a filter that will correctly indicate whether or not the candidate won the elections, based on the data. But how useful is that filter for predicting the outcome of the next election? That is the pitfall of big data.

      --
      If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
    2. Re:You are looking for the wrong product/service by salesgeek · · Score: 1

      Right now, if you are starting with "Data Warehouse" you probably are using the wrong answer key to score your wrong questions.

      --
      -- $G
  25. Good grief... by Ceriel+Nosforit · · Score: 2

    If your company buys 'big data', I have a bridge to sell you.

    Know your data. Don't build a castle in the sky; that's how SAP happened.

    --
    All rites reversed 2010
    1. Re:Good grief... by Anonymous Coward · · Score: 0

      Just curious, what's wrong with SAP?

  26. You must follow the correct process. by codepunk · · Score: 4, Insightful

    1. Hire some bonehead that is expendable and ask him to make the decision.
    2. Fire him when the project fails.
    3. Nobody will ever bring this up again.

    --


    Got Code?
    1. Re:You must follow the correct process. by CreatureComfort · · Score: 2

      This Ask Slashdot question is the direct result of Step1.

      --
      "Unheard of means only it's undreamed of yet,
      Impossible means not yet done." ~~ Julia Ecklar
  27. Have a look at Teradata by golodh · · Score: 1
    I've recently had good experiences with running SQL queries on fairly large (# records: 200 mln. plus) databases on a Teradata machine in a corporate environment. I wasn't involved in any sysadmin work, just the statistical modeling / analysis side of things.

    The company I consulted for uses SAS (on the mainframe, AIX boxes, and PC's) for almost all of its dataprocessing needs, including ETL work. Now they're looking at "Big Data" and discovered they need parallel processing to make it cost-effective (outperforms the mainframe, no per CPU-second charges, ability to let analysts work on AIX boxes or PC's etc.).

    I was able to show significant cost and performance savings in SQL queries over the mainframe (and AIX boxes). Interestingly substantial (50%-100%) speedups were also possible by accessing the Teradata machine in its native SQL (bypassing the SAS "in-database" Teradata support).

    The interesting thing about Teradata is that they offer genuine parallel processing (like Hadoop), but offer it as an end-user ready SQL interface to a database engine (you still need sysadmins though). Contrast this to Hadoop where the Hadoop layer is basically the start of the road and you usually have to worry about hardware issues and software architecture issues (such as which database engine to choose) as well. Sometimes you have to take the custom-made route (e.g. Wall-street firms doing automated trading) but sometimes it's an outright liability in a DIY-hostile environment (e.g. in large corporations).

    The teradata machine I worked with supports SQL, SAS, and R (which competes with SAS of course, and usually out-competes it when it comes to advanced statistics if you know what you're doing but we had to use SAS exclusively, by order) and could easily handle terabytes of data.

    So my suggestion is to take a look at it.

    It's not Open Source (although it does support R), and it's less fun for tinkerers, and it's harder to custom-parallise your own algorithms on (I hear, I never tried). On the other hand it does provide a ready-to-run parallelised SQL database and lots of storage. It's not cheap though, but in a corporate environment that's usually not the first consideration.

    1. Re:Have a look at Teradata by Anonymous Coward · · Score: 0

      +1 for Terradata, especially if you use their expertise to help you architect it - it's what they're good at, so use their knowledge and experience

    2. Re: Have a look at Teradata by Anonymous Coward · · Score: 0

      Worked a little bit with Teradata and seen our pros work with it. It's very fast and powerful even when juggling terabytes of data in a query.

    3. Re:Have a look at Teradata by Anonymous Coward · · Score: 2, Informative

      Former back-office Teradata employee here. Teradata makes a very powerful product, but if security and availability of your data is critical, then I would look elsewhere. I'm not going to divulge any company secrets, but I will copy some snippets from employee reviews on Glassdoor:

      "Security is nonexistent. LAN credentials are sent in plain text (unencrypted) everywhere... CUSTOMER credentials to CUSTOMER systems (IP addresses and credentials) are sent in plain text (unencrypted)"

      "IT outages are frequent, long, and completely avoidable. This is true for all aspects of IT (Network, Data Storage Solutions, Servers, Application, and Databases etc)"

      "Disaster Recovery is always a second thought and most applications have no or very little 'actual' DR capability"

      "Customer data, including IP addresses and passwords for their production systems, is not secure and is not treated with respect"

      "Customer contracts are not accessible to IT so that we can claim 'plausible deniability'"

      "Patching is a joke and no preventative maintenance is ever approved or done"

      "If our customers knew how their data was being treated, they wouldn't be our customers for very long."

  28. I recommend by Anonymous Coward · · Score: 0

    dBASE with Crystal Reports.

    1. Re:I recommend by ihtoit · · Score: 1

      ooh you 'orrible cunt!

      --
      Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
    2. Re: I recommend by Anonymous Coward · · Score: 0

      Peoplesoft FTW

  29. I've got some REALLY big MS Access DBs by Anonymous Coward · · Score: 0

    Some of them are several HUNDRED megabytes :O

  30. Finally... by eddy_crim · · Score: 1

    .. a slashdot topic on which I might actually be qualified to comment. I’ve spent a lot of time analysing suitable databases for data warehouse. As other commenters have mentioned you don’t really give enough detail about the types of data and likely use cases however I can assumer your going to do similar things to most of our customers. We have used 2 products in our business, both are column stores which tend to have the characteristics of very fast read/join and query but should not be used for anything remotely transactional. Initially we used Infobright which has an OSS community edition which for a Kimball-style data warehouse will happily take you up to 2-3 million rows before the query performance on more complex joins starts to creep over 1-2 seconds. As we took on larger clients we switch to Amazon Redshift. This is essentially a fairly distant cousin of Postgres with a bunch of technology thrown in from parexel. we found it the best performer by far in terms of bang for buck (you need to use the SSD disk option) when compared to things like Teradata (mentioned above) supports encryption and is very easy to get up and running with. If you follow Kimball’s http://www.kimballgroup.com/ design patterns you cant go far wrong but keep it simple at all times. We use Talend for ETL but are in the process of developing our own technology and Jasper-server Commercial for out front end Disclaimer: I have no direct interest in the products mentioned however I am CTO of a BI/Data Warehousing start-up (www.matillion.com) and have spent plenty of time in the trenches with

    --
    hmmm.
  31. Sounds like a cool company by Anonymous Coward · · Score: 0

    "Hey guys, let's pivot our company to target a field we know absolutely nothing about! Because everyone's buying into that Big Data Analytics buzzword these days -- we'll clearly become millionaires!"
    Seriously, your company should hire some professionals to answer these questions for your very specific use-case, since you clearly don't have the skills in-house.
    I don't know, that's what I would do if I had a *great* idea but *zero* expertise on a subject...
    Don't expect to have a proper answer after a couple hours of Google and an "Ask Slashdot".

  32. ElasticSearch, Logstash, Kibana (ELK Stack) by operator_error · · Score: 2

    The ELK Stack might be an option. In my field, (many) web servers can stream all their logs off-site in Real-Time using Logstash Forwarder (or instead they might use rysnc, or rsyslog, or...). A central server, in the secure private intranet perhaps reads and indexes this log data, (that's ElasticSearch, which is sort of like a personal Google for your logs, any logs of any kind, or other Big Data). Kibana is a user-friendly Angular.js application and presentation layer. If you're familiar with NewRelic for server monitoring, you can save views just like when using that tool.

    http://jakege.blogspot.nl/2014...

    Okay, maybe this is sort of like 'when all you have is a hammer, everything looks like a nail', but this suggestion is the extent of my background in this area. Although I have had an itch to scratch, and so far, this is my best open-source result.

    There's a ton of citations you should search for yourself, but I'll provide one I found that might start to help. Using this tool, it is fairly easy to parse out the myriad of hacker efforts at attacking the servers for example; even when you're the NY Times.

    1. Re:ElasticSearch, Logstash, Kibana (ELK Stack) by kiphat · · Score: 1

      We use the ELK stack for our log management. It's primary engine is the NoSQL Elasticsearch It works great, it's fast and is extremely flexible. WikiMedia Recently moved to Elasticsearch as its primary search engine. It's definitely worth a look.

  33. Oracle users hate them... by gweihir · · Score: 1

    I know a few. They are all looking at options to get rid of Oracle, and often of Solaris as well. On the other hand, MSSQL is still basically a toy. It really depends on you data model and the queries you run. Key-value stores ("no SQL"), for example, are really easy to distribute over many servers.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  34. Exadata by Durandal1979 · · Score: 1

    Hi, i would take a look at exadata if you really need great performance. The oracle rdbms is one of the most reliable i know of so paired with hardware specifially designed to perform is a nice thing. Otherwise you could try to work with flash storage (flash cards) for really high performance but you will still need a good database. I dont know how good all these SQLDB (open, ms) work but for sure i know that they dont play a vital role in the enterprise environment i work in and know of. So stick with IBM/DB2 (maybe BLU as performance boost for Big Data) or Oracle (with 12c they have created in memory queries like in SAP HANA so this could work well in your case buuuuttt you need physiscal memory for every bit of data you want to query fast so maybe this gets costly.....

  35. First step - hire a consultant by bjk002 · · Score: 1

    This is a MASSIVE undertaking, requiring deep and profound strategic decisions to be made at the highest levels of the company/organization.

    To go all in on what advice you might receive from slashdot is fool hearty at best.

    Do yourself and your company a favor, hire a world class consultant to come in and provide some advice.

    --
    Opinion:=TMyOpinion.Create(Me);
  36. Couple of cents by Anonymous Coward · · Score: 0

    I enjoy the whole you have to have this or that, and oh you can't use Microsoft or Oracle and be able to consume Big Data. I've worked with Oracle, Hadoop, and Microsoft, Netezza, and DB2. All of these support Big Data, but each has it's problems. With just getting started Microsoft will fit your needs well, as it has the whole stack for getting you off the ground quickly. You're still going to experience some pain because of learning the data modeling piece is going to hold you back at first as you learn the difference between star schema and snowflake schema. Your data source will become a problem too as you de-normalize the data in the ETL process and then put it into your staging tables before figuring out where it goes. I've had terabytes in a Microsoft data warehouse, along with a 7 terabyte SSAS tabular model, so take the comments here with a grain of salt. Good luck on your endeavor.

  37. Not sure you know what you are talking about... by bjk002 · · Score: 1

    Not that I am a fanboi of Oracle, but ODI is a fantastic tool.

    --
    Opinion:=TMyOpinion.Create(Me);
  38. dont do it by bloodhawk · · Score: 1

    If you need to ask this question on Slashdot then chances are you don't have the skills to build and run such a system properly.

  39. Microsoft APS(formerly PDW) by cat_jesus · · Score: 1

    I'm a little late to the party so this might get buried but here goes.

    I would strongly recommend looking into Microsoft's Analytics Platform System(APS), Formerly Parallel Data Warehouse(PDW). It's an MPP appliance that combines PDW and Hadoop. I got to spend a week on one of these appliances recently and I can't wait to get back on it. It supports combined queries usinf polybase across Hadoop and the Data warehouse(as well as the cloud).

    Typically data scientists will want to work in Hadoop and use R, this makes it easy to migrate your Data warehouse into Hadoop so the data scientist can do his analysis without affecting the traditional BI clients that are using the warehouse.

    I would also recommend SQL Server Analysis Tabular mode to build cubes off your Data warehouse. I have one client that uses the old PDW and creates cubes in SSAS tabular as well as Powerview in Excel and Sharepoint and it is loved by the end users. It's fast and the data visualizations are great. I will admit that Tableau is beautiful and I really like it, but users almost always want their data in Excel. It's best to just start them there.

    The good news for you is that MS is offering subsidized POC installations with their gold partners. What this means is you contact MS and tell them you are interested in a Proof of Concept and they will provide you with vouchers to pay for one of their gold certified partners to come in and set up a data warehouse with your data using their appliance. Then the gold partner bills MS instead of you for their time. It's a win win win. If it doesn't work out, tell MS to take the appliance back. You can also have an off premise POC if you like. Those are a little easier to set up because you don't need to get your organizations IT server team involved.

    I've been building data warehouses since 1998, been coding since 1985 and I am very impressed with this technology. It seems clear to me that in a massive data warehouse scenario this appliance is a winner. I'm still excited about how easy it is to move massive amounts of data between the PDW and Hadoop. That's incredibly useful for a number of scenarios.

    Now before anyone starts skewering me for being an MS fanboi, let me point out that there are a few things that MS does well. Databases is one and Excel is the other. MS pisses me off to no end for many other things, but these two spaces are impressive.

    Oh and I forgot to mention one of the great things about it being an appliance is that a lot of the configuration headaches are taken away from you. Need more space? Just plug a few more nodes into the rack, tell the appliance to redistribute the data and off you go. That gives you the freedom to focus more on your data and less on administrative tasks that you shouldn't have to worry about.

    1. Re:Microsoft APS(formerly PDW) by See+Attached · · Score: 1

      "Those are a little easier to set up because you don't need to get your organizations IT server team involved" Your IT bretheren might be a little non-plussed being skipped in the process.

      --
      Time for a new Political party in the US (or two!) One is off the rails Other cant pony up a leader.
    2. Re:Microsoft APS(formerly PDW) by cat_jesus · · Score: 1

      I would only recommend that for the Proof of concept. You will definitely need your infrastructure team involved when it comes to installing the appliance. And you need an executive to remove roadblocks and help make things happen once you get moving.

      With that said, I do understand that IT infrastructure can get a little butt hurt about installing appliances. Like I said, I've been doing tech for 30 years now and one trend I have noticed is that a lot of IT departments have drifted away from the customer centered mindset they had when I first started out. This is not true for all companies but for a huge number of them, the marginal IT staff are more worried about the appearance that they are geniuses and maintaining control. Installing an appliance like this strikes fear into their hearts on both fronts and you need to be mindful of their inflated and fragile egos.

      Let me give you an example. I did some work for a fortune 100 company that largely outsourced its IT to India. They still have large IT departments in the states which tend to be more useful than their Indian counterparts who play all sorts of games to close "tickets" without ever addressing problems. I had an Excel based BI solution that needed to be rolled out to sales people. IT in their infinite wisdom opened the gates for people to choose Apple laptops but made anyone ordering one sign an agreement that basically said IT is not responsible if shit doesn't work.

      They had a lot of issues with internal sites that required IE so the solution was to make IE available through Citrix. OK, that's cool. They figured that one out before rolling out the MacBooks. When it came to a custom Excel solution they were not going to add Excel to the Citrix box. When asked why not, they claimed licensing costs of Excel. After looking into that I discovered that there would be no licensing issue at all and was able to prove it. Then it became a "security" issue. When pressed on the nature of the security issue, they came up with a lot of doublespeak that amounted to a giant heaping pile of bullshit. It would have taken less than an hour to publish Excel on Citrix.

      They spent more time arguing against adding Excel to Citrix than it would have taken to just add the damned thing.

      The upshot of all this is people expect a lot less from IT now and they suffer needlessly and waste a shitload of money because IT doesn't give a damn about their customers anymore. They are not in the problem solving business and they just want you(the business user) to go away unless it's something they feel like playing with.

      This is one of the reasons I love doing consulting work. I like to help people, I like to automate and I want to make the computers do the mindless work, rather than people. IT, too often, gets in the way and causes more inefficiencies. In the case of this particular client, they have to hire more staff to manually deal with data and processes that they cannot get IT to automate.

    3. Re:Microsoft APS(formerly PDW) by mikaere · · Score: 1

      +1 to this, I did some training on this stuff earlier this year, and I was impressed with the overall offering. I particularly like the fact that you buy a pre-set appliance, so no mucking around with config and installations etc Also, the SQL 2014 columnstore indexes should give massive query speed improvements.

      --
      It's good luck to be superstitious
  40. Go to Big Data Meetup in your area by salesgeek · · Score: 1

    You can find out a lot in a few hours just by going to a Big Data meetup. Traditional database vendors are trying to hijack big data and make it their buzzword. Real big data players are using tools like Hadoop, Spark, Solr, Elastic Search and other tools that allow you to use commodity hardware to get a much more performant platform for big data. The appliance vendors have some interesting off the shelf stuff... you should really take some time to see what is going on... it's wild west time.

    --
    -- $G
  41. Why "Big Data" by Anonymous Coward · · Score: 0

    Why do you think of your problem as "Big Data"? How many gigabytes of data are we talking about? How much data is added on a daily basis? One of the reasons I ask is because sometimes people talk about big data but really a traditional database would handle the problem perfectly.

    1. Re:Why "Big Data" by umdesch4 · · Score: 1

      Totally this. I work for a company that has a 5TB database that's currently holding all granular transaction data for a few thousand companies over 10 years. The main transaction detail table grows by 1-200k records per hour on average (around 50 new inserts a second), which amounts to about 1-2 GB a day. With the way things are ramping, we're on track to increase by around 1 TB a year on that database. We allow several levels of reporting to those companies, with details vs. aggregation, and all kinds of data warehouse slicing and dicing for everything they could possibly want. There are issues with some reports being slow sometimes, and data warehouse problems occasionally making it fall as much as a whole day behind (oh, the horror!), but it generally works.

      As a rule of thumb, we don't consider this anywhere near big data. A large Oracle database, and some standard (by now we could call them "traditional") tools for cubes and data warehouses is all we need.

  42. Red Brick by Anonymous Coward · · Score: 0

    Red Brick

  43. Postgres-XL is really impressive by Anonymous Coward · · Score: 0

    It's open source, scales horizontally, and runs big queries in parallel (it does well at transactional processing, too). At one point it was pretty hard to deploy but that's gotten significantly better. Most of the Postgres tools ecosystem works with it. It's TransLattice's open source stack and they provide services for it, along with a few other vendors.

  44. Greenplum or Redshift by fdicostanzo · · Score: 1

    We use http://en.wikipedia.org/wiki/G... which is a clustered Postgres implementation. It has its problems (Postgres 8.2? seriously?) But it is very fast for ETL and batch queries on large data sets. We house 100+TB and get excellent performance. Its commercial and you pay by the TB.

    Then there is also AWS Redshift. We have found it to be quicker at some things and possibly cheaper but immature in its feature set (no UDF, etc). The thinking here is that if you have a separate system for ETL, Redshift would make an excellent data warehouse/ data market SELECT server. Pay by usage/ hour.

    --
    Synergies are basically awesome, and they're even better when you leverage them. -PA
  45. Are you sure its Big Data? by g8oz · · Score: 1

    Don't confuse a regular data warehouse with Big Data. If Big Data is a "thing" your company wants to get into, it probably does not apply to you.

    As for your data warehouse, MS SQL Server and is a good enough base to start with. IBM's DB2 is another underrated platform. Don't feed Oracle please.

  46. Nature of the load makes one thing simple... by See+Attached · · Score: 1

    Use SSD based storage for the data, so you don't have to wait for spindles. Seems that Pure Storage does it best of late, whereas other vendors have optimized the spindle based storage. PS did it from ground up. Best part is the documentation, Its ALL written on a single 3x5 card. No matter what software you use, skip the spindles.

    --
    Time for a new Political party in the US (or two!) One is off the rails Other cant pony up a leader.
  47. A complete response would take too much space... by ZahrGnosis · · Score: 1

    This is a wildly nontrivial question. Volumes are written about building data warehouses, and there's a lot to consider. In a large complicated environment, you could spend weeks doing comparisons (some people spend years, but that seems extreme); and some of the decisions are worth weighing.

    The first question is what capability are you looking for -- why are you sure one of these vendors is correct, and have you truly explored your options? If you want a place to capture and gather lots of near-real-time sensor data, then Hadoop might be good, if you want a more traditional Kimball or Inmon style warehouse for a small or mid size amount of data, then Microsoft, Oracle, Teradata, IBM, MySQL, and others have decades of experience that is, in fact, useful. But that's just a single-source vendor, and your question is focused on database vendors. Asking what "capability" you need includes ETL, Reporting, Meta Data, Master Data, Data Quality, User Interaction, Training, Methodology... if you're going to in-house all of that, or spread those things to multiple vendors then your answers will be different.

    All of those lead to follow-on questions. Where does cost play a role? Watch your up front costs vs long-term TCO. Do you have a development team with any expertise that may make it easier to in-house decisions and developments for one platform over another? Is your corporate buy-in strong so you can weather people second-guessing your decision? There are technical issues, personnel issues, cost issues...

    The first ANSWER is really that any vendor will work, and every vendor will have different headaches. Older vendors have very specific ways of doing things, but that can make developers less expensive and more uniformly capable (although you'll always find extremes). Asking several Oracle DBAs to question each other and report back on each other's competencies is rather easy. With newer capabilities like Amazon, Google, and other cloud-big-data vendors, the landscape is newer, people are using different approaches (each of which may be valid), and it's not clear which are going to survive long enough to have the richest eco systems. But again, these systems came into being for a reason -- Hadoop and NoSQL databases can perform better and more cheaply than older databases in raw throughput, or unstructured data, or other areas but they sacrifice different things -- ACID compliance, strong typing or data models, or what have you.

    Some of it just depends on taste. Some people avoid a single provider "lock-in" and pick and choose different ETL tools (see Informatica), Reporting Tools (Cognos, Microstrategy, Tableau, Jasper, Pentaho), and other tools (Talend DQ/MDM comes to mind... there are many), while some people prefer single vendors due to massive integration (particularly Microsoft if you're a Windows farm). If you're Gmail based, then Google's apps have good integration; if you have an Oracle ERP then several tools speak nice to it.

    I'm generalizing a lot of examples that don't always apply, to keep things shortish, but the bottom line is that every option has strengths and weaknesses. I wish it were easier.

  48. greenplum by Anonymous Coward · · Score: 0

    If you like postgres, Greenplum is your big data solution.

  49. The AWS option by Anonymous Coward · · Score: 0

    Have you considered Amazon AWS? If a Hadoop back-end works for you, Redshift (optionally with CloudHSM for encryption at rest) for the analytics, or RDS (MySQL, MSSQL, Oracle, or PostgreSQL) coupled with the various compute offerings if you want to roll your own, S3 for massive storage, and DirectConnect if you need private data circuits directly into the AWS network infrastructure.

  50. Miosoft by Anonymous Coward · · Score: 0

    I work for Miosoft (no, not Microsoft). We have a product you may want to examine. Live data feeds, continuous consolidation of contexts from data fragments, parallel reporting on thousands of CPUs with petabytes of data.

  51. Unlikely scenario by lucm · · Score: 1

    In SSIS (the ETL tool that comes with SQL Server), the default isolation level is serializable. People often use SSIS to stage data and/or feed a denormalized data warehouse.

    Someone claiming that an analytics tool is causing locks in SQL Server does not know what they are talking about. The most recent BI engine from Microsoft (Tabular) does everything in-memory, and with the older one, which is OLAP-based, data is typicalled moved out of SQL Server and into a SSAS cube.

    There's the possible scenario of someone deciding to use ROLAP; feeding a cube from a live production database. But if someone took pains to setup that kind of thing and yet used a locking isolation level, then he should not complain about it on Slashdot, he should RTFM.

    --
    lucm, indeed.
  52. One word by Anonymous Coward · · Score: 0

    postgresql, you'll wish you did.
    If you have money to burn oracle it is.

  53. You're dropping nickels all over the place! by jds62f · · Score: 1

    I see lots of buzz words, but they don't make much sense together. Big Data and a Data Warehouse are not the same thing. If you *only* care about big data, you don't need to care about ETL. All of these things require you to know your data (and to have a goal). Its one of those things were the execution is a lot more important than the product chosen. Your goal cannot be 'get into this "big data" thing'. I'd recommend finding some user groups for the tools you're interested in and asking a few other companies what they are doing.

  54. Insight from a Few Years Experience by JCaptainP · · Score: 1

    No offense, but from the sound of it you have no clue about a BI infrastructure, which is what you're talking about. If your company is serious they'll hire a team of 10 people w/ an average salary well north of 100k and have a couple million dollar budget per year for IT systems, including an analytic data base, ETL system, and BI application.

    My guess is that you just want to start off by incrementally building a DW and want ad hoc analytic capabilities. My proof of concept solution would be to use Pentaho Data Integration (PDI) as the ETL layer, PostgreSQL as the db, and Tableau for visualizations. As you move into the big data space and build out your data model you should move to an analytical DB, and the cheapest good solution is Redshift from Amazon. Most of the Analytical DBs are derived from an old version of Postgresql anyway, so as long as you don't custom code the ETL solutions and use standard sql, migrations should be very easy. Also, as you grow, you can migrate away from Tableau to a real BI application like Cognos or Microstrategy. Also, as your data grows you may need even more storage for persistent staging areas and can then consider Hadoop. I would not recommend it to start, unless you really know what you're doing. As for advanced statistics everyone is using R now but is problematic w/ big data as it pulls data into memory for processing, so you may have to pre-aggregate the data if super big sets are involved.

    I've been in the business intelligence space for over 10 years. My two top lessons learned; you need leadership in this space and to only implement custom code as a last result. For the former, BI has the ability to be implemented somewhat via an agile incremental model, but it's still a large solution and will require long term resources. Therefore, if you can't count on leadership to back you you shouldn't start the project. Secondly, custom code in this space can make a mountain out of a mole hill. For example, while you may be able to write a customized script or stored proc that's 30% faster than the ETL solution, I wouldn't suggest it. ETL, use appropriately, will help you manage your data long term. You be able to visually understand what's going on and switch DBMS rapidly.

  55. FreeTDS by Anonymous Coward · · Score: 1

    FreeTDS works well. Why would you have to use ODBC?

  56. Don't start a BI or BD project by choosing technol by Anonymous Coward · · Score: 0

    Whether this question is genuine, or weekend astroturf, we should not perpetuate a myth that a Big Data, or Business Intelligence project should all be about choosing a technology.

    Yes we all get excited about our big racks of flashy equipment, and we can brag about being terascale, petascale or whatever. This is not what it is all about.

    I suggest your first search is for recent Big Data projects that really aren't unlocking the value that they thought they would. Find companies that have these enormous deployments, and huge datatstores, but can't seem to find any ROI. You will then understand the basic principle.

    Finally, understand that 'Big Data' does have a firm root in certain approaches and technologies, but it has become 90% hyperbole and marketing speak. You hardly ever hear about BI anymore, and a lot of Big Data proponents would hapily see you 'throw out' a lot of the history - mainly because they have a box/service to sell and wouldn't want you to catch on that there might be more to it.

    If you want to do this, and you are not just the weekend marketing machine, then talk to some of the guys around here to understand more. If you ARE, then please stop perpetuating a myth.

    (disclosure: I have worked with VLDB for many years, using ORACLE, SQL Server, mySQL, Teradata, Netezza, Qlikview and a bunch of other technologies. ALL of them can unlock value in your data. ALL of them can process very large datasets. Yes really, I have processed Billions of rows through SQL Server even in it's early incarnations.
    None of them can TELL you where the value is in your data.)

  57. Get your feet wet by Slashdot+Parent · · Score: 2

    Personally, I think that the RedShift suggestion is perfect for OP. Judging by the vague requirements ("the big boss wants to get on the Big Data bandwagon!"), OP's company has no clue what it wants to do with its Big Data yet. So why throw down a ton of cash on a solution without having a good idea of what problem needs solving?

    Playing around with RedShift a bit and seeing what value they can extract from their data would be a great pilot program. Later, once they know what they're doing, they can implement their "real" solution.

    --
    They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
  58. Consider your spend by Anonymous Coward · · Score: 0

    Get a good ETL person... Generally worth more than the hardware.

    A bad one can grind ANY hardware you might get into dust.

    (I roll an R610 with FusionIO and the Microsoft BI stack for sales, purchasing, and inventory analysis; 300GB post-process worth of data)

    LOL = captcha "salaries"

  59. You've got one shot to store your data by clenhart · · Score: 1

    When a piece of data come in, store it everywhere you need it. This might be aggregated tables (if you don't use indexed views) or whatever you may need. If you have background processes like ETL, you'll use a lot of your hardware for processing at the expense of queries.

    Avoid ETL. You've got one shot to store your data everywhere.