Ask Slashdot: Choosing a Data Warehouse Server System?
New submitter puzzled_decoy writes The company I work has decided to get in on this "big data" thing. We are trying to find a good data warehouse system to host and run analytics on, you guessed it, a bunch of data. Right now we are looking into MSSQL, a company called Domo, and Oracle contacted us. Google BigQuery may be another option. At its core, we need to be able to query huge amounts of data in sometimes rather odd ways. We need a strong ETLlayer, and hopefully we can put some nice visual reporting service on top of wherever the data is stored. So, what is your experience with "big data" servers and services? What would you recommend, and what are the pitfalls you've encountered?
Oregon Resident here. After the recent issues with Oracle..... yup. Not gonna recommend 'em again. Not a big fan of my tax money being wasted.
The first step is to ask Slashdot a really vague question to a highly technical and expensive undertaking.
Elastic Search is getting there as a tool for this, but it isn't really ready yet.
AWS RedShift. Don't bother with old school operating servers, patching OS's, etc.... Just focus on data + business logic. That's where you really add value, right?
Define the goals. Don't mistake software for creativity and insight. If your company is going to crunch a lot of data find someone qualified to think analytically and recommend the correct tools for the job.
I hear that R is very upcoming in statistical work. I also hear that any other 'big data' solution is going to cost you as much as a full time employee anyway.
Also, yes, skip Oracle. If you put that much effort in to tuning a system/the way you're asking the question nearly anything could come up with a valid answer that quickly.
Help do my job for me.
Only the State obtains its revenue by coercion. - Murray Rothbard
The way you're going at it you're basically burning money. "We must have this big data thing too!" is every hardware vendor's eyes going "ka-ching" and you'll be overpaying whatever you do. Even if you think you're getting a good price.
The problem with big data as a thing (BDaaT) is that without a clear goal you'll be gathering too much data and storing it for too long. Thereby you "need" too much processing power to shoot through it, and the only way left is downhill. This creates myriads of problems, of which overpaying for too much hardware is but the least.
So, you think you're serious about this big data thing? Just bring sacks of money to your fave distie. That is all.
Open-source so you don't have to cough up millions of dollars to see if you can get business.
Clusterable, scalable and standards-based so you're not locking down too far into one solution-space.
Trying to become famous by taking photos. Visit my homepage please.
I am in luck to witness how bad their ETL tool is. In the end it works (In same way that assembler would also work..).
There are bugs all over the place (the most "pleasant" one is the one that occurs randomly during the saving and previously unsaved work gets lost). Also it would be good to have very very good relations with your admins who would need to spend enormous amount of time optimizing it (otherwise you will have lots of time for drinking coffee, while things are "opening"). Some features are completely mis-implemented (e.g. "copying" feature, analogous would be to have some C++ object representing hierarchy (tree), and just doing memory copy without adjusting any pointers, which would mean that in the copy of the object all references, that should be internal to the object, actually point to old object). And on all top of that, logging seems to be slapped on really as an afterthought by somebody's nephew..
do your job or go apply at mcdonalds.
Pretty easy to try it out immediately... http://aws.amazon.com/redshift
MSSQL?
why would anyone in their right mind go with MICROSOFT for a company database ? specially a big data database ?
I will not claim any "big data" experience.
but make realistic goals that are testable and expect to pay thru the nose when dealing with Oracle and the other big money options.
and you will need to clean your data up. databases collect a lot of crap over time.
I would in my very limited experience recommend an in house solution. Leaning toward a mssql setup myself. Currently my company uses a third party app which we have to shovel money into year round. Save your money.
Lifted this secret dictionary from a marketing professional while they were asleep at great risk. If they find out I took it they will use big data to track me down and kill me so please hush.
Page 1.
"The Cloud" - Market speak for our terms may change at any time and when they do your fucked Pinocchio.
"IoT" - Shorthand for "idiot" a derogatory term used to describe those who believe building Internet connected toasters make them "kewl"
"Big Data" - Is a "BFD" stated in most sarcastic tone humanity can deliver. It's also realization after stalking millions of people, collecting countless trillions of data points all of your efforts were in fact completely worthless.
"Map Re-Duce" - This is an appalling socially unacceptable practice of counting duces as they make contact with the bowl after having fully submerged beneath the water line. Once all duces are counted they are picked up by hand and dropped one at a time on a map where the Gaussian distribution of droppings is carefully catalogued for further statistical analysis.
"NoSQL" - Shorthand for "No SQuirreLs" .. This is often shouted as cry for help by the hopelessly clueless who believe squirrels have cursed them when in actuality their problems were self-inflicted.
" query huge amounts of data in sometimes rather odd ways. "
clients who spend over 2k.month and also are interested in dating shemales? you don't say....
my personal experiences with big data and "odd queries" is that unless used for targeted advertising (which can harm the brand if done TOO well or too clunkily) it is collected and never sees any real use. Why? People required to provide information usually decide that their time is more important than your data point and write in whatever they think will pass a cursory inspection. This makes trends impossible to spot, and thus: gigo.
Don't waste your time and money, just go with Hadoop.
Need ETL? Well for one there is PIG, but if you want to do stream processing Apache Storm / Kafka.
Take a look at this, http://hortonworks.com/hdp/
All completely Open Source.
Whatever you do, don't go mssql as you will end up processing most of your data in the analytics tool.
I've seen it lock tables even on only reads causing other processes to be terminated.
The closest it has got to materialized views are clustered indexed views which suck and can barely do any processing.
Big data is an entire field of study, this is not "should I use vi or emacs or nano" and even that requires a shitload of context and the source of flame wars until the end of time.
Think about your budget, your audience, and the value that you can add by spending time and money on this.
MapReduce (hadoop) is awesome and open source, you can run it in house or in multiple cloud offerings and has a tremendous community. BUT it sucks at relationships (foreign keys) graph calculations and others.
Graph databases can make connections between things that are impossible in other systems, but are only good for graph relationships.
OLAP data stored in n-dimensional cubes allows reporting and analysis if familiar tools that many analysts (not programmers) think is the cat's pajamas.
Your best be is to slow down and talk to your users, while reading Seven Databases in Seven Weeks
https://pragprog.com/book/rwdata/seven-databases-in-seven-weeks
And then realize that you probably need to hire a consultant so you have somebody to fire when the whole thing goes south.
If the data fits in a database, it is not Big Data.
If I was tasked with coming up with ideas for a Data Warehouse Server System, and given that I know almost nothing about such systems, my first port of call would probably be Apache. What about Cassandra, Hadoop, Hive, Mahout or Pig (or combinations thereof)? All of these are downloadable and playable-with (and being Apache, FLOSS).
As a previous poster pointed out, there is also PostgreSQL, again FLOSS. Again downloadable and playable-with.
You never know what is enough unless you know what is more than enough. - Blake
Sounds like you're very good in the buzzword-department but have no idea what you're doing at all.... What kind of data are we talking about? Lots of writes? Lots of reads? Is the data suitable for splitting up? What kind of queries will you need to run? Do you need uptime? Or consistency?
Also if you're looking at MSSQL or Oracle, you obviously DO NOT HAVE Big Data. Big Data is data that cannot be dealt with using regular RDBMSes. Do you really have or plan to have multiple terabytes of data? If not, you don't have big data.
Based on the information you've given us we cannot give you any advice at all apart from stopping what you're doing and hiring an expert.
0x or or snor perron?!
I would use open source and my own servers, but since you're considering Oracle and Microsoft,
You should look at IBM Bluemix. I've heard good things about it. Watson integration.
"First they came for the slanderers and i said nothing."
Just avoid anything that uses systemd and you'll be fine.
You're asking the wrong questions.You should start higher up the chain in business-value land - WHYdo you need a data warehouse system (to run analytics)... great WHY do you need to run analytics (to discover XXXXX from the data we generate/own/handle). OK now you're getting closer... now, armed with the knowledge about what data you will be storing, and what kind of insights you would like to generate, you need to approach a specialist data analysis & insights company who can help you to select the correct products and platforms for your data storage, processing and analysis needs.
The way you have phrased the questions in your post makes it obvious you don't really have a lot of experience in this arena, and this is not a decision you can afford to get wrong. This company may also be able to offer consultancy about generating your queries, reports, and carrying out some of the data analysis, but it sounds like you want to do this yourself - now that's actually quite reasonable to attempt in-house.
If your company buys 'big data', I have a bridge to sell you.
Know your data. Don't build a castle in the sky; that's how SAP happened.
All rites reversed 2010
1. Hire some bonehead that is expendable and ask him to make the decision.
2. Fire him when the project fails.
3. Nobody will ever bring this up again.
Got Code?
The company I consulted for uses SAS (on the mainframe, AIX boxes, and PC's) for almost all of its dataprocessing needs, including ETL work. Now they're looking at "Big Data" and discovered they need parallel processing to make it cost-effective (outperforms the mainframe, no per CPU-second charges, ability to let analysts work on AIX boxes or PC's etc.).
I was able to show significant cost and performance savings in SQL queries over the mainframe (and AIX boxes). Interestingly substantial (50%-100%) speedups were also possible by accessing the Teradata machine in its native SQL (bypassing the SAS "in-database" Teradata support).
The interesting thing about Teradata is that they offer genuine parallel processing (like Hadoop), but offer it as an end-user ready SQL interface to a database engine (you still need sysadmins though). Contrast this to Hadoop where the Hadoop layer is basically the start of the road and you usually have to worry about hardware issues and software architecture issues (such as which database engine to choose) as well. Sometimes you have to take the custom-made route (e.g. Wall-street firms doing automated trading) but sometimes it's an outright liability in a DIY-hostile environment (e.g. in large corporations).
The teradata machine I worked with supports SQL, SAS, and R (which competes with SAS of course, and usually out-competes it when it comes to advanced statistics if you know what you're doing but we had to use SAS exclusively, by order) and could easily handle terabytes of data.
So my suggestion is to take a look at it.
It's not Open Source (although it does support R), and it's less fun for tinkerers, and it's harder to custom-parallise your own algorithms on (I hear, I never tried). On the other hand it does provide a ready-to-run parallelised SQL database and lots of storage. It's not cheap though, but in a corporate environment that's usually not the first consideration.
dBASE with Crystal Reports.
Some of them are several HUNDRED megabytes :O
.. a slashdot topic on which I might actually be qualified to comment. I’ve spent a lot of time analysing suitable databases for data warehouse. As other commenters have mentioned you don’t really give enough detail about the types of data and likely use cases however I can assumer your going to do similar things to most of our customers. We have used 2 products in our business, both are column stores which tend to have the characteristics of very fast read/join and query but should not be used for anything remotely transactional. Initially we used Infobright which has an OSS community edition which for a Kimball-style data warehouse will happily take you up to 2-3 million rows before the query performance on more complex joins starts to creep over 1-2 seconds. As we took on larger clients we switch to Amazon Redshift. This is essentially a fairly distant cousin of Postgres with a bunch of technology thrown in from parexel. we found it the best performer by far in terms of bang for buck (you need to use the SSD disk option) when compared to things like Teradata (mentioned above) supports encryption and is very easy to get up and running with. If you follow Kimball’s http://www.kimballgroup.com/ design patterns you cant go far wrong but keep it simple at all times. We use Talend for ETL but are in the process of developing our own technology and Jasper-server Commercial for out front end Disclaimer: I have no direct interest in the products mentioned however I am CTO of a BI/Data Warehousing start-up (www.matillion.com) and have spent plenty of time in the trenches with
hmmm.
"Hey guys, let's pivot our company to target a field we know absolutely nothing about! Because everyone's buying into that Big Data Analytics buzzword these days -- we'll clearly become millionaires!"
Seriously, your company should hire some professionals to answer these questions for your very specific use-case, since you clearly don't have the skills in-house.
I don't know, that's what I would do if I had a *great* idea but *zero* expertise on a subject...
Don't expect to have a proper answer after a couple hours of Google and an "Ask Slashdot".
The ELK Stack might be an option. In my field, (many) web servers can stream all their logs off-site in Real-Time using Logstash Forwarder (or instead they might use rysnc, or rsyslog, or...). A central server, in the secure private intranet perhaps reads and indexes this log data, (that's ElasticSearch, which is sort of like a personal Google for your logs, any logs of any kind, or other Big Data). Kibana is a user-friendly Angular.js application and presentation layer. If you're familiar with NewRelic for server monitoring, you can save views just like when using that tool.
http://jakege.blogspot.nl/2014...
Okay, maybe this is sort of like 'when all you have is a hammer, everything looks like a nail', but this suggestion is the extent of my background in this area. Although I have had an itch to scratch, and so far, this is my best open-source result.
There's a ton of citations you should search for yourself, but I'll provide one I found that might start to help. Using this tool, it is fairly easy to parse out the myriad of hacker efforts at attacking the servers for example; even when you're the NY Times.
I know a few. They are all looking at options to get rid of Oracle, and often of Solaris as well. On the other hand, MSSQL is still basically a toy. It really depends on you data model and the queries you run. Key-value stores ("no SQL"), for example, are really easy to distribute over many servers.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Hi, i would take a look at exadata if you really need great performance. The oracle rdbms is one of the most reliable i know of so paired with hardware specifially designed to perform is a nice thing. Otherwise you could try to work with flash storage (flash cards) for really high performance but you will still need a good database. I dont know how good all these SQLDB (open, ms) work but for sure i know that they dont play a vital role in the enterprise environment i work in and know of. So stick with IBM/DB2 (maybe BLU as performance boost for Big Data) or Oracle (with 12c they have created in memory queries like in SAP HANA so this could work well in your case buuuuttt you need physiscal memory for every bit of data you want to query fast so maybe this gets costly.....
This is a MASSIVE undertaking, requiring deep and profound strategic decisions to be made at the highest levels of the company/organization.
To go all in on what advice you might receive from slashdot is fool hearty at best.
Do yourself and your company a favor, hire a world class consultant to come in and provide some advice.
Opinion:=TMyOpinion.Create(Me);
I enjoy the whole you have to have this or that, and oh you can't use Microsoft or Oracle and be able to consume Big Data. I've worked with Oracle, Hadoop, and Microsoft, Netezza, and DB2. All of these support Big Data, but each has it's problems. With just getting started Microsoft will fit your needs well, as it has the whole stack for getting you off the ground quickly. You're still going to experience some pain because of learning the data modeling piece is going to hold you back at first as you learn the difference between star schema and snowflake schema. Your data source will become a problem too as you de-normalize the data in the ETL process and then put it into your staging tables before figuring out where it goes. I've had terabytes in a Microsoft data warehouse, along with a 7 terabyte SSAS tabular model, so take the comments here with a grain of salt. Good luck on your endeavor.
Not that I am a fanboi of Oracle, but ODI is a fantastic tool.
Opinion:=TMyOpinion.Create(Me);
If you need to ask this question on Slashdot then chances are you don't have the skills to build and run such a system properly.
I'm a little late to the party so this might get buried but here goes.
I would strongly recommend looking into Microsoft's Analytics Platform System(APS), Formerly Parallel Data Warehouse(PDW). It's an MPP appliance that combines PDW and Hadoop. I got to spend a week on one of these appliances recently and I can't wait to get back on it. It supports combined queries usinf polybase across Hadoop and the Data warehouse(as well as the cloud).
Typically data scientists will want to work in Hadoop and use R, this makes it easy to migrate your Data warehouse into Hadoop so the data scientist can do his analysis without affecting the traditional BI clients that are using the warehouse.
I would also recommend SQL Server Analysis Tabular mode to build cubes off your Data warehouse. I have one client that uses the old PDW and creates cubes in SSAS tabular as well as Powerview in Excel and Sharepoint and it is loved by the end users. It's fast and the data visualizations are great. I will admit that Tableau is beautiful and I really like it, but users almost always want their data in Excel. It's best to just start them there.
The good news for you is that MS is offering subsidized POC installations with their gold partners. What this means is you contact MS and tell them you are interested in a Proof of Concept and they will provide you with vouchers to pay for one of their gold certified partners to come in and set up a data warehouse with your data using their appliance. Then the gold partner bills MS instead of you for their time. It's a win win win. If it doesn't work out, tell MS to take the appliance back. You can also have an off premise POC if you like. Those are a little easier to set up because you don't need to get your organizations IT server team involved.
I've been building data warehouses since 1998, been coding since 1985 and I am very impressed with this technology. It seems clear to me that in a massive data warehouse scenario this appliance is a winner. I'm still excited about how easy it is to move massive amounts of data between the PDW and Hadoop. That's incredibly useful for a number of scenarios.
Now before anyone starts skewering me for being an MS fanboi, let me point out that there are a few things that MS does well. Databases is one and Excel is the other. MS pisses me off to no end for many other things, but these two spaces are impressive.
Oh and I forgot to mention one of the great things about it being an appliance is that a lot of the configuration headaches are taken away from you. Need more space? Just plug a few more nodes into the rack, tell the appliance to redistribute the data and off you go. That gives you the freedom to focus more on your data and less on administrative tasks that you shouldn't have to worry about.
You can find out a lot in a few hours just by going to a Big Data meetup. Traditional database vendors are trying to hijack big data and make it their buzzword. Real big data players are using tools like Hadoop, Spark, Solr, Elastic Search and other tools that allow you to use commodity hardware to get a much more performant platform for big data. The appliance vendors have some interesting off the shelf stuff... you should really take some time to see what is going on... it's wild west time.
-- $G
Why do you think of your problem as "Big Data"? How many gigabytes of data are we talking about? How much data is added on a daily basis? One of the reasons I ask is because sometimes people talk about big data but really a traditional database would handle the problem perfectly.
Red Brick
It's open source, scales horizontally, and runs big queries in parallel (it does well at transactional processing, too). At one point it was pretty hard to deploy but that's gotten significantly better. Most of the Postgres tools ecosystem works with it. It's TransLattice's open source stack and they provide services for it, along with a few other vendors.
We use http://en.wikipedia.org/wiki/G... which is a clustered Postgres implementation. It has its problems (Postgres 8.2? seriously?) But it is very fast for ETL and batch queries on large data sets. We house 100+TB and get excellent performance. Its commercial and you pay by the TB.
Then there is also AWS Redshift. We have found it to be quicker at some things and possibly cheaper but immature in its feature set (no UDF, etc). The thinking here is that if you have a separate system for ETL, Redshift would make an excellent data warehouse/ data market SELECT server. Pay by usage/ hour.
Synergies are basically awesome, and they're even better when you leverage them. -PA
Don't confuse a regular data warehouse with Big Data. If Big Data is a "thing" your company wants to get into, it probably does not apply to you.
As for your data warehouse, MS SQL Server and is a good enough base to start with. IBM's DB2 is another underrated platform. Don't feed Oracle please.
Use SSD based storage for the data, so you don't have to wait for spindles. Seems that Pure Storage does it best of late, whereas other vendors have optimized the spindle based storage. PS did it from ground up. Best part is the documentation, Its ALL written on a single 3x5 card. No matter what software you use, skip the spindles.
Time for a new Political party in the US (or two!) One is off the rails Other cant pony up a leader.
This is a wildly nontrivial question. Volumes are written about building data warehouses, and there's a lot to consider. In a large complicated environment, you could spend weeks doing comparisons (some people spend years, but that seems extreme); and some of the decisions are worth weighing.
The first question is what capability are you looking for -- why are you sure one of these vendors is correct, and have you truly explored your options? If you want a place to capture and gather lots of near-real-time sensor data, then Hadoop might be good, if you want a more traditional Kimball or Inmon style warehouse for a small or mid size amount of data, then Microsoft, Oracle, Teradata, IBM, MySQL, and others have decades of experience that is, in fact, useful. But that's just a single-source vendor, and your question is focused on database vendors. Asking what "capability" you need includes ETL, Reporting, Meta Data, Master Data, Data Quality, User Interaction, Training, Methodology... if you're going to in-house all of that, or spread those things to multiple vendors then your answers will be different.
All of those lead to follow-on questions. Where does cost play a role? Watch your up front costs vs long-term TCO. Do you have a development team with any expertise that may make it easier to in-house decisions and developments for one platform over another? Is your corporate buy-in strong so you can weather people second-guessing your decision? There are technical issues, personnel issues, cost issues...
The first ANSWER is really that any vendor will work, and every vendor will have different headaches. Older vendors have very specific ways of doing things, but that can make developers less expensive and more uniformly capable (although you'll always find extremes). Asking several Oracle DBAs to question each other and report back on each other's competencies is rather easy. With newer capabilities like Amazon, Google, and other cloud-big-data vendors, the landscape is newer, people are using different approaches (each of which may be valid), and it's not clear which are going to survive long enough to have the richest eco systems. But again, these systems came into being for a reason -- Hadoop and NoSQL databases can perform better and more cheaply than older databases in raw throughput, or unstructured data, or other areas but they sacrifice different things -- ACID compliance, strong typing or data models, or what have you.
Some of it just depends on taste. Some people avoid a single provider "lock-in" and pick and choose different ETL tools (see Informatica), Reporting Tools (Cognos, Microstrategy, Tableau, Jasper, Pentaho), and other tools (Talend DQ/MDM comes to mind... there are many), while some people prefer single vendors due to massive integration (particularly Microsoft if you're a Windows farm). If you're Gmail based, then Google's apps have good integration; if you have an Oracle ERP then several tools speak nice to it.
I'm generalizing a lot of examples that don't always apply, to keep things shortish, but the bottom line is that every option has strengths and weaknesses. I wish it were easier.
If you like postgres, Greenplum is your big data solution.
Have you considered Amazon AWS? If a Hadoop back-end works for you, Redshift (optionally with CloudHSM for encryption at rest) for the analytics, or RDS (MySQL, MSSQL, Oracle, or PostgreSQL) coupled with the various compute offerings if you want to roll your own, S3 for massive storage, and DirectConnect if you need private data circuits directly into the AWS network infrastructure.
I work for Miosoft (no, not Microsoft). We have a product you may want to examine. Live data feeds, continuous consolidation of contexts from data fragments, parallel reporting on thousands of CPUs with petabytes of data.
In SSIS (the ETL tool that comes with SQL Server), the default isolation level is serializable. People often use SSIS to stage data and/or feed a denormalized data warehouse.
Someone claiming that an analytics tool is causing locks in SQL Server does not know what they are talking about. The most recent BI engine from Microsoft (Tabular) does everything in-memory, and with the older one, which is OLAP-based, data is typicalled moved out of SQL Server and into a SSAS cube.
There's the possible scenario of someone deciding to use ROLAP; feeding a cube from a live production database. But if someone took pains to setup that kind of thing and yet used a locking isolation level, then he should not complain about it on Slashdot, he should RTFM.
lucm, indeed.
postgresql, you'll wish you did.
If you have money to burn oracle it is.
I see lots of buzz words, but they don't make much sense together. Big Data and a Data Warehouse are not the same thing. If you *only* care about big data, you don't need to care about ETL. All of these things require you to know your data (and to have a goal). Its one of those things were the execution is a lot more important than the product chosen. Your goal cannot be 'get into this "big data" thing'. I'd recommend finding some user groups for the tools you're interested in and asking a few other companies what they are doing.
No offense, but from the sound of it you have no clue about a BI infrastructure, which is what you're talking about. If your company is serious they'll hire a team of 10 people w/ an average salary well north of 100k and have a couple million dollar budget per year for IT systems, including an analytic data base, ETL system, and BI application.
My guess is that you just want to start off by incrementally building a DW and want ad hoc analytic capabilities. My proof of concept solution would be to use Pentaho Data Integration (PDI) as the ETL layer, PostgreSQL as the db, and Tableau for visualizations. As you move into the big data space and build out your data model you should move to an analytical DB, and the cheapest good solution is Redshift from Amazon. Most of the Analytical DBs are derived from an old version of Postgresql anyway, so as long as you don't custom code the ETL solutions and use standard sql, migrations should be very easy. Also, as you grow, you can migrate away from Tableau to a real BI application like Cognos or Microstrategy. Also, as your data grows you may need even more storage for persistent staging areas and can then consider Hadoop. I would not recommend it to start, unless you really know what you're doing. As for advanced statistics everyone is using R now but is problematic w/ big data as it pulls data into memory for processing, so you may have to pre-aggregate the data if super big sets are involved.
I've been in the business intelligence space for over 10 years. My two top lessons learned; you need leadership in this space and to only implement custom code as a last result. For the former, BI has the ability to be implemented somewhat via an agile incremental model, but it's still a large solution and will require long term resources. Therefore, if you can't count on leadership to back you you shouldn't start the project. Secondly, custom code in this space can make a mountain out of a mole hill. For example, while you may be able to write a customized script or stored proc that's 30% faster than the ETL solution, I wouldn't suggest it. ETL, use appropriately, will help you manage your data long term. You be able to visually understand what's going on and switch DBMS rapidly.
FreeTDS works well. Why would you have to use ODBC?
Whether this question is genuine, or weekend astroturf, we should not perpetuate a myth that a Big Data, or Business Intelligence project should all be about choosing a technology.
Yes we all get excited about our big racks of flashy equipment, and we can brag about being terascale, petascale or whatever. This is not what it is all about.
I suggest your first search is for recent Big Data projects that really aren't unlocking the value that they thought they would. Find companies that have these enormous deployments, and huge datatstores, but can't seem to find any ROI. You will then understand the basic principle.
Finally, understand that 'Big Data' does have a firm root in certain approaches and technologies, but it has become 90% hyperbole and marketing speak. You hardly ever hear about BI anymore, and a lot of Big Data proponents would hapily see you 'throw out' a lot of the history - mainly because they have a box/service to sell and wouldn't want you to catch on that there might be more to it.
If you want to do this, and you are not just the weekend marketing machine, then talk to some of the guys around here to understand more. If you ARE, then please stop perpetuating a myth.
(disclosure: I have worked with VLDB for many years, using ORACLE, SQL Server, mySQL, Teradata, Netezza, Qlikview and a bunch of other technologies. ALL of them can unlock value in your data. ALL of them can process very large datasets. Yes really, I have processed Billions of rows through SQL Server even in it's early incarnations.
None of them can TELL you where the value is in your data.)
Personally, I think that the RedShift suggestion is perfect for OP. Judging by the vague requirements ("the big boss wants to get on the Big Data bandwagon!"), OP's company has no clue what it wants to do with its Big Data yet. So why throw down a ton of cash on a solution without having a good idea of what problem needs solving?
Playing around with RedShift a bit and seeing what value they can extract from their data would be a great pilot program. Later, once they know what they're doing, they can implement their "real" solution.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Get a good ETL person... Generally worth more than the hardware.
A bad one can grind ANY hardware you might get into dust.
(I roll an R610 with FusionIO and the Microsoft BI stack for sales, purchasing, and inventory analysis; 300GB post-process worth of data)
LOL = captcha "salaries"
When a piece of data come in, store it everywhere you need it. This might be aggregated tables (if you don't use indexed views) or whatever you may need. If you have background processes like ETL, you'll use a lot of your hardware for processing at the expense of queries.
Avoid ETL. You've got one shot to store your data everywhere.