Replacing Traditional Storage, Databases With In-Memory Analytics
storagedude writes "Traditional databases and storage networks, even those sporting high-speed solid state drives, don't offer enough performance for the real-time analytics craze sweeping corporations, giving rise to in-memory analytics, or data mining performed in memory without the limitations of the traditional data path. The end result could be that storage and databases get pushed to the periphery of data centers and in-memory analytics becomes the new critical IT infrastructure. From the article: 'With big vendors like Microsoft and SAP buying into in-memory analytics to solve Big Data challenges, the big question for IT is what this trend will mean for the traditional data center infrastructure. Will storage, even flash drives, be needed in the future, given the requirement for real-time data analysis and current trends in design for real-time data analytics? Or will storage move from the heart of data centers and become merely a means of backup and recovery for critical real-time apps?'"
The marginalization of long-term data storage can only be a good thing -- the big advertising and other firms get the analytical data that actually matters to their bottom line, and to the extent that the average joe's privacy is being invaded at the very least the fruits of that invasion will become increasingly accessible.
Discarding data is something that, as a programmer, I don't often do. Too often I will need it later. Real time analytics are not going to change this. As long as hard drive storage continues to get cheaper, there's going to be more data stored. Partially because the easier it is to store large blocks the more likely I am to store bigger packets. I'd LOVE to store entire large XML blocks in databases sometimes, and we decide not to because of space issues. So, yeah, no. Datacenters aren't going anywhere. Things just get more complicated on the hosting side.
Note that the article writer is a strong stakeholder in his earthshattering predictions coming true.
My little site.
Will storage, even flash drives, be needed in the future, given the requirement for real-time data analysis and current trends in design for real-time data analytics?
Of course storage will be needed in the future! It was needed in the past and it's needed in the present. What kind of question is that?
Or will storage move from the heart of data centers and become merely a means of backup and recovery for critical real-time apps?
Oy-yoy-yoy.
I'm getting another drink.
Just... wow... goatse in 2011? Are you a time traveler from 1999?
For the cutting edge in this area, see what the "high frequency traders" are doing. Computers aren't fast enough for that any more. The trend is toward writing trading algorithms in VHDL and compiling them into FPGAs, so the actual trading decisions are made in special-purpose hardware. Transaction latency (from trade data in on the wire to action out) is dropping below 10 microseconds. In the high-frequency trading world, if you're doing less than 1000 trades per second, you're not considered serious.
More generally, we have a fundamental problem in the I/O area: UNIX. UNIX I/O has a very simple model, which is now used by Linux, DOS, and Windows. Everything is a byte stream, and byte streams are accessed by making read and write calls to the operating system. That was OK when I/O was slower. But it's a terrible way to do inter-machine communication in clusters today. The OS overhead swamps the data transfer. Then there's the interaction with CPU dispatching. Each I/O operation usually ends by unblocking some thread, so there's a pass through the scheduler at the receive end. This works on "vanilla hardware" (most existing computers), which is why it dominates.
Bypassing the read/write model is sometimes done by giving one machine remote direct memory access ("RDMA") into another. This is usually too brutal, and tends to be done in ways that bypass the MMU and process security. So it's not very general. Still, that's how most Ethernet packets are delivered, and how graphics units talk to CPUs.
The supercomputer interconnect people have been struggling with this for years, but nothing general has emerged. RDMA via Infiniband is about where that group has ended up. That's not something a typical large hosting cluster could use safely.
Most inter-machine operations are of two types - a subroutine call to another machine, or a queue operation. Those give you the basic synchronous and asynchronous operations. A reasonable design goal is to design hardware which can perform those two operations with little or no operating system intervention once the connection has been set up, with MMU-level safety at both ends. When CPU designers have put in elaborate hardware of comparable complexity, though, nobody uses it. 386 and later machines have hardware for rings of protection, call gates, segmented memory, hardware context switching, and other stuff nobody uses because it doesn't map to vanilla C programming. That has discouraged innovation in this area. A few hardware innovations, like MMX, caught on, but still are used only in a few inner loops.
It's not that this can't be done. It's that unless it's supported by both Intel and Microsoft, it will only be a niche technology.
Even a single consumer hard drive is a terabyte of storage.... how many servers at any cost have a terabyte of RAM?
It's funny that only today I chatted with some folks on the PostgreSQL IRC support channel about this, asking whether it is at all possible to have 2 postmasters running at the same time, one to do in memory SQL against an all-in-memory database, and the other to write to the database (and no, they think that it is not possible to have 2 postmasters talking to the same database this way, they believe it will corrupt the data). The suggestion was just to increase shared_buffers and file system block buffer size. I am thinking that maybe also it's useful to try and set up the streaming replication (xlog shipping) to another PostgreSQL database store/instance and use the other database as read only, then increase shared_buffers and OS disk block buffers.
Don't really know whether there is any significant advantage of one approach over another (except for having 2 databases of-course, so they become spares.)
You can't handle the truth.
This is all well-and-good until someone accidentally knocks out the power. Then all of that stuff needs recomputed if it's not stored to disk.
Are you really sure you want them to come up with something new?
I'm getting sick and tired of hearing about yet another hype in IT-land where everything has to be done in yet another new way.
All developers understand that different problems require different solutions. Will the managers who shove this crap up our asses please stop doing so? It's not productive, you're not going to get a better solution by forcing it do be implemented in whatever buzzword falls of the last bandwagon of an ever-growing parade of buzzwords.
"In-memory analytics" is what we started out with before databases, and guess what; it's never gone away. We've never stopped using it. Now just tell us what problem you have let us developers decide how to solve it.
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
Dear internet: Set your photoshops to "Goatse Tron Guy" and you will glimpse mankind's unutterably horrible future!
...are not usually applied to "big data". I'm not sure what technologies are being referred to, but a few billion rows is the limit to what I've seen. This is NOT what I would call "big data".
Download a free (as in the beer) app http://www.qlikview.com/us/explore/experience/free-download and see for yourself what current commercial software can do. I load as much as a hundred GB into the RAM for analytics with this application. Just keep in mind that star schema is the best for this software. Get your tables from an existing database as flat files, load them "as is" and start analysis immediately.
...a stunned silence fell upon the hall.
But puting data in system ram = harder reboots as you need to dump it to a disk. Also what about UPS's you need one that has the power to last for the time it takes to do that as well.
Decentralization is the way.
You will always want that data so you can manipulate it in some other manner that wasn't taken into account by the in-memory analysis, or even the scope of your project. These marketing blokes sure like to seize the day, don't they?
Twinstiq, game news
How do I do my restores from that? All I seem to find are core dumps, and remnants of memory leaks.
Hard-drives aren't really as slow as people think. The problem is that mechanical hard-drives is slow on seeking, but if seeking can be eliminated, you can quite easily saturate your CPU on even a moderately complex calculation.
Case of point: http://www.youtube.com/watch?v=WQw7c-PliB4
In-memory data storage is fine as long as it isn't primary data storage. Yes it's faster but there are a lot of downsides as well. The most important is that it isn't easy to share between servers (a close second is that it's hard to replicate to a remote site for disaster recovery purposes) so each server needs to have its own copy of the data and there needs to be some way of keeping all that data in sync.
The alternative is to have good old "traditional" storage sitting where it always sits and when the servers boot up or start their processing they load in the appropriate data set from the storage in to memory. This gives you all of the benefits of the fast in-memory processing without worrying about all of the downsides you create by using it as primary storage. So the memory isn't storage, it's cache.
So the real battle that will take place is not between hard disks and memory, it will be between RAM and SSDs.
I believe VoltDB (http://voltdb.org) uses in-memory and MPP if anyone is interested in giving it a test-spin. It's from Michael Stonebreaker of various databases (Ingres, Vertica, etc)
They've been doing a number of presentations on the topic you can probably find on the site.
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
Although TFA doesn't say so explicitly, I think it's talking about the race to get the best targeted advertising analytics in place for global applications like eBay, FB etc. These applications don't have the same database requirements as traditional business apps. It makes sense to talk about new ways of doing things for them, but TFA's author and a lot of other people make the mistake of thinking or implying that these new techniques will apply directly to traditional business apps as well. Sorry, not.
----------
Happy New Year, may it suck less for ya than the last one.
Most OS's and programming languages will let you map your memory data structure to a contiguous disk file so your disk IO is performed at paging speeds. The file system is only touched when the file is mapped (opened). Your system can then be configured to chose to what degree your data is in memory vs. disk.
Remember when the first 64-bit machines became commercially available?
"zOMG, now we can keep whole databases in RAM with the 4GB limit gone!"
This is just CS101. Memory hierarchy - you keep your data in the fastest memory it'll fit in (that you can afford.)
Now we can afford more RAM so we can do more per unit time because we don't have to wait for IO. Duh.
thegodmovie.com - watch it
Decentralization is the way.
If you're a consultant and find a client working in a centralized way, you sell decentralization as the way to solve all their woes. If you find them working in a decentralized way, you sell them on centralizing to solve all their woes.
There are only two constants here: 1) every business has woes, regardless of structure; 2) consultants extract lots of value by shifting those woes around
Your plan failed! I was curious if that site still exists, so I defocused my eyes before looking. All I saw was a vague blur with a red blur in the middle.
WTF is SoulSkill still drunk?
This is SO nothing new, nor is it even interesting.
In memory DB's are nothing new, they are simply prone to failure and this is why hardware storage be it spinning drives or Flash will always be around.
All it takes is one hiccup by the memory logic or an interrupt controller or DMA channel and all your in memory data is toast forcing a reload from the last checkpoint which can take quite a while when you are talking say a terabyte of information.
Clifford Hersh and Jeffery Spirn coked up the ANTS database a few years back. It was BLAZINGLY fast. It outran all of them, including Times Ten and it never got any traction and it was a fully in memory database.
Hey KID! Yeah you, get the fuck off my lawn!
You are one sick puppy! I mean, you had an argument about a fucking HOSTS file and you didn't agree. What do you do? Do you go back to your private rwal-world life and ignore the other person's comment? No, you find out when he posts regarding a completely unrelated topic and flame him there.
Get a life, man.
Oh, and you still didn't find the time to register a username on /. (or you really are a coward). Sweet.
Whenever in an argument, remember this.
Too bad businesses are typically run by dolts who don't have the slightest idea how to interpret the data. I've been in charge of "medical informatics" for a large firm and spent a startling amount of time having to explain the difference between a mean and a median to high-level executives.
Yes it is the wave of the ... present not future.
Walmart already knows when you buy a product how old you are, marriage classification, gender, sexual preference, criminal background, and what food and or products you stastically will buy the next 3 to 6 months and offers coupons based on them. They get this data from the credit card companies and sharing data with other suppliers.
This is a marketers dream and the wave of the future. Statistically analysis is why Oracle and DB2 are still huge despite mysql. It is because the database and their apis support these functions and you can help beat the competition by being statistically significant in all you do and knowing trends. Forecasting is another big one based on datamining errr memory analytic
I.T. is called information technology for a reason. Using it to produce and predict rather than look up makes it much more useful.
http://saveie6.com/
No, we will simply keep everything powered on and start anew if we lose power.
Also, we will all be mega-corps, even at home. No one will start with datasets under a few Pentabytes. Not even for photos and text.
"The number of flash drives or PCIe flash devices needed to achieve the performance of main memory is not cost-effective, given the number of PCIe buses needed, the cost of the devices and the complexity of using them compared to just using memory."
... "Even if flash device latency improves, it still has to go through the OS and PCIe bus compared to a direct hardware translation to address memory."
... "Knowledge of how to do I/O efficiently is limited because I/O programming is not taught in schools."
... "The cost of I/O in terms of operating system interrupt overhead, latency and the path through the I/O stack is another limitation."
If you don't like the game, change the rules! The problem here is the multiple hardware and software layers between the flash memory (for now) and the processor. Take the bus and extend it to a box, if necessary, that has a ton of directly addressable flash memory, IOW Flash RAM is your system memory. DRAM, if any, should be used as level 3 cache. As your database grows, you add more flash RAM. All the quotes above only make the self-imposed complexity even more ridiculous. Stupid humans!
"[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go