Slashdot Mirror


Replacing Traditional Storage, Databases With In-Memory Analytics

storagedude writes "Traditional databases and storage networks, even those sporting high-speed solid state drives, don't offer enough performance for the real-time analytics craze sweeping corporations, giving rise to in-memory analytics, or data mining performed in memory without the limitations of the traditional data path. The end result could be that storage and databases get pushed to the periphery of data centers and in-memory analytics becomes the new critical IT infrastructure. From the article: 'With big vendors like Microsoft and SAP buying into in-memory analytics to solve Big Data challenges, the big question for IT is what this trend will mean for the traditional data center infrastructure. Will storage, even flash drives, be needed in the future, given the requirement for real-time data analysis and current trends in design for real-time data analytics? Or will storage move from the heart of data centers and become merely a means of backup and recovery for critical real-time apps?'"

28 of 124 comments (clear)

  1. Totally inane by MrAnnoyanceToYou · · Score: 5, Insightful

    Discarding data is something that, as a programmer, I don't often do. Too often I will need it later. Real time analytics are not going to change this. As long as hard drive storage continues to get cheaper, there's going to be more data stored. Partially because the easier it is to store large blocks the more likely I am to store bigger packets. I'd LOVE to store entire large XML blocks in databases sometimes, and we decide not to because of space issues. So, yeah, no. Datacenters aren't going anywhere. Things just get more complicated on the hosting side.

    Note that the article writer is a strong stakeholder in his earthshattering predictions coming true.

    1. Re:Totally inane by fuzzyfuzzyfungus · · Score: 3, Insightful

      Also, it isn't really all that earthshattering. The fact that RAM is faster and offers lower latency than just about anything else in the system has been true more or less forever. Essentially all OSes of remotely recent vintage already opportunistically use RAM caching to make the apparent speed of disk access suck less(nicer RAID controllers will often have another block of RAM for the same purpsoe). Programs, at the individual discretion of their creators, already hold on to the stuff that they will need to chew over most often in RAM, and only dump to disk as often as prudence requires.

      The idea that, as advances in semiconductor fabrication make gargantuan amounts of RAM cheaper, high-end users will do more of their work in RAM just doesn't seem like a very bold prediction...

    2. Re:Totally inane by Kilrah_il · · Score: 3, Funny

      As advances in semiconductor fabrication make gargantuan amounts of RAM cheaper, high-end users will do more of their work in RAM.

      Now you have a bold prediction.
      Sincerely,
      me

      --
      Whenever in an argument, remember this.
    3. Re:Totally inane by tomhudson · · Score: 3, Insightful
      Good one - except that in this case, a lot of the so-called "work" is BS, consumers are pushing against being data-mined, regulators are getting into the act, and if your business model is so dependent on being a rude invasive pr*ck, perhaps you deserve to die ...

      And the same thing will happen when revenue-strapped governments slap a transfer tax and/or minimum hold periods on stocks - something that should have been done a long time ago.

    4. Re:Totally inane by MrAnnoyanceToYou · · Score: 2

      There must be some way to solve a problem like that, where you have a series of pointers to files, if not the files themselves as well, with the ability to add markers of some kind to each of those pointers. (maybe we can call them, "Records!!!" like CD's used to be called) And then! Then! We can disguise how the management of these 'records' are organized from the user, so they don't have to think about it. And give them a simple, logical way to get data about those 'records' out of the big, organized whole. It'd be, like, a whole new basic way to store our records! We could easily find what we wanted in our basic data storage. I can't believe noone's thought of it before. ;)

      My point here isn't that you should use a database to store your data about your files, (unfortunately, a unified markup system for files doesn't exist yet; it would be nice, but all that stuff is in the OS right now) my point is that the author of the article is missing that even if in-memory data systems do become extremely large, the underlying theory of the technology will not change much.

      And the underlying theory relies heavily on caching, limiting how much of your overall dataset is currently relevant, and so on. While I will admit it's possible many databases' useful data size will eventually be outgrown by RAM-style memory storage, when that happens market forces will probably make it comparatively expensive to hold all your data in memory at once. Partially because clean, concise code is generally far more expensive to produce than sloppy crap that chews through your data storage.

    5. Re:Totally inane by quanticle · · Score: 3, Informative

      I didn't really see the author mention anything about discarding data. Rather, it seems like he's saying that existing databases (which attempt to commit data to persistent storage as soon as possible) will be marginalized as the speed gap between persistent storage and RAM widens. Instead, business applications are going to hold data in RAM, and rely on redundancy to prevent data loss when a system fails before its data has been backed up to the database.

      --
      We all know what to do, but we don't know how to get re-elected once we have done it
    6. Re:Totally inane by Firehed · · Score: 2

      Fraud detection doesn't need microsecond timing. Fraud detection is based on good data, not "fast data"

      Sorry, but that's just wrong. Fraud analysis on credit transactions needs to be performed extremely quickly (and payment sites that process ACH need to do that quickly as well) in order for the networks to be usable. So while it requires good data, it also needs fast data - and a lot of it. At a minimum, it often looks at the user's complete payment history, the history on that credit card (did the user suddenly change? if so, the card number was probably stolen) not specific to the user, the activity at that IP address and other IPs that user has logged in from (which may include many other users and/or cards), etc. There's a lot of work to be done in less than a second or two.

      --
      How are sites slashdotted when nobody reads TFAs?
  2. The cutting edge is in high frequency trading by Animats · · Score: 5, Informative

    For the cutting edge in this area, see what the "high frequency traders" are doing. Computers aren't fast enough for that any more. The trend is toward writing trading algorithms in VHDL and compiling them into FPGAs, so the actual trading decisions are made in special-purpose hardware. Transaction latency (from trade data in on the wire to action out) is dropping below 10 microseconds. In the high-frequency trading world, if you're doing less than 1000 trades per second, you're not considered serious.

    More generally, we have a fundamental problem in the I/O area: UNIX. UNIX I/O has a very simple model, which is now used by Linux, DOS, and Windows. Everything is a byte stream, and byte streams are accessed by making read and write calls to the operating system. That was OK when I/O was slower. But it's a terrible way to do inter-machine communication in clusters today. The OS overhead swamps the data transfer. Then there's the interaction with CPU dispatching. Each I/O operation usually ends by unblocking some thread, so there's a pass through the scheduler at the receive end. This works on "vanilla hardware" (most existing computers), which is why it dominates.

    Bypassing the read/write model is sometimes done by giving one machine remote direct memory access ("RDMA") into another. This is usually too brutal, and tends to be done in ways that bypass the MMU and process security. So it's not very general. Still, that's how most Ethernet packets are delivered, and how graphics units talk to CPUs.

    The supercomputer interconnect people have been struggling with this for years, but nothing general has emerged. RDMA via Infiniband is about where that group has ended up. That's not something a typical large hosting cluster could use safely.

    Most inter-machine operations are of two types - a subroutine call to another machine, or a queue operation. Those give you the basic synchronous and asynchronous operations. A reasonable design goal is to design hardware which can perform those two operations with little or no operating system intervention once the connection has been set up, with MMU-level safety at both ends. When CPU designers have put in elaborate hardware of comparable complexity, though, nobody uses it. 386 and later machines have hardware for rings of protection, call gates, segmented memory, hardware context switching, and other stuff nobody uses because it doesn't map to vanilla C programming. That has discouraged innovation in this area. A few hardware innovations, like MMX, caught on, but still are used only in a few inner loops.

    It's not that this can't be done. It's that unless it's supported by both Intel and Microsoft, it will only be a niche technology.

    1. Re:The cutting edge is in high frequency trading by Gorobei · · Score: 3, Interesting

      Yep, the article is 10-20 years out of date.

      HFT has been using statistical synchronization of dbs for years.

      Big financial shops switched to in-memory dbs decades ago. With co-lo on the compute farms.

      I don't know why he's even talking about 32G boxes as servers. That's a desktop, real db hosts are an order of magnitude bigger.

      His "push the disks to the edge of the network?" Um, that's already happened - it's called tier 2. Tier 1 is the terabytes of solid-state storage we keep just in case.

      This is a blast from the 1990s.

    2. Re:The cutting edge is in high frequency trading by Rich0 · · Score: 3, Insightful

      There is another simple solution to optimizing HFT - just aggregate and execute all trades once per minute, with the division between each minute taking place in UTC plus/minus a random offset (a few seconds on average - with 98% of divisions being within 5 seconds either way).

      Boom, now there is no need to spend huge amounts of money coming up with lightning-fast implementations that don't actually create real value for ordinary people.

      Business ought to be about improving the lives of ordinary people. Sure, sometimes the link isn't direct, and I'm fine with that. However, we're putting far to much emphasis on optimizing what amounts to numbers games that do nothing to produce real things of value for anybody...

    3. Re:The cutting edge is in high frequency trading by BitZtream · · Score: 2, Informative

      So I'm guessing you've never actually done any development?

      The 'byte stream' model is not from UNIX, its just the way the hardware is laid out physically.

      IPC happens in an entirely different way unless you're using something simplistic like pipes

      RDMA is pretty much a stable of high speed cluster computing, however its DMA that allows pretty much everything in your PC to work without slowing the processor down. Even your keyboard controller uses DMA to get the characters into somewhere useful.

      As far as what you're calling RDMA via Infiniband, I've seen massive clusters (some of the largest in the world) using it ... safely.

      If you think nothing uses the protections provided by the x86 family I'd like to know what shitty OS you're using? Not only does everyone actually use it on the x86, they do it in ... get this ... C! Perhaps you should take a look at a few open source OSes and notice that while there is some assembly in specific places for speed and the required lowest level libraries ... you'll be suprised by the fact that all of that memory management stuff is written in ... C and utilized by .... C programs.

      I guess you're also ignore the fact that intel and amd added more protection hardware to the x86 architecture JUST FOR VIRTUALIZATION ... I suppose you think the fact mordern hypervisors won't work without these features present is just a silly little annoyance that the software venders throw in to make us buy new hardware to pad their bank accounts?

      I'm not sure what development you do, by my standard C library uses MMX for many functions that require me to do nothing to take advantage of their speedup.

      You really have no clue do you?

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    4. Re:The cutting edge is in high frequency trading by Bill,+Shooter+of+Bul · · Score: 2

      You really do not understand the domain in question. The whole idea behind hft is to analyze real time data and make a near instantaneous stock trade that capitalizes on that data analysis *before* anyone else does. Waiting a second is too long in this case. The value they add to their customers: Cold hard cash. The value to the stock market: liquidity (fair argument if its too much liquidity).

      --
      Well.. maybe. Or Maybe not. But Definitely not sort of.
    5. Re:The cutting edge is in high frequency trading by Rich0 · · Score: 2

      Then, why did the price of gasoline drop $1.50 in a few weeks from record highs when the hedge funds dried up?

      I am sure that lots of effort goes into the logistics of oil distribution, etc. That is all effort well-spent.

      The part I don't like is when people buy oil futures speculating that prices will rise without any intention to take delivery of the oil. That just results in people bidding up the price.

      I'm certainly not the only one suggesting that needless speculation drives up the cost of commodities. Just look at housing prices with the mortgage bubble/etc.

      Again, my issue isn't with markets - it is with people buying derivatives purely on speculation, without any interest in actually dealing with the product that is being tracked. Corn growers who want to hedge the value of their crops is fine. Airlines that want to hedge the value of oil is also fine. However, this should be limited to the value of the actual material being traded.

    6. Re:The cutting edge is in high frequency trading by carnalforge · · Score: 2

      [...]

      More generally, we have a fundamental problem in the I/O area: UNIX. UNIX I/O has a very simple model, which is now used by Linux, DOS, and Windows. Everything is a byte stream, and byte streams are accessed by making read and write calls to the operating system. That was OK when I/O was slower. But it's a terrible way to do inter-machine communication in clusters today. The OS overhead swamps the data transfer. Then there's the interaction with CPU dispatching. Each I/O operation usually ends by unblocking some thread, so there's a pass through the scheduler at the receive end. This works on "vanilla hardware" (most existing computers), which is why it dominates.

      This is true. Though you're underestimating "modern" os's. Though, think of it as defensive planning. Who knowed ~20+ years ago that we would have solid state disks? Who knowed we would have 10GB NICs? SATA?
      But the foundamental design of IO streams works and is easily adapted on new devices. Add on that the simplicity of /dev and all the concept of input and output in UNIX. Think about it.

      [...]

      The supercomputer interconnect people have been struggling with this for years, but nothing general has emerged.
      RDMA via Infiniband is about where that group has ended up. That's not something a typical large hosting cluster could use safely.

      Add to that fibrechannel. And NUMA is an old and tried technology.

      Most inter-machine operations are of two types - a subroutine call to another machine, or a queue operation. Those give you the basic synchronous and asynchronous operations. A reasonable design goal is to design hardware which can perform those two operations with little or no operating system intervention once the connection has been set up, with MMU-level safety at both ends. When CPU designers have put in elaborate hardware of comparable complexity, though, nobody uses it. 386 and later machines have hardware for rings of protection, call gates, segmented memory, hardware context switching, and other stuff nobody uses because it doesn't map to vanilla C programming. That has discouraged innovation in this area. A few hardware innovations, like MMX, caught on, but still are used only in a few inner loops.

      At the cost of my mood points or whatever, now i call bullshit.
      Rings protection? Used at least in linux.
      Call gates? You mean Sysenter? Used at least in linux from ~2002 if im not wrong
      Segmented memory? Hello 32bits? Is that what you mean? Correct me if im wrong, but i thought it was a thing of the past.
      Hardware context switching? You mean VMX (AMD) or SVM (Intel) ? At least on Linux those instructions are used.

      C is the limiting on this? Please.

      MMX? SSE/2 etc?
      gcc -mmmx -msse -msse2 -msse3 -mssse3 -msse4 -msse4.1 -msse4.2

      (talking about gcc because that is what i know, though im sure other compilers cane use those instructions too)

      It's not that this can't be done. It's that unless it's supported by both Intel and Microsoft, it will only be a niche technology.

      yep right.

      --
      :wq!
    7. Re:The cutting edge is in high frequency trading by Rich0 · · Score: 2

      Simple - the kinds of people who were up to their eyeballs in hedging the price of oil futures, were also up to their eyeballs in hedging the prices of real-estate, mortgage-backed securities, and credit-default swaps. They lost their shirts, and for a little while they couldn't afford to keep buying oil futures. Suddenly the price of oil plummeted tremendously, and now ordinary people who buy oil for the purpose of actually burning it and not trading it can afford to do so.

      Derivatives can serve a legitimate purpose in stabilizing markets. However, they are out of control today and if anything tend to destabilize markets as a result.

  3. Re:Terabyte RAM? by fuzzyfuzzyfungus · · Score: 2

    1TB is still in the realm of rather specialized; but 512GB systems(while not inexpensive) are actually pretty available. A quick glance at Dell shows that(even without the benefits of a rep, volume pricing, or any sort of negotiation), a 2U R815 with 512GB of RAM can be yours for a hair under $40,000. Kitted out with the specs you actually want, of course, it might run you another $20k above that. If AMD isn't your flavor, the intel-based but otherwise similar R810 will run five to ten thousand more than the R815 with otherwise similar options...

    At those prices, I'd venture to say that Flash still has a reasonably bright future ahead of it in the high-speed/low-latency storage market(not to mention the volatility issue); but(especially if your problem can handle being broken up across multiple systems with only modestly fast interconnects) the cost of enormous amounts of RAM has dropped pretty significantly.

    Now, if you can't deal with the limitations of commodity cluster interconnects, and have to have more than a half terabyte of RAM in a single memory space, I get the impression that your options get more expensive pretty fast. Phrases like "up to 16TB shared global memory" and "single system image", are generally your cue to hold on to your wallet and run... If that is what you want, though, you can buy it.

  4. Can we please stop already? by mwvdlee · · Score: 5, Insightful

    I'm getting sick and tired of hearing about yet another hype in IT-land where everything has to be done in yet another new way.

    All developers understand that different problems require different solutions. Will the managers who shove this crap up our asses please stop doing so? It's not productive, you're not going to get a better solution by forcing it do be implemented in whatever buzzword falls of the last bandwagon of an ever-growing parade of buzzwords.

    "In-memory analytics" is what we started out with before databases, and guess what; it's never gone away. We've never stopped using it. Now just tell us what problem you have let us developers decide how to solve it.

    --
    Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    1. Re:Can we please stop already? by Desert+Raven · · Score: 3

      Agreed, someone comes up with something new to solve a very specific issue, and all of a sudden someone's predicting how it will completely replace everything else in the next month.

      Grow up.

      Physical storage and relational databases aren't going anywhere anytime soon. in-memory this and non-relational that are all well and good for the specific problems they were designed for, but physically stored and relational data fits the needs of 90% of data storage and retrieval. I sure as HECK don't want my bank storing my financial data purely in memory.

      So keep yelling to yourselves about how the sky is falling on traditional techniques. Meanwhile the rest of us have real work to do.

    2. Re:Can we please stop already? by AllenNg · · Score: 2

      I think you're missing a few evolutionary pieces. Most data analytics systems that I'm aware of are not currently relational. Long ago, the data lived in memory, but memory was expensive, so everything was moved to disk. The relational model added the formalisms of normalization (to cut down on space, among other reasons), but the types of multi-dimensional queries used by the analytics apps required too many joins for this to work. So the data was de-normalized (eg. OLAP) to improve performance. As memory prices came down, people started putting the OLAP indexes and aggregates into memory to get a performance boost. Moving the data back to memory and returning to a normalized, relational model isn't so much "drastic new thing" as it is "logical next step". For me, the upsetting thing is that just as I'm getting good at the data warehousing thing, it seems we're going to be switching to being relational again.

  5. Re:Goodbye Orwell by quanticle · · Score: 5, Informative

    You're misinterpreting the post. No one said anything about long term data storage being marginalized or eliminated. Instead, the author is talking about the difference between persistent and non-persistent storage. He's saying that existing database technologies that rely on persistent storage are being marginalized as the speed difference between spinning disks and RAM widens, and the low cost of RAM makes it practical to hold large data sets entirely in memory. According to the author, data processing and analysis will increasingly move towards in-memory systems, while traditional databases will be relegated to a "backup and restore" role for these in-memory systems.

    --
    We all know what to do, but we don't know how to get re-elected once we have done it
  6. I know am being your stereotypical anarchist but.. by Nrrqshrr · · Score: 2

    Decentralization is the way.

  7. Re:Terabyte RAM? by fuzzyfuzzyfungus · · Score: 2

    That one wasn't even intentional, unfortunately. My love of puns has, apparently, seeped directly into whatever part of my brain is responsible for day-to-day verbal and written work...

  8. Global-scale analytics != standard IT load by drdrgivemethenews · · Score: 2

    Although TFA doesn't say so explicitly, I think it's talking about the race to get the best targeted advertising analytics in place for global applications like eBay, FB etc. These applications don't have the same database requirements as traditional business apps. It makes sense to talk about new ways of doing things for them, but TFA's author and a lot of other people make the mistake of thinking or implying that these new techniques will apply directly to traditional business apps as well. Sorry, not.

    ----------

    Happy New Year, may it suck less for ya than the last one.

  9. Re:Goodbye Orwell by hairyfeet · · Score: 2

    Exactly. It really doesn't matter if you have the slowest (and thus longer lasting and cheaper to operate) HDD on the planet if all the important data is in RAM and kept there. RAM since DDR has gotten so ridiculously fast that NO SSD has a snowball's chance of catching up anytime soon, if at all, and the economies of scale have made RAM one of the cheapest if not the cheapest upgrades you can add to any system.

    Even in the consumer market falling RAM prices and changes to OS design make the hard drive pretty much a backup and long term storage medium more than anything else. I advise my customers on new builds to go ahead and let me install 4GB, because with Superfetch after a week of Windows 7 learning their usage patterns all of their apps are preloaded into RAM making launching and using instantaneous, and with suspend to RAM booting is pretty much a thing of the past. It cost less than $100 to add 8GB to mine and now everything I use is ALWAYS preloaded, making the speed just insane. Everyone that comes by the shop is always amazed at how I can launch a half a dozen apps while another 4 or 5 are doing various jobs and it is always instantaneous. But with 6GB reserved by the OS for Superfetch all the apps I use are simply waiting for me in RAM.

    So I have to agree with TFA. With the prices of RAM cheap and only getting cheaper having data you actually use often needing to swap in and out of the HDD or SSD is just nuts. And then if you have it all in RAM you can use the lower speed and less power hungry "green" drives for persistent backup instead of using SSDs which haven't come anywhere near the GB per $ ratio of spinning platters yet, although their speed is incredible. But if everything is already in RAM, do you really need to spend the crazy $$$ for the large SSD?

    --
    ACs don't waste your time replying, your posts are never seen by me.
  10. Re:I know am being your stereotypical anarchist bu by hazem · · Score: 2

    Decentralization is the way.

    If you're a consultant and find a client working in a centralized way, you sell decentralization as the way to solve all their woes. If you find them working in a decentralized way, you sell them on centralizing to solve all their woes.

    There are only two constants here: 1) every business has woes, regardless of structure; 2) consultants extract lots of value by shifting those woes around

  11. Re:But puting data in system ram = harder reboots by Joe+The+Dragon · · Score: 2

    what help is diesel when the main power room with Transfer Switch is on fire and the UPS don't have the power to run the systems for a long time as they are setup just to be there for the time it's takes for the diesel to start up.

  12. Re:Terabyte RAM? by Anonymous Coward · · Score: 2, Interesting

    I think, perhaps, that you're missing the point, at least of the article. It has nothing to do with whether to store information in memory or in the database and everything to do with the current trend of using dedicated analytics products (i.e. OLAP) to do data analysis. Whereas we used to use the same relational databases to store, retrieve and analyze all data with SQL as the Swiss Army knife that enabled it all, we're moving towards a model where the relational database is responsible for storage and retrieval of information only and dedicated analytics products have their own cache of the information for reporting and analysis purposes.

    The point is that relational databases are being marginalized and one of their major selling points (i.e. the ability to analyze data based on the relationship between different types of data) is increasingly less relevant. Once you're limiting your RDBMS usage to simple CRUD operations, the rationale for choosing an RDBMS (especially an expensive one like Oracle and its ilk) over NoSQL options or open source databases with limited support for power-user options starts to disappear. MySQL may lack a lot of the features that experienced DBAs consider mandatory, but it can do INSERTs, UPDATEs and DELETEs as well as anything and it has no problems with SELECTs based on keyed columns. Similarly, Casandra, Voldemort and such can also easily support that limited subset of functionality.

    That is why RDBMSs are becoming marginalized. Applications are increasingly being designed to either avoid an RDBMS back-end or to use it as simple "dumb" storage and rely on a separate analytics product to accomplish all the complicated logic that previously would be accomplished with complicated SQL and stored procedures. Beyond that, OLAP concepts allow the data-mining interface to require less development effort. It's simple to write an interface around (an) OLAP cube(s) and allow the user to choose the dimensions and measures and allow the user to pivot, drill-down and such. In fact, most analytics products do this stuff out of the box without any development necessary. With a SQL database, an interface needs to be created that will translate the user's instructions into SQL, which can often become very complex and requires significant effort to ensure that the resulting SQL will perform well.

    This isn't about RDBMSs becoming unnecessary, it's about them now being best served in a much more limited role than they've previously occupied in the application architecture.

  13. Re:Goodbye Orwell by More_Cowbell · · Score: 2

    I work for a large (global) web hosting company, and I'd just like to counter the 'low cost of RAM' idea... Yes, most RAM is cheap, but when you start looking at 'large data sets', cheap is a relative term.

    For example, the HP DL580 G7 can hold a Terabyte of RAM, but to do so it uses 16GB DIMMs, at $1000 each. http://h30094.www3.hp.com/product/sku/5100299/mfg_partno/500666-B21

    When you add that up, it's $64,000 just for RAM in ONE server. And we don't sell it to you, (in fact we only lease it from HP ourselves) we add a ridiculous additional monthly charge to your bill, well above what it costs us. Also keep in mind, anybody spending that kind of money has a multiple times redundant system... So, no, I would not call it 'low cost'.

    --
    Experience teaches only the teachable. -AH