A New Approach To Database-Aided Data Processing

← Back to Stories (view on slashdot.org)

A New Approach To Database-Aided Data Processing

Posted by Soulskill on Wednesday February 27, 2013 @09:49AM from the think-of-the-database-as-free-child-labor dept.

An anonymous reader writes "The Parallel Universe blog has a post about parallel data processing. They start off by talking about how Moore's Law still holds, but the shift from clock frequency to multiple cores has stifled the rate at which hardware allows software to scale. (Basically, Amdahl's Law.) The simplest approach to dealing with this is sharding, but that introduces its own difficulties. The more you shard a data set, the more work you need to do to separate out the data elements that can't interact. Optimizing for 2n cores takes more than twice the work of optimizing for n cores. The article says, 'If we want to continue writing compellingly complex applications at an ever-increasing scale we must come to terms with the new Moore's law and build our software on top of solid infrastructure designed specifically for this new reality; sharding just won't cut it.' Their solution is to transfer some of the processing work to the database. 'This because the database is in a unique position to know which transactions may contend for the same data items, and how to schedule them with respect to one another for the best possible performance. The database can and should be smart.' They demonstrate how SpaceBase does this by simulating a 10,000-spaceship battle on different sets of hardware (code available here). Going from a dual-core system to a quad-core system at the same clock speed actually doubles performance without sharding."

45 comments

Min score:

Reason:

Sort:

SpaceBase? by Anonymous Coward · 2013-02-27 09:54 · Score: 0

Surely you mean EVE Online.
1. Re:SpaceBase? by Anonymous Coward · 2013-02-27 10:02 · Score: 0
  
  their presentation sounds a lot as if they are aiming at CCP (who are hindered by python's GIL) as a customer.
Ludicrous by TechyImmigrant · 2013-02-27 10:03 · Score: 2

This is ludicrous. Paraphrasing: "We do databases, so we'll say that the solution to scaling parallel software resides in databases".
The applications for parallel processing are many and diverse. Databases are relevant to only some of them.

--
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
1. Re:Ludicrous by Anonymous Coward · 2013-02-27 10:12 · Score: 0
  
  Well, MapReduce/Hadoop is about spreading the big data around a cluster moving most of the processing to the distributed file stores. So this isn't that big a leap.
2. Re:Ludicrous by Anonymous Coward · 2013-02-27 10:32 · Score: 0
  
  Exactly. With agent-based software, the database knows jack about anything because usually there isn't one. Each agent stores the (miniscule) information necessary to do it's little task. The scalability bottlenecks of the database and the CPU are both gone. You just need to make sure the agents have good rules for self-organization (or can be easily shepherded) to prevent the network from becoming the bottleneck. And that's just one example of a parallel data processing system that doesn't use a database. There are plenty others.
  Have a parallel processing problem? Add a database to the mix and now you have two problems.
3. Re:Ludicrous by Common+Joe · 2013-02-27 10:46 · Score: 1
  
  I agree. Databases come with a lot of overhead, but in some (not all) situations, the extra overhead can speed things up. I find it hard to believe that databases would speed up something as simple as what they are suggesting. Something more complex? Perhaps, but I'd have to see what they propose.
  On a separate note: Did I miss something? They are talking about coding this up in databases, yet the code they give is in Java. It would be nice to see code in both Java and SQL... and which machines, which operating systems, and what database they are using. Without this, all I see is talk. Without the database code, everything falls into the "maybe" and "what if" categories. We need something concrete before we can have an actual discussion about coding something up in a threaded Java program vs an ACID compliant database.
4. Re:Ludicrous by davester666 · 2013-02-27 18:52 · Score: 1
  
  If all you make is a hammer, everything better be a nail.
  
  --
  Sleep your way to a whiter smile...date a dentist!
5. Re:Ludicrous by julesh · 2013-02-27 19:46 · Score: 2
  
  On a separate note: Did I miss something? They are talking about coding this up in databases, yet the code they give is in Java. It would be nice to see code in both Java and SQL
  Their database is a NoSQL database that uses queries implemented as Java objects; it doesn't have a query language other than those objects.
6. Re:Ludicrous by TechyImmigrant · 2013-02-28 06:51 · Score: 1
  
  The hammer they wrote for that particular nail is fine, but the claims in the story are plain silly. Particularly this bit (my emphasis):
  "but the shift from clock frequency to multiple cores has stifled the rate at which hardware allows software to scale. (Basically, Amdahl's Law.) The simplest approach to dealing with this is sharding"
  
  --
  I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Modularity combined with self generating software by Anonymous Coward · 2013-02-27 10:12 · Score: 0

We need to have greater code modularity along with the utilization of algorithms that can generate and optimize code en masse. That will allow programmers not to 'dirty their hands' with the details of how to implement their programs on ever more complex hardware, but rather to deal with the larger picture elements of interface design and general purpose design without worrying about Moore's or Amdahl's "Laws" (of which neither is).
-___-
*Squinty*
Can someone explain... by Synerg1y · 2013-02-27 10:13 · Score: 2

What the difference between threading an app and sharding it are? I'm kinda leaning towards writing this off as a bunch of theoretical BS, not the kind that makes sense either. Database servers are the highest load servers on most networks, distributing data process to them sounds idiotic at best.
1. Re:Can someone explain... by Kjella · 2013-02-27 10:45 · Score: 1
  
  What the difference between threading an app and sharding it are? I'm kinda leaning towards writing this off as a bunch of theoretical BS, not the kind that makes sense either. Database servers are the highest load servers on most networks, distributing data process to them sounds idiotic at best.
  Different scales, the only way I've heard sharing used is to distribute the load between many servers, not like a cluster but that each server has their own form of dedicated area, like for example in a MMORPG you can divide the game world and pass people from shard to shard as they cross the game world. Or splitting a data set by rows depending on which "belong" together, a bit like NUMA for data - local data interacts fastest with local data, less fast with remote data but it all acts seamless as one big database. It works quite well if your range of interaction is limited like in a space shooter, but it's not a magic bullet.
  
  --
  Live today, because you never know what tomorrow brings
2. Re:Can someone explain... by FlyingGuy · 2013-02-27 16:05 · Score: 2
  
  Sharding
  is the secret ingredient in the the Web scale sauce ! >/p>
  
  --
  Hey KID! Yeah you, get the fuck off my lawn!
3. Re:Can someone explain... by Synerg1y · 2013-02-28 04:36 · Score: 1
  
  That makes sense, so like a thread = entire server kind of scope from my example. Is it really appropriate to call it a database, or a smart distributed cache? In addition to local data interacting faster with local data, memory is even faster, I'd imagine that would make it a ways more complicated to design/implement though.
Shared RAM by naroom · 2013-02-27 10:15 · Score: 4, Interesting

Yep, it totally ignores cases where multiple threads can be chewing on the same piece of RAM without conflict. My domain is image processing, and as long as each thread can access its own sub-chunk of the image, parallelizing my code takes near-zero overhead. I don't have to split the data into chunks at all.
1. Re:Shared RAM by toastar · 2013-02-27 10:25 · Score: 1
  
  mmm.... Embarrassingly Parallel...
2. Re:Shared RAM by julesh · 2013-02-27 19:41 · Score: 1
  
  mmm.... Embarrassingly Parallel...
  When you come down to it, it's amazing how many of the problems that actually require large amounts of processor time _are_ embarrassingly parallel. Presumably because the more calculations you have to perform, the more likely those calculations are to be partitionable into entirely independent data sets.
I have a solution by stewsters · 2013-02-27 10:17 · Score: 5, Funny

We can make a new language that can do processing in the database. That way we don't need to get all the rows we want to do operations on. It will look like this: "Select sum(`widgit_count`) from warehouses where state = 4 "
1. Re:I have a solution by Bacon+Bits · 2013-02-27 11:16 · Score: 4, Interesting
  
  This is not a case of "let's do processing at the database". This isn't "holy crap SQL has functions!" or "try to use set-based queries that return the data you want rather than getting a dozen record sets and looping through them with 'until (RecordSet.eof())'". This is what you do when you've done all that and you still have performance problems because of data size and complexity of queries.
  It's a case of needing to maintain data consistency in processing when you have 10,000 concurrent users all changing data but you want to process something very complex with your real time data set. Think things like geocoding in real time all cell phones attached to your cellular network, then running tower load balancing applications that can be made aware of the fact that the data has changed as it's changing and taking that into account. A tower could see that a high data user on an adjacent tower is approaching and could begin preparing for that. The 10s of thousands of spaceships is just a simple example. Let's say you want to resurface the highway system for Los Angeles, and you want to use real time data of the number of cars on the road at different times to model how traffic patterns might change when you close lanes so you can determine how to close the lanes and test the best method for how you should re-route traffic.
  The key idea here is that each spaceship or cell phone user or automobile can interact with each other based on their data (in this case, proximity data). How can we write applications that might need to signal 20,000 other processes that their data just changed? RDBMSs are already incredibly good at dealing with data consistency and concurrency, and for large data sets that can interact arbitrarily with the rest of the data, sharding doesn't work.
  Now let's say you want to do something really difficult, like modelling the human body at the cellular level. Each cell is it's own process, but each cell can interact with any number of other cells with signalling mechanisms. This chemical signalling would have to be translated to data signalling to the application processes, and it would all need to be kept consistent to maintain the reality of the simulation. Now give the simulation cancer. Now test an experimental treatment. Now do it 500,000 times each for all 10,000 types of cancer and each of the 1,000s of possible cures, and speed up the timeline to go as quickly as possible. You can have entire planetary populations of simulated humans with every disease ever known, and you can try every possible treatment simultaneously. Trillions of simulated humans dying from failed treatments advancing your knowledge in the real world by hundreds of thousands of years in a fraction of the time. Now do the same with astrological bodies, or subatomic particles.
  We use simulations now to model things that we understand but can rarely observe, but rarely do we do so as quickly as they occur in the natural world. What will happen when we can model anything and everything... instantly... simultaneously.
  
  --
  The road to tyranny has always been paved with claims of necessity.
2. Re:I have a solution by Anonymous Coward · 2013-02-27 12:18 · Score: 0
  
  Make sure you teach the devs this wonderful thing. I still see RBAR on new code from devs who don't know what sets are.
3. Re:I have a solution by stewsters · 2013-02-27 13:47 · Score: 2
  
  I understand the concept of big data, I used to do Hadoop back in 2010. But my point is that their example code just it seems to implement spatial hashing in a distributed database, which has been around for a while. I think the summary missed what makes these guy's approach better.
  
  It seems pretty obvious that you should use some type of indexing in the database to select items rather than do some cool O(n^2) operations when you have billions of items.
  
  Also, webscale
4. Re:I have a solution by pron · 2013-02-28 01:08 · Score: 1
  
  Author here. The point of the demo is not to show the benefits of indexing. That, you are quite right, is obvious. The point was to show how the database can parallelize application code. Instead of asking the database to perform a query or a transaction, whether synchronously or asynchronously, you pass your data-processing code to the database to schedule and run. The idea is to unite the database and the application, or make the database not just a data-store but also a parallelization framework. But don't mistake this database to be some sort of monolithic monster running on a huge server and programmed in an obscure language. This database runs the application on the application server, and it uses whatever language the application employs.
Oracle would love this. by Anonymous Coward · 2013-02-27 10:45 · Score: 0

They charge for CPU usage of their DB software. More specifically they charge an arm and a leg for CPU usage of their DB software. Yes, moving more logic to the database sounds like a brilliant idea...
From my perspective it seems like things have been heading in the opposite direction with NoSQL solutions and simpler distributed data stores. Having core logic in both the app layer and the database layer adds quite a bit of complexity.
Also, when did spaceship battles become a database benchmark, and who would freely tightly integrate a performance critical real time application like a computer game with a database. If you have to use one, you can bet it will be your main bottleneck.
Yet another slashvertising ... by Taco+Cowboy · 2013-02-27 10:46 · Score: 2

This submit is yet another example of how advertisement is disguised as Slashdot article
This "SpaceBase" thing is but a database product
This "parallel processing a-la database" thing is but part of an advertising campaign being pushed by the company "parallel universe" to advertise their "SpaceBase" database package
That's all

--
Muchas Gracias, Señor Edward Snowden !
1. Re:Yet another slashvertising ... by machine321 · 2013-02-28 00:58 · Score: 1
  
  Yeah, what a piece of shard.
2. Re:Yet another slashvertising ... by Anonymous Coward · 2013-02-28 02:10 · Score: 0
  
  I sharded once. I had to go home and change my pants.
What keeps everything from being in the same place by Animats · 2013-02-27 10:46 · Score: 1

Space is what keeps everything from being in the same place. If you can partition your problem spatially, it gets easier. You have to be able to handle interaction across boundaries, though. This is OK as long as you don't have interaction across multiple boundaries. Grid systems have trouble at corners where four grid squares meet.. (There's an argument for using hexes, because you never have more than three hexes meeting at a point.
Hard cases include fast moving objects, big objects, and groups of connected objects that cross multiple boundaries. Back when I did physics engines, I was talking to some people developing a planet-sized MMORPG, and said "Now let us all join hands across the world". Big groan.
Facebook hit this problem. They were college-based originally, and assumed that most interaction would be with people who were physically nearby. As Facebook scaled up, that was no longer true. They needed a lot more bandwidth between their data centers.
I could make more comments on this, but I'm going to go outside and ride my horse at the beach instead.
Re:Modularity combined with self generating softwa by Tablizer · 2013-02-27 10:48 · Score: 2

The problem is that it's hard to optimize parallel-ization for all useful factors/dimensions. Generally optimizing data for one grouping de-optimizes for another.
Replication may improve reading by copying and re-grouping the copies by the different dimensions (often on diff servers), but this makes writing more complex and slow because then the replication and reconstitution of copies for each dimension becomes a bottleneck.
The real problem is that we live in a 3-D universe. If we move to a 12-D universe, then our queries will scream (if we don't go above 11 factors). But they only allow celebrities like Elvis and Hoffa in there.

--
Table-ized A.I.
Oh sharDing by Anonymous Coward · 2013-02-27 10:50 · Score: 0

I could have sworn that said sharting.
Scale up vs. scale out by Nefarious+Wheel · 2013-02-27 10:55 · Score: 1

By moving processing to the database, you're implicitly changing from scale-out (parallelism) to scale-up, aren't you? Unless you have a cluster with really, really fast row-level lock processing, the solution for faster DB is usually faster CPU and more memory for transaction buffers, not more computers. Larger buffers puts an additional overhead on memory-to-memory transfer (as well as more lock traffic & the delays that introduces) on a scale-out basis for databases, so the tendency is to scale up a database with fewer, but larger computers.
Innit?
(I'm using the almost archaic term "computers" here to indicate individual processing nodes. Saying "processors" has become ambiguous with the term "cores" I think.)

--
Do not mock my vision of impractical footwear
1. Re:Scale up vs. scale out by pron · 2013-02-28 00:48 · Score: 1
  
  Author here. Take a look at our open-source grid: http://puniverse.github.com/galaxy/
Disgusting by Anonymous Coward · 2013-02-27 11:13 · Score: 0

All this talk of sharting and spreading loads leaves a horrible taste in my mouth.
Like Hadoop? by booch · 2013-02-27 11:23 · Score: 1, Interesting

This doesn't sound at all ground-breaking. They've basically discovered what Hadoop already does -- if you shard your data, it makes sense to run the processing where the data is, to reduce communication overhead. And Hadoop didn't pioneer the idea, either. It's based on Google's MapReduce, and I'm pretty certain that the ideas go back much further than that.

--
Software sucks. Open Source sucks less.
1. Re:Like Hadoop? by Anonymous Coward · 2013-02-27 16:58 · Score: 0
  
  A Hadoop cluster of one laptop. Sounds like what we had in the 70's with IBM batch processing.
How about actually writing efficient software? by Anonymous Coward · 2013-02-27 11:29 · Score: 0

How about actually writing efficient software instead of bloating it with useless crap and excess features?
Hardware has gotten better, the market is driving the bling and those improvements end up wasted on crap.
What does it matter? by Mister+Liberty · 2013-02-27 11:53 · Score: 0

All that counts is explosion.png.
Intelligent design by WaffleMonster · 2013-02-27 11:54 · Score: 2

I'm a fan of databases, DSLs, query languages and parallelizing compilers. I think there are huge opportunities to punt problems to all manners of optimizers which dynamically figure out which resources are to be used to crunch a problem. It is in my view inevitable this is the future.
The problem is this only takes you so far. At some level you actually have to design a system that scales and you still have to get into the weeds to do it unless there is some serious human level AI involved.
There is a reason people pay big money for large single system image machines. Not everyone has the luxury of googles and facebooks problems.
What do cores have to do with it? by Anonymous Coward · 2013-02-27 12:41 · Score: 1

Most databases, especially "large" ones wind up being I/O bound. A large database being defined, somewhat arbitrarily, as 3x or more larger than the amount of RAM available to store it.
Sharding is a solution designed to deal with horizontally or vertically partitioning a database. By splitting the database you can store the database on different mass storage systems. In so doing you achieve parallel I/O, which directly addresses the I/O binding.
All this ignores the effects of RAID, drive/controller caches, SSDs and so forth which help but are external to the OS.
Since processor cores are usually running at relatively low utilization rates anyway, adding more cores (or using the existing ones more efficiently) is rarely all that helpful. I'd never turn down an optimization, especially if it was meaningful and nondisruptive but this is perhaps not the place to devote lots of optimizing resources.
Having said all that, there is still something to what they are saying. One of the most scalable systems I know of (single image, commercially available, not a lab demo or v0.0001 pre-alpha whatever) is the IBM iSeries. It achieves scalability by tightly binding the database to the OS which allows the 2 to work better together. You still have to have the I/O channels to support additional processors/cores/whatever, but it's an integrated system and that is done at the time of system configuration.
The result is linear scaling over 2 orders of magnitude processor count increases. Few systems can do that.
However at some point I'd still expect to see scaling problems. Single image systems start to have severe locking problems at high process (and processor) counts. Supercomputers deal with this routinely. Eventually you have to migrate to shared-nothing designs. Which works but introduces significant communications overhead (that's why supercomputer designs prize high speed, low latency network communications).
1. Re:What do cores have to do with it? by greg1104 · 2013-02-27 13:57 · Score: 2
  
  The other underappreciated benefit of sharding is that it brings more caching RAM to bear on the problem. In traditional hardware, and this is even more true of cloud setups like Amazon's EC2, the maximum amount of memory you can configure in an instance isn't that high. This number isn't going up as fast anymore either. You can get 256GB of RAM in a machine, but from the perspective of speed to any one core it will not even be close to 32X as fast as 8GB.
  Adding another shard doubles the amount of RAM for caching and the underlying I/O capacity. That it also has more cores for processing is a bonus, but not the primary design reason for sharding as a database scaling operation. The approach outlined here is a slightly more clever than average approach for CPU limited programs that are not quite classic parallel processing workloads. But that doesn't make it suddenly a replacement for sharded databases in general. There are three main ways that splitting work across nodes can help--CPU, memory, and disk--and this helps a bit with one type. It's a pretty narrow use case.
2. Re:What do cores have to do with it? by pron · 2013-02-28 01:29 · Score: 1
  
  Author here. Our products are currently in-RAM, and any disk IO, if present, is asynchronous (we're not durable).
abstraction might have a performance penalty? by jsprenkle · 2013-02-27 14:04 · Score: 0

Really? Wow. Who would have thought ;)

--
- I've got bad karma because I won't parrot everyone else's opinion
Old news by Anonymous Coward · 2013-02-27 14:10 · Score: 0

Wasn't that the rationale behind PICK?
MS: Bad ideas are GOOD! by Futurepower(R) · 2013-02-27 14:43 · Score: 1

"Under Gates, MS used to steal the best ideas; now they steal the bad ones."
Inside Microsoft, that is considered an improvement... in making people suffer. Still, however, Microsoft is forced to release bad operating system versions only every other release. MS doesn't have complete control yet. In the future, all releases will have new problems.
Heresy! by hanshotfirst · 2013-02-28 05:17 · Score: 2

Everyone knows database systems designed for concurrency and parallel processing across threads, cores, servers, and data centers are best used as a bit bucket for a warehouse full of Java VMs.

--
Why, oh why, didn't I take the Blue Pill?
The technique WILL work (especially if/when...) by Anonymous Coward · 2013-02-28 13:22 · Score: 0

Distributed across MULTIPLE servers, especially with smaller data, will work well for performance!
This even works with simple text files, albeit even on a single machine (& is very effective with thread usage in combination with it on a multi-core machine)...
So - how do I KNOW this?
Couple ways, actually (in databasing first of all, but more recently, even with this):
Simple -> http://www.start64.com/index.php?option=com_content&id=5851:apk-hosts-file-engine-64bit-version&Itemid=74
That version of my application for host file mgt. has an "update" (version 6.0++) that I have been testing prior to re-release that is FASTER & uses something MUCH like what's being done here in fact...
The updated model takes the single, relatively LARGE custom hosts file data intake (between new data & merging the OLD data, deduplicating & normalizing it) as a single file, & essentially "shards" it too!
See - 1st, I pushed as far as I could go with it (does the job in under 7 minutes on an Intel I7 Core 3970/top-of-the-line afaik) & that was even using "tricks" like upping the CPU priority as well of the parent process thread doing that work...
Now - the new one's architecture varies on that & does the job, better/faster, so far:
I busted the data up as I process it & sent it to process simultaneously on multiple threads, 1 to a 'shard'...
(Threads for as many as the datasets I broke the single file into)
I cut the time down to THAT, albeit on a FAR lesser machine in testing here @ home!
(Intel I7 Core 920 - basically "middle of the road" in performance CPU's from INTEL nowadays, if that (it changes so far, more cache etc./et al on the better more costly chips - just like the one above, which I had a gamer I know who ALWAYS has the "latest/greatest" equipment test for me... same data I used, same all the way around, EXCEPT for my use of "sharding" (sort of))).
Thus - It shows you an algorithm/technique IS superior to "brute horsepower" which you should only use as a last resort...
APK
P.S.=> Now, above ALL else here - I've been doing database work in information systems since 1994 professionally, & anybody that's been doing it for awhile WILL tell you this - watch it with "big" JOINS: They're an example of where things get "too hairy" & SLOW...
So, by doing this 'sharding' technique, they're doing essentially what I noted above in my app that works on a single text file, & working across multiple smaller parts that make up a whole ordinarily & seeing speed/performance boosts!
That'd only get BETTER in a distributed computing scenario (just like how commonplace multiple hundreds of PC's can outdo supercomputers in a SINGLE rig)...
Breaking it into multiple parts for threaded concurrent processing IF NOT across multiple machines, will work...
Lastly: I even noted that here when I 1st built it, to a "naysayer" on HOW I was going to "eke more oomph" outta it using the SAME BASIC IDEA of busting the data up into smaller parts & using threads to get more outta it, AROUND THIS TIME LAST YEAR no less -> http://slashdot.org/comments.pl?sid=2019504&cid=35365340 ...
No question it works, and the 'variation' they're doing above will too, no questions asked! LOL, makes me wonder IF they "bit off my idea" (there's very little original thought though, but you never know, however I'd think that anyone with a bit of experience in DB work would know this or come to that realization independently, anyhow)...
... apk