A New Approach To Database-Aided Data Processing
An anonymous reader writes "The Parallel Universe blog has a post about parallel data processing. They start off by talking about how Moore's Law still holds, but the shift from clock frequency to multiple cores has stifled the rate at which hardware allows software to scale. (Basically, Amdahl's Law.) The simplest approach to dealing with this is sharding, but that introduces its own difficulties. The more you shard a data set, the more work you need to do to separate out the data elements that can't interact. Optimizing for 2n cores takes more than twice the work of optimizing for n cores. The article says, 'If we want to continue writing compellingly complex applications at an ever-increasing scale we must come to terms with the new Moore's law and build our software on top of solid infrastructure designed specifically for this new reality; sharding just won't cut it.' Their solution is to transfer some of the processing work to the database. 'This because the database is in a unique position to know which transactions may contend for the same data items, and how to schedule them with respect to one another for the best possible performance. The database can and should be smart.' They demonstrate how SpaceBase does this by simulating a 10,000-spaceship battle on different sets of hardware (code available here). Going from a dual-core system to a quad-core system at the same clock speed actually doubles performance without sharding."
This is ludicrous. Paraphrasing: "We do databases, so we'll say that the solution to scaling parallel software resides in databases".
The applications for parallel processing are many and diverse. Databases are relevant to only some of them.
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
What the difference between threading an app and sharding it are? I'm kinda leaning towards writing this off as a bunch of theoretical BS, not the kind that makes sense either. Database servers are the highest load servers on most networks, distributing data process to them sounds idiotic at best.
Yep, it totally ignores cases where multiple threads can be chewing on the same piece of RAM without conflict. My domain is image processing, and as long as each thread can access its own sub-chunk of the image, parallelizing my code takes near-zero overhead. I don't have to split the data into chunks at all.
We can make a new language that can do processing in the database. That way we don't need to get all the rows we want to do operations on. It will look like this: "Select sum(`widgit_count`) from warehouses where state = 4 "
This submit is yet another example of how advertisement is disguised as Slashdot article
This "SpaceBase" thing is but a database product
This "parallel processing a-la database" thing is but part of an advertising campaign being pushed by the company "parallel universe" to advertise their "SpaceBase" database package
That's all
Muchas Gracias, Señor Edward Snowden !
Space is what keeps everything from being in the same place. If you can partition your problem spatially, it gets easier. You have to be able to handle interaction across boundaries, though. This is OK as long as you don't have interaction across multiple boundaries. Grid systems have trouble at corners where four grid squares meet.. (There's an argument for using hexes, because you never have more than three hexes meeting at a point.
Hard cases include fast moving objects, big objects, and groups of connected objects that cross multiple boundaries. Back when I did physics engines, I was talking to some people developing a planet-sized MMORPG, and said "Now let us all join hands across the world". Big groan.
Facebook hit this problem. They were college-based originally, and assumed that most interaction would be with people who were physically nearby. As Facebook scaled up, that was no longer true. They needed a lot more bandwidth between their data centers.
I could make more comments on this, but I'm going to go outside and ride my horse at the beach instead.
The problem is that it's hard to optimize parallel-ization for all useful factors/dimensions. Generally optimizing data for one grouping de-optimizes for another.
Replication may improve reading by copying and re-grouping the copies by the different dimensions (often on diff servers), but this makes writing more complex and slow because then the replication and reconstitution of copies for each dimension becomes a bottleneck.
The real problem is that we live in a 3-D universe. If we move to a 12-D universe, then our queries will scream (if we don't go above 11 factors). But they only allow celebrities like Elvis and Hoffa in there.
Table-ized A.I.
By moving processing to the database, you're implicitly changing from scale-out (parallelism) to scale-up, aren't you? Unless you have a cluster with really, really fast row-level lock processing, the solution for faster DB is usually faster CPU and more memory for transaction buffers, not more computers. Larger buffers puts an additional overhead on memory-to-memory transfer (as well as more lock traffic & the delays that introduces) on a scale-out basis for databases, so the tendency is to scale up a database with fewer, but larger computers.
Innit?
(I'm using the almost archaic term "computers" here to indicate individual processing nodes. Saying "processors" has become ambiguous with the term "cores" I think.)
Do not mock my vision of impractical footwear
This doesn't sound at all ground-breaking. They've basically discovered what Hadoop already does -- if you shard your data, it makes sense to run the processing where the data is, to reduce communication overhead. And Hadoop didn't pioneer the idea, either. It's based on Google's MapReduce, and I'm pretty certain that the ideas go back much further than that.
Software sucks. Open Source sucks less.
I'm a fan of databases, DSLs, query languages and parallelizing compilers. I think there are huge opportunities to punt problems to all manners of optimizers which dynamically figure out which resources are to be used to crunch a problem. It is in my view inevitable this is the future.
The problem is this only takes you so far. At some level you actually have to design a system that scales and you still have to get into the weeds to do it unless there is some serious human level AI involved.
There is a reason people pay big money for large single system image machines. Not everyone has the luxury of googles and facebooks problems.
Most databases, especially "large" ones wind up being I/O bound. A large database being defined, somewhat arbitrarily, as 3x or more larger than the amount of RAM available to store it.
Sharding is a solution designed to deal with horizontally or vertically partitioning a database. By splitting the database you can store the database on different mass storage systems. In so doing you achieve parallel I/O, which directly addresses the I/O binding.
All this ignores the effects of RAID, drive/controller caches, SSDs and so forth which help but are external to the OS.
Since processor cores are usually running at relatively low utilization rates anyway, adding more cores (or using the existing ones more efficiently) is rarely all that helpful. I'd never turn down an optimization, especially if it was meaningful and nondisruptive but this is perhaps not the place to devote lots of optimizing resources.
Having said all that, there is still something to what they are saying. One of the most scalable systems I know of (single image, commercially available, not a lab demo or v0.0001 pre-alpha whatever) is the IBM iSeries. It achieves scalability by tightly binding the database to the OS which allows the 2 to work better together. You still have to have the I/O channels to support additional processors/cores/whatever, but it's an integrated system and that is done at the time of system configuration.
The result is linear scaling over 2 orders of magnitude processor count increases. Few systems can do that.
However at some point I'd still expect to see scaling problems. Single image systems start to have severe locking problems at high process (and processor) counts. Supercomputers deal with this routinely. Eventually you have to migrate to shared-nothing designs. Which works but introduces significant communications overhead (that's why supercomputer designs prize high speed, low latency network communications).
"Under Gates, MS used to steal the best ideas; now they steal the bad ones."
Inside Microsoft, that is considered an improvement... in making people suffer. Still, however, Microsoft is forced to release bad operating system versions only every other release. MS doesn't have complete control yet. In the future, all releases will have new problems.
Everyone knows database systems designed for concurrency and parallel processing across threads, cores, servers, and data centers are best used as a bit bucket for a warehouse full of Java VMs.
Why, oh why, didn't I take the Blue Pill?