Slashdot Mirror


A New Approach To Database-Aided Data Processing

An anonymous reader writes "The Parallel Universe blog has a post about parallel data processing. They start off by talking about how Moore's Law still holds, but the shift from clock frequency to multiple cores has stifled the rate at which hardware allows software to scale. (Basically, Amdahl's Law.) The simplest approach to dealing with this is sharding, but that introduces its own difficulties. The more you shard a data set, the more work you need to do to separate out the data elements that can't interact. Optimizing for 2n cores takes more than twice the work of optimizing for n cores. The article says, 'If we want to continue writing compellingly complex applications at an ever-increasing scale we must come to terms with the new Moore's law and build our software on top of solid infrastructure designed specifically for this new reality; sharding just won't cut it.' Their solution is to transfer some of the processing work to the database. 'This because the database is in a unique position to know which transactions may contend for the same data items, and how to schedule them with respect to one another for the best possible performance. The database can and should be smart.' They demonstrate how SpaceBase does this by simulating a 10,000-spaceship battle on different sets of hardware (code available here). Going from a dual-core system to a quad-core system at the same clock speed actually doubles performance without sharding."

13 of 45 comments (clear)

  1. Ludicrous by TechyImmigrant · · Score: 2

    This is ludicrous. Paraphrasing: "We do databases, so we'll say that the solution to scaling parallel software resides in databases".

    The applications for parallel processing are many and diverse. Databases are relevant to only some of them.

    --
    I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    1. Re:Ludicrous by julesh · · Score: 2

      On a separate note: Did I miss something? They are talking about coding this up in databases, yet the code they give is in Java. It would be nice to see code in both Java and SQL

      Their database is a NoSQL database that uses queries implemented as Java objects; it doesn't have a query language other than those objects.

  2. Can someone explain... by Synerg1y · · Score: 2

    What the difference between threading an app and sharding it are? I'm kinda leaning towards writing this off as a bunch of theoretical BS, not the kind that makes sense either. Database servers are the highest load servers on most networks, distributing data process to them sounds idiotic at best.

    1. Re:Can someone explain... by FlyingGuy · · Score: 2

      Sharding

      is the secret ingredient in the the Web scale sauce ! >/p>

      --
      Hey KID! Yeah you, get the fuck off my lawn!
  3. Shared RAM by naroom · · Score: 4, Interesting

    Yep, it totally ignores cases where multiple threads can be chewing on the same piece of RAM without conflict. My domain is image processing, and as long as each thread can access its own sub-chunk of the image, parallelizing my code takes near-zero overhead. I don't have to split the data into chunks at all.

  4. I have a solution by stewsters · · Score: 5, Funny

    We can make a new language that can do processing in the database. That way we don't need to get all the rows we want to do operations on. It will look like this: "Select sum(`widgit_count`) from warehouses where state = 4 "

    1. Re:I have a solution by Bacon+Bits · · Score: 4, Interesting

      This is not a case of "let's do processing at the database". This isn't "holy crap SQL has functions!" or "try to use set-based queries that return the data you want rather than getting a dozen record sets and looping through them with 'until (RecordSet.eof())'". This is what you do when you've done all that and you still have performance problems because of data size and complexity of queries.

      It's a case of needing to maintain data consistency in processing when you have 10,000 concurrent users all changing data but you want to process something very complex with your real time data set. Think things like geocoding in real time all cell phones attached to your cellular network, then running tower load balancing applications that can be made aware of the fact that the data has changed as it's changing and taking that into account. A tower could see that a high data user on an adjacent tower is approaching and could begin preparing for that. The 10s of thousands of spaceships is just a simple example. Let's say you want to resurface the highway system for Los Angeles, and you want to use real time data of the number of cars on the road at different times to model how traffic patterns might change when you close lanes so you can determine how to close the lanes and test the best method for how you should re-route traffic.

      The key idea here is that each spaceship or cell phone user or automobile can interact with each other based on their data (in this case, proximity data). How can we write applications that might need to signal 20,000 other processes that their data just changed? RDBMSs are already incredibly good at dealing with data consistency and concurrency, and for large data sets that can interact arbitrarily with the rest of the data, sharding doesn't work.

      Now let's say you want to do something really difficult, like modelling the human body at the cellular level. Each cell is it's own process, but each cell can interact with any number of other cells with signalling mechanisms. This chemical signalling would have to be translated to data signalling to the application processes, and it would all need to be kept consistent to maintain the reality of the simulation. Now give the simulation cancer. Now test an experimental treatment. Now do it 500,000 times each for all 10,000 types of cancer and each of the 1,000s of possible cures, and speed up the timeline to go as quickly as possible. You can have entire planetary populations of simulated humans with every disease ever known, and you can try every possible treatment simultaneously. Trillions of simulated humans dying from failed treatments advancing your knowledge in the real world by hundreds of thousands of years in a fraction of the time. Now do the same with astrological bodies, or subatomic particles.

      We use simulations now to model things that we understand but can rarely observe, but rarely do we do so as quickly as they occur in the natural world. What will happen when we can model anything and everything... instantly... simultaneously.

      --
      The road to tyranny has always been paved with claims of necessity.
    2. Re:I have a solution by stewsters · · Score: 2

      I understand the concept of big data, I used to do Hadoop back in 2010. But my point is that their example code just it seems to implement spatial hashing in a distributed database, which has been around for a while. I think the summary missed what makes these guy's approach better.

      It seems pretty obvious that you should use some type of indexing in the database to select items rather than do some cool O(n^2) operations when you have billions of items.

      Also, webscale

  5. Yet another slashvertising ... by Taco+Cowboy · · Score: 2

    This submit is yet another example of how advertisement is disguised as Slashdot article

    This "SpaceBase" thing is but a database product

    This "parallel processing a-la database" thing is but part of an advertising campaign being pushed by the company "parallel universe" to advertise their "SpaceBase" database package

    That's all

    --
    Muchas Gracias, Señor Edward Snowden !
  6. Re:Modularity combined with self generating softwa by Tablizer · · Score: 2

    The problem is that it's hard to optimize parallel-ization for all useful factors/dimensions. Generally optimizing data for one grouping de-optimizes for another.

    Replication may improve reading by copying and re-grouping the copies by the different dimensions (often on diff servers), but this makes writing more complex and slow because then the replication and reconstitution of copies for each dimension becomes a bottleneck.

    The real problem is that we live in a 3-D universe. If we move to a 12-D universe, then our queries will scream (if we don't go above 11 factors). But they only allow celebrities like Elvis and Hoffa in there.

  7. Intelligent design by WaffleMonster · · Score: 2

    I'm a fan of databases, DSLs, query languages and parallelizing compilers. I think there are huge opportunities to punt problems to all manners of optimizers which dynamically figure out which resources are to be used to crunch a problem. It is in my view inevitable this is the future.

    The problem is this only takes you so far. At some level you actually have to design a system that scales and you still have to get into the weeds to do it unless there is some serious human level AI involved.

    There is a reason people pay big money for large single system image machines. Not everyone has the luxury of googles and facebooks problems.

  8. Re:What do cores have to do with it? by greg1104 · · Score: 2

    The other underappreciated benefit of sharding is that it brings more caching RAM to bear on the problem. In traditional hardware, and this is even more true of cloud setups like Amazon's EC2, the maximum amount of memory you can configure in an instance isn't that high. This number isn't going up as fast anymore either. You can get 256GB of RAM in a machine, but from the perspective of speed to any one core it will not even be close to 32X as fast as 8GB.

    Adding another shard doubles the amount of RAM for caching and the underlying I/O capacity. That it also has more cores for processing is a bonus, but not the primary design reason for sharding as a database scaling operation. The approach outlined here is a slightly more clever than average approach for CPU limited programs that are not quite classic parallel processing workloads. But that doesn't make it suddenly a replacement for sharded databases in general. There are three main ways that splitting work across nodes can help--CPU, memory, and disk--and this helps a bit with one type. It's a pretty narrow use case.

  9. Heresy! by hanshotfirst · · Score: 2

    Everyone knows database systems designed for concurrency and parallel processing across threads, cores, servers, and data centers are best used as a bit bucket for a warehouse full of Java VMs.

    --
    Why, oh why, didn't I take the Blue Pill?