Slashdot Mirror


Distributed Databases?

yamla asks: "I am interested in learning about distributed, fault-tolerant databases. That is, a database (not necessarily SQL) where the data is spread out (not replicated) amongst a large number of computers and furthermore, any reasonable number of those computers could disconnect or reconnect at any time without making it impossible to retrieve the stored data. I think this is a far more interesting problem than peer-to-peer because, provided such a solution scales, it would seem to solve the decentralised peer-to-peer problems. It would also seem to open up all kinds of new applications which we have hardly begun to think about yet. So I'm interested in good places to go to read up on (potential) solutions."

5 of 12 comments (clear)

  1. Slightly simplier problems by coyote-san · · Score: 3

    A slightly simplier problem can give you insights into possible solutions. How do you manage a distributed file system? That is, something that looks like a single file system but can operate and recover from partitioning?

    There are a couple solutions, but all (iirc) in turn ultimately reduce to an even simplier question: how do you manage distributed messaging? A classic example of this is "buy" and "sell" orders in a distributed stock exchange - there needs to be some way of ensuring that all parties can agree on <b>the</b> ordering of all messages. Disagreement on ordering can have major ramifications since it can affect the price paid, possibly even whether the stock was obtained at all. Likewise, ordering of file systems reads and writes can determine what gets written to disk and/or what gets fed to running applications.

    Once you have that, you can start looking at recovery issues in filesystems. IIRC, all come down to a question of how many systems you write data to, and how many systems you read data from. When you read data you'll often get multiple versions of the same information (because of update latency) and you need to know how to determine which is the most current.

    The two extremes are "everyone has everything" (total replication) and "only one server has each item" (multiple independent and disjoint servers). Depending on expected loads (esp. the ratio of reads to writes) you might see a policy of reading from a third of all systems, writing to 2/3 of them. No system will have all data, but all will have most.

    All of this points to an unstated assumption in your question. "Distributed" means more than one thing - to someone who has studied algorithms they usually refer to designs that maximize availability despite network partitioning (e.g., line cuts or court injuctions against some servers). These algorithms require substantial, if not complete, data replication.

    To many people, "distributed" also means what we would call "partitioned" algorithms where multiple sites work on a small part of the problem and the results are combined later. Examples are factorization efforts and SETI-at-home. These algorithms don't require replication, but they are highly vulnerable to partitioning.

    What problem are you trying to solve with this distributed database?

    --
    For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
  2. There are seminal research papers on this by MemRaven · · Score: 2
    and they all come down to one thing: it can't be done very well, and we should all stop trying. It all got summed up by Jim Gray in a paper I can't find a link to right now.

    IBM had a distributed database project going on back in the System-R days, and they never really got it working. I worked on the Mariposa project at U.C. Berkeley which attempted to solve some of this problem, and it didn't really get that far beyond a data warehousing context. The problem of ensuring that replication along with ownership and transactional semantics were preserved just became too difficult to solve in a purely generic way.

    If you're just interested in high availability query processing, the Mariposa work is probably pretty relevant (a company called Cohera tried to commercialize it). If you're interested in distributed transactions, you've walked into the realm of Tuxedo (by BEA systems, caveat, a former employer). While specific instances of the problem CAN be solved, one general purpose system is going to have significant problems, so it's best to categorize what you're interested in solving.

    I highly recommend that you dive into the big Stonebraker/Hellerstein book on database system implementation research papers and start reading up. It's a VERY difficult problem. Hellerstein is part of a new project which is also trying to solve some of the problems in a different way.

  3. Ignore me by MemRaven · · Score: 2
    I was specifically referring to the issues with a transactional SQL database, and I failed to read one parenthetical bit on the original post.

    I'm feeling particularly dumb right now.

  4. dns by po_boy · · Score: 2

    the DNS setup is a pretty good example of a distributed database.

    All your events are belong to us.

  5. ummm... by nomadic · · Score: 3

    Wait, if they're not replicated, how could you get data from a machine that's been brought offline?
    --