Slashdot Mirror


Distributed Versus Centralized DB?

OSXCPA asks: "I am developing a framework for processing accounting information. What kinds of advice would you have for setting up an environment where several (eventually, many) remote data aggregation and processing centers would update a central server 'after hours' to avoid processing delays. For those with ERP experience, I am trying to process transactions locally, instead of generating a batch in the central server, then update and validate the result to the central server, so my remote users can process at full speed all day, and central DB can get caught up later. Remote locations are all self-contained, in that no one will be posting transactions to another remote location. No big iron - this is all networked PC/Mac hardware and a central server (eventually to be a server farm, as needed). Feedback or constructive criticism would be appreciated. Technical advice is great, but I'm also looking for stuff like 'Gee, if I had only known XXX when I set this up, I would have done YYY...'"

3 of 47 comments (clear)

  1. Run away while you still have your sanity. ;) by Shag · · Score: 3, Insightful

    I am not a professional DBA, but I tend to wind up working with them. The one thing I remember from a big project where replication and whatnot came into play was that none of that stuff ever worked as well as the vendors said it would, and it all required a lot more work on the DBA's part (and typically lots more $$$ in general) than it was supposed to.

    --
    Village idiot in some extremely smart villages.
  2. Been there, done that! by itwerx · · Score: 3, Informative

    I did that awhile back for a data warehouse and reporting system with remote data collection points. (Half a TB of data with 50 million queries a day and a couple Gig to integrate every night). I did all the data validation and scrubbing and some preprocessing remotely to utilize the processing power of the collection machines as much as possible and kept the backend stored procedures strictly limited to import and reporting functions. The front end reporting also had some functions built into the client so I could harness the power of those machines as well. YMMV in that regard though depending on the architecture and intended purpose.

    Tip 1 - don't use any clustering software on the back end. Every clustering model out there tries to be generic enough to handle a variety of apps. This adds unnecessary overhead and complexity. If you design the app for a multiplicity of servers in the first place your performance and scalability will be much better! Not to mention everything will be a lot simpler which makes troubleshooting much easier! In my app I just had a table with a list of server names. To add a server you just loaded a blank database on it, plugged the machine in and put the name in that table. A few key tables (such as that one) were automatically replicated every few minutes and if the schema was missing on a server it got built and populated on the fly using self-generating scripts. (Very handy if you're making changes to the structure! :)

    Tip 2 - when importing drop your indexes! It's infinitely faster to bulk import into non-indexed tables and then rebuild the index once afterwards when you're ready for post-processing than it is to try to maintain the indexes for every little insert. (Literally about a thousand to one speed ratio if the data is any size at all).

    Tip 3 - like tip 2 this is more general database design advice. And that is to base your normalization on the nature of the data and the queries, not necessarily on the technical granularity. In other words a little bit of de-normalization is okay if it reduces the complexity of your queries enough. In our case we had an insane number of cases where many tables could be cross related in different ways. Based on the reporting we needed to do we purposely duplicated a few fields (mostly indexes) here and there between tables. This cut the memory utiliation of some of the more complex queries by up to 50%. When some of the joins can result in gigabyte matrices that's a significant savings!

    Tip 4 - try to spread the load. In the complex queries above we would of course try to break them down into smaller queries that could be run in sequence but that often resulted in the final report taking too long. The answer was to move the aggregation to the client. I.e. have the client pass the different parts of the report off to multiple servers to run in parallel and then pull the results together at the client side.

    That's all I can think of off the top of my head. If you want me to consult further just email me, I'm not terribly expensive. :)

  3. AfterHours == Bad by iwadasn · · Score: 3, Informative


    Be very careful about trying to "fix everything" after hours. That's how we do some things (and did the same thing at my previous place), and it's a nightmare. If anything is wrong in the morning, then it's hard to go back and re-run the previous night's batch processing. Often it takes too long, often you would just corrupt data further, etc... Very often this results in us having to just have incorrect data for a day, and then get it right the next day. In addition, due to global expansion, we only have 2 hours each day in which to run this stuff. If a market were to stay open an extra little while, or we try to expand somewhere in the forbidden slot, I don't know what would become of us.

    You are far better off trying to get everything to be realtime, as much as possible, and build in mechanisms for the system to recover from corruption and incrementally do batching throughout the day. Do NOT rely on the fact that the system goes offline for 8 hours each night, that's an easy way to produce a system that costs you your job.