Distributed Versus Centralized DB?

← Back to Stories (view on slashdot.org)

Distributed Versus Centralized DB?

Posted by Cliff on Saturday July 16, 2005 @07:30AM from the updates-without-terrible-bottlenecks dept.

OSXCPA asks: "I am developing a framework for processing accounting information. What kinds of advice would you have for setting up an environment where several (eventually, many) remote data aggregation and processing centers would update a central server 'after hours' to avoid processing delays. For those with ERP experience, I am trying to process transactions locally, instead of generating a batch in the central server, then update and validate the result to the central server, so my remote users can process at full speed all day, and central DB can get caught up later. Remote locations are all self-contained, in that no one will be posting transactions to another remote location. No big iron - this is all networked PC/Mac hardware and a central server (eventually to be a server farm, as needed). Feedback or constructive criticism would be appreciated. Technical advice is great, but I'm also looking for stuff like 'Gee, if I had only known XXX when I set this up, I would have done YYY...'"

13 of 47 comments (clear)

Min score:

Reason:

Sort:

Use unique identifiers... by stefanlasiewski · 2005-07-16 07:41 · Score: 2, Informative

Whatever solution you choose, be sure to give each DB entry a unique identifier that is unique across systems, and make sure the identifier does not change (E.g. Don't use the customer's name, phone number or any other changeable information).

One such solution: Each customer transaction has an internal ID number, like "Order #1000". Each system prepends it's own identifier to the order number, so order #1000 made on System 03 will have a UID of 03.1000. 03.1000 won't conflict with order #1000 made on system 02, because the latter UID is 02.1000.

Using a unique internal UID also lets you keep all order history around. If th customer wants to cancel their order and retry, you simply mark 03.1000 as cancelled, and start a new order with 03.1001.

This may seem obvious to you, but I can't tell you how many times I've seen a product fail because the development team decided to use the customer's name/phone number/address as the UID, which meant the customer couldn't change their name/phone number/address without breaking the index or other features in the DB. Not sure why, but this seems to be a common problem in databases.

--
"Can of worms? The can is open... the worms are everywhere."
Run away while you still have your sanity. ;) by Shag · 2005-07-16 07:44 · Score: 3, Insightful

I am not a professional DBA, but I tend to wind up working with them. The one thing I remember from a big project where replication and whatnot came into play was that none of that stuff ever worked as well as the vendors said it would, and it all required a lot more work on the DBA's part (and typically lots more $$$ in general) than it was supposed to.

--
Village idiot in some extremely smart villages.
1. Re:Run away while you still have your sanity. ;) by unitron · 2005-07-16 18:24 · Score: 2, Funny
  
  "...none of that stuff ever worked as well as the vendors said it would, and it all required a lot more work on the DBA's part (and typically lots more $$$ in general) than it was supposed to."
  So in other words it's a lot like any other kind of software you buy?
  
  --
  I see even classic Slashdot is now pretty much unusable on dial up anymore.
Been there, done that! by itwerx · 2005-07-16 07:55 · Score: 3, Informative

I did that awhile back for a data warehouse and reporting system with remote data collection points. (Half a TB of data with 50 million queries a day and a couple Gig to integrate every night). I did all the data validation and scrubbing and some preprocessing remotely to utilize the processing power of the collection machines as much as possible and kept the backend stored procedures strictly limited to import and reporting functions. The front end reporting also had some functions built into the client so I could harness the power of those machines as well. YMMV in that regard though depending on the architecture and intended purpose.

Tip 1 - don't use any clustering software on the back end. Every clustering model out there tries to be generic enough to handle a variety of apps. This adds unnecessary overhead and complexity. If you design the app for a multiplicity of servers in the first place your performance and scalability will be much better! Not to mention everything will be a lot simpler which makes troubleshooting much easier! In my app I just had a table with a list of server names. To add a server you just loaded a blank database on it, plugged the machine in and put the name in that table. A few key tables (such as that one) were automatically replicated every few minutes and if the schema was missing on a server it got built and populated on the fly using self-generating scripts. (Very handy if you're making changes to the structure! :)

Tip 2 - when importing drop your indexes! It's infinitely faster to bulk import into non-indexed tables and then rebuild the index once afterwards when you're ready for post-processing than it is to try to maintain the indexes for every little insert. (Literally about a thousand to one speed ratio if the data is any size at all).

Tip 3 - like tip 2 this is more general database design advice. And that is to base your normalization on the nature of the data and the queries, not necessarily on the technical granularity. In other words a little bit of de-normalization is okay if it reduces the complexity of your queries enough. In our case we had an insane number of cases where many tables could be cross related in different ways. Based on the reporting we needed to do we purposely duplicated a few fields (mostly indexes) here and there between tables. This cut the memory utiliation of some of the more complex queries by up to 50%. When some of the joins can result in gigabyte matrices that's a significant savings!

Tip 4 - try to spread the load. In the complex queries above we would of course try to break them down into smaller queries that could be run in sequence but that often resulted in the final report taking too long. The answer was to move the aggregation to the client. I.e. have the client pass the different parts of the report off to multiple servers to run in parallel and then pull the results together at the client side.

That's all I can think of off the top of my head. If you want me to consult further just email me, I'm not terribly expensive. :)
Comment removed by account_deleted · 2005-07-16 08:05 · Score: 2, Insightful

Comment removed based on user account deletion
The most crucial thing by richg74 · 2005-07-16 08:06 · Score: 2, Informative

I've implemented a couple of database applications across multiple geographical locations (in "Wall Street" contexts), and I think the one thing that you must be sure of is that you really understand all the ways in which the data is used in and impacts the real-world business.
You mention that this is for an accounting application. You suggest a scheme with local databases that periodically send updates to the central facility. If the books and records are in a strict hierarchy (each node is contained in exactly one "higher" node), this can work fine for some purposes, like financial reporting.

In the situations I worked on, people initially agreed that this sort of setup would be fine. However, as we dug a little deeper, we realized it wouldn't work, because financial reporting wasn't the only real need. In the context of an investment bank, the capacity to take risk is a (finite) asset, and the amount of capacity that's available at any time is a function of all the trades that have been done anywhere -- since everything is ultimately reflected back to the parent's accounts. Had we not realized that, we would have built a fine system that we would have started rebuilding as soon as we'd finished it (cf. There's never time to do it right, but always time to do it over).

I won't belabor things like redundancy, backups, and so on; I assume you have taken careful thought for those.

Finally, IMO, the trickiest thing to get right in a distributed DB system is maintaining the referential and logical integrity of the database. I personally know of one well-known DB vendor's initial replication solution (names omitted to protect the guilty) that had a subtle but potentially very serious logical flaw. Moral: do your own homework.
data integrity above all else by Anonymous Coward · 2005-07-16 10:01 · Score: 2, Insightful

I can't stress this enough: the purpose of a database is to guarantee data integrity. Be sure you understand that: storing/retreving/manipulating data is part of it too, but without data integrity, you're lost.

Unfortunately the state of the database industry (both vendors and users) is so fucked up right now I'm not even sure people know what "data integrity" means. I'll give you one definition: it means that every possible combination of values in your database has a legal business meaning. Does it makes sense to have order items without corresponding orders? No? Then make sure it's not possible.

Another definition: don't ever allow the two facts "X" and "not X" to ever be derived from the same database! With my order item example above: if order #123 doesn't exist, but there are order items referencing #123, then you can prove both "order #123 exists" and "order #123 doesn't exist" from the same database. BAD!

As for replicated databases, one approach is usually to partition the data. For instance, use a compound key (REGION#, ID#), where REGION# is hardcoded on each database. I.e. you have a constraint that says REGION# must always be "100" on database #1, "101" on database #2, etc.

You can combine all the data regularly into one BIG databases, or you can just go out and query each individual database depending on what region it's in. I'm maintaining an in-house app currently that keeps customer data for each individual city in a cluster physically located in that city. Then an outside app polls the databases and combines the data for presentation. Works pretty well, but my queries are pretty simple.
Not economiclal by duffbeer703 · 2005-07-16 11:19 · Score: 2, Informative

We looked into doing something like this a year ago, and determined that with the cost of networking and clusters of cheap computers being the way they are, a centralized solution was far cheaper.

My scenario was a distributed environment with approximately 3,000 remote sites. If you had less remote sites, the distributed processing may be cost effective, but you need to use extreme caution as distributed solutions can escalate in cost quickly.

--
Conformity is the jailer of freedom and enemy of growth. -JFK
Small scale solution by j()nty · 2005-07-16 11:26 · Score: 2, Insightful
I'm in the same situation, writing an application with about 500 users spread over 30 offices, a local database in each office, a central database, and overnight updates from the offices to the central database.
I'm using globally-unique-id's for all records. These are generated by the database system I'm using and are guaranteed to be different across all the offices. No conflict between new records created at different offices. And all I have to do is use an existing feature from the database system.

I don't delete records. Just mark dead records using a timestamp. This way records don't have to be deleted from the central database and other offices; instead just propagate the dead timestamp.

I've decided to write my own update program that pushes changes from one office to the central database. I can use this same program in reverse to update the offices from the central database. And I can use it to update laptop users who only connect once a week. And, and, and...

It's a chore to write the update program. But I see some advantages:
- The update program can be as simple as possible, but no simpler.
- I decide when, where, and how the update program runs.
- If there are conflicts between updates from different offices I can code the procedure for resolving the conflicts. The rules for resolving conflicts are decided by the users, they get exactly what they want, and the pay for me to code it.
AfterHours == Bad by iwadasn · 2005-07-16 14:07 · Score: 3, Informative

Be very careful about trying to "fix everything" after hours. That's how we do some things (and did the same thing at my previous place), and it's a nightmare. If anything is wrong in the morning, then it's hard to go back and re-run the previous night's batch processing. Often it takes too long, often you would just corrupt data further, etc... Very often this results in us having to just have incorrect data for a day, and then get it right the next day. In addition, due to global expansion, we only have 2 hours each day in which to run this stuff. If a market were to stay open an extra little while, or we try to expand somewhere in the forbidden slot, I don't know what would become of us.

You are far better off trying to get everything to be realtime, as much as possible, and build in mechanisms for the system to recover from corruption and incrementally do batching throughout the day. Do NOT rely on the fact that the system goes offline for 8 hours each night, that's an easy way to produce a system that costs you your job.
Distributed by JoeCommodore · 2005-07-17 08:52 · Score: 2, Insightful

I've not been trained in building distributed systems but have learned on my own from the school of hard knocks.

As metioned a good Record ID strategy is a must; I used a packed timestamp (base 60, using upper/lowr case numbers) with a node/user identifier, besides keeping the records unique you can also use the ID to track down where/when problems mikght have occurred.

Also time stamps and delete logs!!! You need to know what data is new or old or should have been deleted.

Work on a way to automate updates, you are going to be coding not for one computer but for all your nodes on the network, set something out there that can keep your data/apps in check and updated. A big stumbling block is getting the data transfer structure and protocol worked out.

Look at the importance and frequency of data and make sure your system can handle it, if nightly is OK, then it's fine, though there might be times when your satte;lites might be down, don't paint yourself into a corner by expecing every transfer to work right every time.

Choose your apps wisely, since you mentioned Macs that narrows your choice of what you can use. We had used FoxBase +/Mac, but 4th dimension is cross-platform and looks real flexible. It could be done using LAMP at the nodes, but you will have to work out your own distributed data import/export scripts. (DBF xfer was sooo eay)

I find you have to implement a lot more remote validation, as you expect your data to be clean when it comes in from the other offfices.

Also things happen to different sites, and they may have to restore from backups etc. Work out how to re-sync the sites when such events occur.

You are looking at a very complex system to manage. (those thousand or so lines of spaghetti code will keep you employed)

Normalize and clean up your database as much as possible now include planned data element (even if you currently hide it for now), making structure changes when distributed can be a real pain.

With that said, after years of being distributed; technology here in our rural community now makes it possible to centralize, and I'm relieved that I can make the switch. (though I may keep bits distibuted, field laptops, and such.) It's a good skill to learn, adds a whole new level of paranoia when you're coding. :-)

--
"Enjoy what you're doing! If it becomes drudgery, you're doing it wrong!" - Jim Butterfield
A list of considerations by Kefaa · 2005-07-18 03:47 · Score: 2, Informative

Here is my short list based on real life experiences:

*Read Been There, Done that above.
*If your system is geographically dispersed there is no night time. If you can handle the transactions at night, then handle them "near realtime" by sending the transactions into a queue to be processed when bandwidth, CPU, etc. are available.
*Assume the Central database will not respond for three days (a tornado, etc. at the site) and see what your model will do versus what the business wants.
*Unplug the network during a transfer to the Central site. What do you expect it to do, when an admin pulls the cable for a second to clean behind the server.
*Identify units of work. Those transactions the business believes must occur together. These are often different than DB units of work.
*Expect bad data. People will tell you that it does not happen, but if you send transactions via a secondary process, it will. A record that was good on the remote box will be unusable on the Central or vise versa. Now what should happen?
*Compress all transactions being sent to/fro. While some will be tiny and hardly seem worth it, a consistent approach makes life at 3:00 a.m. a lot easier.
* XML is your best friend and worst enemy. If you choose to use it, know why it is good for your application.
*Avoid any product, template, approach or design built for the future. If you are not going to use it in the initial production roll-out. Stay away. If you are that good at reading the future, by lottery tickets. Unless this is a very small/fast project your options twelve months from now will be very different and you will have spent the time/money/energy building a solution without a use.
*Consider what areas are most likely to change, based on past business experience, and make them easy to change. (No, this does not conflict with the previous bullet
*New tools and additional bodies will not help when you are late. Add more people early and your life will be much easier. Rapid Development by S. McConnell is still the bible for PM.
*Create a lab where you connect the Centeral DB to a set of remote servers (also in the lab). They should be blazing fast as they are connected locally, and give you the ability to create "what if" cases in-house.
*Create a step by step process for rebuilding a server from scratch. From the time the hardware is turned on, to the time the server can be released. This is a painful experience and will take a week or better full time, but will make every additional site easy to implement. Have a standard for what is dynamic based on the server. (i.e. Step 26: Change ##siteName## to local host name) so people type "www.abcCompany.com" instead of siteName. Then have someone not involved in the documentation try it.

Hope this helps.
Cheers.
some retail experience... by ecklesweb · 2005-07-18 06:28 · Score: 2, Informative

I'm in a situation where we have a few thousand retail locations that send data up at night. A few points of pain:

1. Use an extensible format for your batch uploads.
What you think you want to upload today will be different (probably less than) what you want to upload tomorrow. Use an extensible format for your batch uploads (XML is awfully nice in this regard, it just eats up bandwidth). That way when you make upgrades to part of the chain, your server software can gracefully handle differences in the data being sent up from different locations. If you try to "bleed the turnip" in terms of bandwidth and go with something like a fixed width or comma separated file, you're going to pay for it in maintenance down the road.

2. Keep a backup of the data at the remote location until the remote location receives a well-formed acknowledgement form the central location. Just because the remote location sends up the data doesn't mean it's captured; a lot of things can go wrong, and it makes life much easier if you can recover the data at the remote location later. Furthermore, PLAN to miss some uploads; it's just going to happen sometimes, and if your system is designed to handle that fact, you'll be much better off.

3. Document your data formats - the format at the remote location and the central location (presumably those are relational databases and can be documented with detailed ER diagrams) and the format used for transmission. Troubleshooting is a pain in the ass if you are unsure about how to map a transmission field to a remote or central DB field. Seems like a no-brainer, but if you don't do it from the beginning you're not ever going to do it.

4. You might want to give your remote locations random upload times within the overall upload window so as not to "slashdot" yourself. If you want the remote locations to upload between 1 AM and 4 AM, write a script and schedule it in cron that will trigger the upload at some random time during that period. If all your remote locations start uploading at 01:00:00.00, you're going to have trouble scaling.

This kind of thing isn't rocket science, but once you start down the wrong path, it's easy to get entrenched to the point where it's easier to live with your mistakes than to take corrective action.