When is Database Muscle Too Much?
DBOrNotDB asks: "At some of the places I've worked in the past, there have been DBAs who generally insisted that given accurate specifications and enough hardware and software, you could stuff nearly anything you wanted to into a database, manipulate it, and pull it back out again in a reasonable time. The feeling at my current workplace seems to be that very few projects lend themselves to database usage and that a customized one-off data storage solution should be developed for each project. This seems like a violation of many major software engineering principals (e.g. reuse) to me. My question is, what kind of success or horror stories does the community have about trying put different projects into databases? Numbers (# of rows, tables, total data storage, cost, etc) would be nice, but even just anecdotes would be helpful."
Slightly off topic as it were, but I've noticed that a lot of people seem to think that Excel works very nicely as a database. In some cases this might be true, but the bigger you get the more problems you have, and I just can't seem to convince those of a less-than-technical mind (read: management) otherwise.
This sig no verb.
One company I worked for contracted out an application we had to build to a provider who brought in this crazy, dishevled, brilliant Russian database engineer. I remember that with every issue that would come up he would say, "Is database application" and go off muttering to himself. Content management software needed? "Is database application." File system problems? "Is database application." new mouse drivers? "Is database application." What to order for lunch? "Is database application."
The moral of the story? Any computer application is built most quickly and easily using a database solution.
"Moderate drinking can help prevent amputated limbs" -- Abigail Zuger, NYTimes, 12/31/02
Speaking of Excel ... I work at an engineering firm where a couple of the manger types seem to think that excel is superior to matlab (which they also have installed on their machines ... but never use). Sometimes I am amazed (although usually disgusted ;) by the things that the excel spreadsheets they hand me do.
... I then have to reboot into w2k in order to extract the data into a text file, reboot into linux and then do the real work <sigh> they are management so I can't complain tooo loudly.
... excel should NEVER be used to plot more than about 10 data points ... EVER!
They will set up sheets to produce input data sets for decent sized monte carlo runs and then hand them to me to do the actual runs
And don't get me started on the BUT UGLY graphs that excel produces
i was always under the impression that the filesystem was a better place for this, assuming the directory structure was simple or fixed (i.e. you wouldn't be creating thousands of subdirs dynamically). why store all your banners in a table as BLOBs when you can simply have a web-accessible directory and store them there?
i never really found a clear reason on which was better performance-wise, though i suspect the filesystem-based way is. i also found it to be less of a hassle to implement. any intelligent thoughts?
You obviously are working with morons. Very few data-oriented applications need to write their own data-stores. Almost anything you can imagine (complex relational data, object-oriented data, xml stuff, photos, video footage, 3/4D spatial data, etc, etc..), someone has written database software tuned for it. Use it and be happy.
11*43+456^2
A custom DB will be faster than a general purpose DB (by definition).
A relational DB offers great flexibility (pull any data from the database, add most any index to greatly improve the performance, etc).
I would base the decision mostly on the interfaces to the application. I've worked on applications where the company prefered to access the SQL database directly for reporting and I've worked on projects where the only interface to the application was via HTTP/XML. In the latter case, no one cared how we stored the data, so we dumped XML into a filesystem.
Joe
Joe Batt Solid Design
I have to agree that database reuse is among the most essential parts of running a profitible business. I've worked with all sorts of RDBMSes, from MS to Oracle to PostgresSQL, on everything from the lowliest hand-me-down Linux server to top-of-the-line Big Iron, and I can tell you that any modern database is going to be able to take whatever you can throw at it. I like being able to whip out whatever data we have, shove it in and pull it out again, repeatedly and at a moment's notice. It's this kind of flexibility that makes us keep coming to database systems in the first place.
Karma: Good (despite my invention of the Karma: sig)
When you work in an environment where dog + uncle is either an Oracle DBA, MCSE or VP programmer, you wind up with projects like this: import random length string data from an embedded device into Oracle table, write some really convoluted stored procedures to parse the string data, write a nifty VB program to fetch the results for display to the users, who don't edit the data, just view it.
In short, you get some expensive licenses and a lot of work to manage what could be done with a short shell script.
Anybody ever hear about: grep, cut, sed, awk, sort, uniq... God forbid Perl?
When your only tool is a hammer...
Sure, there are times when writing an RDBMS-based solution seems like a big overhead. But there's a good reason for using RDBMS on projects that are likely to mutate and add new features over time, and/or have to interoperate with other programs and systems.
On the other hand, if you just want to stores a small array of data that fits in a 100 line text file, and the program is completely closed and self contained, there's no need for the flexibility of a RDBMS.
Imagine a business that has to "send and receive stuff"...
If you're moving two or three little packages to nearby local area businesses, only, you can get by with a small car.
But imagine your regularly ship objects large and small to locations local and international... Then you need an intermodal transportation system. Sure, your interface might be "the shipping guy", but the backbone of the transportation is heavy duty...
The reason we use the database instead of the filesystem is deathly simple. The database is god-like. I can do point it time recovery, and guarunttee that the database is completely consistant with the recovery point. If I had all that in a filesystem it'd be harder. It means I have hot-rollover capaibility from server to server, without having to duplicate my filesystem from machine to machine, I just let the hot archive logs deal with that. It means I have one backup system, and one failure recovery plan. It means all I have to monitor is the Oracle tablespace to ensure I don't run out of space. It means when I say "commit", I can hold Oracle accountable for ensuring the data is there, rather then having myself held accountable by the management.
If I was a good little boy and swallowed all the kool-aid, I'd use iFS (Oracle's Internet Filesystem) and it'd be all good. However, I don't I just use a huge array of blob's in my Oracle database.
Now that said, I have a remote filesystem that all of this data gets spooled to. Once spooled there it gets written to CD. Once the CD's are written, they are then used to find, compare, and if they match delete the blobs out of the Database. The CD then deletes the files out of the spool. Duplicate the CD, compare the two, send one offsite.
The other reason we use the database, that it's easier to deal with in our application, because writting a join against the filesystem is tricky...
Kirby
That's the first question that you should ask yourself when looking at going outside of the datbase for persistent storage. I would imagine that the data is quite important, and unless you want to write all the necessary functaionality in your one-off data storage solution (which may or may not be slower than the a RDBMS or ODBMS solution) you should have a pretty good argument for why a real database should be used. It sounds like you're working with a pretty uninformed technical staff, to be honest.
I'm curious about all-in-memory "anti-databases" like Prevayler (http://www.prevayler.org). Check it out; it sounds stupid at first, but I think the arguments for it are coherent and sensible.
Obviously, these won't be a good fit for massive terabytes of data, but for applications with small to medium data storage needs, hmmm? After all, you can always port to a database later....
Perhaps it's just that I can't think "outside the square" or something, but I can't really think of ANY application I've worked on where you couldn't make a good case for storing the data in a database. In some cases I've used XML files (when there was only a very small amount of data to be stored) but anything bigger than that I've always stored in a database. Perhaps under some circumstances (that I can't think of right now) you _might_ want to roll your own storage system....but I think these sort of projects would be the exception (handled of course!) rather than the rule.
In one scenario, you maintain business logic, info storage, display, and all sorts of other crap. In the other, you only maintain logic and display. It's easier to force your customers into lockin with some proprietary mish mash, but there are others (like myself) who will turn you down flat.
Internal stuff is similar. Do you want to maintain EVERYTHING, or just half of it?
Jesus was all right but his disciples were thick and ordinary. -John Lennon
Pluses: the database takes care of synchronization, and nearly takes care of backup/recover. Very nice. Some DB rigor may rub off on your designers.
Minuses: the DBM is large (in MB, in install/config requirements, and in CPU usage) and your customer may not be running the DBM brand/version which you have tested your app with. Supporting multiple DB vendors is a pain. SQL is sort of standard, but the table definitions tend to vary. Ick.
I have found from the projects I worked on for my company that business applications almost always need some sort of database. Databases are just easy to work with and are fast.
For example a recent project required that it stores 500 000 records (relating to cotton btw) of around 15 columns each. Now if think of that number and how long it would find a particular row or change a set of data. the MS SQL database we used on modest hardware was damn quick, quicker than could be imagined in fact.
Anyway SQL is like so usefull. With a proper structured SQL query you can do some amazing stuff, very quickly and with very little effort.
If the data is not critical or can be easily re-created, then a filesystem will suffice.
But if the data is critical to the business, and/or not easily be re-created,
the data should go into a real DB that is Managed Properly(tm).
You are being MICROattacked, from various angles, in a SOFT manner.
But please, for the love of God, don't listen to those moron college profs who say normalization is key. That's all good and fun until you have a million records in each of four tables you need to join to provide a solution. That, or your data model charts take up 300 sq. ft of wall space.
Good DBMSes can break complex joins catching the criteria piece by piece. You can also create run time extracts which are used by real time / almost real time systems for read access. However your advice is simply terrible. Once you lose normal form you lose the associative law on your table algebra. That means join operations are not defined independently of order they are performed in and that is very bad. Rick mentions an example of this in terms of addresses but it can get far worse.
Databases are more than just storage. Disregarding storing and retrieval, a good database has good design.
If the database won't be designed properly (as in many just-get-the-job-done small businesses) then a specific application may be better. But, if someone will spend the time doing design, the database forces logic and structure onto the system. While this may be an annoyance to sloppy coders, this helps ease usage (because of strict guidelines) and understanding. Yes, understanding. There are times that data is to some extent known, but to a lesser extent understood. A decent database layout increases understanding as the objects and relations must be logical.
I speak this as a DBA. And, as a DBA, for good or for bad, there is hardly a project that wouldn't benefit from clear data definitions.
Have you read my journal today?
Sometimes, well-thought out denormalisation can make a huge, positive impact on an application. Yes, it can be difficult to make sure you don't have any anomalies in your data, but with a rigourous design and development methodology, these problems can be minimised.
Don't denormalise for the sake of denormalising - the trick is to know when to break the rules and to do so very carefully.
Denormalisation is only one tool that can be used to improve the performance of a system, and of course, other options like more memory, faster CPUs and better code should be addressed first.
But please, for the love of God, don't listen to those moron college profs who say normalization is key. That's all good and fun until you have a million records in each of four tables you need to join to provide a solution. That, or your data model charts take up 300 sq. ft of wall space.
Don't normalise your database at your own peril!
Learn how to properly normalise a database (3NF is usually good enough) and then learn how to write decent queries and tune your indexes appropriately. In some circumstances it may be worthwhile caching some data in extra tables which are refreshed periodically. In even rarer circumstances it may be necessary to denormalise the database, but always normalise it and only denormalise if you encounter some intractible performance issues.
We have some incredibly complex queries at my current client and we have managed to gain performance improvements by re-writing certain queries. Splitting complex queries up and using temp tables, derived tables, sub-queries, etc can help and you will have less of a problem with data anomalies which can creep into a denormalised structure.
Serving images from the filesystem is always faster than serving them from a database, since the database is also on the filesystem. (If you think that the DB could cache the results and therefore be faster, just serve your most frequently requested images from a filesystem in RAM (a RAM-disk) -- of course you can't cache more than what would fit in RAM, no matter if you do it yourself or with the DB.)
Using DB for serving images can make sense when you serve different images in different cases like banners, where you want to have simpler CGI scripts (it's because of convenience, not because of performance). But using database doesn't have to mean storing images in a database.
If I found out that serving some of my images would be easier (and that means less error prone, better to maintain, etc.) if I used a database, I would do something like this: I'd store the digests of my images in the DB (it could be a binary MD5 for example -- just 16 bytes per image). Then, when my CGI script gets the digest, it would use it to find image on the filesystem.
So, if it gets 9743a66f914cc249efca164485a19c5c it serves /images/97/43/a66f/914cc249efca164485a19c5c.png
(splitting a digest to get different directories depth would
depend on the filesystem and number of images of course,
this is just an example; also one could use less than 128 bits of
the digest if it would be enough to have 96 or 64 (depanding
on the number of images) to have shorter paths).
The .png suffix could be also stored in the database to allow
easy use of few different image formats.
(The DB could store more human-friendly paths instead of message digests of images, but would need more human interaction -- it's probably a matter of taste).
This way you still can have a cluster of very simple statical image serving servers in the future, while having the benefits of databse. Also the database traffic is much much lower. The only difference with your scripts is that you send a redirect instead of actual content, which is even easier.
The most important benefit, however, is that you can have statical images, database and CGI frontends split into three independent machines or even groups of machines, when your traffic become to high for an all-in-one sollution. Because with the DB-only BLOB images you better have lots of money for a database cluster (and DB-CGI bandwidth of redundant internal traffic). Using a database-stored images when all you need is easier searching with SQL queries is in my opinion just using database as an expensive filesystem.
root@aio:~# nmap -sX -iR -p1- # Ho, ho, ho! Merry Xmas, everyone!
This is a horrible approach to analyzing any problem. Normailized and denormalized data both have their place in today's RDBMS-driven world. Take any one of the large ERP packages available today: Oracle Financials, PeopleSoft, SAP, etc.
When working with specific data, for example Accounts Payable data, you really don't want to duplicate all of that customer data again and again and again for each row in the database, hence you normalize it. Yes, you pay a bit of a speed penalty when joining against the CUSTOMERS, CUSTOMER_ADDRESSES , INVOICES and INVOICE_LINES. In reality, that difference is never larger than a few seconds for large (read: 10 million+ record tables) when using a properly optimized (read: good index scheme) set of normalized tables.
RDBMS' are work very well when finding the 10 rows out of those 10 million that fit your search request. Where they puke is trying to manipulate 50% or more of the data contained in multiple tables: the Data Warehouse/Data Mart.
Those same Accounts Payable tables make reporting a real pain when they're normalized, so you go through a denormalization (or summarization) procedure to fill out your reporting infrastructure. Pre-summarize your data into a single row with multiple 'buckets' for every strange query procedure you want to view the numbers by. Duplicate data on every row and get those 10+ millon records per month down to a few hundred thousand at most.
If you have to write custom programs in C, Java, Perl or your language of choice in order to operate on an exported version of the data, so much the better; the database won't perform as well when acting on every row in the database as a program optimized to summarize it. Once that's done, load it back in.
By using this approach, you're able to use the best of both the normalized and denormalized approaches and satisfy both the data entry clerks -- because the data is entered quickly -- and the managers -- because they can get virtually any report in under 30 seconds.
Sweeping comments that suggest that one method or another is bad in all cases just screams: "Look at me! I don't know what I'm doing but I'm going to tell you that what you're doing is wrong, anyway!"
There's so little difference between politics and jihad lately...
It really depends on what data you are storing. How much data, how critical, what are you doing with it, etc...
Sometimes, Excel is good enough. Or XML. Or plain text files. Or a custom file format. etc...
If the people working on your project have only ever worked with databases, they will want to use databases for everything. Most stuff will fit into a database, but sometimes it is not apprpriate to do so - as other have mentioned here, storing images in a database is not always a good idea, but you would probably want to store the location of the images in a database.
If you need to be able to ship the data around to different machines/offices/clients/over the net/etc, then maybe an XML file will be best. Custom file formats may be appropriate in some cases too. (Though I'd lean towards a more open file format.)
Huh? Do you create a custom C library for every application as well? How about a custom UI toolkit? Custom preprocessor/compiler?
Sounds kind of silly, doesn't it.
So why do these folks think a "customized one-off data storage solution" sounds any better? It's the same problem - you can either use something that's already been debugged, tested, and tweaked for performance, or you can spend your own time and effort to create it yourself. That's time and effort that could go towards coding and testing the final product, but is instead spent elsewhere (probably because someone thinks that using a dabatbase for storage would make the application "bloated").
I think the problem is probably that when you mention using a "database", most people equate that term with "general purpose database server" (Oracle, SQL Server, Postgress, MySQL, etc.) There are libraries available that were specifically designed to offer programs lightweight database access without the pain of using a full-fledged RDBMS. Search Google for embedded database, xbase library, or open source database library to start... there are any number of toolktis that will allow you to create a very customized storage solution without having to create "one-off" code for each and every project.
"Great men are not always wise: neither do the aged understand judgement." Job 32:9
I would just like to point out that the filesystem is a particular case of specialized database so this whole debate fs versus db does not make much sense. Some of its implementations may suck, but it's still a database that has a well-known, familiar interface (access mechanism). One can build a file store using a filesystem and replace it with a database-backed "file system" later if need be.
And we should also keep in mind that while some databases do give killer performance, you have to pay a lot for it, while the ol' filesystem comes with the OS already.
Now, for any other use than filesystem-related stuff, it does not make sense to try to invent your own small-scale storage mechanism when there are so many good, cheap/free database servers out there well-suited for the job.
I didn't see this mentioned anywhere, but just because you have a multi proc system, it doesn't mean you have to run MSSQL on all the procs. You can purchase a single processor license of sql server, but be running it on a quad proc box. To be compliant, you just need to be sure you are setting the processors the service can use (er, its on the 'Processor' tab under the sql server properties).
This isn't necessarily a bad thing to do either. When you are having to be conservative with your cash, a lot of the times these boxes have to serve multiple purposes. Having the sql server running on only procs 3 & 4 would leave 1 & 2 available to do 'other stuff' (web services? perl scripts?).
With SQL2k you can even have the development and the production sql server be the same system and generally not effect each other performance-wise when you are thrashing the procs. You just need to setup multiple instances of the service and assign each to separate processors (of course, they won't be completely autonomous since they *are* on the same box, but at least you won't get competition for the processor)
In any case, I'd go with the quad proc box. Only get one CPU if you want. You can always add to it later and purchase further licenses *if* you need them.
(-7 Offtopic, Posted to wrong ask slashdot database topic for the day. Hoohah)
When you want speed and flexibility and scalability and reliability and extendability and particularly developer productivity you will undoubtedly end up shooting yourself in the foot later if you avoid some form of DBMS up front.
Where you have a particularly well-defined, narrow functionality, and performance in a small footprint is a requirement, an RDBMS may not be such a good choice, but DB libraries like berkeley db can still be very useful.
And with PostgreSQL, Firebird , MySQL and so many other free, open-source projects out there covering such a broad spectrum of needs for a database, why would you not use that expert work?
A customer of ours is currently considering installing an ERP system an all locations, which are connected via satellite link with a main location, where the main database server is running.
:-) Which means one has to go :-) But it won't be the satellite ;-)
The ERP software has even its' GUI in the database, which is simply plain stupid when it comes to bandwidth (although it's very flexible on the other hand). The ERP people want to have 100MBit bandwidth, but the satellite has only 1.5MBit
Was it in Dallas?
Lol, probably not but it's amazing what you see in the "real world" of corporate IT departments.
Things you don't want to learn about your local prison:
1. They use excel for tracking inmates
2. They have no less than 20 people in the same copy of said spreadsheet
What am I missing here? A catalogue of bibliographical info seems to be the perfect application for a database. Like most applications the data seeems quite simple on the surface each entry will have:
You could easily roll your own file format to store this data. Maybe some sort of xml.
But what happens if your users decide they need more functionality? The bibliography has grown from 50 entries to 50,000 and now they want to search by Author's last name, publisher, and publishing date. You could write your own search methods, and create indexes for your data but why bother? The kind developers working on firebird, postres, and mySQL have already worked through some sleepless nights to solve this problem for you, and are giving their solution away for free.
I always look for intersting, elegant ways to build a project so that program is as small, and fast as possible. Writing a filesystem that is perfect for the task at hand often looks like a good idea at the start of a project but when additional features are added and we start talking about delivery times using a database looks more attractive.
When you can imagine querying the data you are entering - you can't easily query images, or other binary data (although I guess there must be someone working on this problem somewhere...). If you can't query it, you should usually find a better place to store it - NAS is usually fine - and maintain a pointer to it (e.g. a filename). Yes, it's something that can get un-synced, but most databases suck when it comes to actually dealing with binary data, and you can use that capacity a lot more effectively elsewhere.
When the structure of the data is likely to remain stable. If your application deals with well-understood entities, whose properties are unlikely to change over time, a database is a great solution.
Databases are, however, relatively change-resistant - it's typically a pain in the backside to change the datatype of a column, remove columns etc. So, if you're working in a domain where you continuously learn new things about your core entities, or if your development processes are highly iterative, you might be better off using an alternative data storage mechanism.
When more than a single user is likely to access the data - yes, you can create locking mechanisms yourself. You can also take your own garbage to the local dump. It's usually not a good use of your time, and the cost of not dealing with the issues involved are expensive, both for garbage and concurrent access to shared data.
When you require consistency accross transactions - the good old ACID (atomicity, consistency, isolation, durability) principles which become important for many non-trivial applications.
If you care about enforcing rules of referential integrity - do you want to ensure that all the tracks in your record collection can be tied back to a recording ? Do all orders have to have a customer ? Those things are far simpler to implement with an RDBMS than in code.
There are instances where using an RDBMS is not appropriate. Ones that spring to mind are :
- your business domain is not well understood or liable to rapid change. In this case, the cost of change for database objects is likely to be a problem - consider storing data in a self-describing format like XML.
- the application domain doesn't lend itself to being described in relational terms - image manipulation tools, word processors etc. which deal with mainly binary information probably should not use a relational model for their core data structures.
Alternatives exist - Object Oriented databases are becoming more and more popular. I have way too little experience with these to comment on their use.
It's all very well in practice, but it will never work in theory.
Well, yes and no. Yes, you're right that the 64-bit parts of MD5 digests are not unique, but so are the full 128-bit digests. Any n-bit digest, provided it's randomly distributed, will be the same statistically every 2^n times, since there are only 2^n different results. Now it's up to you if you think 281474976710656 (48 bits) different digests is enough for you, or you need 590295810358705651712 (69 bits) or full MD5 340282366920938463463374607431768211456 (128 bits).
You have to use enough bits to make sure (well, you never can be sure, like you can't be sure that you won't win a lottery 1000 times in a row -- you get the idea) that two files having the same digest is practically impossible (because it's always theoretically possible, however unlikely). It depends on the number of files you have. For n-bit digest and m files there are 2^nm different results and (2^n)!/(2^n - m)! good results (i.e. those results without collisions).
So, the probability of not having any collisions is (2^n)!/(2^nm (2^n - m)!) but since calculating (2^128)! is not what you want to do (trust me -- a 1000 teraflop supercomputer would need half a million times more time than the age of our universe, provided it would have so much RAM and could handle so long numbers, which I don't even dare estimating). You better write this from the command line, it's a little Perl one-liner I just hacked out of boredom -- yes, I know, I should take my medicine and get some sleep:
perl -le'($n,$m)=@ARGV; for($w=$z=2**$n,++$_;$m;--$m,--$z){$_*=$z/$w} print' n m
It will compute (2^n)!/(2^nm (2^n - m)!) (rounded to your floating point resolution) i.e. it will give you the probability of not having any collisions using n-bit digests with m files (-0 means it's impossible and 1 means it's sure or so possible that almost sure). If anyone asks how does it work -- it's magic. Copyright © 2002 alfaiomega. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. There is NO warranty; not even for MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE or READABILITY.
Great, I thought no one would read my comment with Score:1. That's good to hear that someone liked it more than the moderators.
root@aio:~# nmap -sX -iR -p1- # Ho, ho, ho! Merry Xmas, everyone!
perl -le'($n,$m)=@ARGV; for($w=$z=2**$n,++$_;$m;--$m,--$z){$_*=$z/$w} print' n m
It will compute (2^n)!/(2^nm (2^n - m)!) (rounded to your floating point resolution) i.e. it will give you the probability of not having any collisions using n-bit digests with m files
you are a god, dude. i've no idea why, but it works. i'm starting to learn perl today...