RAID Solutions For Terrabyte Databases?
gullevek asks: "We are about to implement a huge database. In the first two years it will grow up to 650 GB and it will get larger and larger afterwards and could grow up to 2-4 TB. I never before implemented such a large DB. The DB Software has been choosen but now I have to find the right hardware. The basic components are not a problem, but what about storage? I would prefer to use RAID, of course, but what type of RAID? RAID 5 may be the best for disk failure, but it can be quite slow. RAID1 is fastest but the most expensive. Especially when it is at this sizes.
And what about the type of drive: SCSI-3? FCA? FibreChannel? Do you folks at Slashdot have any suggestions?"
Folks like EMC make their living selling, servicing, upgrading, and guaranteeing uptime for big systems like this.
Put the word out that you need a some RAID, and a service contract to go with it. Start talking with other folks who have similar sized solutions, and see who their vendor is.
The first thing is to talk to your DBA and get his/her input. DBAs, competant ones, have done a lot of this type of work in the past, and they'll have an enormous amount of help to provide you. They'll know your usage pattern by heart, and be able to provide you with some help as to usage.
The first thing to realize is that for most RDBMS usage patterns, RAID is a Very Very Very Bad Thing. But when I say "most", I mean "most with updates to live data."
RDBMS' use data in 4 main types of storage, and it's important to understand them:
- Main Table Storage. This is where your data actually "lives", and is ironically the least important storage wise.
- Temporary Table Storage. This is the storage space for temporary space and temporary tables, which is extremely useful for performance management.
- Index Storage. This is where data indexing structures live, and is extremely performance critical.
- Log Storage. This is where the log for your system lives (physical and logical) and is also extremely important for performance.
The most important thing for performance is to PHYSICALLY segregate ALL four types as much as possible. For example, if you're going into multi-terabyte databases, you might want all four types of data not just on different disk arrays (i.e. RAID controllers), but also on different SCSI channels, and different host adapters (i.e. multiple SCSI or FC-AL cards) altogether.You also want to bear in mind that your update speed is limited by the ability to handle log writes. Log writes aren't limited by bandwidth. They're limited by the latency of each disk. Every disk can handle a certain number of operations per second. Even if you add more disks in a RAID configuration, you're never going to be able to handle more transactions per second, because you're not increasing the number of operations of any of them, and all of them must be touched for every transactional write.
So with that in mind, allow me to recommend something:
The number one advice I can give is to consult with others. If you haven't done this before, there are people (your DBA, your database vendor, your hardware vendor, your systems integrator) who have. This is serious business, and not something to screw around with. Terabyte-level databases are still NOT so common that everyone can and should attempt them. Having terabyte-levels of data throughout an enterprise is, but in one application it isn't. You'll probably not get it right the first time, so take your time and consult with every one of your vendors on capacity and performance planning.
Not to be crass or mean, but if you're asking slashdot, you probably shouldn't be doing this all by yourself.