The Future of Databases
gManZboy writes "Ever wonder where database technology is going? This is something that Turing award winner Jim Gray from Microsoft has given a lot of thought to. He recently published an article in which he looks at the many forces pushing database technologies forward, and what those new technologies will look like. Gray writes, 'the greatest of these [research challenges] will have to do with the unification of approximate and exact reasoning. Most of us come from the exact-reasoning world -- but most of our clients are now asking questions that require approximate or probabilistic answers.'"
As in, he passed the Turing test?
Starsucks
How many times have we heard of huge sites going down because databases become corrupt or unrecoverable, or of the huge resource strain (memory and CPU) from a large database?
In my opinion, the future of databases is nothing so complicated as pitched here -- but rather a move to simpler, more reliable back ends where the filesystem is the database. This is certainly the vision pitched by Hans Reiser and reiserfs, which aims to put more database like intelligence within the filesystem. So you eliminate extra unnecessary layers that just eat up resources and create fragile databases.
Data is data. Just data. Save it, read it, sort it how you like. Efficient results mean having rapid, low-latency access to data.
... except that you still do, because SQL isn't a way of avoiding logical errors. ... and that they still don't save time. At best, they allow for some parallelism, external access to the data, and a separation of concerns.
Add code to it, and you have data+code.
OF course, code is data, and thus data can be treated as code, and handled by other code. LISP does this moderately well.
But you can't avoid the fact that, as it stands, databases are just engines for keeping your data structures outside your code, or when you add code to them, engines for reading your data structure for you so that you don't have to think about how to do it.
I'm getting rather tired of the fad that databases should be tacked on to everything, ranging from a shopping list to guidance systems. When did adding overhead become the mark of skill?
The requirements for a database today aren't too much different from those twenty years ago - except for what we want to get out of them.
Now that data mining is a $[insert large number here]million industry, databases are being asked to do a lot more processing with this data than before. For example: old database query = get these attributes from tuples that match this pattern. New database query = determine how likely a user who has accessed 30 or more times this last month is to subscribe to the second-level pay service within the next ninety days, with or without an email advertising said service.
I [may] disapprove of what you say, but I will defend to the death your right to say it.
Imagine for example, you want to use a database to store information about packets flowing through your network. Thats all dandy on normal network links, but if we are talking about a multigigabit link, it is likely that your hard disk can't keep up with storing that data. Or that the hardware to do so would cost too much. So instead you could take every second packet and look at that, and approximate. This particular example is refering to data stream management systems.
... MBA's want the magic glowy box to do their thinking for them.
Fortunately, Microsoft will be there to take their money.
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
$techology is dying. It will be replaced with $replacement. Insert 4000 more words sprinkled with $random_buzzwords. I am so smart! The end.
most of our clients are now asking questions that require approximate or probabilistic answers
Bioinformatics databases are a good example of this. DNA and protein sequence databases are often searched by approximate string-matching algorithms based on "dynamic programming" to hidden Markov models and other stochastic grammars.
Historically, drug target-hunters in Big Pharma created a market for accelerated hardware to facilitate dynamic programming searches, some of which (e.g. Paracel's Fast Data Finder chip) was originally marketed to government agencies who, um, shared an interest in approximate string-matching ;)
The "next great advancement" in databases will be when I can setup 2 or more linux servers and have them act as a single database server. Our database server is the most expensive item in our datacenter because it's an N-way IBM server.
Could someone summarize it without using the letter 'e'?
Sure.
Th Futur of Databass
Postd by timothy on Monday May 02, @08:12PM
from th your-flight-status-is-'mayb' dpt.
gManZboy writs "vr wondr whr databas tchnology is going? This is somthing that Turing award winnr Jim Gray from Microsoft has givn a lot of thought to. H rcntly publishd an articl in which h looks at th many forcs pushing databas tchnologis forward, and what thos nw tchnologis will look lik. Gray writs, 'th gratst of ths [rsarch challngs] will hav to do with th unification of approximat and xact rasoning. Most of us com from th xact-rasoning world -- but most of our clints ar now asking qustions that rquir approximat or probabilistic answrs.'"
Hmmm, I kind of like 'databass'.
> The "next great advancement" in databases will be when I can setup 2 or more linux servers and have
>them act as a single database server. Our database server is the most expensive item in our datacenter
>because it's an N-way IBM server.
lol, IBM has supported *exactly* what you are talking about for at least five years.
That is, you can spread your db2 database across 10,100, or 1000+ linux commodity boxes (ideally blades). Or you can use windows, or aix, or solaris, or hp-ux, etc. Of course, those individual boxes can be SMPs in their own right - so a thousand 8-way aix boxes is certainly possible, if not cheap.
Oracle is now in this game as well - oracle 10g can certainly support 32, and maybe 64 individual linux boxes in a cluster. The techniques are different between the two - oracle might be better at transactional systems. db2 is definitely better at data warehousing, data mining, etc.
Of course, there are still benefits to a big smp: a single P570 16-way will cost you $250k. But each of those 16 cpus is multi-code (and far faster than intel or amd), and with its micro-partitioning - it can run at least 150 linux or aix lpars (logical partitions). These lpars can grow or shink as they need - so you aren't always over-buying for size, buying new hardly-used hardware, or having to colocate apps on a busy server - when a different os would be preferable. Not to say everyone should go this way - but there are definite benefits.
As a 15 year DBA, currently we are working with some of the would-be far reaching (to most people) concepts described in this paper. The idea of a TRUE SQL Debugger is like, so big it's sick. Quest offers some tools that kinda sorta do this for Oracle systems, but a true realtime debugger would save me YEARS of work during my career as a SQL coder. For an Idea of scale, The last replication project I wrote for an employer propogated over Oracle DB_LINKS via triggers to synchronize a dataset in two cities, log it, and do something with errors. Because this particular system was a Peoplesoft installation, it was a subset of 6800 tables and 15k lines of code give or take some triggers, with NO debugger. OMG, it's like a "finally" moment to have someone even claim to be fixing this soon in their architecture.
Next, there was some inane reference to reiserfs above, which clearly ignores what a database fundamentaly both is, and is becoming. It really began (and I hate to admit this as a former Solaris/Oracle admin) with SQL Server 7 and Oracle 8, and the concept that a database should be object programmable. Reiser is not going to be streaming still frames of image data fast enough to a remote client to rebuild seamlessly into a movie, for instance. Or recalculate all of a company's business logic for point of sale systems so that, for instance, the wrong type of credit card gets rejected, or so a supply chain gets populated, the list is endless. Reiser, and for that matter VFS and the other myriad of database enhanced filesystems, are tools. Good ones, but tools...
It's interesting to note that MS has finally figured out that the "n-tier" was a dumb idea. It's almost like, well you take all this shit, then sell it through a middle man, but expect to not have to pay him anything for brokering. Like, duh. We actively benchamrked this process, in fact, and discovered that it does, not suprisingly, take time to pass data through an extra server.
Workflow is life. It's what make this page exist (SD is I believe run in MySQL). The idea of publishing-subscribers with atomic transactions is hardly new, but I agree with the authors that this is the direction of the market, simply because businesses now are getting spread all over. Read - If your job just went to India, learn to be a DBA, cuz when all that shit they sent over there comes back, you can bet its going to be a mess (and is a mess actually already, which is why, in particular, people in ERP fields that intertwine with mine(as a DBA) demand and recieve very large salaries, 200$US an hour is not unusual). The reason this particular ramble is relevant, is because lots of global companies are either looking at, or are already implementing, the idea of data grids, where all the data servers inside a global network stay in sync. Suzy the secretary checks out a document in Baltimore, and that document flags as in use in Madrid through transactional replication within a kind of database trust-relationship network. It's a very very good way for companies with lots of data to keep it all together, but today it's still a pain in the ass to manage.
Vertical partioning is pretty much worthless except to data warehousing installations, most of whom are probably running on strong equipment already (to have that much data). Not to mention, I believe (I'd have to check, since it's not a feature I'd really use) Oracle's 10G product allows for this already if you really want it. Materialized views is another point here that raises my hackles. This guy is writing about the wonder of materialized views and column partitions, which ARE a cool performance cheat in large systems, but make no mistake that by the time you get to this point, you are probably rearranging deck chairs on the titanic anyway. Essentially Materialized views precache SQL resultsets into a temporary table which gets constantly updated so it can always provide a full resultset without having to parse the parent table. This is processor and space expensive. Vertical par
Imagination is the silver lining of Intelligence.
[misc drivel] Read more at:
http://www.prevayler.org/
Oh my dear god. You've never actually used Prevayler have you? Prevayler isn't nearly as useful on actual data problems as Prevayler's worshippers would have you believe.
I know this because I tried to use it. If you'd ever tried to use it, you'd know how unbelievably poorly it performed when attempting to implement real world queries. You have to implement every query in Java, and Java is a particularly poor implementation choice for creating complex queries.
What if I said that this can be as fast as 8000 times faster than Oracle?
This "performance comparison" that the Prevayler group trots out is particularly funny as their test uses a single ArrayList of objects as in-memory "storage" and then "queries" it by index. Not exactly a realistic problem. Try a query across four classes with a few million instances of each class and you'll quickly discover what relational databases are good for.
Regards,
Ross
Yes, a good DBA and/or Database Developper is a very valuable addition to any team.
The problem is that in a lot of corporations (e.g., the one I work for), they -- and all other admins -- have been taken and put in a different building. And more importantly they don't actually have to cooperate with any team.
Their job's goal is no longer the same as the developpers: to get a program done by a deadline. They've been turned into a bureaucracy whose only job is to see that the servers run. No more.
That's an _awful_ job description, because it directly makes the developpers their enemy. I'm not even talking "slippery slope", but direct cause-effect. Instead of being "the other half of the team that will make this program work", developpers just become "those assholes who crash our servers."
It's not hard to get from that point of view to pathologic cases like the admin that limited our productive servers to 3 connections per server. He kept his own servers running perfectly (which is his job description) at the expense of making the company's productive programs grind to a halt (hey, it's not in his job description to care about those.)
That's the problem with that kind of internal organization. As one BOFH-wannabe once said "The source of the problems on my network are the users. Would you prefer that I cut your access? Then there wouldn't be any problems any more." Another one threw a hissy fit that we dared ask that he does his job, during work hours. Yeah, how dare we bother him by asking if he could please reboot the test server he's managing.
That's the underlying problem. Instead of providing a service _to_ the users, a whole caste has been created whose job is to serve the computer, and the users are just those pesky assholes disturbing his majesty the computer. That's a very unproductive situation to create.
Worse yet, a bunch of companies invented the devastating practice of internal invoices. The admins in one department won't even go to the toilet unless they can send an bill to another department for it.
They won't even talk to each other (e.g., the WebSphere admin telling the DBA and the Unix admin that he needs a Solaris patch and a newer version of Oracle for the "transactionBranchesLooselyCoupled" setting.) No, you have to personally talk to all three of them, because otherwise they can't send three bills for it.
And predictably, they'll do _nothing_ more than the bare minimum that was requested and billed. E.g., you have to tell the DBA explicitly to set this and that, to this and that value, because she won't do that on her own. Which basically means you already need to have all the knowledge of a DBA, and she is just acting as a proxy over the phone... and sending you a bill for it.
Basically if you're not that kind of a DBA, you have my respect. All I'm saying is that when you read about "teams of clowns" or about people who'd rather invent their own storage than deal with a DBA... well, they're not necessarily avoiding _your_ kind, but the kind of clown I've described above.
A polar bear is a cartesian bear after a coordinate transform.