> Trolltech provides the library in two licences - the free licence which mandates that the applications developed be > released under GPL and a commercial non-free licence which allows one to develop closed source applications using Qt.
So, I wonder if like MySQL people will always assume that it is free?
yeah, I bought a copy of Agenda sometime around 1988 for something like $400.
I remember liking it a lot but realizing that the possibilities it offered were far beyond the ability of most mortals to master.
Remember, PIMS were a hot market prior to Windows 3.0 - but most products were never ported to Windows because there wasn't enough revenue being made. This was because users would buy the hype, buy the product but weren't commited enough to get past the learning curve and dedicate time to maintaining data in it. So, most PIMs just became shelf-ware. And it wasn't because they weren't powerful or didn't have good interfaces - they did (as Grandview and Agenda easily demonstrate).
It's just that most people aren't that sophisticated when it comes to how they think of information (whether personal or not). Most people are barely at the list-stage of information mastery - giving them hierarchies with outlining functionality or anything beyond that like in Agenda completely overwhelmes them.
So, I'm hoping that Chandler's a success and delivers something really cool. But then again, I've been hearing about it for over a year - it's time to stop talking and start delivering. And if they deliver, I'll happily be one of those 5% in the market that'll use it.
> Its easy. You write a DAL which abstracts away any sql you need to write. You then create a code generator, which not > only creates a DAL class for each table, but generates the procs automatically. It works quite nicely for me.
Can you describe that in more detail?
What I think you're saying is that you generate a set of generic select, insert, update, delete procedures for each table based upon metadata in the database catalog.
If this is the case:
- how do you handle reporting queries?
- how do you handle query tuning around performance and concurrency?
- how do you handle joins?
- how do you handle updates? set all columns?
> I find it saddening that you've not had a good experience with Notes. For me, it's one of the most fascinating, capable and resilient > peices of software ever written. > In the geek community, I'm in the minority. In business, that's far more debatable - with 120 million licences sold, people must > be seeing value in Notes somewhere. Sadly, not here on Slashdot.:-(
Yeah, it's a drag when you're really good with some product or technology, can really make it sing, know where the weak areas are and how to work around them - and then see that others look down their noses at it.
It's kind of how I feel when I talk to developers that dislike relational databases and want to use some kind of database encapsulation instead. They explain how bad databases are, and how much better their code is. I explain how they'll eventually want multiple products to connect to database - and SQL is the common language. How they'll eventually need to create reports, dashboards, portals - all using SQL. Possibly migrate some of their data to a warehouse for more powerful analysis - again, using SQL. And while SQL isn't the best language out there, it is very powerful and very common.
Good luck with notes though. Maybe something will happen and it'll get a great boost.
> I'd love to see the results for a vehicle that was less overspecialized though.
sure, but if you want to compare them to the jeep on the highway, it's only fair to have them also compete offroad.:-)
i've got an olds aurora with a v8 (a very aerodynamic sedan), and have been doing quite a lot of the same analysis this guy, though less formally. I can get 30 mpg if i'm very gentle on acceleration, keep it under 75 mph, etc. Drafting can probably get it to 35 mpg with two car lengths between, but the guys that drive the semis don't like that much. I think you might be able to get this car to 40 mpg through modifications, running on flat terrain at sea level and keeping the speed to 60-65 mph. Maybe. I'd be surprised if you could get an accord to 80 mpg with similar approach - maybe 50-55 I'd think.
Also noticed as the author mentioned that cruise control doesn't save gas on large hills. It's way better to pull off on the acceleration until you're just doing 35 mph on the way up, and give gas until you're doing 120+ mph on the way down.
But back to my original point, this car can't tow more than 3,000 pounds (no horse trailers), has miserable clearance, and can't pull a stump worth a damn.
> You don't need a tank like a Landcruiser or Jeep.
A jeep wrangler isn't a tank - it's a very small SUV - only has two full-sized seats with two small seats in the back that double as the storage area. I suspect that it weighs less than a Subaru Forester or Suzuki Grand Vitara, but I'm not sure about that. But they do only get around 25 mpg.
The jeep cherokee is the bigger version: this supports two people up-front, another 2-3 in the back seats, plus storage behind this. It's the medium-sized SUV (and the one that in my opinion everyone started to copy in the late-80s).
It's still far lighter in weight than a landcruiser, or a full-sized truck like a ford f150, or a suburban-like thing. These are the ones getting 10 mpg - and often are poor performers offroad anyway (where their excessive width & length prevents them from being used on some trails, etc).
Personally, I've done quite a lot of offroading in an old International Harvester Scout (old tech from the 60s & 70s). See these at www.binderbulletin.org. They're very heavy and incredibly durable, but get 16 mpg on the highway best case (unless you've got one of the 70s diesels).
I use mine as a daily-driver, but at this point that totals less than 5,000 miles a year. It is also used for offroading, but I found that getting out via even a very capable vehicle only gets you so far. Then you really want to hike, take a mountain bike, dirt bike, or quad-runner. The truck is still useful - and will take you and your family & gear far beyond where a Grand Vitara will go. But at some point it is just much more pleasant to hop out and get on the bike.
Sorry for the late response, just drove 2000 miles with the family and noticed this response...
> OK, hotshot. What makes you think that Notes applications are nightmares in terms of maintenance?
Direct experience with in-house notes support teams struggling to support a small handful of applications on notes.
> Maintenance? Easy, because you have no DB schema to care about. Changes are much easier for the developer to handle, and don't > require hours of extensive database maintenance - they're pretty much just a form change and perhaps a cleanup agent to remove any "retired" fields. > Not only do I not see a maintenance nightmare, but I actually see a clear advantage.
That's an advantage in some terms of maintenance - everything is dynamic. But it's also a huge disadvantage to data quality: where the ability to dynamically change a schema also means that you generally lose the ability to get a consistent picture from the data across time. Some data has attribute x, some have y, some have z. It's much more useful (though more time-consuming) to keep everything consistent.
> And quantify your concerns on scalability, please.
Applications all over the world are growing in terms of data - our notes apps with a few gbytes of data were struggling to stay online. The kind of applications that php + db2/oracle could handle easily was killing notes.
> Data quality? In what sense?
In the sense that relational databases support declarative data quality enforcement - ensuring that the data is consistent across the database is generally very simple. Ensuring that any entered 'customer-id' actually exists is trivial in a relational database. Ensuring that the only disposition_codes allowed are 'prod','test','dev',trans' is trivial in a relational database. It wasn't trivial enough in notes - and so of course, didn't get done. The resulting data was a horror.
Then again, there were the times that users replicated old data to the central servers. Sometimes caused by old users replicating up, another time by an experienced admin trying to do a restore. Ick.
On one application we had to retype 100% of the data by hand to clean it up. Note that this was after we had implemented a LEI bridge to automate the export. We just had to give up entirely on it.
> Data quality? You've lost me. You're not one of those weird people who thinks all data should be relational, are you? > I've never understood that. Some data and working processes lends themselves well to relational schemas. But most just don't. > It's a restricting, cumbersome, maintenance intensive abstraction which is often unnecessary and just used out of habit.
No, most data fits best into a network model. But, unfortunately there are no great network modeling databases out there. Of the options we have we can immediately toss out hierarchical databases (and xml data storage) as a rehashing of previously discarded technology. The OODBMS has never been able to scale to handle simple scans - nor able to handle networks gracefully. The relational model scales well and also supports them adequately.
As far as non-network models - what has the scalability of a relational database? What products have survived as long? I've heard countless developers insist on something like java container-managed persistence in order to avoid lock in with relational technology. Guess what? ten years later relational databases will still be around, but two years later container-managed persistence was discredited and that company wanted to move from java anyway.
Relational technology isn't perfect, but it's unfortunately better than most other options today.
> Microsoft tried for years to get a relational database backend to the way we store data - it was called WinFS, and failed > despite their massive resources.
So? they've also failed to create a secure os, does that mean it is impossible?
> It allows you to create workflow apps which are truly quite impressive.
and nightmares in terms of maintenance, scalability and data quality.
Honestly, every one of these things I run into is a catastrophe. I'm sure that they were better than the manual processes that they usually replace, but I wish that they could have been implemented in php & postgresql/db2/oracle/whatever.
ah, and did I mention usability? Notes has its own usability patterns - which are different from everything else. The client has millions of configuration parameters - that are distributed in an arbitrary fashion across dozens of overlapping menus.
Teamrooms? ick, we've been moving that stuff to wikis for years. Yep, even the documents - go into our wiki as attachments, and yes we can lock down the security.
It's too bad though - if the right people (just a few with a vision and real experience), the right processes (probably 2% of what they're actually buried under), and the right budget had all intersected about 5 years ago this could be a good product today. But now it's just a nightmare.
And sure, running on linux is good. But accessing my notes from Thunderbird would be *far* better.
> Woah, woah, woah! Any shop using an ad-hoc collection of Access DBs and Excel spreadsheets is probably a small business that can't afford Oracle.
Not necessarily - since oracle for a small database ( 4gbytes of data I think) is free now anyway. But *oracle* doesn't matter - use of any database, even mysql, would be a drastic improvement.
What's probably more important is: 1. there's no network for a centralized solution, they use client software instead 2. there may be no funding to do this right 3. management may be of the type that doesn't like to tackle big improvements that it doesn't understand well
Ok, so lots of unknowns. But here's a potential approach:
1. A centralized solution using a single database is the ideal approach. But perhaps the network connectivity simply cannot be overcome. Or at least not immediately - so first implement a small database on each laptop. This means something really tiny like MySQL. Perfectly fine to start with, and compatible with everything else - so you could convert to whatever later on once the network issue is resolved.
2. You are probably stuck with the excel & access - since it sounds like they are the output of required applications. Fine, then you just need a way to import that data into MySQL. Some databases (like db2) have built-in import tools for excel - so you might get lucky. Otherwise, I'd shop around for the simplest utility to help with the task. I'd avoid anything that's too much of a distraction here -.net, etc. Keep it extremely simple.
3. I'd make the import/export process as simple as possible. Ideally a big green icon they punch.
4. You could use a light-weight http server along with php for the reporting. Again, very simple to implement.
Once the above is working fine on the laptops, then if the network problems can be overcome it wouldn't be too difficult to centralize everything. The same web reports that ran on the laptops can run on a server, along with the same database schema as well. Could theoretically even be mysql if the amount of writes is small enough. Uploading the files, or transferring data from the local copy of mysql would be the only new development required.
> Unfortanly many parts of DB2 is buggy, slow and bloated.
Not the core database. Like anything else, the fringe functionality that fewer people use or that is newer has tends to have more problems. Stuff like table inheritance, xml, replication, cube views, etc, client tools, etc. But the engine works very well. That ultimately is what I focus 99% of my time on.
And back to how it was clunky a few years ago: anyone on the 7.* versions really needs to upgrade. v8 is a very good product, and now v9 is going to be out in a couple of months.
Note also that in my experience most problems in which people "had to switch to oracle" started out because they had an oracle staff that did everything the "oracle way" - using oracle-style partitioning, etc. Then discovered that db2 didn't work as well as oracle in doing things the oracle way. Then they switched. However, it was a foregone conclusion that they would switch. With v9 coming out and oracle-style partitioning built-in perhaps those that insist in developing for oracle will find it much easier to get their code to work with db2.
>>Fortunately for me, most of my massive databases get php front-ends these days. And hopefully RoR soon.
> Then they probably aren't the kind of substantial applications being discussed. Large high-transaction-rate systems > tend to use considerable amounts of in-memory storage and avoid falling through to the database where possible, as the database > can be the slowest part of the system. The 'let the database store everything' approach of PHP and RoR doesn't scale to the highest levels of performance.
Keep in mind that "substantial applications" take a variety of forms. Some have high transaction rates with relatively simple queries and many concurrent users. Perhaps J2EE is well-suited here, though I suspect that the complexity it incurs isn't worth the benefit, and there are a variety of ways to cache database data, including within the database of course. And of course, most people don't need "the highest theoretical levels of performance" - they need reasonable performance.
Other substantial applications have low to no transaction rates with extremely complex queries and fewer users. In this latter situation j2ee is no better suited than RoR, actually worse in a variety of ways such as complexity & cost. Enterprise reporting and information dashboards & consoles are perfect examples in which applications may have just a few hundred simultaneous users running thousands of extremely complex queries.
Back to RoR & DB2 - i'm currently working on a multi-terabyte DB2 database application that supports hundreds of large customers running extremely complex queries around the clock. It worked fine in php, there's no reason to believe that it won't work even better in RoR. Saying that this isn't an enterprise app, isn't a substantial app, or doesn't apply to anyone else is highly misleading.
> I imagine that all three people who use DB2 are quite pleased with this development:) > What kind of market share does it have?
Depends on how you count it. I think by revenue IBM databases account for about 1/3 of the market. This primarily consists of db2 but also includes numbers from informix and ims.
> Seriously, I haven't seen any person or development shop in my area using DB2. I've never heard of it being used at all.
It's used heaviest on the mainframes, but also works very well on windows & unix/linux. After having spent decades developing databases using db2, oracle, informix, sybase, sql server, postgresql, mysql, access, clipper, ims-db, vsam, etc - I've grown to like db2 quite a bit (and Informix the most). DB2 is far faster than mysql or postgresql, and about 1/2 the cost of oracle.
The primary reason it doesn't have a larger marketshare is that IBM isn't very good at marketing, and until about four years ago its unix/windows version was a little clunky.
DB2 works fine for small projects where it is very cost effective, but you typically see it the most on very large projects. It especially shines when your data volumes keep growing - then it gives a ton of different scalability options - all the way up to very robust beowulf-like clustering capabilities in which you can spread your database across hundreds of separate servers. For large projects like this its only real competition is Informix or Teradata.
> There is good reason why you may not want it to be. The Java/J2EE/Websphere approach often uses clustering and cacheing to give > high performance and scalability. You would not want to let a small RoR application (or any other type of application) loose on such such a system.
It depends:
- I'd bet that 9 out of 10 websphere/weblogic implementations don't use clustering
- many massive databases have relatively modest websites, ie the heavy-lifting is all backend not presentation
- many massive databases have small related "helper" applications that also need presentation layers
- even clustering & caching applications should be able to handle changes to reference data through simple cache-refresh methods.
Fortunately for me, most of my massive databases get php front-ends these days. And hopefully RoR soon.
> I work in banking and my experience has seen DB2 used to support very heavy applications (e.g. internet banking with 1+ million customers). > Is rails being used in enterprise for heavy web apps (not my field)?
Massive databases often support multiple presentation layers - ecommerce, internal administration, parter access, internal reporting, etc, etc. Even if you're doing one in Java/websphere there's no reason another couldn't be in RoR.
> Please bear in mind that I do NOT speak for MySQL AB in this or any other matter - this is just my take on the policy, and I > could be dead wrong about the rationale behind it.
which is exactly what's wrong with mysql licensing: it's deliberately vague. And who exactly wants to consult with their lawyer before using a database product?
and of course, just like AT&T just changed their privacy policy to no one's surprise; who's going to be surprised when mysql gets sufficient market share to tighten up their licensing?
> Is there any reason you're not also partitioning the PostgreSQL database, other than to make it look bad in the fictional benchmark? > Maybe you can do more advanced partitioning with DB2 or Oracle - don't know, haven't used 'em - but PostgreSQL is certainly capable of > the trivial example you mentioned.
Nah, I usually don't consider approaches using inheritance or union alls, except in desparate conditions. It theoretically works, but in my experience is much more work to implement, can be a problem to alter within a transaction, often isn't 100% compatible with other sql operations(load?), doesn't support etc.
From the doc you provided:
"As we can see, a complex partitioning scheme could require a substantial amount of DDL. "
another comment ends with "So performance can be drastically worse if you use partitions."
So, yeah - this is like partitioning-lite. Enough to work in some narrowly-scoped situations, but not so well that you want to make it a part of your general solutions. You certainly wouldn't want to create 365 daily partitions this way: that would be 365 or 366 tables within a single ddl script (ugh). Plus, it still doesn't address parallelism - so your 4 or 8 way smp is stuck scanning the data with a single cpu.
Postgresql is a very cool database, but it still has a way to go with this kind of functionality. It's moving quickly, so it'll be cool to see a strong solution here within a couple of years. But until it has something bullet-proof I'm in no hurry to bet my career on it.
> The big three database manufacturers all charge pretty much the same for the same feature set. Oracle costs the same are > sql server and db/2 within a percent or two.
Not in my experience: db2 is often 50% the cost of Oracle, especially since partitioning is an extra $10k/cpu for oracle and a completely usable form of partitioning is included within the base db2 product.
Right now I've got a multi-terabyte data warehouse running on a db2 license that costs $1500/cpu. If I wanted it directly accessible on the internet then it would run $7500/cpu.
> Personally I don't see how anybody can charge for databases these except to the largest organizations. The killer feature > seems to be real and reliable replication and clustering.
No, those are just the features that the open source community seems to want to target. But why? They're both typically used for failover, and the commercial products have far better failover solutions (and ones that actually work across geographical separated data centers).
Bi-directional replication is at best a pain in the butt and when used to actually consolidate data its lack of tranformation abilitities stinks.
Clustering can deliver either availability or speed. The oracle solution is geared towards availability, the informaix/db2 solution towards speed (it's like a beowulf architecture, but been around for ten years). The former is ok, but again doesn't work across data centers, the latter is ok - and ideal when you've got 20 TB, but otherwise overkill.
What about partitioning & parallelism? Why use a product like db2 or oracle? How about because they can save you huge dollars on hardware? Take this example: you've got a million rows of data a day for 365 days on a 4-way SMP with 8 gbytes of memory and four disk arrays. Users run a wide variety of queries for reporting & analysis. Assume that your hardware cost $88,000 (list) for high-end models of this type.
If you're using postgresql or mysql many of those queries are going to result in tablescans - in which the database has to read every single one of those 365 million rows. This is because btree indexes don't work if you're selecting more than 1-3% of the data. So, want to see all data for previous month for monthly reports? Fine, but you'll have to wait four minutes to scan all data.
On the other hand, if you're using db2 for example you'll probably want to partition (with MDC - available even in free product) on day. When you query on a month of data DB2 will scan just that one month - 1/12 of the table. Then when it does that it'll run the query in parallel - giving you 4x the performance of the single-threaded queries from mysql & postgresql. Then you've also got fine-grained memory tuning, a wide variety of optimizations and a fast optimizer - capacle of handling complex queries. Ignoring the performance benefits of the latter features (only because difficult to quantity) and just using the first two - you're going to get 48x faster performance from db2 than mysql or postgresql. That query that took 10 minutes on msyql? It'll run in four seconds on db2.
How much would you spend on hardware to try to get that mysql query to run in 4 seconds? Far more than the cost of oracle or db2!
> Windows is like a house of cards made from million decks, so many co-dependancies. It's why Vista has taken so long and will > continue to cause problems.
Yeah, it would be interesting to see how many lines of code, classes, function points, whatever have gone into each release - along with the number of years, programmers and dollars it took to get there.
Windows XP took quite a long time and really didn't offer much. Vista is taking what? twice as long? and is again offering very little. The interesting question is what do the economies of the next three versions of windows look like based on this trend?
Assume that it takes:
- 4 hours to write a given program in python, 32 hours to write same program in C++
- 10 seconds to run the python program, but just 2 seconds to run the faster C++ program
- the program is run 20 times a day
- assume the developer time costs as much as the the time of the person that runs it
Ok, so it'll take 630 days of running this program for the faster C++ program to make up for the extra time to develop it. So, if you can wait two years for a payback then C++ is the way to go, otherwise code it in python.
There that was easy. Ok, any other simple problems out there? Which editor you should use? What's just the right amount of comments per program? Which is better - cvs or subversion?
> You just hit queries per hour, what our software hits in 12 seconds. We average 5,000 queries a second, for a large EMR (Electronic Medical Record) > application, with approximately 4,000 users. Our software is web based, written in C, using MySQL as the backend. We're currently sitting at 16T of > storage as well, with a growth of 2G per day of new data.
Not to take anything away from your achievement, but we're talking apples & oranges here: the queries I'm talking about are reporting queries - not simple operational/transactional ones. They aren't just looking up a status on an item. They're each often scanning tens of thousands of rows of data, sometimes millions, joining to a separate temp table - itself created as a cartesian product from multiple dimension tables, then grouping the final result. These are queries that in require parallelism and partitioning to achieve fast results. Completely doable in db2, oracle, informix and to a much lesser extent sql server. Not in mysql or postgresql yet though since these products lack parallelism and partitioning.
Additionally, we can manage the entire query load from a single four-way power5 aix box running a low-end version of db2. Not only can mysql not match that reporting performance (it's about 1/40th the speed of db2 for reporting), but I don't think that it would meet your figures above either. How many servers do you have supporting your mysql database?
> PHP has its places - extremely large, mission critical applications are not one of them.
Odd that you said that given that you're using mysql.;-)
However, I think your comment is misplaced: while this system does a ton of heavy moving, the front-end isn't one of the performance-challenged parts of the system: the extraction of data from other systems (python), the transformation of data (python), and the processing of queries (db2) are the performance-oriented parts. The display of data (php) handles just a fraction of the workload of the other pieces. So, in this case - php works ok in an extremely large, mission critical application. It's biggest challenge in our context is manageability rather than performance. Would I recommend php for your application, perhaps not - but that doesn't mean it doesn't work fine for ours.
Likewise, perhaps mysql can work well within a mission-critical application. You tend to have to buy a lot more hardware, and spend a lot more time on testing. And porting to other databases can be tough. But you can still do it. And maybe sometimes it makes sense. I'd be hesitant to say "mysql cannot do enterprise computing". It's just not true.
> If you don't mind me asking, what python ETL software are you using? Is this something available commercially, through open source, or something of your own design.
We wrote it ourselves. It wasn't difficult to write, but we focused on compliance with some patterns that made it both easy to develop and easy to manage. For example, one pattern is that all processes are completely autonomous, typically run out of cron every minute. If there is no data to process they die quietly, if they are already running they die quietly, if their schedule is suppressed they die quietly. All they produce is a file used by a downstream process. But they don't run that process - that process like the above one, is also constantly checking for a file.
So, we didn't spend a fortune on etl software, have something that can scale very cheaply, and is very simple to build and run. It would have been great to use product that already provided these benefits, but we couldn't find one. Oh yeah, one other benefit: until we got into production (and earned credibility) we had almost no budget for software, training, etc.
> I have been looking for software to replace the ease of SQL Server DTS for quite a while and using DB2 (if necessary) would not be a problem.
yeah, DTS is terrible. I replaced one large etl environment in DTS with cygwin and python. It immediately reduced our labor costs. There are some open source etl projects, but I can't recommend any. That's too bad, since it would be great to have an alternative to Data Junction, Informatica, Ab Initio, etc.
>> cans of a hundred million rows at a time aren't uncommon (though seldom happen more than a few dozen times a day). > Yes they are. Go read what you wrote.
I'm intimately familiar with this application. When I say that only a few dozen queries a day are scanning a hundred million rows in a given day, it is the case.
> This app is completely written in korn shell, python, php and sql (db2). >> One guess where 99% of the ccycles arae in that (and 90% of the dollars).
No need to guess, I'll provide the info.
First off, cycles? really? what an simplistic way to think about such a system - you normally also look at io and network performance and memory. Especially io performance when you're slinging this much data around. And - python and c have extremely similiar io performance.
And cost? You're implying that the 90% of the cost is spent on hardware apparently because python is wasting cpu cycles. This is not the case: etl servers running python consume less than 10% of the total hardware cost. And hardware is less than 10% of the total project cost. So:
- php & python has had a negligable impact on hardware cost which is 10% of project cost
- php & python has had a hugely beneficial impact on labor cost which is 90% of project cost
I'm not going to say that python, php and ruby will achieve the same benefits on all projects. But on this "hard" project those technologies far outperformed what native code would have achieved.
Re:Have you tried coding anything hard?
on
The End of Native Code?
·
· Score: 2, Informative
> How much app. processing goes on in the script languages? How much hardware do you dedicate to those vs. db2?
Well, db2 is obviously managing quite a lot of it. Certainly all of the queries, but also the very fast loads. DB2 is running on four-way Power4 & Power5 hardware with 4-12 disk arrays per server with 64-bit architectur and typically 8 gbytes of memory. It's running extremely fast.
By the time the data hits PHP it is typically just small result hits - that is, a scan of a few million rows will typically generate just a hundred rows or less that will go into a chart, graph or table. The PHP component is just running on older intel four-way SMPs. For a while much of it was statically generating all possible query pages - which meant a vast amount of processing, which worked fine - while we had slower databases.
But *100%* of the data pours through python. Every single row has to be reformated, has to be validated, has to have multiple fields replaced with identifiers to other tables (which require lookups in arrays). All of this happens in python. The python processes can hit 500 million events a day comfortably on older 1.4 ghz intel four-way smps with 4 gbytes of memory and a single slow disk array. They can hit about 3 billion events a day on fast 2.8+ ghz intel four-ways with multiple disk arrays.
> Is this externally visible (like search_used_books.amazon.com...)?
Yes, but only if you're one of hundreds of our customers.
> And of course to be an argumentative/.er, "Which interpretted language is db2 written in?":)
you've got me there - we do run several pieces of software written in native code: aix, redhat linux, apache, db2, etc. And they do quite a lot of work - I'm not taking anything from them or suggesting that they be ported to python. However, our python ETL server is still churning out a vast amount of data every hour of the day.
Re:Have you tried coding anything hard?
on
The End of Native Code?
·
· Score: 5, Informative
> When your web-based-datastore gets 50,000 inserts per second, hovers between 15 and 20 billion rows and endures a sustained query rate > of 43,000 queries per hour, tell me which part of it you want to coded in PHP.
hmm, the warehouse I work on has multiple databases with billions of rows in them, can hit insert rates of 100,000 rows a second, can experience 60,000 queries/hour - many of which are trending data over 13 months, has hundreds of users. Many of these users are allowed to directly hit some of the databases with whatever query tool they want. Scans of a hundred million rows at a time aren't uncommon (though seldom happen more than a few dozen times a day).
This app is completely written in korn shell, python, php and sql (db2). Looks like Ruby is also coming into the picture now, will probably supplant much of the php in order to improve manageablity.
Oh yeah, and the frequency of releases is quick and it's defect rate is low. And we're planning to begin adding over 400 million events a day soon. I've done similar projects in C and java. Never anywhere near as successfully as in python and php.
We might consider rewriting a few select python classes in c. Maybe, if we port the ETL over to the Power5 architecture with psycho doesn't run. Otherwise, it's cheaper to just buy more hardware at this point - since each ETL server can handle about 3 billion rows of data/day with our python programs.
It's important that in a book entitled "The Art of SQL" they followed the organizational structure of "The Art of War".
Well, it really isn't at all important should it be a surprise.
And The Art of War faddish? The book is over 2500 years old, influenced Emperor Napoleon, General Patton, BH Liddell Hart (who in turn influenced the creation of the WWII German military strategies), General McArthur, etc, and has sold well to non-military types for at least 20 years. I think the world could use a few more books that survive the test of time 1/10th as well.
> Trolltech provides the library in two licences - the free licence which mandates that the applications developed be
> released under GPL and a commercial non-free licence which allows one to develop closed source applications using Qt.
So, I wonder if like MySQL people will always assume that it is free?
yeah, I bought a copy of Agenda sometime around 1988 for something like $400.
I remember liking it a lot but realizing that the possibilities it offered were far beyond the ability of most mortals to master.
Remember, PIMS were a hot market prior to Windows 3.0 - but most products were never ported to Windows because there wasn't enough revenue being made. This was because users would buy the hype, buy the product but weren't commited enough to get past the learning curve and dedicate time to maintaining data in it. So, most PIMs just became shelf-ware. And it wasn't because they weren't powerful or didn't have good interfaces - they did (as Grandview and Agenda easily demonstrate).
It's just that most people aren't that sophisticated when it comes to how they think of information (whether personal or not). Most people are barely at the list-stage of information mastery - giving them hierarchies with outlining functionality or anything beyond that like in Agenda completely overwhelmes them.
So, I'm hoping that Chandler's a success and delivers something really cool. But then again, I've been hearing about it for over a year - it's time to stop talking and start delivering. And if they deliver, I'll happily be one of those 5% in the market that'll use it.
> Its easy. You write a DAL which abstracts away any sql you need to write. You then create a code generator, which not
> only creates a DAL class for each table, but generates the procs automatically. It works quite nicely for me.
Can you describe that in more detail?
What I think you're saying is that you generate a set of generic select, insert, update, delete procedures for each table based upon metadata in the database catalog.
If this is the case:
- how do you handle reporting queries?
- how do you handle query tuning around performance and concurrency?
- how do you handle joins?
- how do you handle updates? set all columns?
Thanks
> I find it saddening that you've not had a good experience with Notes. For me, it's one of the most fascinating, capable and resilient :-(
> peices of software ever written.
> In the geek community, I'm in the minority. In business, that's far more debatable - with 120 million licences sold, people must
> be seeing value in Notes somewhere. Sadly, not here on Slashdot.
Yeah, it's a drag when you're really good with some product or technology, can really make it sing, know where the weak areas are and how to work around them - and then see that others look down their noses at it.
It's kind of how I feel when I talk to developers that dislike relational databases and want to use some kind of database encapsulation instead. They explain how bad databases are, and how much better their code is. I explain how they'll eventually want multiple products to connect to database - and SQL is the common language. How they'll eventually need to create reports, dashboards, portals - all using SQL. Possibly migrate some of their data to a warehouse for more powerful analysis - again, using SQL. And while SQL isn't the best language out there, it is very powerful and very common.
Good luck with notes though. Maybe something will happen and it'll get a great boost.
ken
> I'd love to see the results for a vehicle that was less overspecialized though.
:-)
sure, but if you want to compare them to the jeep on the highway, it's only fair to have them also compete offroad.
i've got an olds aurora with a v8 (a very aerodynamic sedan), and have been doing quite a lot of the same analysis this guy, though less formally. I can get 30 mpg if i'm very gentle on acceleration, keep it under 75 mph, etc. Drafting can probably get it to 35 mpg with two car lengths between, but the guys that drive the semis don't like that much. I think you might be able to get this car to 40 mpg through modifications, running on flat terrain at sea level and keeping the speed to 60-65 mph. Maybe. I'd be surprised if you could get an accord to 80 mpg with similar approach - maybe 50-55 I'd think.
Also noticed as the author mentioned that cruise control doesn't save gas on large hills. It's way better to pull off on the acceleration until you're just doing 35 mph on the way up, and give gas until you're doing 120+ mph on the way down.
But back to my original point, this car can't tow more than 3,000 pounds (no horse trailers), has miserable clearance, and can't pull a stump worth a damn.
> You don't need a tank like a Landcruiser or Jeep.
A jeep wrangler isn't a tank - it's a very small SUV - only has two full-sized seats with two small seats in the back that double as the storage area. I suspect that it weighs less than a Subaru Forester or Suzuki Grand Vitara, but I'm not sure about that. But they do only get around 25 mpg.
The jeep cherokee is the bigger version: this supports two people up-front, another 2-3 in the back seats, plus storage behind this. It's the medium-sized SUV (and the one that in my opinion everyone started to copy in the late-80s).
It's still far lighter in weight than a landcruiser, or a full-sized truck like a ford f150, or a suburban-like thing. These are the ones getting 10 mpg - and often are poor performers offroad anyway (where their excessive width & length prevents them from being used on some trails, etc).
Personally, I've done quite a lot of offroading in an old International Harvester Scout (old tech from the 60s & 70s). See these at www.binderbulletin.org. They're very heavy and incredibly durable, but get 16 mpg on the highway best case (unless you've got one of the 70s diesels).
I use mine as a daily-driver, but at this point that totals less than 5,000 miles a year. It is also used for offroading, but I found that getting out via even a very capable vehicle only gets you so far. Then you really want to hike, take a mountain bike, dirt bike, or quad-runner. The truck is still useful - and will take you and your family & gear far beyond where a Grand Vitara will go. But at some point it is just much more pleasant to hop out and get on the bike.
Sorry for the late response, just drove 2000 miles with the family and noticed this response...
> OK, hotshot. What makes you think that Notes applications are nightmares in terms of maintenance?
Direct experience with in-house notes support teams struggling to support a small handful of applications on notes.
> Maintenance? Easy, because you have no DB schema to care about. Changes are much easier for the developer to handle, and don't
> require hours of extensive database maintenance - they're pretty much just a form change and perhaps a cleanup agent to remove any "retired" fields.
> Not only do I not see a maintenance nightmare, but I actually see a clear advantage.
That's an advantage in some terms of maintenance - everything is dynamic. But it's also a huge disadvantage to data quality: where the ability to dynamically change a schema also means that you generally lose the ability to get a consistent picture from the data across time. Some data has attribute x, some have y, some have z. It's much more useful (though more time-consuming) to keep everything consistent.
> And quantify your concerns on scalability, please.
Applications all over the world are growing in terms of data - our notes apps with a few gbytes of data were struggling to stay online. The kind of applications that php + db2/oracle could handle easily was killing notes.
> Data quality? In what sense?
In the sense that relational databases support declarative data quality enforcement - ensuring that the data is consistent across the database is generally very simple. Ensuring that any entered 'customer-id' actually exists is trivial in a relational database. Ensuring that the only disposition_codes allowed are 'prod','test','dev',trans' is trivial in a relational database. It wasn't trivial enough in notes - and so of course, didn't get done. The resulting data was a horror.
Then again, there were the times that users replicated old data to the central servers. Sometimes caused by old users replicating up, another time by an experienced admin trying to do a restore. Ick.
On one application we had to retype 100% of the data by hand to clean it up. Note that this was after we had implemented a LEI bridge to automate the export. We just had to give up entirely on it.
> Data quality? You've lost me. You're not one of those weird people who thinks all data should be relational, are you?
> I've never understood that. Some data and working processes lends themselves well to relational schemas. But most just don't.
> It's a restricting, cumbersome, maintenance intensive abstraction which is often unnecessary and just used out of habit.
No, most data fits best into a network model. But, unfortunately there are no great network modeling databases out there. Of the options we have we can immediately toss out hierarchical databases (and xml data storage) as a rehashing of previously discarded technology. The OODBMS has never been able to scale to handle simple scans - nor able to handle networks gracefully. The relational model scales well and also supports them adequately.
As far as non-network models - what has the scalability of a relational database? What products have survived as long? I've heard countless developers insist on something like java container-managed persistence in order to avoid lock in with relational technology. Guess what? ten years later relational databases will still be around, but two years later container-managed persistence was discredited and that company wanted to move from java anyway.
Relational technology isn't perfect, but it's unfortunately better than most other options today.
> Microsoft tried for years to get a relational database backend to the way we store data - it was called WinFS, and failed
> despite their massive resources.
So? they've also failed to create a secure os, does that mean it is impossible?
IBM created an os twenty year
> It allows you to create workflow apps which are truly quite impressive.
and nightmares in terms of maintenance, scalability and data quality.
Honestly, every one of these things I run into is a catastrophe. I'm sure that they were better than the manual processes that they usually replace, but I wish that they could have been implemented in php & postgresql/db2/oracle/whatever.
ah, and did I mention usability? Notes has its own usability patterns - which are different from everything else. The client has millions of configuration parameters - that are distributed in an arbitrary fashion across dozens of overlapping menus.
Teamrooms? ick, we've been moving that stuff to wikis for years. Yep, even the documents - go into our wiki as attachments, and yes we can lock down the security.
It's too bad though - if the right people (just a few with a vision and real experience), the right processes (probably 2% of what they're actually buried under), and the right budget had all intersected about 5 years ago this could be a good product today. But now it's just a nightmare.
And sure, running on linux is good. But accessing my notes from Thunderbird would be *far* better.
> Woah, woah, woah! Any shop using an ad-hoc collection of Access DBs and Excel spreadsheets is probably a small business that can't afford Oracle.
.net, etc. Keep it extremely simple.
Not necessarily - since oracle for a small database ( 4gbytes of data I think) is free now anyway. But *oracle* doesn't matter - use of any database, even mysql, would be a drastic improvement.
What's probably more important is:
1. there's no network for a centralized solution, they use client software instead
2. there may be no funding to do this right
3. management may be of the type that doesn't like to tackle big improvements that it doesn't understand well
Ok, so lots of unknowns. But here's a potential approach:
1. A centralized solution using a single database is the ideal approach. But perhaps the network connectivity simply cannot be overcome. Or at least not immediately - so first implement a small database on each laptop. This means something really tiny like MySQL. Perfectly fine to start with, and compatible with everything else - so you could convert to whatever later on once the network issue is resolved.
2. You are probably stuck with the excel & access - since it sounds like they are the output of required applications. Fine, then you just need a way to import that data into MySQL. Some databases (like db2) have built-in import tools for excel - so you might get lucky. Otherwise, I'd shop around for the simplest utility to help with the task. I'd avoid anything that's too much of a distraction here -
3. I'd make the import/export process as simple as possible. Ideally a big green icon they punch.
4. You could use a light-weight http server along with php for the reporting. Again, very simple to implement.
Once the above is working fine on the laptops, then if the network problems can be overcome it wouldn't be too difficult to centralize everything. The same web reports that ran on the laptops can run on a server, along with the same database schema as well. Could theoretically even be mysql if the amount of writes is small enough. Uploading the files, or transferring data from the local copy of mysql would be the only new development required.
> Unfortanly many parts of DB2 is buggy, slow and bloated.
Not the core database. Like anything else, the fringe functionality that fewer people use or that is newer has tends to have more problems. Stuff like table inheritance, xml, replication, cube views, etc, client tools, etc. But the engine works very well. That ultimately is what I focus 99% of my time on.
And back to how it was clunky a few years ago: anyone on the 7.* versions really needs to upgrade. v8 is a very good product, and now v9 is going to be out in a couple of months.
Note also that in my experience most problems in which people "had to switch to oracle" started out because they had an oracle staff that did everything the "oracle way" - using oracle-style partitioning, etc. Then discovered that db2 didn't work as well as oracle in doing things the oracle way. Then they switched. However, it was a foregone conclusion that they would switch. With v9 coming out and oracle-style partitioning built-in perhaps those that insist in developing for oracle will find it much easier to get their code to work with db2.
>>Fortunately for me, most of my massive databases get php front-ends these days. And hopefully RoR soon.
> Then they probably aren't the kind of substantial applications being discussed. Large high-transaction-rate systems
> tend to use considerable amounts of in-memory storage and avoid falling through to the database where possible, as the database
> can be the slowest part of the system. The 'let the database store everything' approach of PHP and RoR doesn't scale to the highest levels of performance.
Keep in mind that "substantial applications" take a variety of forms. Some have high transaction rates with relatively simple queries and many concurrent users. Perhaps J2EE is well-suited here, though I suspect that the complexity it incurs isn't worth the benefit, and there are a variety of ways to cache database data, including within the database of course. And of course, most people don't need "the highest theoretical levels of performance" - they need reasonable performance.
Other substantial applications have low to no transaction rates with extremely complex queries and fewer users. In this latter situation j2ee is no better suited than RoR, actually worse in a variety of ways such as complexity & cost. Enterprise reporting and information dashboards & consoles are perfect examples in which applications may have just a few hundred simultaneous users running thousands of extremely complex queries.
Back to RoR & DB2 - i'm currently working on a multi-terabyte DB2 database application that supports hundreds of large customers running extremely complex queries around the clock. It worked fine in php, there's no reason to believe that it won't work even better in RoR. Saying that this isn't an enterprise app, isn't a substantial app, or doesn't apply to anyone else is highly misleading.
> I imagine that all three people who use DB2 are quite pleased with this development :)
> What kind of market share does it have?
Depends on how you count it. I think by revenue IBM databases account for about 1/3 of the market. This primarily consists of db2 but also includes numbers from informix and ims.
> Seriously, I haven't seen any person or development shop in my area using DB2. I've never heard of it being used at all.
It's used heaviest on the mainframes, but also works very well on windows & unix/linux. After having spent decades developing databases using db2, oracle, informix, sybase, sql server, postgresql, mysql, access, clipper, ims-db, vsam, etc - I've grown to like db2 quite a bit (and Informix the most). DB2 is far faster than mysql or postgresql, and about 1/2 the cost of oracle.
The primary reason it doesn't have a larger marketshare is that IBM isn't very good at marketing, and until about four years ago its unix/windows version was a little clunky.
DB2 works fine for small projects where it is very cost effective, but you typically see it the most on very large projects. It especially shines when your data volumes keep growing - then it gives a ton of different scalability options - all the way up to very robust beowulf-like clustering capabilities in which you can spread your database across hundreds of separate servers. For large projects like this its only real competition is Informix or Teradata.
> There is good reason why you may not want it to be. The Java/J2EE/Websphere approach often uses clustering and cacheing to give
> high performance and scalability. You would not want to let a small RoR application (or any other type of application) loose on such such a system.
It depends:
- I'd bet that 9 out of 10 websphere/weblogic implementations don't use clustering
- many massive databases have relatively modest websites, ie the heavy-lifting is all backend not presentation
- many massive databases have small related "helper" applications that also need presentation layers
- even clustering & caching applications should be able to handle changes to reference data through simple cache-refresh methods.
Fortunately for me, most of my massive databases get php front-ends these days. And hopefully RoR soon.
> I work in banking and my experience has seen DB2 used to support very heavy applications (e.g. internet banking with 1+ million customers).
> Is rails being used in enterprise for heavy web apps (not my field)?
Massive databases often support multiple presentation layers - ecommerce, internal administration, parter access, internal reporting, etc, etc. Even if you're doing one in Java/websphere there's no reason another couldn't be in RoR.
> Please bear in mind that I do NOT speak for MySQL AB in this or any other matter - this is just my take on the policy, and I
> could be dead wrong about the rationale behind it.
which is exactly what's wrong with mysql licensing: it's deliberately vague. And who exactly wants to consult with their lawyer before using a database product?
and of course, just like AT&T just changed their privacy policy to no one's surprise; who's going to be surprised when mysql gets sufficient market share to tighten up their licensing?
> Is there any reason you're not also partitioning the PostgreSQL database, other than to make it look bad in the fictional benchmark?
> Maybe you can do more advanced partitioning with DB2 or Oracle - don't know, haven't used 'em - but PostgreSQL is certainly capable of
> the trivial example you mentioned.
Nah, I usually don't consider approaches using inheritance or union alls, except in desparate conditions. It theoretically works, but in my experience is much more work to implement, can be a problem to alter within a transaction, often isn't 100% compatible with other sql operations(load?), doesn't support etc.
From the doc you provided:
"As we can see, a complex partitioning scheme could require a substantial amount of DDL. "
another comment ends with "So performance can be drastically worse if you use partitions."
So, yeah - this is like partitioning-lite. Enough to work in some narrowly-scoped situations, but not so well that you want to make it a part of your general solutions. You certainly wouldn't want to create 365 daily partitions this way: that would be 365 or 366 tables within a single ddl script (ugh). Plus, it still doesn't address parallelism - so your 4 or 8 way smp is stuck scanning the data with a single cpu.
Postgresql is a very cool database, but it still has a way to go with this kind of functionality. It's moving quickly, so it'll be cool to see a strong solution here within a couple of years. But until it has something bullet-proof I'm in no hurry to bet my career on it.
> The big three database manufacturers all charge pretty much the same for the same feature set. Oracle costs the same are
> sql server and db/2 within a percent or two.
Not in my experience: db2 is often 50% the cost of Oracle, especially since partitioning is an extra $10k/cpu for oracle and a completely usable form of partitioning is included within the base db2 product.
Right now I've got a multi-terabyte data warehouse running on a db2 license that costs $1500/cpu. If I wanted it directly accessible on the internet then it would run $7500/cpu.
> Personally I don't see how anybody can charge for databases these except to the largest organizations. The killer feature
> seems to be real and reliable replication and clustering.
No, those are just the features that the open source community seems to want to target. But why? They're both typically used for failover, and the commercial products have far better failover solutions (and ones that actually work across geographical separated data centers).
Bi-directional replication is at best a pain in the butt and when used to actually consolidate data its lack of tranformation abilitities stinks.
Clustering can deliver either availability or speed. The oracle solution is geared towards availability, the informaix/db2 solution towards speed (it's like a beowulf architecture, but been around for ten years). The former is ok, but again doesn't work across data centers, the latter is ok - and ideal when you've got 20 TB, but otherwise overkill.
What about partitioning & parallelism? Why use a product like db2 or oracle? How about because they can save you huge dollars on hardware? Take this example: you've got a million rows of data a day for 365 days on a 4-way SMP with 8 gbytes of memory and four disk arrays. Users run a wide variety of queries for reporting & analysis. Assume that your hardware cost $88,000 (list) for high-end models of this type.
If you're using postgresql or mysql many of those queries are going to result in tablescans - in which the database has to read every single one of those 365 million rows. This is because btree indexes don't work if you're selecting more than 1-3% of the data. So, want to see all data for previous month for monthly reports? Fine, but you'll have to wait four minutes to scan all data.
On the other hand, if you're using db2 for example you'll probably want to partition (with MDC - available even in free product) on day. When you query on a month of data DB2 will scan just that one month - 1/12 of the table. Then when it does that it'll run the query in parallel - giving you 4x the performance of the single-threaded queries from mysql & postgresql. Then you've also got fine-grained memory tuning, a wide variety of optimizations and a fast optimizer - capacle of handling complex queries. Ignoring the performance benefits of the latter features (only because difficult to quantity) and just using the first two - you're going to get 48x faster performance from db2 than mysql or postgresql. That query that took 10 minutes on msyql? It'll run in four seconds on db2.
How much would you spend on hardware to try to get that mysql query to run in 4 seconds? Far more than the cost of oracle or db2!
> Windows is like a house of cards made from million decks, so many co-dependancies. It's why Vista has taken so long and will
> continue to cause problems.
Yeah, it would be interesting to see how many lines of code, classes, function points, whatever have gone into each release - along with the number of years, programmers and dollars it took to get there.
Windows XP took quite a long time and really didn't offer much. Vista is taking what? twice as long? and is again offering very little. The interesting question is what do the economies of the next three versions of windows look like based on this trend?
Pretty damn bleak if you ask me.
Assume that it takes:
- 4 hours to write a given program in python, 32 hours to write same program in C++
- 10 seconds to run the python program, but just 2 seconds to run the faster C++ program
- the program is run 20 times a day
- assume the developer time costs as much as the the time of the person that runs it
Ok, so it'll take 630 days of running this program for the faster C++ program to make up for the extra time to develop it. So, if you can wait two years for a payback then C++ is the way to go, otherwise code it in python.
There that was easy. Ok, any other simple problems out there? Which editor you should use? What's just the right amount of comments per program? Which is better - cvs or subversion?
> You just hit queries per hour, what our software hits in 12 seconds. We average 5,000 queries a second, for a large EMR (Electronic Medical Record)
;-)
> application, with approximately 4,000 users. Our software is web based, written in C, using MySQL as the backend. We're currently sitting at 16T of
> storage as well, with a growth of 2G per day of new data.
Not to take anything away from your achievement, but we're talking apples & oranges here: the queries I'm talking about are reporting queries - not simple operational/transactional ones. They aren't just looking up a status on an item. They're each often scanning tens of thousands of rows of data, sometimes millions, joining to a separate temp table - itself created as a cartesian product from multiple dimension tables, then grouping the final result. These are queries that in require parallelism and partitioning to achieve fast results. Completely doable in db2, oracle, informix and to a much lesser extent sql server. Not in mysql or postgresql yet though since these products lack parallelism and partitioning.
Additionally, we can manage the entire query load from a single four-way power5 aix box running a low-end version of db2. Not only can mysql not match that reporting performance (it's about 1/40th the speed of db2 for reporting), but I don't think that it would meet your figures above either. How many servers do you have supporting your mysql database?
> PHP has its places - extremely large, mission critical applications are not one of them.
Odd that you said that given that you're using mysql.
However, I think your comment is misplaced: while this system does a ton of heavy moving, the front-end isn't one of the performance-challenged parts of the system: the extraction of data from other systems (python), the transformation of data (python), and the processing of queries (db2) are the performance-oriented parts. The display of data (php) handles just a fraction of the workload of the other pieces. So, in this case - php works ok in an extremely large, mission critical application. It's biggest challenge in our context is manageability rather than performance. Would I recommend php for your application, perhaps not - but that doesn't mean it doesn't work fine for ours.
Likewise, perhaps mysql can work well within a mission-critical application. You tend to have to buy a lot more hardware, and spend a lot more time on testing. And porting to other databases can be tough. But you can still do it. And maybe sometimes it makes sense. I'd be hesitant to say "mysql cannot do enterprise computing". It's just not true.
> If you don't mind me asking, what python ETL software are you using? Is this something available commercially, through open source, or something of your own design.
We wrote it ourselves. It wasn't difficult to write, but we focused on compliance with some patterns that made it both easy to develop and easy to manage. For example, one pattern is that all processes are completely autonomous, typically run out of cron every minute. If there is no data to process they die quietly, if they are already running they die quietly, if their schedule is suppressed they die quietly. All they produce is a file used by a downstream process. But they don't run that process - that process like the above one, is also constantly checking for a file.
So, we didn't spend a fortune on etl software, have something that can scale very cheaply, and is very simple to build and run. It would have been great to use product that already provided these benefits, but we couldn't find one. Oh yeah, one other benefit: until we got into production (and earned credibility) we had almost no budget for software, training, etc.
> I have been looking for software to replace the ease of SQL Server DTS for quite a while and using DB2 (if necessary) would not be a problem.
yeah, DTS is terrible. I replaced one large etl environment in DTS with cygwin and python. It immediately reduced our labor costs. There are some open source etl projects, but I can't recommend any. That's too bad, since it would be great to have an alternative to Data Junction, Informatica, Ab Initio, etc.
>> cans of a hundred million rows at a time aren't uncommon (though seldom happen more than a few dozen times a day).
> Yes they are. Go read what you wrote.
I'm intimately familiar with this application. When I say that only a few dozen queries a day are scanning a hundred million rows in a given day, it is the case.
> This app is completely written in korn shell, python, php and sql (db2).
>> One guess where 99% of the ccycles arae in that (and 90% of the dollars).
No need to guess, I'll provide the info.
First off, cycles? really? what an simplistic way to think about such a system - you normally also look at io and network performance and memory. Especially io performance when you're slinging this much data around. And - python and c have extremely similiar io performance.
And cost? You're implying that the 90% of the cost is spent on hardware apparently because python is wasting cpu cycles. This is not the case: etl servers running python consume less than 10% of the total hardware cost. And hardware is less than 10% of the total project cost.
So:
- php & python has had a negligable impact on hardware cost which is 10% of project cost
- php & python has had a hugely beneficial impact on labor cost which is 90% of project cost
I'm not going to say that python, php and ruby will achieve the same benefits on all projects. But on this "hard" project those technologies far outperformed what native code would have achieved.
> How much app. processing goes on in the script languages? How much hardware do you dedicate to those vs. db2?
/.er, "Which interpretted language is db2 written in?" :)
Well, db2 is obviously managing quite a lot of it. Certainly all of the queries, but also the very fast loads. DB2 is running on four-way Power4 & Power5 hardware with 4-12 disk arrays per server with 64-bit architectur and typically 8 gbytes of memory. It's running extremely fast.
By the time the data hits PHP it is typically just small result hits - that is, a scan of a few million rows will typically generate just a hundred rows or less that will go into a chart, graph or table. The PHP component is just running on older intel four-way SMPs. For a while much of it was statically generating all possible query pages - which meant a vast amount of processing, which worked fine - while we had slower databases.
But *100%* of the data pours through python. Every single row has to be reformated, has to be validated, has to have multiple fields replaced with identifiers to other tables (which require lookups in arrays). All of this happens in python. The python processes can hit 500 million events a day comfortably on older 1.4 ghz intel four-way smps with 4 gbytes of memory and a single slow disk array. They can hit about 3 billion events a day on fast 2.8+ ghz intel four-ways with multiple disk arrays.
> Is this externally visible (like search_used_books.amazon.com...)?
Yes, but only if you're one of hundreds of our customers.
> And of course to be an argumentative
you've got me there - we do run several pieces of software written in native code: aix, redhat linux, apache, db2, etc. And they do quite a lot of work - I'm not taking anything from them or suggesting that they be ported to python. However, our python ETL server is still churning out a vast amount of data every hour of the day.
> When your web-based-datastore gets 50,000 inserts per second, hovers between 15 and 20 billion rows and endures a sustained query rate
> of 43,000 queries per hour, tell me which part of it you want to coded in PHP.
hmm, the warehouse I work on has multiple databases with billions of rows in them, can hit insert rates of 100,000 rows a second, can experience 60,000 queries/hour - many of which are trending data over 13 months, has hundreds of users. Many of these users are allowed to directly hit some of the databases with whatever query tool they want. Scans of a hundred million rows at a time aren't uncommon (though seldom happen more than a few dozen times a day).
This app is completely written in korn shell, python, php and sql (db2). Looks like Ruby is also coming into the picture now, will probably supplant much of the php in order to improve manageablity.
Oh yeah, and the frequency of releases is quick and it's defect rate is low. And we're planning to begin adding over 400 million events a day soon. I've done similar projects in C and java. Never anywhere near as successfully as in python and php.
We might consider rewriting a few select python classes in c. Maybe, if we port the ETL over to the Power5 architecture with psycho doesn't run. Otherwise, it's cheaper to just buy more hardware at this point - since each ETL server can handle about 3 billion rows of data/day with our python programs.
It's important that in a book entitled "The Art of SQL" they followed the organizational structure of "The Art of War".
Well, it really isn't at all important should it be a surprise.
And The Art of War faddish? The book is over 2500 years old, influenced Emperor Napoleon, General Patton, BH Liddell Hart (who in turn influenced the creation of the WWII German military strategies), General McArthur, etc, and has sold well to non-military types for at least 20 years. I think the world could use a few more books that survive the test of time 1/10th as well.