Domain: dbms2.com
Stories and comments across the archive that link to dbms2.com.
Comments · 20
-
Forget itI call total BS on this post for a few reasons
- 100 TB is about 250 SSD drives. Ebay runs a few hundred peta of storage over several datacenters throughout the US, not including all other countries.
- We run several racks at Telx. Our cost per month in NYC including electric is only about 3k per cabinet. Does Ebay really care about 3 cabinets?
- Over half the systems now access the shared storage which contains the drives? Yes, if I map a drive to a particular SAN I guess I am now accessing the data. Does that mean I am actually leveraging it?
All in all this is barely a dent in anything Ebay does. It sounds more like an experiment and hype of the drives they used.
-
An architectural perspective
First, a few helpful links:
- Facebook architecture overview
- Facebook architecture presentation by Aditya Agarwal
- Amdahl's law
- Facebook uses many Hadoop nodes
- Facebook has 2nd largest Hadoop cluster
Amdahl's law says that if Facebook were to switch from PHP to C++, the best possible improvement in the overall processing time is proportional to the total time spent in PHP now. If PHP processing accounts for 90% of the time and they reduce that to zero, they'd have a 10x speedup. However, if it accounts for 10% of the time and they reduce it to zero, they'd have about a 10% speedup.
So, the question is: How much time (overall) is spent in PHP processing? My guess is not very much. As other posters have pointed out, there are disk accesses and MySQL. And quite a bit is cached in Memcached.
The original article is slashdotted now, so I'm not sure if it says what those 30k servers are doing, but Facebook has more than just PHP running. Perhaps a thousand of those servers are running Hadoop, probably calculating the social network.
From an architectural perspective, it probably does not make sense for them to optimize for processing speed (i.e., switch PHP to C++) if their performance is acceptable. That's because they face larger risks: modifiability and time to market pressures. They may worry that switching to a statically typed language (such as C++, but Java would be similar) would make new feature development slower. If they could have both, great, but these two quality attributes often trade off against each other. A design with better performance may hinder modifiability, and vice versa.
I don't mean to start a language war -- I'm speaking broadly about the idea that dynamically typed languages (PHP, Smalltalk, Ruby, Python,
...) yield programs that are faster to write and modify compared to statically typed languages (C, C++, Java). You may disagree with that generalization, but you may agree that others think it is true, and are therefore acting rationally if they choose a dynamic language when they want modifiability.Disclaimers: I knew Aditya in school but haven't spoken to him about Facebook; I am writing a book on software architecture.
-
Re:Wise They Are
How many of the top 10 largest (petabyte sized, or approaching) are running on commercial databases, and how many are on postgresql again? Wasn't there a slashdot article just a few weeks ago? Hmm, one link I have is http://www.dbms2.com/2008/08/25/greenplum-is-in-the-big-leagues/
-
What is MapReduce SPECIFICALLY useful for?
At the risk of quoting myself,
Proponents of MapReduce highlight two advantages:
1. MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
2. MapReduce runs in massively parallel mode "for free," without extra programming.
Based on those advantages, MapReduce would indeed seem to have significant uses, including:
* Specialized indexing of large quantities of data. Obviously, MapReduce was built for text indexing of the Web. But it would likely also be useful for, say, preprocessing satellite telemetry or intelligence intercepts, or for doing early steps in large-scale network traffic analysis. MapReduce may not be good for data management, but it looks good for banging stuff into specialized data management systems.
* Computer-scientific research. If you're trying to figure out better ways to, say, digest and analyze huge amounts of astronomical data, MapReduce seems like a great platform. Today's researchers - even the students - aren't nearly as adept at parallel algorithms as one would hope. Perhaps we should take those complications away to let them focus on the unique parts of their work. Breakthrough programming is hard enough anyway, especially if you're trying to do all the work yourself. -
TIBCO is the logical next candidate to be bought
TIBCO is the logical next candidate to be bought. If not SAP.
-
Vertica is far from the only columnar game in town
And it's not just Sybase IQ, either. There are lots of columnar players. Kognitio also has a columnar VLDB offering, but it's quite different from Vertica's. And the columnar memory-centric BI offerings are interesting as well, such as QlikTech's and SAP's. Also, full-text indexing is pretty columnar itself.
-
Vertica is far from the only columnar game in town
And it's not just Sybase IQ, either. There are lots of columnar players. Kognitio also has a columnar VLDB offering, but it's quite different from Vertica's. And the columnar memory-centric BI offerings are interesting as well, such as QlikTech's and SAP's. Also, full-text indexing is pretty columnar itself.
-
Vertica is far from the only columnar game in town
And it's not just Sybase IQ, either. There are lots of columnar players. Kognitio also has a columnar VLDB offering, but it's quite different from Vertica's. And the columnar memory-centric BI offerings are interesting as well, such as QlikTech's and SAP's. Also, full-text indexing is pretty columnar itself.
-
Vertica is far from the only columnar game in town
And it's not just Sybase IQ, either. There are lots of columnar players. Kognitio also has a columnar VLDB offering, but it's quite different from Vertica's. And the columnar memory-centric BI offerings are interesting as well, such as QlikTech's and SAP's. Also, full-text indexing is pretty columnar itself.
-
Very promising for SOME applications
I had a long chat with Mike Stonebraker a few weeks ago, and came away with the following tentative opinions about Vertica's prospects, and those for columnar systems in general.
* Pinpoint data lookup doesn't seem like a great fit for columnar systems. Indeed, traditional rows-and-B-trees would seem to be best.
* Constrained query and reporting would seem to be a sweet spot, even though it's a sweet spot for some of the best competition as well.
* Cube-filling calculations involve big intermediate result sets. I'm not sure that's a great fit for columnar systems.
* Hardcore tabular data crunching would seem in many cases to be another sweet spot, again against a lot of competition, at least in some of its sub-categories.
* Text and media search are best done by specialized systems that, at least in the case of text, wind up being quasi-columnar. The same goes for other specialty areas. Systems like Vertica's have nothing to offer directly to these applications. However, it might be possible for Vertica to integrate with them fairly quickly, given that they're starting from vaguely similar philosophical roots.
There also are some technical details in that article; a link to a short, somewhat hagiographic intro to Mike himself; and so on. -
He's endorsing DBMS2. He just doesn't know it.
Just as Amazon was using SOA long before it was named, the same is true of DBMS2. Add that to SAP's adoption, and we're getting somewhere.
:) -
He's endorsing DBMS2. He just doesn't know it.
Just as Amazon was using SOA long before it was named, the same is true of DBMS2. Add that to SAP's adoption, and we're getting somewhere.
:) -
Seriously, there are a lot of things to try
1. Multiple e-commerce models. Something will probably work.
2. Multiple approaches to network analysis, collaborative filtering, etc. (Obligatory shameless plug: The hot new company in network analysis is Cogito.)
3. Various communications things.
4. Various real time monitoring things, both narrowly filtered and for overall trends.
I bet if I'd logged onto the site a single time in my whole life I might be able to come up with even more ideas. ;) -
Juvenal delinquency
Juvenal is the ancient Roman who asked "Who will watch the watchmen?" For George Bush, the answer is evidently "Preferably, nobody."
-
The way we protect liberty will have to change
A lot of people seem to be overlooking two basic facts:
1. The amount of information government truly needs to gather to protect us is also sufficient to greatly threaten our liberty.
2. Governments will inevitably gather much more information than they really need.
As a result, it is necessary to design legal systems (and where possible to restrain the design of technical systems) so that even though government has the information, it doesn't commonly use it in nefarious ways. I've written a series of articles about that. Most of them can be found starting from the link http://www.monashreport.com/2006/06/06/freedom-eve n-without-data-privacy/, or more generally from http://www.monashreport.com/category/public-policy -and-privacy/privacy/
Examples of why we should expect government to gather huge amounts of information include, in no particular order:
A. All the call/e-mail/whatever connection information they're already getting, as documented in the news around NSA surveillance, AT&T's involvement, and so on.
B. Laws to require ISPs or information service providers to keep records of which IP addresses connect to which sites (so as to fight child porn, piracy, whatever).
C. Britain's moves towards complete video tracking of car movements (I get my reporting on this from The Register).
D. Credit card transaction records.
E. Forthcoming integrated electronic health records. (Those will have huge benefits to the saving of lives, quality of life, cost and efficiency of health care, etc. Whatever the privacy risks, they need to be managed so that health care is allowed to improve.)
And that's even without mentioning RFID.
What's slowing all this down is some political opposition, plus the huge technical difficulty of the required system integration projects. But in a small number of decades, it will all have happened. Our laws and oversight systems need to have evolved drastically by then. We have to start now.
I'm definitely not saying that we should cripple government in gathering and using information. Indeed, I'm an advisor to Cogito, a company with some of the most powerful relationship analysis software out there. http://www.dbms2.com/category/object-oriented-and- xml-technology/cogito/ But I think we need to radically upgrade our legal structures in response to these technological trends. -
You might want to look at some Oracle alternatives
Besides the obvious MySQL/commodity low end stuff, you might want to look at specialty high end alternatives too. I have my shoes off as I write this, so counting SAP's BI Accelerator customers is no problem for me; still, that's an interesting product heralding an interesting trend. And I think DATallegro will be on brandname boxes soon too.
My backup for these opinions can be found at various places on http://www.dbms2.com/ -
OS-DBMS integration is anyway pretty common
If you think about, there are a lot of cases of OS-DBMS integration, or at least highly OS-aware implementations. Examples include Teradata, mainframe DB2, data warehouse appliances, the AS/400 case mentioned in another note -- and arguably Oracle itself! Unlike other portable DBMS vendors, Oracle does a lot of OS-specific integration/interface work for each platform it supports.
I posted a little support for that argument at http://www.dbms2.com/2006/07/09/os-dbms-integratio n/. -
If Intel's partners are to be believed ...
.. Woodcrest is the real deal. Companies that held their noses and supported Intel in the past for financial reasons now say that Woodcrest has actually caught up with or leapfrogged Opteron. DATallegro is just the most visible example. At the risk of yet another shameless plug, you can see some details via http://www.dbms2.com/2006/06/28/good-datallegroin
t el-white-paper/ -
But there's also an opposing trend
System architecture is changing in a profound way that will somewhat limit the commoditization on which virtualization depends. It's not just a matter any more of CPUs doing calculation and ordering up random disk accesses. RAM speeds, memory bus speeds, interprocessor pipeline speeds -- that stuff all matters a lot now. This is most evident in data warehousing/analytics, where data warehouse appliances (Netezza, DATallegro) and even memory-centric technologies (SAP, Applix) are becoming more important, but it could also be a broader trend.
I've written about some of the details at http://www.dbms2.com/
No way do I dispute the benefits of virtualization in OLTP, messaging, and so on. It's just not the be all and end all.
-
Re:Not silly at all
Date and Fabian Pascal have been fairly clear that it's an actual company - they've discussed the owner's name, which I don't recall, I'd have to look it up on my hard drive somewhere [Steve Tarin, apparently, is the inventor and owner, I've just looked it up].
According to Pascal, Date has seen a working implementation of the TRM, and is writing a book about it tentatively entitled "Go Faster! The Transrelational Approach to DBMS Implementation."
The company name is Required Technologies Inc.,
39141 Civic Center Dr. Ste. 250, Fremont, CA 94538 Their Web site remain "under construction".
The patent for TRM is United States Patent 6,009,432. There is also the patent application: United States Patent Application 0010000536.
There is also a resume of one Vincent Poydenot who described his employment with Required Technologies as Vice-President of Software Development. He describes a 15-man development team which developed a full implementation for Windows NT 4.0 in Visual C++, with a port to Solaris and Linux.
Links for the above here http://dmoz.org/Computers/Software/Databases/Relat ional/Implementations/Required_Technologies//
An ad for programmers to work for the company when they were apparently in New York is here:
http://www.codeguru.com/forum/archive/index.php/t- 188697.html/
According to Pascal's DBdebunk Web site:
"Not only had Date been exposed to a working TRM implementation - a prototype built by Required Technologies that included update and disk operations - but so have other highly respected database researchers and implementers. Moreover, several potential customers ran their own benchmarks against this prototype using their own real-world data and their own live complex queries. The results were extraordinary. In every case, TRM delivered orders-of-magnitude performance improvements over existing RDBMSs, in a large dynamic disk-based environment. These results can be demonstrated to anyone seriously interested in TRM....
Not only does the prototype implementation of TRM (referenced above) still exist, but also a full-blown commercial disk-based updatable RDBMS based on TRM (with standard SQL, ODBC, JDBC, and third-party tool interfaces, plus all standard subsystems) is nearly complete."
The above was as of January 2005. Back in late 2004 Pascal was describing "large transactional databases with subsecond response." Note that these were not in-memory databases but disk-based.
Apparently there is some legal or financial issue involved that is threatening the owner with "having his company taken away from him", according to one reference. They claim the guy has been fighting tooth and nail to resolve the issues, but there hasn't been any recent info.
I would have assumed the whole thing would have been resolved by now in most cases, unless the people involved are waiting for some court case.
I have now found a post on Curt Monash's blog whereupon he apparently - I say apparently because I have no idea whether his information is correct - debunks the entire project and the company:
http://www.dbms2.com/category/memory-centric-data- management//
Fabian Pascal's response on Curt's blog:
Monash knows zilch about TRM. But then he knows zilch about RM too,and lack of knowldge has not stopped him ever before from generating crappola. In fact, he is not even aware of how ignorant he is.
Nonsense indeed, but the only one is from Monash.
Unskilled and unaware of it. Typical american.
Comment by fabian pascal -- November 14, 2005 @ 12:41 pm
There follows a ton of incredibly acrimonious comments between Monash and Pascal in which both accuse the other of various incompeten