Slashdot Mirror


Internet Archive Gets 4.5PB Data Center Upgrade

Lucas123 writes "The Internet Archive, the non-profit organization that scrapes the Web every two months in order to archive web page images, just cut the ribbon on a new 4.5 petabyte data center housed in a metal shipping container that sits outside. The data center supports the Wayback Machine, the Web site that offers the public a view of the 151 billion Web page images collected since 1997. The new data center houses 63 Sun Fire servers, each with 48 1TB hard drives running in parallel to support both the web crawling application and the 200,000 visitors to the site each day."

14 of 235 comments (clear)

  1. Re:Where do they store 4.5TB off site by fuzzyfuzzyfungus · · Score: 3, Informative

    TFA indicates that they have a mirror at the library of Alexandria. Unless things have changed since last I read about them, the mirroring is pretty much it. The Internet Archive does very impressive work; but they don't have that much money. No Real Big Serious Enterprise tape silos here.

  2. Re:Story is meaningless without LOC measurement by Anonymous Coward · · Score: 5, Informative
  3. Re:Story is meaningless without LOC measurement by commodore64_love · · Score: 3, Informative

    83 terabyte in the LOC, so 4.5 petabytes == 54 Libraries of Congress

    4.5 petabytes == 4500 terabyte hard drives, times $75 each == ~$340,000 == how much taxpayers spend, each hour, to maintain the LOC

    --
    "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
  4. In Other News by Erik+Fish · · Score: 5, Informative

    Incidentally: FileFront is closing in five days, taking with it any files that aren't hosted elsewhere.

    I am told that many of the Half-Life mods hosted there are not available anywhere else, so get while the getting is good...

  5. Re:What about 1996 and earlier? by Profane+MuthaFucka · · Score: 2, Informative

    Because after 1996 women shaved all their hair off due to a mistaken belief that men prefer their women to look like little girls. We don't, we like the big bushes, and that is why you must save that porn for the good of mankind.

    --
    Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
  6. Math by PowerKe · · Score: 3, Informative

    63 servers * 48 disk of 1 TB = 3024 TB. According to the announcement on the archive.org 3 Petabytes would be right.

  7. "Sun Fire" by fm6 · · Score: 3, Informative

    The new data center houses 63 Sun Fire servers

    That's not very specific. "Sun Fire" is a brand that for a while got applied to all of Sun's rack-mount servers (except for NEBS-compliant servers, which were and are called "Sun Netra"). A little confusing, of course, which is why they've started calling new SPARC boxes "Sun SPARC Enterprise" to differentiate them from those mangy x64 "Sun Fire" systems. Except that there are still SPARC systems called "Sun Fire", so I guess the confusion factor didn't get any better...

    Anyway, the specific server being used here is the Sun Firex X4500, a system with no less than 48 1 TB disks in a 4U space. Notice that this model is EOLed; presumably iarchive got a deal on some remaindered machines.

    The shipping container is something we've seen before.

    1. Re:"Sun Fire" by ximenes · · Score: 2, Informative

      Since they're using one of Sun's modular datacenters that is actually on the Sun campus, I would imagine that they got some financial incentives / support from Sun for all of this.

      The X4500 is EOL as you mention, although it was still sold a few months back. It lives on as the X4540, which really isn't that different; the main thing is it's moved to a newer Opteron processor type and is a fair bit cheaper. So they didn't really miss out on anything.

      It's kind of interesting to me that they went this route, as opposed to a bunch of servers talking to a bunch of storage separately. This seems to be an exact use case for the X4500-type system, which as far as I'm aware is pretty unique.

    2. Re:"Sun Fire" by Anonymous Coward · · Score: 2, Informative

      Anyway, the specific server being used here is the Sun Firex X4500 [sun.com], a system with no less than 48 1 TB disks in a 4U space. Notice that this model is EOLed; presumably iarchive got a deal on some remaindered machines.

      There are newer X4540s which are mostly the same, but have newer CPUs, and can hold more memory (16 -> 64 GB).

  8. Re:What about 1996 and earlier? by scottrocket · · Score: 1, Informative

    Yes, "The Wayback Machine", at archive.org. Coincidentally, I was there just last night, looking at a January '98 Slashdot.

  9. Re:63 x 48 = 3024Tb by spinkham · · Score: 4, Informative

    TFA says "...eight racks filled with 63 Sun Fire x4500 servers with dual- or quad-core x86 processors running Solaris 10 with ZFS. Each Sun server is combined with an array of 48 1TB hard drives." (emphasis mine)

    I would guess this means there's a x4500 with 24TB in local disks, and 48TB in attached storage per machine. (24+48)*63 does give us the quoted number

    --
    Blessed are the pessimists, for they have made backups.
  10. Re:Where do they store 4.5TB off site by Anonymous Coward · · Score: 2, Informative

    In Brewster Kahle's December 2007 TED talk he mentions a third mirror in the Netherlands.
    http://www.ted.com/index.php/talks/brewster_kahle_builds_a_free_digital_library.html

    As he puts it, the Archive is mirrored on 'a fault line, a flood plain, and in the Middle East'.

    Funny thing is I can't find another reference to the Netherlands mirror. The Bibliotheca Alexandrina site mentions a plan to eventually have four sites (California, Alexandria, Europe, and Asia), but that's it. Anyone know what happened with the Netherlands site?

  11. Re:63 x 48 = 3024Tb by rackserverdeals · · Score: 2, Informative

    Sun has more information and an Interactive tour of the Internet Archive modular data center on their site.

    The total raw capacity of the container is 3 peta bytes. In reality it's going to be less than that. First, 2 disks are likely to be setup in a mirrored pool for the system disks. I believe the root pool only supports mirrors, not raidz. Not sure if this has changed.

    That leaves you with 46 disks for data. Maybe they partitioned part of the root pool to include in the data pools, not sure, but zfs works better with whole disks.

    In the interactive tour, they weren't clear on how they set up the pools.

    Side note. Maybe I'm cynical, but if this was the other way around, with linux servers replacing sun/solaris servers that probably would have been the headline.

    Pretty neat to find out that the internet archive is powered by Java too. The wayback machine is java as well as the crawlers.

    --
    Dual Opteron < $600
  12. Re:Where do they store 4.5TB off site by Rural · · Score: 2, Informative

    Their aim is to preserve the content found on the Web. They need the hardware for that. I assume they don't need much for the "serving users" part.