Slashdot Mirror


Internet Archive Gets 4.5PB Data Center Upgrade

Lucas123 writes "The Internet Archive, the non-profit organization that scrapes the Web every two months in order to archive web page images, just cut the ribbon on a new 4.5 petabyte data center housed in a metal shipping container that sits outside. The data center supports the Wayback Machine, the Web site that offers the public a view of the 151 billion Web page images collected since 1997. The new data center houses 63 Sun Fire servers, each with 48 1TB hard drives running in parallel to support both the web crawling application and the 200,000 visitors to the site each day."

31 of 235 comments (clear)

  1. Where do they store 4.5TB off site by wjh31 · · Score: 5, Interesting

    one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system

    1. Re:Where do they store 4.5TB off site by LiquidCoooled · · Score: 5, Funny

      one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system

      floppy disks.
      lots of floppy disks.

      --
      liqbase :: faster than paper
    2. Re:Where do they store 4.5TB off site by MichaelSmith · · Score: 4, Funny

      Its like the two USB hard disks I use for backups. Pick up the container and swap it with the container from secure storage,

    3. Re:Where do they store 4.5TB off site by MrEricSir · · Score: 4, Funny

      It's simple, the backups are compressed -- they simply remove all those useless zeroes from the binary data.

      --
      There's no -1 for "I don't get it."
    4. Re:Where do they store 4.5TB off site by DigiShaman · · Score: 4, Interesting

      Umm, how many forklifts and 18 wheelers does it take to swap out 4.5 petabytes worth of data each day?

      --
      Life is not for the lazy.
    5. Re:Where do they store 4.5TB off site by commodore64_love · · Score: 5, Funny

      They'd better have it backed-up. Last time the Alexandria library burned-down, we lost about one thousands years of collected information from ancient Greece and Rome. Ooopsie.

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    6. Re:Where do they store 4.5TB off site by Anonymous Coward · · Score: 5, Funny

      Can you say, Parallelism?

      Parallelogram.... crap
      Parallellellell... dammit
      Parapalouza... >

      Why did you have to point that out to everyone? :(

    7. Re:Where do they store 4.5TB off site by notthepainter · · Score: 4, Interesting

      Sadly, even modern day archives get wrecked. See http://www.spiegel.de/international/germany/0,1518,611311,00.html

    8. Re:Where do they store 4.5TB off site by zach297 · · Score: 5, Funny

      I'd suggest also using stone slabs. Water can do serious damage to paper, and don't get me started on fire hazards. Good old Stone Slabs resist both of those really well. I'm not sure what the write speed is, however, so you'll probably need to hire many stonecutters to work in parallel.

      A math problem. My favorite. I don't know much about stone cutters but lets assume they can write one bit every 2 seconds. Thats 1 byte in 16 seconds. The internet archive is (4.5 x 1,125,899,906,842,624) 5,066,549,580,791,808 (5 quadrillion) bytes. That works out to 81,064,793,292,668,928 (81 quadrillion) seconds or about 2,570,547,732 (2.5 billion) years. That is far to long for their stringent 2 month backup cycle. They would need 15,423,286,395 (15.4 billion) stone cutters to keep schedule assuming they had unlimited stone. Last time I checked there were only between 6 and 7 billion people with only a small fraction of them being stone cutters. That leaves but one solution. Force the web developers to become stone cutters. This would not only increase the work force but also reduce the amount needed to backup because fewer people will be making more web pages to backup.

    9. Re:Where do they store 4.5TB off site by Omniscient+Lurker · · Score: 4, Interesting

      Instead of writing in binary you could write the data in a base-36 format and then convert back to binary. The stone cutters could then store more data per glyph increasing their write rate considerably (and decreasing read rate) by amounts I am unwilling to calculate.

  2. Story is meaningless without LOC measurement by Dr_Banzai · · Score: 5, Funny

    I have no idea how much 4.5 PB is until it's given in units of Libraries of Congress.

    1. Re:Story is meaningless without LOC measurement by Anonymous Coward · · Score: 5, Informative
    2. Re:Story is meaningless without LOC measurement by Wingman+5 · · Score: 5, Interesting

      from http://www.lesk.com/mlesk/ksg97/ksg.html The 20-terabyte size of the Library of Congress is widely quoted and as far as I know is derived by assuming that LC has 20 million books and each requires 1 MB. Of course, LC has much other stuff besides printed text, and this other stuff would take much more space.

      1. Thirteen million photographs, even if compressed to a 1 MB JPG each, would be 13 terabytes.
      2. The 4 million maps in the Geography Division might scan to 200 TB.
      3. LC has over five hundred thousand movies; at 1 GB each they would be 500 terabytes (most are not full-length color features).
      4. Bulkiest might be the 3.5 million sound recordings, which at one audio CD each, would be almost 2,000 TB.

      This makes the total size of the Library perhaps about 3 petabytes (3,000 terabytes).

      so 230 libraries by the old standard or 1.5 by the new standard

    3. Re:Story is meaningless without LOC measurement by Anonymous Coward · · Score: 4, Funny

      460.8 Lines of Code? What's that supposed to be? Hello World in COBOL?

  3. Storage Envy by jacksinn · · Score: 5, Funny

    Does lusting after all their space make me a peta-phile?

    --
    Life==Jeopardy. All the answers are right in front us - the hard part is coming up with the correct question.
  4. Own the internet! by Anonymous Coward · · Score: 5, Funny

    so all one need to do to "own the internet" is to drive a big rig and ... lift the container off their parking lot?

    1. Re:Own the internet! by peragrin · · Score: 5, Funny

      well if you plug in a laser printer you can print off a hard copy for your boss.

      --
      i thought once I was found, but it was only a dream.
  5. Slight problem? by girlintraining · · Score: 5, Funny

    I can now theoretically steal "the internet" with a flatbed truck and a lift. There's something to be said for conventional data centers: They're rather hard to load onto a truck and drive off with.

    --
    #fuckbeta #iamslashdot #dicemustdie
    1. Re:Slight problem? by rackserverdeals · · Score: 4, Interesting

      Here's a video tour of one if you need it for reference.

      Don't forget to turn off the water and unplug the ethernet cables. Just be very careful with the power cords.

      --
      Dual Opteron < $600
  6. What about 1996 and earlier? by commodore64_love · · Score: 4, Interesting

    Are there any resources the let us see websites from 1996, 95, 94, or 93? I would love to revisit the web as it appeared when I first discovered it (1994 at psu.edu).

    --
    "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    1. Re:What about 1996 and earlier? by Tumbleweed · · Score: 4, Funny

      I would love to revisit the web as it appeared when I first discovered it (1994 at psu.edu).

      No, you wouldn't.

  7. They store 4.5PB in Egypt! by CannonballHead · · Score: 4, Funny

    The Internet Archive also works with about 100 physical libraries around the world whose curators help guide deep Internet crawls. The Internet Archive's massive database is mirrored to the Bibliotheca Alexandrina, the new Library of Alexandria in Egypt, for disaster recovery purposes.

    1. Re:They store 4.5PB in Egypt! by Anonymous Coward · · Score: 4, Funny

      Egypt could be a good choice. The area is fairly famous for reliable persistent storage. From papyrus scrolls to stone engravings, things tend to keep there better than most places. There really aren't many other geographical areas on earth that can claim the same kind of data retention rates over the time periods they've dealt with. Though despite their impeccable track record with avoiding hardware failures, they've done significantly worse when it comes to data loss due to theft and/or hackers/pirates.

      The one curious part about that choice is that the library at Alexandria is the one notable case where mass amounts of data were irreparably lost. So it's odd that they'd choose to entrust their data to that specific institution. Perhaps they felt that since it's under new management, the previous problems will have been resolved.

      However, had the choice been mine, I would have chosen to store my offsite data in Luxor. It's data retention was quite good, and included one data store that was preserved in its entirety for over 3000 years. As an added benefit, it seems that they've opened a second location that's significantly more convenient for the IA since there's no overseas transmission to worry about.

  8. In Other News by Erik+Fish · · Score: 5, Informative

    Incidentally: FileFront is closing in five days, taking with it any files that aren't hosted elsewhere.

    I am told that many of the Half-Life mods hosted there are not available anywhere else, so get while the getting is good...

  9. Never underestimate the bandwidth ... by Ungrounded+Lightning · · Score: 4, Insightful

    ... of a 4.5 petabyte datacenter in a shipping container in transit.

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
  10. You can ship it over OC-192... by Ungrounded+Lightning · · Score: 4, Interesting

    ... one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot,..

    As I recall from one of Brewster's talks: Part of the idea was that you can install redundant copies of this data center around the world and keep 'em synced.

    You can ship 4.5 petabytes over a single OC-192 link in about 71 days.

    --
    Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
    1. Re:You can ship it over OC-192... by TheGratefulNet · · Score: 5, Funny

      You can ship 4.5 petabytes over a single OC-192 link in about 71 days.

      yeah, but just at the 70th day, someone will pick up the phone and the whole thing will have to be resent.

      --

      --
      "It is now safe to switch off your computer."
    2. Re:You can ship it over OC-192... by aaarrrgggh · · Score: 4, Insightful

      Or, you can ship the 40' containers in just under two weeks!

  11. Re:63 x 48 = 3024Tb by spinkham · · Score: 4, Informative

    TFA says "...eight racks filled with 63 Sun Fire x4500 servers with dual- or quad-core x86 processors running Solaris 10 with ZFS. Each Sun server is combined with an array of 48 1TB hard drives." (emphasis mine)

    I would guess this means there's a x4500 with 24TB in local disks, and 48TB in attached storage per machine. (24+48)*63 does give us the quoted number

    --
    Blessed are the pessimists, for they have made backups.
  12. Re:"Sun Fire" by fm6 · · Score: 4, Interesting

    This seems to be an exact use case for the X4500-type system, which as far as I'm aware is pretty unique.

    Indeed. Sun is on a density kick. Check out the X4600, which does for processing power what the X4500 did for storage.

    In both cases, there actually are competing products that are sort of the same. The most conspicuous difference is that the Sun versions cram the whole caboodle into 4 rack units per system, about half the space required by their competitors.

    More absurdly-dense Sun products:

    http://www.sun.com/servers/x64/x4240/
    http://www.sun.com/servers/x64/x4140/

    The point of these systems is that they take up less expensive rack space than equivalent competitors. They're also "greener": if you broke all that storage and computing power down into less dense systems, you'd need a lot more electricity to run them and keep them cool. That not only saves money, it gives the owner the ability to claim they're working on the carbon footprint.

  13. The off-site backup IS the Internet. by billstewart · · Score: 4, Funny

    They're keeping the offsite backup distributed around the Internet, using the World-Wide Web to store it in real time.

    Part of it may even be on *your* machine! We've really got to stop Brewster from leaching all your storage and make him store his backup himself - this business of using the originals to back up the backup just isn't sustainable!

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks