Slashdot Mirror


PetaBox: Big Storage in Small Boxes

An anonymous reader writes "LinuxDevices.com is reporting that a Linux-based system comprising more than a petabyte of storage as been delivered to the Internet Archive, the non-profit organization that creates periodic snapshots of the Internet. The PetaBox products, made by Capricorn Technologies, are based on Via mini-ITX motherboards running Debian or Fedora Linux. The IA's PetaBox installation consists of about 16 racks housing 600 systems with 2,500 spinning drives, for a total capacity of roughly 1.5 petabytes, according to the article. Now to strap one of those puppies to my iPod!" The Internet Archive continues to astound.

57 of 295 comments (clear)

  1. Good to see. by Anonymous Coward · · Score: 5, Funny

    For all the jokes out there about people 'downloading the internet' it's good to know someone is actually doing it.

    1. Re:Good to see. by FireballX301 · · Score: 4, Funny

      Who the heck cares about the rest of the internet, can this thing hold all the pr0n?

    2. Re:Good to see. by Anonymous Coward · · Score: 5, Funny

      But does it run Lin... um.

      How about a Beo.. oh damn

    3. Re:Good to see. by bigberk · · Score: 2, Insightful

      people from my univ might recognize this... there was a famous guy in our engineering faculty who, back in the 90s, had written some kind of an automated porn downloading app. It was running on their UNIX servers but he left it running unattended. apparently he had no quota because within a few days he had filled up the entire system storage with porn, several hundreds of megabytes worth which was very substantial back then.

      I had a similar experience, I was playing around on irc back when we were swapping video files through DCC. apparently some downloading got out of hand and paged the admin, who contacted me and politely pointed out that I had a process running wild and filling /tmp... oops, must be an experiment gone wrong I had to say

    4. Re:Good to see. by Council · · Score: 5, Interesting

      In one of the weirder perspective exercises I've ever conceived:

      5 petabytes of storage is enough for a brief five-minute DVD-quality sex scene for each person of legal age in the US (two to a scene). 100 petabytes would be five minutes of porn of every pair of people in the world.

      I actually wonder about this a little; how many women have posed nude on the internet? There seem to be an awful lot; I haven't been able to see them all (though I will continue to try). Where do they mostly come from, I wonder.

      --
      xkcd.com - a webcomic of mathematics, love, and language.
    5. Re:Good to see. by Mark+Hood · · Score: 4, Funny

      There seem to be an awful lot; I haven't been able to see them all (though I will continue to try). Where do they mostly come from, I wonder.

      Let me get this straight, you're trying to see all the porn in the world, and you still don't know where babies come from? :)

      --
      Liked this comment? Why not buy me something nice
  2. Storage galore! by Bananatree3 · · Score: 2, Funny

    If, If only I could get a hold of one of those, I could Rival GOOGLE! Yes! I can become the next internet craze with my super, duper search engine crawling the web! I have the space, now I just need a connection in the middle of Alaska fast enough to rival google...

  3. You hear about the Petabox? by Dancin_Santa · · Score: 5, Funny

    Michael Jackson was heard breathing a sigh of relief. He thought it was where they sent Petafiles.

    R. Kelly was scrambling to find the company's phone number.

    1. Re:You hear about the Petabox? by pyrrhonist · · Score: 4, Funny
      Michael Jackson was heard breathing a sigh of relief. He thought it was where they sent Petafiles.

      Hmmm, this seems almost familiar...

      Let's analyze this situation:

      • The time on our posts is exactly the same.
      • There's a difference of only 3 in the post id values.
      • I was unable to foresee the R. Kelly connection.
      This can only mean one thing... You are the Kwisatz Haderach!

      GET OUT OF MY MIND!!!

      --
      Show me on the doll where his noodly appendage touched you.
  4. archive.org by Nasarius · · Score: 4, Interesting
    Internet Archive, the non-profit organization that creates periodic snapshots of the Internet.

    They do a lot more than that! I've just been downloading some Warren Zevon shows from their Live Music Archive.

    --
    LOAD "SIG",8,1
  5. copyright by DualG5GUNZ · · Score: 5, Interesting

    Not to sound like an advocate or anything... But how is it that the Internet Archives project resists claims of copyright infringement and the likes when they have copies of entire websites in their records?

    --
    "I'm a philosophy major. That means I can think deep thoughts about being unemployed." -- Bruce Lee
    1. Re:copyright by seifried · · Score: 3, Informative

      You can exclude them from your website using the robots.txt:

      User-agent: ia_archiver
      Disallow: /

      For example if you go to archive.org and plug my site into the wayback machine:

      We're sorry, access to http://www.seifried.org/ has been blocked by the site owner via robots.txt.

      and you can also request them to expunge your site from the archive.

      They go out of their way to make it easy to prevent your site being copied (more so then most search engines).

    2. Re:copyright by IntergalacticWalrus · · Score: 2, Interesting

      If they actually did that, the archive would be worthless.

      Besides, the IA only archives HTML pages, and small images in them, nothing else. If you consider your HTML content to be unproductible copyrighted material, might I ask why the hell is it publically accessible on the Web in the first place?

    3. Re:copyright by spacefight · · Score: 2, Interesting

      The Internat Archive is fucking up big time with their robots.txt stuff. If you exclude a site from beeing shown, it doesn't show anything, correct. But: If this site goes offline, the archived pages of that former site are all available, not blocked at all.

    4. Re:copyright by trifish · · Score: 2, Interesting

      But how is it that the Internet Archives project resists claims of copyright infringement and the likes when they have copies of entire websites in their records?


      Did you ask this question when Google introduced site cache several years ago?

    5. Re:copyright by generic-man · · Score: 2, Insightful

      Yes, I did. I got two responses, neither of which answered my question.

      1. FAIR USE!
      2. Google is merely providing a service. If you don't like it you can opt out.

      The Google Cache is not fair use, as it reproduces the entirety of a web page's text for none of the purposes for which Fair Use is defined. (Under Fair Use you are entitled to use a portion of a copyrighted work, not the whole thing.)

      The second one just cracks me up. I thought the Slashdot crowd didn't like being asked to opt out.

      Now, trifish, how can the Internet Archive evade copyright laws by reproducing the entirety of many copyrighted pages? Don't try and argue that they're a library. Libraries buy books; they don't photocopy them.

      --
      For more information, click here.
    6. Re:copyright by generic-man · · Score: 2, Insightful

      If you consider your HTML content to be unproductible copyrighted material, might I ask why the hell is it publically accessible on the Web in the first place?

      If you consider your music to be copyrighted material, might I ask why the hell it's being played on the radio in the first place?

      If you consider your book to be copyrighted material, might I ask why the hell it's being lent out in the library in the first place?

      If you consider your movie to be copyrighted material, might I ask why the hell it's being broadcast on HBO in the first place?

      Just because something is available for free doesn't mean that the producer has granted you a permanent license to distribute it for commercial gain, as Google does with its cache.

      --
      For more information, click here.
    7. Re:copyright by generic-man · · Score: 2, Informative

      Just because you have a cache of something doesn't give you the right to redistribute it for commercial gain. The initial author still retains ownership.

      Imagine if you had a device designed to record audio and reproduce it. That doesn't mean that you can resell your recordings; the original author retains ownership.

      I'm not claiming that it is unethical to cache web pages, just that companies such as Google presume that they have the right to redistribute content to which they own no rights. The web is not like Usenet, where each server hosts others' posts; content is served by an author for as long as the author wants.

      --
      For more information, click here.
  6. Petabox? by eclectro · · Score: 4, Funny


    Isn't that what naked girls climb out of to protest fur coats?

    Thank you, I'll be here all week.

    --
    Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
    1. Re:Petabox? by Anonymous Coward · · Score: 2, Funny

      Actually, it's what geeks would like to do, but are seldom given the chance.

  7. IPod? by NegativeOneUserID · · Score: 2, Funny

    Right, sure, like anyone believes that you want that much storage for music. You just want to use it for pr0n.

    1. Re:IPod? by BlackMesaLabs · · Score: 2, Funny

      Decide to use it for "Pr0n" and you're gonna NEED a beowulf cluster of them...

  8. great usage. by Bananatree3 · · Score: 4, Informative

    Seriously, I think archive.org deservese sutch a storage system. I have very often wanted to go back to view an archive of a website a while ago, but the cache on Google was from yesterday. It also gives multiple archives of the website based on day which can be quite handy, especially for news related sites. I think they quite well deserve it.

  9. Re:Downloading Kazaa by HyperChicken · · Score: 3, Informative

    Not "periodic", continuous. Own a website? Check your logs for the user-agent "ia_archive".

    --
    Free of Flash! Free of Flash!
  10. 'small box' by MonoSynth · · Score: 5, Funny

    So the inventor of the microprocessor dies and suddenly the definition of 'small box' for computer components is again reduced too 'fits in a big room'....

  11. Puppies by Sinner · · Score: 3, Funny
    An anonymous reader writes "LinuxDevices.com is ... according to the article. Now to strap one of those puppies to my iPod!"
    I'm sorry, baby dogs? That's so last week. I've got an arctic seal pup strapped to my iPod. You should see the looks I get on the subway. Bling, baby, Bling.
    --
    fish and pipes
    1. Re:Puppies by Sinner · · Score: 2, Funny

      You gonna eat that?

      --
      fish and pipes
  12. maybe i'll be quoted in 15 years.. by qda · · Score: 4, Funny

    "nobody needs more than a perabyte of storage"

    1. Re:maybe i'll be quoted in 15 years.. by Anonymous Coward · · Score: 2, Funny

      Well, I'd hope somewhere along the line somebody will fix that typo for you. Otherwise, you'll forever be quoted as "nobody needs more than a perabyte [sic] of storage."

  13. Electricity $$$ ? by kasnol · · Score: 3, Funny

    Wow - have they calculate how much is the running cost per day ? I might just stay with my iPod instead for the time being~
    Haha~

    1. Re:Electricity $$$ ? by TheFlyingGoat · · Score: 2, Informative

      50kW at 10 cents per kilowatt hour = $120/day.

      I doubt it draws at a constant 50kW, though. It's probably an average (was given in TFA).

      My math might be completely wrong, given I don't have a clue how to calculate kilowatt hours. Is it just kW * hours_used_daily? :)

      --
      You have enemies? Good. That means you've stood up for something, sometime in your life. --Winston Churchill
  14. 1.5 Petabytes? by TheFlyingGoat · · Score: 3, Interesting

    Where can you purchase 600GB drives these days? (1.5PB / 2500 drives)

    The math doesn't work when you multiply the number of systems out either: 600 systems * 1.6TB/system = 960TB. That's just under a petabyte, or am I missing something?

    Also, if you've got those in a RAID5 setup, you're 'only' talking about approx 800TB of usable space. That's far less than the 1.5 petabytes claimed.

    800TB is a lot of space, but there must be a cheaper/easier way than purchasing 600 systems to do it.

    --
    You have enemies? Good. That means you've stood up for something, sometime in your life. --Winston Churchill
    1. Re:1.5 Petabytes? by TheFlyingGoat · · Score: 2, Informative

      No. They say 2500 drives (actually 2400 since it's 4 per system in 600 systems), which comes out to 600GB per drive for 1.5PB.

      --
      You have enemies? Good. That means you've stood up for something, sometime in your life. --Winston Churchill
  15. Slashdotted .... by theoddbot · · Score: 4, Informative
  16. No redundancy? WTF? by melted · · Score: 2, Informative

    I've actually read TFA. They recommend JBOD configurations to their clients. One drive goes titsup and you've lost 400GB of data. Do they at least offer some kind of mirroring/redundancy solution to back the data up to another array?

    1. Re:No redundancy? WTF? by Depili · · Score: 4, Informative

      Acording to the archive.org (http://www.archive.org/web/petabox.php) they indeed have some redundancy, but not raid. They are operating each system as a separete node, and mirroring nodes. The above link also sheds light on other questions regarding TFA

    2. Re:No redundancy? WTF? by puhuri · · Score: 2, Interesting

      The archive.org maintains its archives in several geographicaly different locations and files are mirrored between those sites. If one disk or node breaks, you still have two or more copies of that material.

      If you archive serious amounts of data, redundancy within node is not the best solution, but to distrbute information between systems. For very important data, you can have as many copies as you have nodes; lesser important data may have just a single copy. If it gets lost, then ok, shit happens but so what. For example, I have just a single copy (no backups, partly RAID) of 10 TiB data (and that data is not available from P2P shop) because it is not economicaly viable to make backups. On the other hand, I have some data in 5 geographicaly diverse copies, both on-line and off-line.

  17. A Great Historical Tool by simrook · · Score: 5, Insightful

    The Internet represents a great historical tool. Case and point is what happened on 9/11. Being able to go back and see the progression, paranoia, patrotism, and early iraq/afgahanistan/binladen/hussien posts and opinions on various new sites is amazing. cnn, fox, the ny times, all are archived several times on 9/11 on archive.org.

    I for one think that archive.org should turn into some UN effort, with a mission to chronical and store daily/timely snapshots of the internet and the culture at the time, preserving it for future generations. What a tool for future historians!

    The ability to look at a large representation of socity at one single critical moment in time, and being able to have first hand sources for all that information is something that can truely change the way history is recorded (and not in the bad newspeak ingsoc way either). Infact, a wholeistic archive of what happens day-to-day, in an easily accessible format, might well help written history to be more representative of actual history (instead of, say the history Bush wants us to believe; that the Iraq war was for human right and not wmd's). I love Foucault.

    The internet archive rocks... really hope this project continues full blast.

    - Peace

    --
    'Truth' is linked in a circular relation with systems of power which produce and sustain it...
    1. Re:A Great Historical Tool by venicebeach · · Score: 2, Funny


      Yes, otherwise such cultural gems as goatse.cx would be lost into the void forever...

    2. Re:A Great Historical Tool by Anonymous Coward · · Score: 2, Insightful

      The 911 targets where chosen in a way everyone would notice. Not exactly amazing that it's well reported on, it would have been if it happened 20 years ago. But that was just a single attack. If you look at the much bigger recent events that you mention, like the war on Iraq, you'll see that there really is hardly any detailed reporting. You have a lot of propaganda by the attackers, some propaganda from the Iraqi government, and some reports by angry people getting in the middle. You still have a completely unclear view of what happened.

      We already had people writing diaries and making lots of pictures in WWII. The improvement isn't that great.

    3. Re:A Great Historical Tool by PReDiToR · · Score: 2, Insightful

      (and not in the bad newspeak ingsoc way either)

      Funny you should mention that, but this whole "Internet as history" thing has me wound up tight.

      Books cannot be changed. They can be destroyed, reprinted and banned but the first edition will always exist in a collection.
      The first edition of a website only exists in digital form and there is no way to stop the original from being edited and timestamped back to the expected date.

      The IA is the MiniTruth's dream come true.

      But who cares? History has always been written by the victorious, hasn't it?

      --

      Do not meddle in the affairs of geeks for they are subtle and quick to anger
  18. The MPAA and RIAA by PrivateDonut · · Score: 3, Interesting

    are going to make a killing of the IA when they have finished, it isn't like they haven't made enough money off others as it is, so they may let this one slide in the name of conserving data. On that note, is the IA downloading EVERYTHING or selectively downloading to prevent such issues as copyright infringment?

  19. Wayback and Slashdot by mcrbids · · Score: 4, Funny

    Go ahead. Try Slashdot in the wayback machine.

    Slashdot has looked virtually identical since 1998!

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
    1. Re:Wayback and Slashdot by pcgabe · · Score: 2, Informative
      Linky Goodness:

      http://web.archive.org/web/19981111190256/http://s lashdot.org/

      Highlights:
      • Episode 1 teaser sheets
      • Does the world really need a 25 gig drive?
      • Patents: how do we keep software free?
      Oh, how far we've come.
      --
      Don't put advice in your sig.
    2. Re:Wayback and Slashdot by hawk · · Score: 4, Funny
      Oh, c'mon. It's not that bad.

      Why, just last year they introduced an entirely new story into the rotation of duplicates . . .

      :)

      hawk

  20. Re:Mega Systems by name773 · · Score: 2, Funny

    large bundles of neatly organized cable... ohh man.

  21. Re:What's wrong with hot swap and RAID 5? by imsabbel · · Score: 2, Interesting

    Because you are comparing apples to oranges.

    They dont use hot swap and raid5 for the same reason google doesnt run on mainframes:
    Its just cheaper to let a higher level logic take care of that stuff instead of strapping redundancy on every node...
    Why hot swap if it isnt needed? The rest of the node will be mirrored somewhere else, so for the cost of fitting out everything with HS bays you could get 5 or 10% more nodes...
    Same for raid5: good high performance Raid5 controllers would increase the system cost by 50% or something. And then its not less expensive than just mirroring nodes.

    --
    HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
  22. Re:What's wrong with hot swap and RAID 5? by tim_uk · · Score: 2, Interesting
    Why the hell are the reports of these guys so far from what the accepted industry practice is, according to IT magazines?

    GOK, I have 3Pb of storage syncronised across two data centres here, all in 7+1 RAID5. Mostly self healing too, if a drive pops, then a spare drive in the same array builds itself into that stripe set, enabling hot replacement of the dead drive.

    I would love to know what their "painful experience" was!

    Using JBOD for this seems a tad courageous, to say the least.

    And then, of course, there's backup...

  23. They don't like RAID by billstewart · · Score: 4, Interesting
    I was a bit puzzled by that also - the article said the things come in racks of 40 or 64TB, and 16 racks times 64TB is about 1PB, not 1.5.

    Also, the article says they don't like RAID, due to bad experiences with RAID5, and the system is configured as JBOD (Just a Bunch Of Disks). It doesn't say why, or what users should do to get equivalent protection. My guess is that depending on RAID within a box means you're still vulnerable if the box's CPU or disk controller decides to scribble the disks, or the power supply decides to catch fire or short out and deliver 240VAC on the +5V line or whatever. So if you want a RAID-like set of redundancy, set up your applications or file system mounting or something to calculate the protection disk in software and hand it off to another 1U box for storage.

    The overhead of the motherboards here is not that high - they're about $150-200, and support 4 disks that probably cost $200-300 each, so they're only about 20% of the cost, which is not bad. The article didn't say they're using SATA, and it sounded like it's some IDE variant instead, but if you're only using 100 Mbps Ethernet to connect to the box and not the optional GigE, it's not the bottleneck anyway. If you wanted an alternative design, you could probably do something with a couple of 4-way SATA controllers per CPU, with a lot of disks stacked vertically in a 3-4U box looking like an X-serve or something. But that wouldn't necessarily have much of an advantage.

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
    1. Re:They don't like RAID by budgenator · · Score: 2, Informative
      "Although Hitachi does not offer an 'enterprise' or '24x7' SATA drive, our testing found their drives to be as reliable as anything out there, enterprise distinction or not," Saikley said.

      I read that as SATA drives. What I wonder about is
      Pentaboxes are ~$ 2.00/GB per the article
      while
      Coraid, priced at $1,995.00 + (4*$314.99 hard drives) = 3918.94 + 664.00( 15U tabletop rackmount) or ~$0.41/GB per my calculations;
      looks like a price war is brewing here unless pentabox has some serious KW in BTU out or performance advantages.
      --
      Apocalypse Cancelled, Sorry, No Ticket Refunds
  24. Re:Mandatory by MrDoh! · · Score: 2, Funny

    Ah, you must be new here.
    (sorry)

    --
    Waiting for an amusing sig.
  25. Not a big improvement... by paulatz · · Score: 2, Interesting

    It was 3 or 4 years ago when I saw a 600 terabytes (0.6 petabytes) tape-based storage system at CERN.

    --
    this post contain no useful information, no need to mod it down
  26. Re:No RAID?! by iamplasma · · Score: 3, Insightful

    Yeah, but the thing is that the storage is spread out between lots of different 1U units, each with either 1 or 1.6Tb. So to make a RAID5 over 1.6Tb in size, you'd have to cross over multiple machines, adding a serious overhead, especially when you have to calculate parity for the parity drive. On the other hand, if you only did RAID 5 in the individual units, it'd be pretty pointless, because with that many units you'd be crazy to rely on no entire machine failures.

    So, while yes, if it really was just one giant supercomputer with a bajillion hard drives in it, RAID 50 would be an ideal solution (as long as the stripes were large enough to prevent too many accesses crossing too many drives, the one big advantage of JBOD here), but that's not what's really in use here.

  27. Two points by Salamander · · Score: 4, Interesting

    First off, this isn't quite an example of a company suddenly deciding to donate stuff to the Archive. As can be seen on their own website, Capricorn was spun off from the Archive on July 1, 2004. To a large extent, Capricorn exists for the specific purpose of providing storage to the Archive, and if that same storage can be sold to others so much the better.

    Second, what about interconnects and performance? The product descriptions say nothing about SCSI or FC or other storage-oriented connectivity, so one must assume that the connection to these boxes is through a network. That would mean each node is an NFS server (or similar), serving up 1.6TB using a 1GHz C3 processor, a maximum of 1GB of memory (for caching etc.) and what appears to be a single GigE link. Can you say unbalanced? The Internet Archive might be the only system with an access pattern so sparse that the ratio between capacity and performance wouldn't be crippling. Don't try using one of these with any other kind of application if performance is a concern...and BTW they don't seem to say anything about high availability or other storage functionality (e.g. integrated backup or snapshots) either. Capricorn's big play seems to be power consumption, but there are other players that can beat them on density (e.g. Copan with 224TB per rack) and multitudes who can offer better performance/functionality. I hate to sound negative, but this is a product so specialized as to be uninteresting.

    Disclaimer: I think I met some of the Copan guys once and they seemed cool enough, but there's no other relationship between me and them. That just happened to be the first name I thought of in this space.

    --
    Slashdot - News for Herds. Stuff that Splatters.
  28. Once upon a time by QMO · · Score: 3, Funny

    I was driving to work. It wasn't a long drive, but more than 5 minutes.

    "Macarena" was on the radio when I started the car. A few minutes later "Macarana" was still on, and I was thinking that the song must be longer than I thought, or something. About then the DJ came on and said "We're playing 'Macarena' until you vomit." Then played the song again.

    After that iteration of the song the DJ came back and played some phone calls of people begging him to change the song, but he just said that it was "Macarena" until you vomit.

    I don't know when the thing started, but by the time I got to work it was the 17th or so "Macarena" in a row.

    --
    Exam 4/C again. Maybe I'll do better this time.
    1. Re:Once upon a time by NaDrew · · Score: 2, Informative
      About then the DJ came on and said "We're playing 'Macarena' until you vomit." Then played the song again.

      After that iteration of the song the DJ came back and played some phone calls of people begging him to change the song, but he just said that it was "Macarena" until you vomit.

      I don't know when the thing started, but by the time I got to work it was the 17th or so "Macarena" in a row.

      This is called stunting. Radio stations do it to mark a transition between formats, apparently in an attempt to drive off listeners to their previous format.
      --
      Vista:XPSP2::ME:98SE
  29. Re:NAS or SAN or ??? by TTK+Ciar · · Score: 2, Informative

    The Petabox is shipped to a customer running Debian Linux by default (though of course you can install whatever you want), so there are a number of solutions to choose from. OpenAFS and (as you pointed out) GFS are made specifically for this kind of setup, providing fairly good abstraction of the underlying cluster and easy access to random data. Within The Archive, we have experimented with different approaches, the one currently in production using an API based on a UDP locator service and rsync.

    Another approach uses a /net directory under which remote filesystems are NFS-mounted on demand (I'm not sure how it works, our chief sysadmin set it up for testing, but if /net/ia105783/0/foo is not mounted, and then you type 'ls /net/ia105783/0/foo' (or any other command which opens a hypothetical file off /net), the remote filesystem is automagically NFS-mounted so that the command can complete).

    I'm not sure that we'll ever use it in production to access our distributed information, though; NFS has a very, very low error rate, but when you have thousands of NFS mounts going on at once (as we do NFS-mount users' /home directories everywhere), "very, very low" translates to "tripping over errors every few days". I've seen some really weird NFS failures and partial failures at The Archive, and I've written some software to be tolerant of them, but most of our software is not, and realistically speaking never will be. It's written to be tolerant of rsync errors instead. *shrug*, six of one, half a dozen of the other. This is one of those things where you need to just pick a solution and use it, whether it's OpenAFS, GFS, NFS, or some homespun thing. All have their pros and cons, and you'll learn to deal with their problems as you use them.

    -- TTK