Slashdot Mirror


Big Data's Invisible Open Source Community

itwbennett writes "Hadoop, Hive, Lucene, and Solr are all open source projects, but if you were expecting the floors of the Strata Conference to be packed with intense, boostrapping hackers you'd be sorely disappointed. Instead, says Brian Proffitt, 'community' where Big Data is concerned is 'acknowledged as a corporate resource', something companies need to contribute back to. 'There is no sense of the grass-roots, hacker-dominated communities that were so much a part of the Linux community's DNA,' says Proffitt."

8 of 49 comments (clear)

  1. Sorry by discord5 · · Score: 3, Insightful

    My basem^H^H^H^H^H hacker cave simply doesn't have any room for a storage array in the PB order.

    1. Re:Sorry by Anonymous Coward · · Score: 4, Interesting

      Parent poster nailed it.
      Try to get support from "the community" when you discover a bug in a code path that nobody except you encounters. Suddenly the community becomes very small indeed.
      There just aren't that many geeks out there who handle petabyte datasets. Prove me wrong, dear reader.

  2. So... I read the article... by bmo · · Score: 4, Interesting

    And I have to ask...

    What was the point of the article? That the trade show is like every trade show ever?

    Really, I'll write a report the next time I go to EASTEC and whine about the lack of "Makers" (in the geek culture sense of the word) among the vendors of Big Machinery.

    --
    BMO

  3. Some open source advocates... by blahplusplus · · Score: 3, Insightful

    ... must face the fact that lots of code is boring to maintain and update. Not to mention unless you are independently wealthy contributing to open source is a drain one ones time and resources. No one should really be concerned that many corporations see value in open source, it's like seeing value in roads or sewers. There is much code that is just like roads and sewers that which would be hard to maintain on a volunteer basis.

  4. A very simple explanation by Anonymous Coward · · Score: 5, Insightful

    "There is no sense of the grass-roots, hacker-dominated communities that were so much a part of the Linux community's DNA"

    This is for one simple reason: most hackers don't need "BigData".

    Perhaps if the typical hacker had a cluster of servers to play with, this would change. But as long as most hackers are bound to using a single personal computer, they're just not going to be very concerned with clusterware.

    They're also not concerned with plenty of other things that are essential to big corporations, like payroll software and CRM (customer relationship managment) software.

    1. Re:A very simple explanation by evilviper · · Score: 4, Informative

      This is for one simple reason: most hackers don't need "BigData".

      Perhaps if the typical hacker had a cluster of servers to play with, this would change.

      "Most hackers" don't need a lot of things that are, never-the-less developed as successful open source projects. Anybody think there's a huge audience for DReaM?

      Storage is getting big... Even a tiny shop can afford obscene amounts of storage. Each 2U server can have 6 x 2TB SATA (3.5") drives pretty inexpensively. As soon as you've got a dataset that needs more space than you can store on one of those, you'd benefit from thesee "big data" solutions, rather than the standby (more expensive) solution of "throw in a monster SAN".

      And you don't even need that much infrastructure. The virtual servers (cloud) service providers aren't very expensive, particularly when you don't care about SLA, and will give you as big of a cluster "to play with" as you could want.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    2. Re:A very simple explanation by scheme · · Score: 3, Insightful

      OTOH I'm sure hadoop and friends would be very useful for the LHC and other big science projects, but they have are mostly taxpayer funded and are fighting to keep the dollars they're getting, not looking for new ways to spend it.

      HDFS is already used by CMS (one of the detectors at the LHC) to store and manage distributed filesystems at various regional centers. After all, when you are generating multiple petabytes each year and need to process it and keep various subsets of it around for analysis by various groups, you need filesystems that can handle multiple PB of files. And yes, I believe patches are being fed upstream as necessary. Other filesystems being used in the US include lustre, dcache, and xrootdfs.

      Although funding is an issue, continuing to run and analyze data from the LHC means that money needs to be spent to buy more storage and servers as needed and to pay people to develop and maintain the systems needed to distribute and analyze all the data being generated . Having multiple PB of particle collision data is useless if you can't analyze it and look for interesting events.

      --
      "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
  5. How small is your basement? by oneiros27 · · Score: 3, Informative

    Internet Archive's last published generation Petabox (now more than a year old, so they were using smaller drives), would take two racks ... which is still reasonable, but you could probably fit it in a single rack with today's drives. A Backblaze Pod is 42 disks in 4U, so you could do it yourself and assuming you can get enough large disks after that whole flooding thing, be able to get a TB in a single rack easily. The Sun Thumper took 48 disks in 4U ... I don't know if the X4540 ever supported larger than 1TB disks, though.

    My department just got a Nexsan E60 in yesterday ... 60 3TB disks in 4U, so you can squeeze 1.8PB raw in a 42U rack. (usable space ... still more than a PB, even with spares.)

    So, space isn't the issue ... power and cooling way be, though.

    --
    Build it, and they will come^Hplain.