Slashdot Mirror


Big Data's Invisible Open Source Community

itwbennett writes "Hadoop, Hive, Lucene, and Solr are all open source projects, but if you were expecting the floors of the Strata Conference to be packed with intense, boostrapping hackers you'd be sorely disappointed. Instead, says Brian Proffitt, 'community' where Big Data is concerned is 'acknowledged as a corporate resource', something companies need to contribute back to. 'There is no sense of the grass-roots, hacker-dominated communities that were so much a part of the Linux community's DNA,' says Proffitt."

49 comments

  1. Sorry by discord5 · · Score: 3, Insightful

    My basem^H^H^H^H^H hacker cave simply doesn't have any room for a storage array in the PB order.

    1. Re:Sorry by Anonymous Coward · · Score: 0

      My basem^H^H^H^H^H hacker cave simply doesn't have any room for a storage array in the PB order.

      That's because you got the order backwards, it's BP (Brian Proffitt) not PB.

    2. Re:Sorry by Anonymous Coward · · Score: 4, Interesting

      Parent poster nailed it.
      Try to get support from "the community" when you discover a bug in a code path that nobody except you encounters. Suddenly the community becomes very small indeed.
      There just aren't that many geeks out there who handle petabyte datasets. Prove me wrong, dear reader.

    3. Re:Sorry by martin-boundary · · Score: 2, Insightful
      Well, you really shouldn't be debugging code on petabyte datasets to begin with. If there's a bug that shows, there's a minimal dataset on which the bug shows, and that's the dataset you can ask help with.

      In general, you should always develop code on a tiny sample of the dataset. Once it's fully debugged and works correctly, then you apply it on your petabyte dataset.

    4. Re:Sorry by Fluffeh · · Score: 1

      Try to get support from "the community" when you discover a bug in a code path that nobody except you encounters. Suddenly the community becomes very small indeed.

      I disagree. If you know how to identify the bug properly and present a solution on how to solve it, show that you did a little research and aren't just a) totally lazy, b) incompetent or c) whining that it doesn't solve all your problems out of the box without understanding it then you will often find the folks helpful. The open source community aren't any different to say the folks that support the software in your office. If you start talking to a tech with "I can't send email, can you fix my windows?" you will likely get the same sort of reply. If you say, "Hey, it looks like my network connection is being refused, I rebooted and I haven't changed any credentials recently, can you look into my account?" you will more likely get a more helpful answer.

      --
      Moved to http://soylentnews.org/. You are invited to join us too!
    5. Re:Sorry by Hal_Porter · · Score: 2, Interesting

      http://adequacy.org/stories/2001.10.2.33542.4010.html

      The Linux Fault Threshold is the point in any conversation about Linux at which your interlocutor stops talking about how your problem might be solved under Linux and starts talking about how it isn't Linux's fault that your problem cannot be solved under Linux. Half the time, the LFT is reached because there is genuinely no solution (or no solution has been developed yet), while half the time, the LFT is reached because your apologist has floundered way out of his depth in offering to help you and is bullshitting far beyond his actual knowledge base. In either case, a conversation which has reached the LFT has precisely zero chance of ever generating useful advice for you; it is safe at this point to start calling the person offering the advice a fucking moron, and basically take it from there. Here's an example taken from IRC logs to help you understand the concept.

      <jsm> Why won't my fucking Linux computer print?
      <linuxbabe> what printer r u using?
      <jsm> I don't know. It's a Hewlett Packard desktop inkjet number
      <linuxbabe> hewlett r lamers. they dont open source drivers <------LFT closely approached!
      <linuxbabe> but we reverse engineered them lol. check the web. or ask hewlett for linux suuport??<------ but avoided, he's still talking about the problem
      <jsm> Thanks. I already did that. But I can't install the drivers on my fucking computer. I've got a floppy disk from HP, but my floppy drive is a USB drive and Linux doesn't have fucking USB support.
      <linuxbabe> linux DOES have USB support!!!!!!
      <jsm> yeh for fucking infrared mice, and for about a thousand makes of webcam it does. Get real here. For my fucking floppy disk drive, I am telling you through bitter experience it does not. Even if someone has written the drivers in the last week
      <jsm> which I sincerely doubt, how the hell am I going to install them given that my floppy drive doesnt work?????
      <jsm> this ought to be in the kernel. what good is a fucking operating system that doesnt operate?
      <linuxbabe> Imacs dont have floppy drives at all <----- useless point, but not LFT. All apologists make pointless jabs at other OSs
      <linuxbabe> so you ought to be greateful that Linux does. drivers like that shouldn't be bundled in the kernel
      <linuxbabe> makes it into fucking M$ bloatware. bleh
      <linuxbabe> download drivers from the web!!!! apt-get is your friend
      <jsm> So everyone keeps telling me. Unfortunately the fucking modem doesn't work under Linux either, and since the Linux installation destroyed Windows, that leaves me kind of fucked.
      <linuxbabe> Linux doesnt destroy windows
      <jsm>mandrake installer does. It "resized" my Windows partition and now the fucker won't work
      <linuxbabe> you shuold have defragmented. windows scatters data all over your hard drive so the installer cant just find a clean chunk to install into. it isn't linux fault <---- distinct signs of LFT being approached
      <linuxbabe> that windoze disk management blows
      <jsm> so why doesn't my fucking modem work?
      <linuxbabe> what computer hav u got
      <jsm> A Sony Vaio PCG
      <linuxbabe> that doesn't have a modem
      <jsm> I assure you it fucking does. I used to use it to check my email back in the days when Windows worked.
      <linuxbabe> its got a winmodem. thats not a modem <----- nitpicking over technical terms is a sign of impending LFT
      <jsm> what do you mean?
      <linuxbabe> a winmodem isnt a proper modem. it just uses proprietary windoze apis. doesnt do the work of a modem at all.

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    6. Re:Sorry by Anonymous Coward · · Score: 1

      I don't see where the problem is. The solution for jsm's problem was pretty clear from the start. All he had do to was type in the binary driver using gestures from his infrared mouse. All that swearing seemed uncalled for.

    7. Re:Sorry by Anonymous Coward · · Score: 1

      My point about the community becoming small was not that you get shut out because you don't know how to ask questions politely or properly, but because you genuinely are encountering behaviour so rare that almost nobody in the mainstream community knows how to help you.
      For what it's worth, I do my research, including reading the source and using gdb to interrupt running processes.

    8. Re:Sorry by scheme · · Score: 2

      Well, you really shouldn't be debugging code on petabyte datasets to begin with. If there's a bug that shows, there's a minimal dataset on which the bug shows, and that's the dataset you can ask help with.

      In general, you should always develop code on a tiny sample of the dataset. Once it's fully debugged and works correctly, then you apply it on your petabyte dataset.

      Some bugs and issues don't show up until you get to a certain scale. Consider race conditions that only occur so often, unless you hit a certain scale you may never see it. To give a another pertinent example consider something that corrupts one byte in a PB (maybe it's a very infrequent condition or something), until your dataset grows to multiple PB, you may not even see it. Or consider the issue that occurs on raid arrays where you get a second drive failure when rebuilding an array after a drive has failed and been replaced. Until individual drives have enough data that rebuilding the array takes a significant amount of time, you'll probably never see this failure scenario and your code may not even be aware that this is something it needs to be able to handle.

      --
      "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
    9. Re:Sorry by justforgetme · · Score: 2

      Well yes, that is primarily how you do it.

      Bigdata work is much more closer to academic research than it is to casual software development work. As is ML and the such.
      It is quite obvious that at higher stratas of specialization the specialists are less. Ask any, seriously involved in research, scientist
      where he finds community specialists to discuss various bugs. The fact is that they don't. They go around mostly asking for
      opinions and fix the bug themselves (which usually includes writing some documentation about it).

      The who article is stating the obvious: There are less specialists at the bleeding edge of research. Which is true in and of itself and
      is made worse by the fact that this research is done by huge proprietary enterprises.

      --
      -- no sig today
    10. Re:Sorry by Anonymous Coward · · Score: 0

      > There just aren't that many geeks out there who handle petabyte datasets.
      There are if you count their, err, "movie" collection.

    11. Re:Sorry by Anonymous Coward · · Score: 0

      A PB takes up about 2m^3, if your hacker cave does not have this much space, it's not called a cave but a hole.

    12. Re:Sorry by amorsen · · Score: 1

      Everything was lost here:

      Why won't my fucking Linux computer print?

      The rest could have been easily avoided by doing a kick/ban at that point.

      --
      Finally! A year of moderation! Ready for 2019?
    13. Re:Sorry by Anonymous Coward · · Score: 0

      ZFS on FreeBSD:

      "It is a 128 bit file system, enabling it to address 18 quintillion times more data than 64 bit systems. The limitations found in ZFS are designed specifically to be large enough to never be encountered (within the known limits of physics, and the number of atoms in the earth’s crust to construct a storage device of this magnitude). The other features include a copy on write transactional model, snapshots and clones, dynamic striping, variable block sizes, lightweight file system creation, cache management, adaptive endianness, and deduplication (to name a few of the more common features)."

      nuff said. ;-)

      http://wiki.freebsd.org/ZFS

    14. Re:Sorry by Hal_Porter · · Score: 1

      Trolololol!

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
  2. So... I read the article... by bmo · · Score: 4, Interesting

    And I have to ask...

    What was the point of the article? That the trade show is like every trade show ever?

    Really, I'll write a report the next time I go to EASTEC and whine about the lack of "Makers" (in the geek culture sense of the word) among the vendors of Big Machinery.

    --
    BMO

  3. Some open source advocates... by blahplusplus · · Score: 3, Insightful

    ... must face the fact that lots of code is boring to maintain and update. Not to mention unless you are independently wealthy contributing to open source is a drain one ones time and resources. No one should really be concerned that many corporations see value in open source, it's like seeing value in roads or sewers. There is much code that is just like roads and sewers that which would be hard to maintain on a volunteer basis.

    1. Re:Some open source advocates... by Anonymous Coward · · Score: 0

      I understand completely. So... you're saying corporate communities like code that is sewer ... that is Open. An open sewer! Wait ... what?

  4. Scratching Itches by Anonymous Coward · · Score: 2, Interesting

    A big part of the grass-roots movement that Linux and other open-source projects benefit from comes about because hackers (in the good sense) contribute to software that they themselves want or need. There probably aren't many programmers that want (or can afford) to store and analyze petabytes of data in their free time. That's important to corporations, though, so I suspect that's why you see primarly corporate interests in open-source Big Data projects.

  5. So, in other words... by king+neckbeard · · Score: 2

    It's pretty much a purely open source community instead of a free software community.

    --
    This is my signature. There are many like it, but this one is mine.
    1. Re:So, in other words... by Anonymous Coward · · Score: 0

      well said.
      also, hackers i know do things because they are an improvement over what exists, and/or because it interests them, or is good for society at large or some larger community. BigData exists to make BigMoney for BigCorporations, to hell with improvements, to hell with interesting and especially to hell with helping society at large. i don't believe these two communities will ever see "eye to eye", and this makes me happy.
      GNU . . . It's Freedom, baby YEAH!!!

  6. A very simple explanation by Anonymous Coward · · Score: 5, Insightful

    "There is no sense of the grass-roots, hacker-dominated communities that were so much a part of the Linux community's DNA"

    This is for one simple reason: most hackers don't need "BigData".

    Perhaps if the typical hacker had a cluster of servers to play with, this would change. But as long as most hackers are bound to using a single personal computer, they're just not going to be very concerned with clusterware.

    They're also not concerned with plenty of other things that are essential to big corporations, like payroll software and CRM (customer relationship managment) software.

    1. Re:A very simple explanation by Anonymous Coward · · Score: 1

      That's generally true, but some of the cluster management software out there installs in pretty low end environments.

      The Apache Incubator Tashi project for example allows for fast startup of VMs. These can be used to run a virtual cluster for a specific purpose, at the end of which the instances can be thrown away. This saves on having one-off installs polluting your main machine.

      I had it provide VMs inside a single VMware Fusion instance, as well as run a real cluster with >100 large nodes and many different users.

      Disclaimer: I am a Tashi developer.

    2. Re:A very simple explanation by Anonymous Coward · · Score: 1

      There's a lot of startup activity in the big data area, along with job opportunities for software engineers. But it seems that the majority of it is about mining behavioral trends in consumer activity and enabling targeted ads and other personalized online experiences. It's a little bit creepy.

      OTOH I'm sure hadoop and friends would be very useful for the LHC and other big science projects, but they have are mostly taxpayer funded and are fighting to keep the dollars they're getting, not looking for new ways to spend it.

    3. Re:A very simple explanation by evilviper · · Score: 4, Informative

      This is for one simple reason: most hackers don't need "BigData".

      Perhaps if the typical hacker had a cluster of servers to play with, this would change.

      "Most hackers" don't need a lot of things that are, never-the-less developed as successful open source projects. Anybody think there's a huge audience for DReaM?

      Storage is getting big... Even a tiny shop can afford obscene amounts of storage. Each 2U server can have 6 x 2TB SATA (3.5") drives pretty inexpensively. As soon as you've got a dataset that needs more space than you can store on one of those, you'd benefit from thesee "big data" solutions, rather than the standby (more expensive) solution of "throw in a monster SAN".

      And you don't even need that much infrastructure. The virtual servers (cloud) service providers aren't very expensive, particularly when you don't care about SLA, and will give you as big of a cluster "to play with" as you could want.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    4. Re:A very simple explanation by scheme · · Score: 3, Insightful

      OTOH I'm sure hadoop and friends would be very useful for the LHC and other big science projects, but they have are mostly taxpayer funded and are fighting to keep the dollars they're getting, not looking for new ways to spend it.

      HDFS is already used by CMS (one of the detectors at the LHC) to store and manage distributed filesystems at various regional centers. After all, when you are generating multiple petabytes each year and need to process it and keep various subsets of it around for analysis by various groups, you need filesystems that can handle multiple PB of files. And yes, I believe patches are being fed upstream as necessary. Other filesystems being used in the US include lustre, dcache, and xrootdfs.

      Although funding is an issue, continuing to run and analyze data from the LHC means that money needs to be spent to buy more storage and servers as needed and to pay people to develop and maintain the systems needed to distribute and analyze all the data being generated . Having multiple PB of particle collision data is useless if you can't analyze it and look for interesting events.

      --
      "When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
  7. Re:biTcH by Anonymous Coward · · Score: 0

    it was more of a challenge when they used URL shorteners.

  8. Re:biTcH by Anonymous Coward · · Score: 0

    And when they put together a post that looked like it wasn't a troll

  9. overflow and "working correctly" by ChipMonk · · Score: 2

    If it isn't working correctly on a petabyte dataset, then it isn't "working correctly", period, no matter how well-hidden the bugs are with gigabyte and terabyte datasets. An unhandled overflow error that doesn't manifest until you exceed 2^64, is still an unhandled overflow error.

    For a trivial example of my point, try using 32-bit signed integers to calculate the Collatz iteration of 113,383.

    1. Re:overflow and "working correctly" by martin-boundary · · Score: 2
      True, but again, this overflow will show up much sooner in a smaller setting, say when the algorithm is compiled with 16-bit or even 8-bit integer variables. You haven't shown that 2^64 is an inherent lower bound for the appearance of the overflow bug.

      Incidentally, people who don't know about computer architecture wouldn't be aware about overflows, so wouldn't know to check these conditions. Something about semi-educated programmers and their ability to debug code?

    2. Re:overflow and "working correctly" by ChipMonk · · Score: 1

      True, but again, this overflow will show up much sooner in a smaller setting, say when the algorithm is compiled with 16-bit or even 8-bit integer variables. You haven't shown that 2^64 is an inherent lower bound for the appearance of the overflow bug.

      I picked 2^64 only because I'm currently using an AMD64X2. It will vary from one architecture to another, anyway, unless the code uses types with explicit bit-widths, like "uint64" or "float80". The point is, know the hardware and software specs, and their accompanying limitations, and make sure you don't exceed them.

      Incidentally, people who don't know about computer architecture wouldn't be aware about overflows, so wouldn't know to check these conditions. Something about semi-educated programmers and their ability to debug code?

      More like their ability to develop quality code to begin with. I seriously doubt they would get hired, or their software used, by the Big Data described in the article.

    3. Re:overflow and "working correctly" by adolf · · Score: 1

      Incidentally, people who don't know about computer architecture wouldn't be aware about overflows, so wouldn't know to check these conditions. Something about semi-educated programmers and their ability to debug code?

      I grok this discussion as an exchange between a user who is experiencing a real problem and needs help with it but is unable to find useful answers, and a programmer who is patiently trying to explain to the user that they are somehow asking the wrong questions, while insinuating that the user should divine the knowledge to reformulate the question to meet the programmer's artificial specifications.

      I submit that no solution to this problem will ever avail itself, given these incompatible mindsets.

      And I fault neither of you.

      Which really is OK: The problem exists, and will continue to exist. It is the nature of the beast for all manner of systems -- not just computers. TFA is about resolving the issue.

      So give up, kids. You'll never fix this on your own, because you don't think the same way. It's alright.

  10. WTF is "Big Data" by Anonymous Coward · · Score: 0

    That reads like an article from a journalist who was paid to attended a 5 day conference on something they know very little about; spent the entire time at the nearby pub.

    It sounds like he either wrote the article based on a 5 minute conversation with one of the nerds of yesteryear who misses the 'old community feeling', or spent 2 minutes at the conference 10 years ago, and 2 minutes at the conference this year, and has written their 'deep analysis' based on their first impression.

    What the hell is Big Data anyway?

  11. How small is your basement? by oneiros27 · · Score: 3, Informative

    Internet Archive's last published generation Petabox (now more than a year old, so they were using smaller drives), would take two racks ... which is still reasonable, but you could probably fit it in a single rack with today's drives. A Backblaze Pod is 42 disks in 4U, so you could do it yourself and assuming you can get enough large disks after that whole flooding thing, be able to get a TB in a single rack easily. The Sun Thumper took 48 disks in 4U ... I don't know if the X4540 ever supported larger than 1TB disks, though.

    My department just got a Nexsan E60 in yesterday ... 60 3TB disks in 4U, so you can squeeze 1.8PB raw in a 42U rack. (usable space ... still more than a PB, even with spares.)

    So, space isn't the issue ... power and cooling way be, though.

    --
    Build it, and they will come^Hplain.
    1. Re:How small is your basement? by Anonymous Coward · · Score: 0

      You're so cool, man!
      May I touch you?

    2. Re:How small is your basement? by Anonymous Coward · · Score: 0

      The Sun X4500 already supports 3TB Drives, i tried it in my attic ;)

  12. Data is big by Anonymous Coward · · Score: 1

    Really Really big
    You just won't believe how vastly hugely mindbogglingly big it is.

    1. Re:Data is big by yahwotqa · · Score: 1

      It's even bigger.

  13. Where's the big data by Anonymous Coward · · Score: 1

    Given an individual can get their hands on storage and clusters ... Where is the interesting data?
    Where is PB sized data of interest to a hacker they can download?
    Where's the fun payoff ?

    1. Re:Where's the big data by evilviper · · Score: 1

      Google's "big data" is just web pages. Start a spider, feed the output to Solr, and see if you can beat Google at web search.

      --
      Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  14. Wrong Conference by allenw · · Score: 1

    I really hate the reporting around Hadoop. Most of these people have absolutely no clue what they are talking about, and this article is just another example of that. Any bit of simple research would have revealed that the actual open source community of developers around Hadoop, Hive, Solr, etc, can be found at ApacheCon. Of course Strata is amazingly commercial: O'Reilly, being a corporate entity, is trying to make cash around the latest craze. If they weren't, they'd make sure the ASF and the other OSS organizations that help make the software had some space and would actually attend.

  15. hacker by Anonymous Coward · · Score: 0

    Hackers follow the signatures like the paper work junk coming out of the company to know the password. SEOWDC

  16. Another reason: JAVA by Anonymous Coward · · Score: 1

    Those programs named are all written in Java, which is more of interest to corporate programmers than hackers.

  17. The Cloud by tommeke100 · · Score: 1

    Sure, most hackers don't have a personal cluster at their disposal to really test the limits of their BigData, web-scale and - insert buzzword here - deployment. There are however a some free 'cloud' alternatives (PaaS) (OpenShift by Red-Hat for example: http://openshift.redhat.com/ that give you the opportunity to play around a bit.

  18. Re:Scratching (other) Itches by Anonymous Coward · · Score: 0

    Surely that's just a medium sized porn collection?

  19. Do you want Big Data to take off? by Lord+Grey · · Score: 1

    Do you want Big Data solutions to appeal to the masses? For open source hackers to tackle petabyte-size problems? Hundreds or thousands of possible solutions for each variation of a problem, like what is found on SourceForge?

    It's dead simple.

    Rename the problem to Big Porn and create a couple of frameworks as examples. The technology will just take right off.

    --
    // Beyond Here Lie Dragons
  20. Just Stop by Anonymous Coward · · Score: 0

    Not another fucking buz word just stop and murder every marketing exec that ever wants to use it.

  21. Big data users - finance - closed community by Anonymous Coward · · Score: 0

    One of the things that separates big data and its open source tools from the rest of the universe is that SO many users are large financial institutions that just think differently from other open source communities. Their thoughts on pulbic cloud - "scary if we don't know where our data is" and "regulators won't let us".

    Their thoughts on sharing with the open source community - "what we're doing with Hadoop is our Secret Sauce. We sure don't want to share that with Citi/BofA/Schwab/nameyourevilempire.

    The more regulations an industry deals with the less likely they will fully participate in an old-school open source project the way most of us think about participation.