Big Data's Invisible Open Source Community
itwbennett writes "Hadoop, Hive, Lucene, and Solr are all open source projects, but if you were expecting the floors of the Strata Conference to be packed with intense, boostrapping hackers you'd be sorely disappointed. Instead, says Brian Proffitt, 'community' where Big Data is concerned is 'acknowledged as a corporate resource', something companies need to contribute back to. 'There is no sense of the grass-roots, hacker-dominated communities that were so much a part of the Linux community's DNA,' says Proffitt."
My basem^H^H^H^H^H hacker cave simply doesn't have any room for a storage array in the PB order.
And I have to ask...
What was the point of the article? That the trade show is like every trade show ever?
Really, I'll write a report the next time I go to EASTEC and whine about the lack of "Makers" (in the geek culture sense of the word) among the vendors of Big Machinery.
--
BMO
... must face the fact that lots of code is boring to maintain and update. Not to mention unless you are independently wealthy contributing to open source is a drain one ones time and resources. No one should really be concerned that many corporations see value in open source, it's like seeing value in roads or sewers. There is much code that is just like roads and sewers that which would be hard to maintain on a volunteer basis.
A big part of the grass-roots movement that Linux and other open-source projects benefit from comes about because hackers (in the good sense) contribute to software that they themselves want or need. There probably aren't many programmers that want (or can afford) to store and analyze petabytes of data in their free time. That's important to corporations, though, so I suspect that's why you see primarly corporate interests in open-source Big Data projects.
It's pretty much a purely open source community instead of a free software community.
This is my signature. There are many like it, but this one is mine.
"There is no sense of the grass-roots, hacker-dominated communities that were so much a part of the Linux community's DNA"
This is for one simple reason: most hackers don't need "BigData".
Perhaps if the typical hacker had a cluster of servers to play with, this would change. But as long as most hackers are bound to using a single personal computer, they're just not going to be very concerned with clusterware.
They're also not concerned with plenty of other things that are essential to big corporations, like payroll software and CRM (customer relationship managment) software.
it was more of a challenge when they used URL shorteners.
And when they put together a post that looked like it wasn't a troll
If it isn't working correctly on a petabyte dataset, then it isn't "working correctly", period, no matter how well-hidden the bugs are with gigabyte and terabyte datasets. An unhandled overflow error that doesn't manifest until you exceed 2^64, is still an unhandled overflow error.
For a trivial example of my point, try using 32-bit signed integers to calculate the Collatz iteration of 113,383.
That reads like an article from a journalist who was paid to attended a 5 day conference on something they know very little about; spent the entire time at the nearby pub.
It sounds like he either wrote the article based on a 5 minute conversation with one of the nerds of yesteryear who misses the 'old community feeling', or spent 2 minutes at the conference 10 years ago, and 2 minutes at the conference this year, and has written their 'deep analysis' based on their first impression.
What the hell is Big Data anyway?
Internet Archive's last published generation Petabox (now more than a year old, so they were using smaller drives), would take two racks ... which is still reasonable, but you could probably fit it in a single rack with today's drives. A Backblaze Pod is 42 disks in 4U, so you could do it yourself and assuming you can get enough large disks after that whole flooding thing, be able to get a TB in a single rack easily. The Sun Thumper took 48 disks in 4U ... I don't know if the X4540 ever supported larger than 1TB disks, though.
My department just got a Nexsan E60 in yesterday ... 60 3TB disks in 4U, so you can squeeze 1.8PB raw in a 42U rack. (usable space ... still more than a PB, even with spares.)
So, space isn't the issue ... power and cooling way be, though.
Build it, and they will come^Hplain.
Really Really big
You just won't believe how vastly hugely mindbogglingly big it is.
Given an individual can get their hands on storage and clusters ... Where is the interesting data?
Where is PB sized data of interest to a hacker they can download?
Where's the fun payoff ?
I really hate the reporting around Hadoop. Most of these people have absolutely no clue what they are talking about, and this article is just another example of that. Any bit of simple research would have revealed that the actual open source community of developers around Hadoop, Hive, Solr, etc, can be found at ApacheCon. Of course Strata is amazingly commercial: O'Reilly, being a corporate entity, is trying to make cash around the latest craze. If they weren't, they'd make sure the ASF and the other OSS organizations that help make the software had some space and would actually attend.
Hackers follow the signatures like the paper work junk coming out of the company to know the password. SEOWDC
Those programs named are all written in Java, which is more of interest to corporate programmers than hackers.
Sure, most hackers don't have a personal cluster at their disposal to really test the limits of their BigData, web-scale and - insert buzzword here - deployment. There are however a some free 'cloud' alternatives (PaaS) (OpenShift by Red-Hat for example: http://openshift.redhat.com/ that give you the opportunity to play around a bit.
Surely that's just a medium sized porn collection?
Do you want Big Data solutions to appeal to the masses? For open source hackers to tackle petabyte-size problems? Hundreds or thousands of possible solutions for each variation of a problem, like what is found on SourceForge?
It's dead simple.
Rename the problem to Big Porn and create a couple of frameworks as examples. The technology will just take right off.
Not another fucking buz word just stop and murder every marketing exec that ever wants to use it.
One of the things that separates big data and its open source tools from the rest of the universe is that SO many users are large financial institutions that just think differently from other open source communities. Their thoughts on pulbic cloud - "scary if we don't know where our data is" and "regulators won't let us".
Their thoughts on sharing with the open source community - "what we're doing with Hadoop is our Secret Sauce. We sure don't want to share that with Citi/BofA/Schwab/nameyourevilempire.
The more regulations an industry deals with the less likely they will fully participate in an old-school open source project the way most of us think about participation.