rockmuelle · Slashdot Mirror

Hadoop was never really the right solution... on Is Big Data Leaving Hadoop Behind? · 2015-05-13 13:59 · Score: 5, Insightful

A scripting language with a good math/stats library (e.g., NumPy/Pandas) and decent raid controller are all most people really need for most "big data" applications. If you need to scale a bit, add few nodes (and put some RAM in them) and a job scheduler into the mix and learn some basic data decomposition methods. Most big data analyses are embarrassingly parallel. If you really need 100+ TB of disk, setup Lustre or GPFS. Invest in some DDN storage (it's cheaper and faster than the HDFS system you'll build for Hadoop).

Here's the break down of that claim in more computer sciencey terms: Almost all big data problems are simple counting problems with some stats thrown in. For more advanced clustering tasks, most math libraries have everything you need. Most "big data" sizes are under a few TB of data. Most big data problems are also I/O bound. Single nodes are actually pretty powerful and fast these days. 24 cores, 128 GB RAM, 15 TB of disk behind a RAID controller that can give you 400 MB/s data rates will cost you just barely 5 figures. This single node will outperform a standard 8 node Hadoop cluster. Why? Because the local, high density disks that HDFS encourages are slow as molasses (30 MB/s). And...

Hadoop has a huge abstraction penalty for each record access. If you're doing minimal computation for each record, the cost of delivering the record dominates your runtime. In Hadoop, the cost is fairly high. If you're using a scripting language and reading right off the file system, your cost for each record is low. I've found Hadoop record access times to be about 20x slower than Python line read times from a text file, using the _same_ file system for Hadoop and Python (of course, Hadoop puts HDFS on top of it). In Big-O terms, the 'c' we usually leave out actually matters here - O(1*n) vs. O(20*n). 1 hour or 20 hours, you pick.

If you're really doing big data stuff, it helps to understand how data moves through your algorithms and architect things accordingly. Almost always, a few minutes of big-O thinking and some basic knowledge of your hardware will give you an approach that doesn't require Hadoop.

tl;dr: Hadoop and Spark give people the illusion that their problems are bigger than they actually are. Simply understanding your data flow and algorithms can save you the hassle of using either.

-Chris

Re:Old browsers on Ask Slashdot: What's the Future of Desktop Applications? · 2015-05-11 06:47 · Score: 2

Two years is our horizon for browser support. Two other trends that have helped us in this regard are (1) that most browsers auto-update or at least nag you a lot and (2) IT departments are more accepting of users running Chrome/Safari/Firefox alongside IE. We're targeting enterprise/internal users, not everyone on the Web, so we can also put some requirements in place when we deploy.

Most of our functionality uses standard HTML/CSS/DOM features, so our we haven't had any issues with features dropping. We don't rely on 3rd party extensions such as Flash or APIs/features that don't have broad support. The decision to use Canvas over SVG for complex visualizations is due to partly this - SVG support is spotty across browsers, Canvas is pretty stable now. Canvas is also much faster at rendering large data sets, which is the other reason for using it.

-Chris

Why we targeted the browser... on Ask Slashdot: What's the Future of Desktop Applications? · 2015-05-11 04:06 · Score: 5, Interesting

I run a company that develops a laboratory informatics platform for data intensive science applications that mix wet lab and analytics operations into single workflows, with gene sequencing as the motivating application - think LIMS with a pipeline and visualization engine, if you're familiar with the space. (Lab7 Systems, if you're curious - http://www.lab7.io/

When we started development a few years ago, we had to make the decision as to whether or not to build a desktop application or a browser-based application. At the time, this wasn't an easy decision. Some aspects of the UI are straightforward form-style interfaces, but others are graphics heavy visualizations of very large data sets (100+ GB in some cases). Scientific and information visualization have almost always benefitted from local graphics contexts and native rendering engines. In addition, the data decomposition tasks often require efficient implementations in compiled languages. Our platform also controls analysis processes on large clusters, another task not well suited for the browser.

We gambled a bit and decided that the browser would be our primary user interface. Two trends at the time helped us make the decision (and luckily they both held steady):

(1) The JavaScript engines in all the major browsers get faster with each new release and now outperform other scripting languages for many tasks.
(2) The JavaScript development community is maturing, with more well-engineered and stable libraries available

As few other considerations helped us make the call:

(1) Our platform is a multi-user system. A desktop client would add to the support burden for our customers.
(2) Our backend needs to integrate with compute clusters, scientific instruments, and large, high-performance file systems. It is server-based, regardless of the client.
(3) The data scales we were dealing with also required "out-of-core" (to use an older term) algorithms for redenering, so the client would never get entire data sets at once.
(4) REST/json... XML, XMLRPC, SOAP, and all the others are a pain to develop for (I speak from experience), REST/json significantly reduced the amount of code we needed to maintain state between the client and server.

Since we made the call to use the browser, we haven't looked back. Early on there were some user interactions that were tricky to implement across all browsers, but today they've all caught up. Our application looks much more like a desktop or (*shudder*) Flash application, with a very rich UI (designed by an actual UX team that gets scientific software ;) ) and complex visualizations. It's also been relatively straight-forward to implement, thanks in large part to the maturity of some JavaScript libraries (we use jQuery, D3 (for complex filtering, but not for visualization), Canvas, Backbone, and a few others).

Personally, I can't imagine ever writing a desktop application again. The browser is just too convenient and, in the last few years, finally powerful enough for most tasks.

-Chris

Re:Investments? on Study Reveals Wikimedia Foundation Is 'Awash In Money' · 2015-05-11 01:58 · Score: 1, Insightful

$20M on salaries sounds about right for an organization with a complex IT infrastructure and global reach. Not sure what the outrage is here, unless you're expecting the people that keep the site up to work for free.

If they were developing the content as well, I'd expect their salaries to be in the $30-50M range. $1M probably gives you 6-8 editorial FTEs, so $30-50M would give you the few hundred editors and their support staff necessary to produce the content. The numbers are different for IT staff - 4-5 FTEs/$1M, so $20M could cover 80 technical staff and a few managers. Of course, there's all other staff as well, so the technical staff numbers are probably lower.

Other posts have already pointed out that $50M in the bank is a smart move for a non-profit.

-Chris

Re:HIPAA violation on Apple's Plans For Your DNA · 2015-05-06 07:50 · Score: 1

But the FDA did smack down 23andMe pretty hard for making medical claims based on SNP profiling.* While HIPPA isn't the right regulatory regime here, the FDA definitely is. 23andMe tried the Uber approach to flaunting regulations and found that when actual human health is involved, "trust us" doesn't cut it.

-Chris

*Can we please stop calling what these companies are doing "DNA sequencing"? It's not and never has been. It's just looking for specific, known markers in your genome. Sequencing is actually getting a readout of your genome.

Just Like the "Liberal Media" on House Panel Holds Hearing On "Politically Driven Science" - Without Scientists · 2015-05-04 07:58 · Score: 5, Insightful

Growing up in the 80s, all I heard was how liberal the media was and how we had to fight against it. Now, with the benefit of hindsight, it's clear that the phrase "liberal media" was a conservative talking point that they repeated ad infinitum until people stopped questioning it and just assumed it was true.

The same thing is happening now with claiming scientists are politically or monetarily motivated (the conservative machine hasn't settled on which script to stick with).

Look, I'm a scientist. I know scientists. I know scientists at NOAA, NCAR, NIST, the Labs, in academia, in industry, at biotechs, at agri-science companies, at space exploration companies, and at oil and gas companies. I know conservative scientists, liberal scientists, agnostic scientists, religious scientists, and hedonistic scientists.

You know what motivates scientists? Science. And to a lesser extent, their ego. If someone doesn't love science, there's no way they can cut it as a scientist. There are no political or monetary rewards available to scientists in the same way they're available to lawyers and lobbyists.

Science if hard work for little pay and possibly some recognition. Unfortunately, the conservative noise machine is slowly building a narrative that scientists are all politically and monetarily motivated. The public doesn't really know any better and will believe this to be true if they hear it enough.

This attempt to paint scientists as political actors is pure bullshit and demeans the hard work and great sacrifices working scientists make every day.

-Chris

Did a paid shill write this summary? on NASA Gets Its Marching Orders: Look Up! Look Out! · 2015-05-02 13:23 · Score: 5, Informative

Seriously. The real story with this bill is that the republicans are defunding the climate monitoring programs. It will take decades to regain the capabilities we'll lose by defunding them now. There's no turf war between NASA and NOAA, just one between republicans and science.

Nice job trying to write a summary for geeks that attempts to bury the real story.

Re:39/100 is the new passing grade. on Results Are In From Psychology's Largest Reproducibility Test: 39/100 Reproduced · 2015-05-01 04:08 · Score: 4, Insightful

Gah. I have mod points but want to add to this conversation.

The point of publishing is to share results of an experiment or study. Basically, a scientific publication tells the audience what the scientist was studying, how they did the experiment, what they found, and what they learned from it. The point of peer review is to review the work to make sure appropriate methods were followed and that the general results agree with the data. Peer review is not meant to verify or reproduce the results, but rather just make sure that the methods were sound.

Scientific papers are _incremental_ and meant to add to the body of knowledge. It's important to know that papers are never the last word on a subject and the results may not be reproducible. It's up to the community to determine which results are important enough to warrant reproduction. It's also up to the community to read papers in the context of newly acquired knowledge. An active researcher in any field can quickly scan old papers and know which ones are likely no-longer relevant.

That said, there is a popular belief that once something is published, it is irrefutable truth. That's a problem with how society interacts with science. No practicing scientist believes any individual paper is the gospel truth on a topic.

The main problem in science that this study highlights is not that papers are difficult to reproduce (that's expected by how science works), but that some (most?) fields currently allow large areas of research to move forward fairly unchecked. In the rush to publish novel results and cover a broad area, no one goes back to make sure the previous results hold up. Thus, we end up with situations where there are a lot of topics that should be explored more deeply but aren't due to the pursuit of novelty.

If journals encouraged more follow-up and incremental papers, this problem would resolve itself. Once a paper is published, there's almost always follow-up work to see how well the results really hold up. But, publishing that work is more difficult and doesn't help advance a career, especially if the original paper was not yours, so the follow-up work rarely gets done.

tl;dr: for the general public, it's important to understand that the point of publishing is to share work, peer review just makes sure the work was done properly and makes no claims on correctness, and science is fluid. For scientists, yeah, there are some issues with the constant quest for novel publications vs. incremental work.

-Chris

Yes, Please!!! on Has the Native Vs. HTML5 Mobile Debate Changed? · 2015-04-27 06:11 · Score: 5, Interesting

For 99% of the applications out there, there's no reason not to do it in the browser if you're starting from scratch today. Most (useful) mobile apps simply display remote content in a way that's contextually relevant to the moment (Yelp, shopping (ordering and product reviews), *Maps, news sites, social media, etc). There's no reason for any of those to be app based. Most apps that aggregate content are poorly designed and not updated frequently. Couple that with the fact that most do not have useful offline modes (the only reason to have an app for content, IMHO), it just makes sense to optimize for the mobile browser rather than spend all the time and effort on an app. Hell, even most games I play casually have no reason being written as apps any more - any word game or puzzler would work fine in the browser.

Instead, put the effort into good mobile design and development practices. Hire good developers to optimize for JavaScript. Hire good developers to optimize your backend operations to reduce latency. Find what features are missing in HTML/JavaScript (e.g., a good client side persistence layer) and encourage the browser vendors to improve there so everyone can benefit.

For context, I develop complex scientific software. We use the browser (desktop) as our client and push the limits of what you can do there. Mobile is not far behind and should be the first choice for new development.

-Chris

Re:Lets use correct terminology. on MakerBot Lays Off 20 Percent of Its Employees · 2015-04-17 08:30 · Score: 4, Insightful

As others have pointed out already in this thread: in the US, if you're laid off you can collect the unemployment insurance you've already paid for. If you're fired or leave voluntarily, you can't collect unemployment insurance.

I'm sure there are other legal differences, but as an employee, this is the important one.

If you are planning on leaving a job under good terms, it's always worth scheduling it around a layoff. You can tell your boss (discretely) and see if you can be laid off instead. The win for your boss is that two employees won't be lost (you plus the person who'd be laid off). The win for you is that you get severance and can collect unemployment.

We restrict our kids' access to YouTube on Consumer Groups Bemoan Google's "Deceptive" Ads for Kids In FTC Complaint · 2015-04-07 03:04 · Score: 2

We cut the cord years ago and have used a mix of Hulu, Netflix, and the various network apps for content (PBS Kids, etc). YouTube has always been problematic, not just for the ads, but also for the content and the "next up" algorithm. As a result, we only let the kids use YouTube (and YouTube Kids) when we're in the room with them and have our finger on the remote.

Here are the specific problems with YouTube:

Ads: The ads are not targeted at all. If you've ever paid attention to ads, you already know the promise of targeted advertising is bunk. The problem with YouTube is that it's doubly bunk when it comes to kids programming on normal YouTube (and apparently on kids' YouTube as well). Completely inappropriate ads will pop up after kids shows. It's not rocket science to tweak your algorithm to play a kid appropriate add after a cartoon, even if it means the occasional adult will get the wrong ad.

Content: This is trickier. A lot of the cartoon content on YouTube consists of collections of episodes bundled into a single video. The problem is, the bundles are created by fans and you have no idea what's in it until you watch it. Sometimes they're crappy screen captures. Sometimes they're dubbed in another language (without calling it out in the title). In those cases, you spend 10 minutes with the kids just trying to find one they can watch. The worst, however, are the ones that are "archival" and created by superfans. My best example is a compilation of Donald Duck cartoons that includes the WWII episode where Donald fights Hitler*. Great episode... for adults who understand the context. Terrible episode for kids. YouTube has no good way of warning parents about this.

Next up: This is easy. The algorithm appears to randomly pick something that has the same word in the title as the previous or has been tagged to be similar. It's very easy to go from Donald Duck to Duck Hunting to Duck Dynasty to an unhinged Phil Robertson rant. Leave your kids alone with YouTube at your own risk!

Look, Google has more money than God and a lot of smart engineers. If they cared about this, they could fix it. YouTube Kids isn't the solution.

-Chris

*does that count for Goodwin?

Re:I wonder on A Robo-Car Just Drove Across the Country · 2015-04-05 02:09 · Score: 1

The same thing train engineers are thinking.

Trains have solved the problem that driverless cars are trying to solve. Instead of cameras, GPS, and detailed maps, they simply use tracks to guide them. Guess what? After a few hundred years of using trains, we've found it helps to have a human on board. Same will be true of "driverless" cars and trucks.

-Chris

Didn't we just learn why this isn't a good idea? on The Democratization of Medical Diagnosis and Discovery · 2015-04-03 05:24 · Score: 1

http://search.slashdot.org/sto...

Sure, I can Google my symptoms and get a superficial understanding of some medical conditions, but that doesn't really mean I have the context to make any sense of them.

Do you really want the person using stackoverflow as their "brain" building your app? No. You want someone who already knows how to build apps and uses it as a reference on occasion. Big difference and the same one with medicine.

-Chris

Re:Suck it Millenials on Millennial Tech Workers Losing Ground In US · 2015-03-27 02:13 · Score: 1

Nice points. I have two kids under 6 right now and was starting to worry about how smart phones might replace computers for most of what they do and thus never expose them to an easy to program platform. What's really exciting for them is the abundance of hobbyist computers and embedded project kits available now. They're going to grow up in a world where simple microcontroller-style projects are completely accessible to them. Makes me almost want to be 6 again!

Today, I can teach my kids some basic UI programming with HTML/CSS/Javascript (not much harder than VB) to get them familiar with high level concepts. I can also get them a BrickPi or any other embedded(ish) system and teach them how hardware works and how to interface with external devices. What a great time to learn technology!

Millennials, by and large, got shafted when it comes to learning how computers work. Most of them went to school when Java was the only language being taught and Linux was becoming too complicated to easily understand for the casual user. When they started working, a little Javascript and CSS got them really far. There weren't many opportunities to really understand how the full stack works. And, with the rise of social media and apps, their exposure to technology was more social than technical. As others in this thread have pointed out, being able to use a simple UI on an iPhone doesn't make you the technology whiz that the media keeps saying you are.

Millennials can still catch up, but I think the next generation is the one that's really going to be primed to do amazing things.

-Chris

Re:It's just hard work and machine learning on Ask Slashdot: What Happened To Semantic Publishing? · 2015-03-24 09:29 · Score: 2

I don't think it's that computers and machine learning really trump an exact model. It's more that manual curated semantic information is difficult to do well and even when done well is simply the curator's interpretation of the key points. Ontologies and controlled vocabularies (necessary to make semantic solutions work) are always biased towards their creators view of the world. Orthogonal interpretations rarely fit with the ontologies and require mapping between knowledge systems. Rather than simplifying things, this just creates another layer of abstraction and meta-data that now must be managed.*

Machine learning, on some level, basically admits this flaw in structured knowledge representation and punts. Instead, it provides tools for querying knowledge bases and finding patterns in them. I think the latter part is just as flawed as manual curation, but the query tools combined with a human are incredibly powerful.

A simple example: Yahoo originally indexed and categorized the Web. When I interviewed there in '96 (and, silly me, turned down the offer), they had a room full of people that did just that. Google, on the other hand, used a graph algorithm combined with standard text search methods to leverage the structure of the web to give good search results. Yahoo eventually bailed on manual curation and we learned how to leverage Google's approach to search to mine knowledge.

tl;dr: manual and automated curation will never properly capture human's representation of knowledge. Instead, better tools plus the human brain will improve our ability to leverage knowledge.

-Chris

*and there's that old saying: every software problem can be solved with another layer of abstraction.

Re:'Virtual Water': Fee Fie Foe Fum, I Smell ENRON on How 'Virtual Water' Can Help Ease California's Drought · 2015-03-22 02:23 · Score: 2

I was going to make a similar post...

It's my understanding that the current almond tree bubble is driven by (wall street?) investors who noticed the price mismatch in water and are using it to make a quick buck, the rest of the state me damned. Of course, these funds have deep pockets and probably can lobby effectively to keep prices where they are until they cash out.

Seems very much like a variation on ENRON but with water instead of gas.

Re:this is just nonsense. on Go R, Young Man · 2015-03-08 04:06 · Score: 1

Feeding a troll here, I know...

I'm pretty sure I had more Legos than you growing up. But, I didn't make the Legos, I built with them. I also was never under the delusion that my lego skillz would translate to a job building lego buildings. It was a fun, creative activity that required allow no learning and occupied most of my childhood.

The current push for everyone to program is the exact opposite of that. Learning to program for all but us autodidactics requires coursework and commitment. Sure, once you can do it it's a lot of fun. But, to keep the lego analogy going, it's like require a basic understanding of mechanical engineering before being allowed to use Legos (sorry Susan Williams - they'll always be Legos, not lego bricks).

Re:this is just nonsense. on Go R, Young Man · 2015-03-07 13:28 · Score: 1

Bricks are also a fundamental building block of our modern world. But I'll be damned if I know how to make one.

Not everyone needs to know everything. I love to code but I also appreciate that my friends who build houses for a living could give a shit about learning to code.

I'm amazed that people on a tech forum don't get that.

R is not a programming language on Go R, Young Man · 2015-03-07 06:53 · Score: 5, Insightful

It's a statistical computing environment. R is much closed to what VB was pre-VB6 - a loosely defined domain specific language with lots of libraries aimed at a specific task. It's not really a general purpose programming language and not a great one to learn if you want to learn to program.

If you do a lot of number crunching and want to move beyond Excel, R is a great choice (as is matlab, s-plus, or any of the others aimed at analytics).

If you do analytics AND want to learn to program, go Python and NumPy/Pandas.

If you just want to learn to program, VB, JavaScript, Python, Java are all good. Just find what you'd like to program and see what languages people are using.

And yes, at some point, pick up a few more languages if you find you like programming.

-Chris

So we'll all have our own CRM for friends? on In 10 Years, Every Human Connected To the Internet Will Have a Timeline · 2015-03-06 04:10 · Score: 2

A few random thoughts on this:

Influencing people by having instant recall is a classic sales trick. Old school sales people wrote notes in their Rolodex to remember spouse's names, birthdays etc,. Today, Salesforce, Zoho, and the like (hell, even linkedin) handle this role. However, as soon as you realize that the sales person remembered something using a CRM rather than actually remembering it, that interaction quickly becomes awkward. In the past, sales techniques like these weren't well known outside of sales circles. Nowadays, everyone knows about them and they're less effective. The value in the technique is that people weren't aware it was being used and mistook the sales person remember personal details as actual friendship, rather than just a sales trick. Same will happen with timelines - we'll quickly sort those who use it as a gimmick and those who are sincere.

Another angle is the fitbit/life tracking. You know who obsessively tracks everything they do in hopes of improving themselves? People who obsessively track everything in hopes of improving themselves. The rest of us don't. Those people will always be around and will use these tools, the rest of us won't.

More importantly on the personal side of things: anyone who's accumulated a lifetime's worth of photos knows you never really go back an look at them in an detail. Sure, once in a while you'll reminisce, but you never do the detailed analysis of your past that these data hoarding stories predict. Instead, you live your life in the present, learning from the past with an eye toward the future. A few million years of evolution has made our brains very good at that. Every attempt to document and catalog our lives externally has failed to really live up to what our brain already does (hint: we likely don't have perfect recall for evolutionarily important reasons).

From the corporate side, data will be tracked as long as it can be traced back to profits. Right now, most of the profits are going to companies selling big data analysis services. It's only a matter of time before their customers move on to the next marketing trend.

trl;dr: live in the present and stop trying to cheat nature. :)

-Chris

ps: yes, the government collecting all this data is scary as hell. Voting can help fix that (at least in America - it'll take a few elections, but it's possible).

Re:What is this ? Keep asking the same question on Should We Really Try To Teach Everyone To Code? · 2015-02-14 08:53 · Score: 1

This. Most of the workforce would benefit from basic education in all aspects of business. Sales, marketing, finance, project management, business development, etc.

In our neck of the corporate world (software), too few employees understand how business actually functions and what it really takes to make a business work. The current culture of "just build an app and you're set for life" leaves out many of the key steps needed to build a business. As a result, most promising applications go no where and most "successful" exists are really just acqui-hires (making the entrepreneur just a well paid headhunter, which has nothing to do with coding ability).

Simple things like knowing how to develop top down and bottom up models of a market would help app developers understand who their users are and how they might generate revenue to continue to fund their app. Even something as simple as understanding that revenue is actually necessary for success is lost on most developers I know.

-Chris

Control key is in the wrong place on Building the Developer's Dream Keyboard · 2015-02-11 08:16 · Score: 1

C'mon! Every programmer worth their salt knows that Control belongs to the left of 'a'. 'Mouse' is cute, but stick that on the bottom (and not where the Meta key goes!).

I'll go back to Emacs now...

-Chris

This is a Computer Science Test... on AP Test's Recursion Examples: An Exercise In Awkwardness · 2015-02-08 08:13 · Score: 1

... not a programming test. Recursion is a key concept in CS and is the foundation of many techniques and principles. Sure, no one uses it in practice that often, but that doesn't mean you shouldn't know it if you're learning CS.

If you want to train people for job interviews, send them to a trade program. If you want them to understand the field, you have to teach them the fundamentals.

I never use the central limit theorem directly when doing stats, but knowing how it underlies the methods helps provide a better understanding of results.

-Chris

Re:No experience teaching no particular gift for i on What Happens When the "Sharing Economy" Meets Higher Education · 2015-02-02 04:48 · Score: 4, Interesting

I have a Ph.D. and am now fully qualified to teach university courses. The funny thing about that is that in the course of getting my Ph.D., I never once had to take a course on how to teach or even teach/TA a course (I was a research assistant the whole time I was in grad school).

I'm an outlier on not having to teach/TA a course in grad school (I did TA an undergrad, though) , but I don't know of any graduate programs that require actual training for teaching.

The person cited in the summary is just as qualified as most Ph.D.s. :)

As for the big bucks, two of my good friends from grad school (both computer scientists) spent their first two years working for free waiting for tenure track positions to open up. They get decent salaries now, but over the course of their careers, it's not what I'd call big bucks.

-Chris

Data scientists == web masters on Cutting Through Data Science Hype · 2015-01-30 13:48 · Score: 2

Data scientists are this bubble's web masters. 'Nuff said.

Slashdot Mirror

User: rockmuelle

Comments · 364