Domain: apache.org
Stories and comments across the archive that link to apache.org.
Stories · 484
-
Apache Web Server Bug Grants Root Access On Shared Hosting Environments (zdnet.com)
An anonymous reader quotes a report from ZDNet: This week, the Apache Software Foundation has patched a severe vulnerability in the Apache (httpd) web server project that could --under certain circumstances-- allow rogue server scripts to execute code with root privileges and take over the underlying server. The vulnerability, tracked as CVE-2019-0211, affects Apache web server releases for Unix systems only, from 2.4.17 to 2.4.38, and was fixed this week with the release of version 2.4.39. According to the Apache team, less-privileged Apache child processes (such as CGI scripts) can execute malicious code with the privileges of the parent process. Because on most Unix systems Apache httpd runs under the root user, any threat actor who has planted a malicious CGI script on an Apache server can use CVE-2019-0211 to take over the underlying system running the Apache httpd process, and inherently control the entire machine.
"First of all, it is a LOCAL vulnerability, which means you need to have some kind of access to the server," Charles Fol, the security researcher who discovered this vulnerability told ZDNet in an interview yesterday. This means that attackers either have to register accounts with shared hosting providers or compromise existing accounts. Once this happens, the attacker only needs to upload a malicious CGI script through their rented/compromised server's control panel to take control of the hosting provider's server to plant malware or steal data from other customers who have data stored on the same machine. "The web hoster has total access to the server through the 'root' account. If one of the users successfully exploits the vulnerability I reported, he/she will get full access to the server, just like the web hoster," Fol said. "This implies read/write/delete any file/database of the other clients." -
Apache NetBeans 10.0 Now Available (apache.org)
The Apache Software Foundation has released NetBeans 10.0, the second major release of the Apache NetBeans IDE. The release, said the Apache Software Foundation, is focused in adding support for JDK 11, JUnit 5, PHP, JavaScript and Groovy, as well in solving many issues. From a blog post: JDK 11 support has been enhanced in the following ways: Integration with the nb-javac project, adding support for JDK 11, removed the CORBA modules, support for JEP 309, Dynamic Class-File Constants, support for JEP 323, Local-Variable Syntax for Lambda Parameters, and support for LVTI Support for Lamdba Parameters.
PHP 7.3: You can now add trailing commas in function calls under PHP 7.3 (mailing list thread), and also use the list reference assignment, the flexible Heredoc and Nowdoc Syntaxes are also supported. [...] And more: context sensitive lexer, PHPStan support, debugger, twig, hints, suggestions, code completionâ¦â visit PHP Features Page and NetBeans 10 New and Noteworthy for more details on PHP support. JUnit 5.3.1 has been added as a new Library to NetBeans, so you can quickly add it to your Java projects. For Maven projects without no existing tests, JUnit 5 is now the default JUnit version. -
Apache NetBeans 10.0 Now Available (apache.org)
The Apache Software Foundation has released NetBeans 10.0, the second major release of the Apache NetBeans IDE. The release, said the Apache Software Foundation, is focused in adding support for JDK 11, JUnit 5, PHP, JavaScript and Groovy, as well in solving many issues. From a blog post: JDK 11 support has been enhanced in the following ways: Integration with the nb-javac project, adding support for JDK 11, removed the CORBA modules, support for JEP 309, Dynamic Class-File Constants, support for JEP 323, Local-Variable Syntax for Lambda Parameters, and support for LVTI Support for Lamdba Parameters.
PHP 7.3: You can now add trailing commas in function calls under PHP 7.3 (mailing list thread), and also use the list reference assignment, the flexible Heredoc and Nowdoc Syntaxes are also supported. [...] And more: context sensitive lexer, PHPStan support, debugger, twig, hints, suggestions, code completionâ¦â visit PHP Features Page and NetBeans 10 New and Noteworthy for more details on PHP support. JUnit 5.3.1 has been added as a new Library to NetBeans, so you can quickly add it to your Java projects. For Maven projects without no existing tests, JUnit 5 is now the default JUnit version. -
Apache NetBeans 10.0 Now Available (apache.org)
The Apache Software Foundation has released NetBeans 10.0, the second major release of the Apache NetBeans IDE. The release, said the Apache Software Foundation, is focused in adding support for JDK 11, JUnit 5, PHP, JavaScript and Groovy, as well in solving many issues. From a blog post: JDK 11 support has been enhanced in the following ways: Integration with the nb-javac project, adding support for JDK 11, removed the CORBA modules, support for JEP 309, Dynamic Class-File Constants, support for JEP 323, Local-Variable Syntax for Lambda Parameters, and support for LVTI Support for Lamdba Parameters.
PHP 7.3: You can now add trailing commas in function calls under PHP 7.3 (mailing list thread), and also use the list reference assignment, the flexible Heredoc and Nowdoc Syntaxes are also supported. [...] And more: context sensitive lexer, PHPStan support, debugger, twig, hints, suggestions, code completionâ¦â visit PHP Features Page and NetBeans 10 New and Noteworthy for more details on PHP support. JUnit 5.3.1 has been added as a new Library to NetBeans, so you can quickly add it to your Java projects. For Maven projects without no existing tests, JUnit 5 is now the default JUnit version. -
Apache NetBeans 10.0 Now Available (apache.org)
The Apache Software Foundation has released NetBeans 10.0, the second major release of the Apache NetBeans IDE. The release, said the Apache Software Foundation, is focused in adding support for JDK 11, JUnit 5, PHP, JavaScript and Groovy, as well in solving many issues. From a blog post: JDK 11 support has been enhanced in the following ways: Integration with the nb-javac project, adding support for JDK 11, removed the CORBA modules, support for JEP 309, Dynamic Class-File Constants, support for JEP 323, Local-Variable Syntax for Lambda Parameters, and support for LVTI Support for Lamdba Parameters.
PHP 7.3: You can now add trailing commas in function calls under PHP 7.3 (mailing list thread), and also use the list reference assignment, the flexible Heredoc and Nowdoc Syntaxes are also supported. [...] And more: context sensitive lexer, PHPStan support, debugger, twig, hints, suggestions, code completionâ¦â visit PHP Features Page and NetBeans 10 New and Noteworthy for more details on PHP support. JUnit 5.3.1 has been added as a new Library to NetBeans, so you can quickly add it to your Java projects. For Maven projects without no existing tests, JUnit 5 is now the default JUnit version. -
Apache NetBeans 10.0 Now Available (apache.org)
The Apache Software Foundation has released NetBeans 10.0, the second major release of the Apache NetBeans IDE. The release, said the Apache Software Foundation, is focused in adding support for JDK 11, JUnit 5, PHP, JavaScript and Groovy, as well in solving many issues. From a blog post: JDK 11 support has been enhanced in the following ways: Integration with the nb-javac project, adding support for JDK 11, removed the CORBA modules, support for JEP 309, Dynamic Class-File Constants, support for JEP 323, Local-Variable Syntax for Lambda Parameters, and support for LVTI Support for Lamdba Parameters.
PHP 7.3: You can now add trailing commas in function calls under PHP 7.3 (mailing list thread), and also use the list reference assignment, the flexible Heredoc and Nowdoc Syntaxes are also supported. [...] And more: context sensitive lexer, PHPStan support, debugger, twig, hints, suggestions, code completionâ¦â visit PHP Features Page and NetBeans 10 New and Noteworthy for more details on PHP support. JUnit 5.3.1 has been added as a new Library to NetBeans, so you can quickly add it to your Java projects. For Maven projects without no existing tests, JUnit 5 is now the default JUnit version. -
Apache OpenOffice, the Schrodinger's Application: No One Knows If It's Dead or Alive, No One Really Wants To Look Inside (theregister.co.uk)
British IT news outlet The Register looks at the myriad of challenges Apache OpenOffice faces today. From the report: Last year Brett Porter, then chairman of the Apache Software Foundation, contemplated whether a proposed official blog post on the state of Apache OpenOffice (AOO) might discourage people from downloading the software due to lack of activity in the project. No such post from the software's developers surfaced. The languid pace of development at AOO, though, has been an issue since 2011 after Oracle (then patron of the project) got into a fork-fight with The Document Foundation, which created LibreOffice from the OpenOffice codebase, and asked developers backing the split to resign.
Back in 2015, Red Hat developer Christian Schaller called OpenOffice "all but dead." Assertions to that effect have continued since, alongside claims to the contrary. Almost a year ago, Jim Jagielski, a member of the Apache OpenOffice Project Management Committee, insisted things were going well and claimed there was renewed interest in the project. For all the concern about AOO, no issues have been raised recently before the Apache Foundation board to suggest ongoing difficulties. The project is due to provide an update this month, according to a spokesperson for the foundation. -
A Critical Apache Struts Security Flaw Makes It 'Easy' To Hack Fortune 100 Firms (zdnet.com)
An anonymous reader quotes a report from ZDNet: A critical security vulnerability in open-source server software enables hackers to easily take control of an affected server -- putting sensitive corporate data at risk. The vulnerability allows an attacker to remotely run code on servers that run applications using the REST plugin, built with Apache Struts, according to security researchers who discovered the vulnerability. All versions of Struts since 2008 are affected, said the researchers. Apache Struts is used across the Fortune 100 to provide web applications in Java, and it powers front- and back-end applications. Man Yue Mo, a security researcher at LGTM, who led the effort that led to the bug's discovery, said that Struts is used in many publicly accessible web applications, such as airline booking and internet banking systems. Mo said that all a hacker needs "is a web browser." "I can't stress enough how incredibly easy this is to exploit," said Bas van Schaik, product manager at Semmle, a company whose analytical software was used to discover the vulnerability. The report notes that "a source code fix was released some weeks prior, and Apache released a full patch on Tuesday to fix the vulnerability." It's now a waiting game for companies to patch their systems. -
Deserialization Issues Also Affect .NET, Not Just Java (bleepingcomputer.com)
"The .NET ecosystem is affected by a similar flaw that has wreaked havoc among Java apps and developers in 2016," reports BleepingComputer. An anonymous reader writes: The issue at hand is in how some .NET libraries deserialize JSON or XML data, doing it in a total unsecured way, but also how developers handle deserialization operations when working with libraries that offer optional secure systems to prevent deserialized data from accessing and running certain methods automatically. The issue is similar to a flaw known as Mad Gadget (or Java Apocalypse) that came to light in 2015 and 2016. The flaw rocked the Java ecosystem in 2016, as it affected the Java Commons Collection and 70 other Java libraries, and was even used to compromise PayPal's servers.
Organizations such as Apache, Oracle, Cisco, Red Hat, Jenkins, VMWare, IBM, Intel, Adobe, HP, and SolarWinds , all issued security patches to fix their products. The Java deserialization flaw was so dangerous that Google engineers banded together in their free time to repair open-source Java libraries and limit the flaw's reach, patching over 2,600 projects. Now a similar issue was discovered in .NET. This research has been presented at the Black Hat and DEF CON security conferences. On page 5 [of this PDF], researchers included reviews for all the .NET and Java apps they analyzed, pointing out which ones are safe and how developers should use them to avoid deserialization attacks when working with JSON data. -
Facebook Petitioned To Change License For ReactJS (github.com)
mpol writes: The Apache Software Foundation issued a notice last weekend indicating that it has added Facebook's BSD+Patents [ROCKSDB] license to its Category X list of disallowed licenses for Apache Project Management Committee members. This is the license that Facebook uses for most of its open source projects. The RocksDB software project from Facebook already changed its license to a dual Apache 2 and GPL 2. Users are now petitioning on GitHub to have Facebook change the license of React.JS as well.
React.JS is a well-known and often used JavaScript Framework for frontend development. It is licensed as BSD + Patents. If you use React.JS and agreed to its license, and you decide to sue Facebook for patent issues, you are no longer allowed to use React.JS or any Facebook software released under this license. -
Facebook Petitioned To Change License For ReactJS (github.com)
mpol writes: The Apache Software Foundation issued a notice last weekend indicating that it has added Facebook's BSD+Patents [ROCKSDB] license to its Category X list of disallowed licenses for Apache Project Management Committee members. This is the license that Facebook uses for most of its open source projects. The RocksDB software project from Facebook already changed its license to a dual Apache 2 and GPL 2. Users are now petitioning on GitHub to have Facebook change the license of React.JS as well.
React.JS is a well-known and often used JavaScript Framework for frontend development. It is licensed as BSD + Patents. If you use React.JS and agreed to its license, and you decide to sue Facebook for patent issues, you are no longer allowed to use React.JS or any Facebook software released under this license. -
Apache Servers Under Attack Through Easily Exploitable Struts 2 Flaw (helpnetsecurity.com)
Orome1 quotes a report from Help Net Security: A critical vulnerability in Apache Struts 2 is being actively and heavily exploited, even though the patch for it has been released on Monday. The vulnerability (CVE-2017-5638) affects the Jakarta file upload Multipart parser in Apache Struts 2. It allows attackers to include code in the "Content-Type" header of an HTTP request, so that it is executed by the web server. Almost concurrently with the release of the security update that plugs the hole, a Metasploit module for targeting it has been made available. Unfortunately, the vulnerability can be easily exploited as it requires no authentication, and two very reliable exploits have already been published online. Also, vulnerable servers are easy to discover through simple web scanning. "Struts 2 is a Java framework that is commonly used by Java-based web applications," reports SANS ISC in their blog. "It is also known as 'Jakarta Struts' and 'Apache Struts.' The Apache project currently maintains Struts." Cisco Talos also has a blog detailing the attack. -
Will Oracle Surrender NetBeans to Apache? (infoworld.com)
An anonymous Slashdot reader quotes InfoWorld: Venerable open source Java IDE NetBeans would move from Oracle's jurisdiction to the Apache Software Foundation under a proposal... endorsed by Java founder James Gosling, a longtime fan of the IDE. Moving NetBeans to a neutral venue like Apache, with its strong governance model, would help the project attract more contributions from various organizations, according to the proposal posted in the Apache wiki.
"Large companies are using NetBeans as an application framework to build internal or commercial applications and are much more likely to contribute to it once it moves to neutral Apache ground," the proposal says. While Oracle will relinquish its control over NetBeans under the proposal, individual contributors from Oracle are expected to continue contributing to the project.
On Facebook, Gosling posted the proposal meant "folks like me can more easily contribute to our favorite IDE. The finest IDE in existence will be getting even better, faster!" InfoWorld reports that when aked if Oracle had neglected NetBeans, Gosling said, "Oracle didn't single out NetBeans for neglect, they neglect everything... I'm thrilled that the NetBeans community will now be able to chart its own course." -
Apache PDFBox Hits 2.0 (sdtimes.com)
mmoorebz writes: After three years of development and with over 150 contributors to the code, Apache PDFBox 2.0 has been released. With this release comes enhancements and improvements. The Apache PDFBox library is an open-source Java tool for working with PDF documents. The project allows creation and manipulation of PDF documents, and the ability to extract content from them. Support for forms in open-source PDF viewers is currently disappointing, and I hope this heralds improvement on that front. -
Meet Flink, the Apache Software Foundation's Newest Top-Level Project
Open source data-processing language Flink, after just nine months' incubation with the Apache Software Foundation, has been elevated to top-level status, joining other ASF projects like OpenOffice and CloudStack. An anonymous reader writes The data-processing engine, which offers APIs in Java and Scala as well as specialized APIs for graph processing, is presented as an alternative to Hadoop's MapReduce component with its own runtime. Yet the system still provides access to Hadoop's distributed file system and YARN resource manager. The open-source community around Flink has steadily grown since the project's inception at the Technical University of Berlin in 2009. Now at version 0.7.0, Flink lists more than 70 contributors and sponsors, including representatives from Hortonworks, Spotify and Data Artisans (a German startup devoted primarily to the development of Flink). (For more about ASF incubation, and what the Foundation's stewardship means, see our interview from last summer with ASF executive VP Rich Bowen.) -
The Apache Software Foundation Now Accepting BitCoin For Donations
rbowen writes The Apache Software Foundation is the latest not-for-profit organization to accept bitcoin donations, as pointed out by a user on the Bitcoin subreddit. The organization is well known for their catalog of open-source software, including the ubiquitous Apache web server, Hadoop, Tomcat, Cassandra, and about 150 other projects. Users in the community have been eager to support their efforts using digital currency for quite a while. The Foundation accepts donations in many different forms: Amazon, PayPal, and they'll even accept donated cars. On their contribution page the Apache Software Foundation has published a bitcoin address and QR code. -
The Apache Software Foundation Now Accepting BitCoin For Donations
rbowen writes The Apache Software Foundation is the latest not-for-profit organization to accept bitcoin donations, as pointed out by a user on the Bitcoin subreddit. The organization is well known for their catalog of open-source software, including the ubiquitous Apache web server, Hadoop, Tomcat, Cassandra, and about 150 other projects. Users in the community have been eager to support their efforts using digital currency for quite a while. The Foundation accepts donations in many different forms: Amazon, PayPal, and they'll even accept donated cars. On their contribution page the Apache Software Foundation has published a bitcoin address and QR code. -
Old Apache Code At Root of Android FakeID Mess
chicksdaddy writes: A four-year-old vulnerability in an open source component that is a critical part of Android leaves hundreds of millions of mobile devices susceptible to silent malware infections. The vulnerability affects devices running Android versions 2.1 to 4.4 ("KitKat"), according to a statement released by Bluebox. The vulnerability was found in a package installer in affected versions of Android. The installer doesn't attempt to determine the authenticity of certificate chains that are used to vouch for new digital identity certificates. In short, Bluebox writes, "an identity can claim to be issued by another identity, and the Android cryptographic code will not verify the claim."
The security implications of this are vast. Malicious actors could create a malicious mobile application with a digital identity certificate that claims to be issued by Adobe Systems. Once installed, vulnerable versions of Android will treat the application as if it was actually signed by Adobe and give it access to local resources, like the special webview plugin privilege, that can be used to sidestep security controls and virtual 'sandbox' environments that keep malicious programs from accessing sensitive data and other applications running on the Android device. The flaw appears to have been introduced to Android through an open source component, Apache Harmony. Google turned to Harmony as an alternative means of supporting Java in the absence of a deal with Oracle to license Java directly.
Work on Harmony was discontinued in November, 2011. However, Google has continued using native Android libraries that are based on Harmony code. The vulnerability concerning certificate validation in the package installer module persisted even as the two codebases diverged. -
Apache OpenOffice Reaches 100 Million Downloads. Now What?
We're thankfully long past the days when an emailed Word document was useless without a copy of Microsoft Word, and that's in large part thanks to the success of the OpenOffice family of word processors. "Family," because the OpenOffice name has been attached to several branches of a codebase that's gone through some serious evolution over the years, starting from its roots in closed-source StarOffice, acquired and open-sourced by Sun to become OpenOffice.org. The same software has led (via some hamfisted moves by Oracle after its acquisition of Sun) to the also-excellent LibreOffice. OpenOffice.org's direct descendant is Apache OpenOffice, and an anonymous reader writes with this excellent news from that project: "The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 170 Open Source projects and initiatives, announced today that Apache OpenOffice has been downloaded 100 million times. Over 100 million downloads, over 750 extensions, over 2,800 templates. But what does the community at Apache need to do to get the next 100 million?" If you want to play along, you can get the latest version of OpenOffice from SourceForge (Slashdot's corporate cousin). I wonder how many government offices -- the U.S. Federal government has long been Microsoft's biggest customer -- couldn't get along just fine with an open source word processor, even considering all the proprietary-format documents they're stuck with for now. -
Apache OpenOffice Reaches 100 Million Downloads. Now What?
We're thankfully long past the days when an emailed Word document was useless without a copy of Microsoft Word, and that's in large part thanks to the success of the OpenOffice family of word processors. "Family," because the OpenOffice name has been attached to several branches of a codebase that's gone through some serious evolution over the years, starting from its roots in closed-source StarOffice, acquired and open-sourced by Sun to become OpenOffice.org. The same software has led (via some hamfisted moves by Oracle after its acquisition of Sun) to the also-excellent LibreOffice. OpenOffice.org's direct descendant is Apache OpenOffice, and an anonymous reader writes with this excellent news from that project: "The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 170 Open Source projects and initiatives, announced today that Apache OpenOffice has been downloaded 100 million times. Over 100 million downloads, over 750 extensions, over 2,800 templates. But what does the community at Apache need to do to get the next 100 million?" If you want to play along, you can get the latest version of OpenOffice from SourceForge (Slashdot's corporate cousin). I wonder how many government offices -- the U.S. Federal government has long been Microsoft's biggest customer -- couldn't get along just fine with an open source word processor, even considering all the proprietary-format documents they're stuck with for now. -
New Apache Allura Project For Project Development Hosting
New submitter brondsem writes: "Today the Apache Software Foundation announced the Allura project for hosting software development projects. Think GitHub or SourceForge on your own servers — Allura has git, svn, hg, wiki, tickets, forums, news, etc. It's written in python and has a modular and extensible platform so you can write your own tools and extensions. It's already used by SourceForge, DARPA, German Aerospace Center, and Open Source Projects Europe. Allura is open source; available under the Apache License v2.0. When you don't want all your project resources in the cloud on somebody else's walled garden, you can run Allura on your own servers and have full control and full data access." (SourceForge shares a corporate overlord with Slashdot). -
New Apache Allura Project For Project Development Hosting
New submitter brondsem writes: "Today the Apache Software Foundation announced the Allura project for hosting software development projects. Think GitHub or SourceForge on your own servers — Allura has git, svn, hg, wiki, tickets, forums, news, etc. It's written in python and has a modular and extensible platform so you can write your own tools and extensions. It's already used by SourceForge, DARPA, German Aerospace Center, and Open Source Projects Europe. Allura is open source; available under the Apache License v2.0. When you don't want all your project resources in the cloud on somebody else's walled garden, you can run Allura on your own servers and have full control and full data access." (SourceForge shares a corporate overlord with Slashdot). -
Subversion Project Migrates To Git
New submitter gitficionado (3600283) writes "The Apache Subversion project has begun migrating its source code from the ASF Subversion repo to git. Last week, the Subversion PMC (project management committee) voted to migrate, and the migration has already begun. Although there was strong opposition to the move from the older and more conservative SVN devs, and reportedly a lot of grumbling and ranting when the vote was tallied, a member of the PMC (who asked to remain anonymous) told the author that 'this [migration] will finally let us get rid of the current broken design to a decentralized source control model [and we'll get] merge and rename done right after all this time.'" Source for the new git backend. -
Subversion Project Migrates To Git
New submitter gitficionado (3600283) writes "The Apache Subversion project has begun migrating its source code from the ASF Subversion repo to git. Last week, the Subversion PMC (project management committee) voted to migrate, and the migration has already begun. Although there was strong opposition to the move from the older and more conservative SVN devs, and reportedly a lot of grumbling and ranting when the vote was tallied, a member of the PMC (who asked to remain anonymous) told the author that 'this [migration] will finally let us get rid of the current broken design to a decentralized source control model [and we'll get] merge and rename done right after all this time.'" Source for the new git backend. -
Spinoffs From Spyland: How Some NSA Technology Is Making Its Way Into Industry
An anonymous reader writes with this news from MIT's Technology Review: "Like other federal agencies, the NSA is compelled by law to try to commercialize its R&D. It employs patent attorneys and has a marketing department that is now trying to license inventions ... The agency claims more than 170 patents ... But the NSA has faced severe challenges trying to keep up with rapidly changing technology. ... Most recently, the NSA's revamp included a sweeping effort to dismantle ... 'stovepipes,' and switch to flexible cloud computing ... in 2008, NSA brass ordered the agency's computer and information sciences research organization to create a version of the system Google uses to store its index of the Web and the raw images of Google Earth. That team was led by Adam Fuchs, now Sqrrl's chief technology officer. Its twist on big data was to add 'cell-level security,' a way of requiring a passcode for each data point ... that's how software (like the infamous PRISM application) knows what can be shown only to people with top-secret clearance. Similar features could control access to data about U.S. citizens. 'A lot of the technology we put [in] is to protect rights," says Fuchs. Like other big-data projects, the NSA team's system, called Accumulo, was built on top of open-source code because "you don't want to have to replicate everything yourself," ... In 2011, the NSA released 200,000 lines of code to the Apache Foundation. When Atlas Venture's Lynch read about that, he jumped—here was a technology already developed, proven to work on tens of terabytes of data, and with security features sorely needed by heavily regulated health-care and banking customers.'" -
Spark Advances From Apache Incubator To Top-Level Project
rjmarvin writes "The Apache Software Foundation announced that Spark, the open-source cluster-computing framework for Big Data analysis has graduated from the Apache Incubator to a top-level project. A project management committee will guide the project's day-to-day operations, and Databricks cofounder Matei Zaharia will be appointed VP of Apache Spark. Spark runs programs 100x faster than Apache Hadoop MapReduce in memory, and it provides APIs that enable developers to rapidly develop applications in Java, Python or Scala, according to the ASF." -
Spark Advances From Apache Incubator To Top-Level Project
rjmarvin writes "The Apache Software Foundation announced that Spark, the open-source cluster-computing framework for Big Data analysis has graduated from the Apache Incubator to a top-level project. A project management committee will guide the project's day-to-day operations, and Databricks cofounder Matei Zaharia will be appointed VP of Apache Spark. Spark runs programs 100x faster than Apache Hadoop MapReduce in memory, and it provides APIs that enable developers to rapidly develop applications in Java, Python or Scala, according to the ASF." -
Google Launches Cordova Powered Chrome Apps For Android and iOS
An anonymous reader writes "Google has launched Chrome apps for Android and iOS. The company is offering an early developer preview of a toolchain based on Apache Cordova, an open-source mobile development framework for building native mobile apps using HTML, CSS, and JavaScript. Developers can use the tool to wrap their Chrome app with a native application shell that enables them to distribute it via Google Play and Apple's App Store." -
Ask Slashdot: Best Way To Implement Wave Protocol Self Hosted?
First time accepted submitter zeigerpuppy writes "It's time to revisit Wave, or is it? I have been looking to implement a Wave installation on my server for private group collaboration. However, all evolutions of Wave seem to be closed-source or experiencing minimal development. I was excited about Kune, but its development looks stalled and despite Rizzoma claiming to be Open-Source, their code is nowhere to be found! Wave-in-a-box looks dead. So Slashdotters, do any of you have a working self-hosted Wave implementation?" -
Apache OpenOffice 4.0 Released With Major New Features
An anonymous reader writes "Still the most popular open source office suite, Apache OpenOffice 4 has been released, with many new enhancements and a new sidebar, based on IBM Symphony's implementation but with many improvements. The code still has comments in German but as long as real new features keep coming and can be shared with other office suites no one is complaining." The sidebar mentioned brings frequently used controls down and beside the actual area of a word-processing doc, say, which makes some sense given how wide many displays have become. This release comes with some major improvements to graphics handling, too; anti-aliasing makes for smoother bitmaps. In conjunction with this release, SourceForge (also under the Slashdot Media umbrella) has announced the launch of an extensions collection for OO. Extensions mean that Open Office can gain capabilities from outside contributors, rather than being wrapped up in large, all-or-nothing updates. You can download the latest version of Apache OpenOffice here. -
Subversion 1.8 Released But Will You Still Use Git?
darthcamaro writes "Remember back in the day when we all used CVS? Then we moved to SVN (subversion) but in the last three yrs or so everyone and their brother seems to have moved to Git, right? Well truth is Subversion is still going strong and just released version 1.8. While Git is still faster for some things, Greg Stein, the former chair of the Apache Software Foundation, figures SVN is better than Git at lots of things. From the article: '"With Subversion, you can have a 1T repository and check out just a small portion of it, The developers don't need full copies," Stein explained. "Git shops typically have many, smaller repositories, while svn shops typically have a single repository, which eases administration, backup, etc."'" Major new features of 1.8 include switching to a new metadata storage engine by default instead of using Berkeley DB, first-class renames (instead of the CVS-era holdover of deleting and recreating with a new name) which will make merges involving renamed files saner, and a slightly simplified branch merging interface. -
Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice. -
Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice. -
Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice. -
Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice. -
Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice. -
Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice. -
Why the 'Star Trek Computer' Will Be Open Source and Apache Licensed
psykocrime writes "The crazy kids at Fogbeam Labs have a new blog post positing that there is a trend towards advanced projects in NLP, Information Retrieval, Big Data and the Semantic Web moving to the Apache Software Foundation. Considering that Apache UIMA is a key component of IBM Watson, is it wrong to believe that the organization behind Hadoop, OpenNLP, Jena, Stanbol, Mahout and Lucene will ultimately be the home of a real 'Star Trek Computer'? Quoting: 'When we talk about how the Star Trek computer had “access to all the data in the known Universe”, what we really mean is that it had access to something like the Semantic Web and the Linked Data cloud. Jena provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. ... In addition to supporting the natural language interface with the system, OpenNLP is a powerful library for extracting meaning (semantics) from unstructured data - specifically textual data in an unstructured (or semi structured) format. An example of unstructured data would be the blog post, an article in the New York Times, or a Wikipedia article. OpenNLP combined with Jena and other technologies, allows “The computer” to “read” the Web, extracting meaningful data and saving valid assertions for later use.'" Speaking of the Star Trek computer, I'm continually disappointed that neither Siri nor Google Now can talk to me in Majel Barrett's voice. -
Apache OpenOffice Downloaded 50 Million Times In a Year
An anonymous reader writes with this quick bite from the H: "Just a few days after the one year anniversary of the release of the first version of OpenOffice from the Apache Foundation (Apache OpenOffice 3.4) on 8 May 2012, the project can now boast 50 million downloads of the Open Source office suite. 10 million of those downloads happened since the beginning of March. In contrast, LibreOffice claimed it had 15 million unique downloads of its office suite in all of 2012." -
Apache CloudStack Becomes a Top-level Project
ke4qqq writes with an excerpt from an ASF press release: "The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today announced that Apache CloudStack has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the Project's community and products have been well-governed under the ASF's meritocratic process and principles." -
Apache CloudStack Becomes a Top-level Project
ke4qqq writes with an excerpt from an ASF press release: "The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today announced that Apache CloudStack has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the Project's community and products have been well-governed under the ASF's meritocratic process and principles." -
Apache CloudStack Becomes a Top-level Project
ke4qqq writes with an excerpt from an ASF press release: "The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today announced that Apache CloudStack has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the Project's community and products have been well-governed under the ASF's meritocratic process and principles." -
Book Review: Hadoop Beginner's Guide
First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review. Hadoop Beginner's Guide author Garry Turkington pages 374 publisher Packt Publishing rating 9/10 reviewer Si Dunn ISBN 9781849517300 summary Explains and shows how to use Hadoop software in Big Data settings. Dr. Turkington is vice president of data engineering and lead architect for London-based Improve Digital. He holds a doctorate in computer science from Queens University of Belfast in Northern Ireland. His Hadoop Beginner's Guide provides an effective overview of Hadoop and hands-on guidance in how to use it locally, in distributed hardware clusters, and out in the cloud.
Packt Publishing provided a review copy of the book. I have reviewed one other Packt book previously.
Much of the first chapter is devoted to "exploring the trends that led to Hadoop's creation and its enormous success." This includes brief discussions of Big Data, cloud computing, Amazon Web Services, and the differences between "scale-up" (using increasingly larger computers as data needs grow) and "scale-out" (spreading the data processing onto more and more machines as demand expands).
Dr. Turkington writes, "One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships."
His 374-page book emphasizes three major aspects of Hadoop: (1) its common projects; (2) the Hadoop Distributed File System (HDFS); and (3) MapReduce.
He explains, "Common projects comprise a set of libraries and tools that help the Hadoop product work in the real world."
The HDFS, meanwhile, "is a filesystem unlike most you may have encountered before." As a distributed filesystem, it can spread data storage across many nodes. "[I]t stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems." The book briefly describes several features, strengths, weaknesses, and other aspects of HDFS.
Finally, MapReduce is a well-known programming model for processing large data sets. Typically, MapReduce is used with clusters of computers that perform distributed computing. In the "Map" portion of the process, a single problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes (and there can be sub-nodes). During the "Reduce" part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)
"The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination," Dr. Turkington notes. He calls this "possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system."
In this 11-chapter book, the first two chapters introduce Hadoop and explain how to install and run the software.
Three chapters are devoted to learning to work with MapReduce, from beginner to advanced levels. And the author stresses: "In the book, we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters." ["AWS" is "Amazon Web Services."]
Chapter 6, titled "When Things Break" zeroes in on Hadoop's "resilience to failure and an ability to survive failures when they do happen.much of the architecture and design of Hadoop is predicated on executing in an environment where failures are both frequent and expected." But node failures and numerous other problems still can arise, so the reader is given an overview of potential difficulties and how to handle them.
The next chapter, "Keeping Things Running," lays out what must be done to properly maintain a Hadoop cluster and keep it tuned and ready to crunch data.
Three of the remaining chapters show how Hadoop can be used elsewhere within an organization's systems and infrastructure, by personnel who are not trained to write MapReduce programs.
Chapter 8, for example, provides "A Relational View on Data with Hive." What Hive provides is "a data warehouse that uses MapReduce to analyze data stored on HDFS," Dr. Turkington notes. "In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard."
Using Hive as an interface to Hadoop "not only accelerates the time required to produce results from data analysis, it significantly broadens who can use Hadoop and MapReduce. Instead of requiring software development skills, anyone with a familiarity with SQL can use Hive," the author states.
But, as Chapter 9 makes clear, Hive is not a relational database, and it doesn't fully implement SQL. So the text and code examples in Chapter 9 illustrate (1) how to set up MySQL to work with Hadoop and (2) how to use Sqoop to transfer bulk data between Hadoop and MySQL.
Chapter 10 shows how to set up and run Flume NG. This is a distributed service that collects, aggregates, and moves large amounts of log data from applications to Hadoop's HDFS.
The book's final chapter, "Where to Go Next," helps the newcomer see what else is available beyond the Hadoop core product. "There are," Dr. Turkington emphasizes, "a plethora of related projects and tools that build upon Hadoop and provide specific functionality or alternative approaches to existing ideas." He provides a quick tour of several of the projects and tools.
A key strength of this beginner's guide is in how its contents are structured and delivered. Four important headings appear repeatedly in most chapters. The "Time for action" heading singles out step-by-step instructions for performing a particular action. The "What just happened?" heading highlights explanations of "the working of tasks or instructions that you have just completed." The "Pop quiz" heading, meanwhile, is followed by short, multiple-choice questions that help you gauge your understanding. And the "Have a go hero" heading introduces paragraphs that "set practical challenges and give you ideas for experimenting with what you have learned."
Hadoop can be downloaded free from the Apache Software Foundation's Hadoop website.
Dr. Turkington's book does a good job of describing how to get Hadoop running on Ubuntu and other Linux distributions. But while he assures that "Hadoop does run well on other systems," he notes in his text: "Windows is supported only as a development platform, and Mac OS X is not formally supported at all." He refers users to Apache's Hadoop FAQ wiki for more information. Unfortunately, few details are offered there. So web searches become the best option for finding how-to instructions for Windows and Macs.
Running Hadoop on a Windows PC typically involves installing Cygwin and openSSH, so you can simulate using a Linux PC. But other choices can be found via sites such as Hadoop Wizard and Hadoop on Windows with Eclipse".
To install Hadoop on a Mac running OS X Mountain Lion, you will need to search for websites that offer how-to tips. Here is one example.
There are other ways get access to Hadoop on a single computer, using other operating systems or virtual machines. Again, web searches are necessary. The Cloudera Enterprise Free product is one virtual-machine option to consider.
Once you get past the hurdle of installing and running Hadoop, Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.
You will have the opportunity, the book's tagline states, to "[l]earn how to crunch big data to extract meaning from the data avalanche."
Si Dunn is an author, screenwriter, and technology book reviewer.
You can purchase Hadoop Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Book Review: Hadoop Beginner's Guide
First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review. Hadoop Beginner's Guide author Garry Turkington pages 374 publisher Packt Publishing rating 9/10 reviewer Si Dunn ISBN 9781849517300 summary Explains and shows how to use Hadoop software in Big Data settings. Dr. Turkington is vice president of data engineering and lead architect for London-based Improve Digital. He holds a doctorate in computer science from Queens University of Belfast in Northern Ireland. His Hadoop Beginner's Guide provides an effective overview of Hadoop and hands-on guidance in how to use it locally, in distributed hardware clusters, and out in the cloud.
Packt Publishing provided a review copy of the book. I have reviewed one other Packt book previously.
Much of the first chapter is devoted to "exploring the trends that led to Hadoop's creation and its enormous success." This includes brief discussions of Big Data, cloud computing, Amazon Web Services, and the differences between "scale-up" (using increasingly larger computers as data needs grow) and "scale-out" (spreading the data processing onto more and more machines as demand expands).
Dr. Turkington writes, "One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships."
His 374-page book emphasizes three major aspects of Hadoop: (1) its common projects; (2) the Hadoop Distributed File System (HDFS); and (3) MapReduce.
He explains, "Common projects comprise a set of libraries and tools that help the Hadoop product work in the real world."
The HDFS, meanwhile, "is a filesystem unlike most you may have encountered before." As a distributed filesystem, it can spread data storage across many nodes. "[I]t stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems." The book briefly describes several features, strengths, weaknesses, and other aspects of HDFS.
Finally, MapReduce is a well-known programming model for processing large data sets. Typically, MapReduce is used with clusters of computers that perform distributed computing. In the "Map" portion of the process, a single problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes (and there can be sub-nodes). During the "Reduce" part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)
"The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination," Dr. Turkington notes. He calls this "possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system."
In this 11-chapter book, the first two chapters introduce Hadoop and explain how to install and run the software.
Three chapters are devoted to learning to work with MapReduce, from beginner to advanced levels. And the author stresses: "In the book, we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters." ["AWS" is "Amazon Web Services."]
Chapter 6, titled "When Things Break" zeroes in on Hadoop's "resilience to failure and an ability to survive failures when they do happen.much of the architecture and design of Hadoop is predicated on executing in an environment where failures are both frequent and expected." But node failures and numerous other problems still can arise, so the reader is given an overview of potential difficulties and how to handle them.
The next chapter, "Keeping Things Running," lays out what must be done to properly maintain a Hadoop cluster and keep it tuned and ready to crunch data.
Three of the remaining chapters show how Hadoop can be used elsewhere within an organization's systems and infrastructure, by personnel who are not trained to write MapReduce programs.
Chapter 8, for example, provides "A Relational View on Data with Hive." What Hive provides is "a data warehouse that uses MapReduce to analyze data stored on HDFS," Dr. Turkington notes. "In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard."
Using Hive as an interface to Hadoop "not only accelerates the time required to produce results from data analysis, it significantly broadens who can use Hadoop and MapReduce. Instead of requiring software development skills, anyone with a familiarity with SQL can use Hive," the author states.
But, as Chapter 9 makes clear, Hive is not a relational database, and it doesn't fully implement SQL. So the text and code examples in Chapter 9 illustrate (1) how to set up MySQL to work with Hadoop and (2) how to use Sqoop to transfer bulk data between Hadoop and MySQL.
Chapter 10 shows how to set up and run Flume NG. This is a distributed service that collects, aggregates, and moves large amounts of log data from applications to Hadoop's HDFS.
The book's final chapter, "Where to Go Next," helps the newcomer see what else is available beyond the Hadoop core product. "There are," Dr. Turkington emphasizes, "a plethora of related projects and tools that build upon Hadoop and provide specific functionality or alternative approaches to existing ideas." He provides a quick tour of several of the projects and tools.
A key strength of this beginner's guide is in how its contents are structured and delivered. Four important headings appear repeatedly in most chapters. The "Time for action" heading singles out step-by-step instructions for performing a particular action. The "What just happened?" heading highlights explanations of "the working of tasks or instructions that you have just completed." The "Pop quiz" heading, meanwhile, is followed by short, multiple-choice questions that help you gauge your understanding. And the "Have a go hero" heading introduces paragraphs that "set practical challenges and give you ideas for experimenting with what you have learned."
Hadoop can be downloaded free from the Apache Software Foundation's Hadoop website.
Dr. Turkington's book does a good job of describing how to get Hadoop running on Ubuntu and other Linux distributions. But while he assures that "Hadoop does run well on other systems," he notes in his text: "Windows is supported only as a development platform, and Mac OS X is not formally supported at all." He refers users to Apache's Hadoop FAQ wiki for more information. Unfortunately, few details are offered there. So web searches become the best option for finding how-to instructions for Windows and Macs.
Running Hadoop on a Windows PC typically involves installing Cygwin and openSSH, so you can simulate using a Linux PC. But other choices can be found via sites such as Hadoop Wizard and Hadoop on Windows with Eclipse".
To install Hadoop on a Mac running OS X Mountain Lion, you will need to search for websites that offer how-to tips. Here is one example.
There are other ways get access to Hadoop on a single computer, using other operating systems or virtual machines. Again, web searches are necessary. The Cloudera Enterprise Free product is one virtual-machine option to consider.
Once you get past the hurdle of installing and running Hadoop, Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.
You will have the opportunity, the book's tagline states, to "[l]earn how to crunch big data to extract meaning from the data avalanche."
Si Dunn is an author, screenwriter, and technology book reviewer.
You can purchase Hadoop Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Book Review: Hadoop Beginner's Guide
First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review. Hadoop Beginner's Guide author Garry Turkington pages 374 publisher Packt Publishing rating 9/10 reviewer Si Dunn ISBN 9781849517300 summary Explains and shows how to use Hadoop software in Big Data settings. Dr. Turkington is vice president of data engineering and lead architect for London-based Improve Digital. He holds a doctorate in computer science from Queens University of Belfast in Northern Ireland. His Hadoop Beginner's Guide provides an effective overview of Hadoop and hands-on guidance in how to use it locally, in distributed hardware clusters, and out in the cloud.
Packt Publishing provided a review copy of the book. I have reviewed one other Packt book previously.
Much of the first chapter is devoted to "exploring the trends that led to Hadoop's creation and its enormous success." This includes brief discussions of Big Data, cloud computing, Amazon Web Services, and the differences between "scale-up" (using increasingly larger computers as data needs grow) and "scale-out" (spreading the data processing onto more and more machines as demand expands).
Dr. Turkington writes, "One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships."
His 374-page book emphasizes three major aspects of Hadoop: (1) its common projects; (2) the Hadoop Distributed File System (HDFS); and (3) MapReduce.
He explains, "Common projects comprise a set of libraries and tools that help the Hadoop product work in the real world."
The HDFS, meanwhile, "is a filesystem unlike most you may have encountered before." As a distributed filesystem, it can spread data storage across many nodes. "[I]t stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems." The book briefly describes several features, strengths, weaknesses, and other aspects of HDFS.
Finally, MapReduce is a well-known programming model for processing large data sets. Typically, MapReduce is used with clusters of computers that perform distributed computing. In the "Map" portion of the process, a single problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes (and there can be sub-nodes). During the "Reduce" part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)
"The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination," Dr. Turkington notes. He calls this "possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system."
In this 11-chapter book, the first two chapters introduce Hadoop and explain how to install and run the software.
Three chapters are devoted to learning to work with MapReduce, from beginner to advanced levels. And the author stresses: "In the book, we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters." ["AWS" is "Amazon Web Services."]
Chapter 6, titled "When Things Break" zeroes in on Hadoop's "resilience to failure and an ability to survive failures when they do happen.much of the architecture and design of Hadoop is predicated on executing in an environment where failures are both frequent and expected." But node failures and numerous other problems still can arise, so the reader is given an overview of potential difficulties and how to handle them.
The next chapter, "Keeping Things Running," lays out what must be done to properly maintain a Hadoop cluster and keep it tuned and ready to crunch data.
Three of the remaining chapters show how Hadoop can be used elsewhere within an organization's systems and infrastructure, by personnel who are not trained to write MapReduce programs.
Chapter 8, for example, provides "A Relational View on Data with Hive." What Hive provides is "a data warehouse that uses MapReduce to analyze data stored on HDFS," Dr. Turkington notes. "In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard."
Using Hive as an interface to Hadoop "not only accelerates the time required to produce results from data analysis, it significantly broadens who can use Hadoop and MapReduce. Instead of requiring software development skills, anyone with a familiarity with SQL can use Hive," the author states.
But, as Chapter 9 makes clear, Hive is not a relational database, and it doesn't fully implement SQL. So the text and code examples in Chapter 9 illustrate (1) how to set up MySQL to work with Hadoop and (2) how to use Sqoop to transfer bulk data between Hadoop and MySQL.
Chapter 10 shows how to set up and run Flume NG. This is a distributed service that collects, aggregates, and moves large amounts of log data from applications to Hadoop's HDFS.
The book's final chapter, "Where to Go Next," helps the newcomer see what else is available beyond the Hadoop core product. "There are," Dr. Turkington emphasizes, "a plethora of related projects and tools that build upon Hadoop and provide specific functionality or alternative approaches to existing ideas." He provides a quick tour of several of the projects and tools.
A key strength of this beginner's guide is in how its contents are structured and delivered. Four important headings appear repeatedly in most chapters. The "Time for action" heading singles out step-by-step instructions for performing a particular action. The "What just happened?" heading highlights explanations of "the working of tasks or instructions that you have just completed." The "Pop quiz" heading, meanwhile, is followed by short, multiple-choice questions that help you gauge your understanding. And the "Have a go hero" heading introduces paragraphs that "set practical challenges and give you ideas for experimenting with what you have learned."
Hadoop can be downloaded free from the Apache Software Foundation's Hadoop website.
Dr. Turkington's book does a good job of describing how to get Hadoop running on Ubuntu and other Linux distributions. But while he assures that "Hadoop does run well on other systems," he notes in his text: "Windows is supported only as a development platform, and Mac OS X is not formally supported at all." He refers users to Apache's Hadoop FAQ wiki for more information. Unfortunately, few details are offered there. So web searches become the best option for finding how-to instructions for Windows and Macs.
Running Hadoop on a Windows PC typically involves installing Cygwin and openSSH, so you can simulate using a Linux PC. But other choices can be found via sites such as Hadoop Wizard and Hadoop on Windows with Eclipse".
To install Hadoop on a Mac running OS X Mountain Lion, you will need to search for websites that offer how-to tips. Here is one example.
There are other ways get access to Hadoop on a single computer, using other operating systems or virtual machines. Again, web searches are necessary. The Cloudera Enterprise Free product is one virtual-machine option to consider.
Once you get past the hurdle of installing and running Hadoop, Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.
You will have the opportunity, the book's tagline states, to "[l]earn how to crunch big data to extract meaning from the data avalanche."
Si Dunn is an author, screenwriter, and technology book reviewer.
You can purchase Hadoop Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Book Review: Hadoop Beginner's Guide
First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review. Hadoop Beginner's Guide author Garry Turkington pages 374 publisher Packt Publishing rating 9/10 reviewer Si Dunn ISBN 9781849517300 summary Explains and shows how to use Hadoop software in Big Data settings. Dr. Turkington is vice president of data engineering and lead architect for London-based Improve Digital. He holds a doctorate in computer science from Queens University of Belfast in Northern Ireland. His Hadoop Beginner's Guide provides an effective overview of Hadoop and hands-on guidance in how to use it locally, in distributed hardware clusters, and out in the cloud.
Packt Publishing provided a review copy of the book. I have reviewed one other Packt book previously.
Much of the first chapter is devoted to "exploring the trends that led to Hadoop's creation and its enormous success." This includes brief discussions of Big Data, cloud computing, Amazon Web Services, and the differences between "scale-up" (using increasingly larger computers as data needs grow) and "scale-out" (spreading the data processing onto more and more machines as demand expands).
Dr. Turkington writes, "One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships."
His 374-page book emphasizes three major aspects of Hadoop: (1) its common projects; (2) the Hadoop Distributed File System (HDFS); and (3) MapReduce.
He explains, "Common projects comprise a set of libraries and tools that help the Hadoop product work in the real world."
The HDFS, meanwhile, "is a filesystem unlike most you may have encountered before." As a distributed filesystem, it can spread data storage across many nodes. "[I]t stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems." The book briefly describes several features, strengths, weaknesses, and other aspects of HDFS.
Finally, MapReduce is a well-known programming model for processing large data sets. Typically, MapReduce is used with clusters of computers that perform distributed computing. In the "Map" portion of the process, a single problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes (and there can be sub-nodes). During the "Reduce" part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)
"The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination," Dr. Turkington notes. He calls this "possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system."
In this 11-chapter book, the first two chapters introduce Hadoop and explain how to install and run the software.
Three chapters are devoted to learning to work with MapReduce, from beginner to advanced levels. And the author stresses: "In the book, we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters." ["AWS" is "Amazon Web Services."]
Chapter 6, titled "When Things Break" zeroes in on Hadoop's "resilience to failure and an ability to survive failures when they do happen.much of the architecture and design of Hadoop is predicated on executing in an environment where failures are both frequent and expected." But node failures and numerous other problems still can arise, so the reader is given an overview of potential difficulties and how to handle them.
The next chapter, "Keeping Things Running," lays out what must be done to properly maintain a Hadoop cluster and keep it tuned and ready to crunch data.
Three of the remaining chapters show how Hadoop can be used elsewhere within an organization's systems and infrastructure, by personnel who are not trained to write MapReduce programs.
Chapter 8, for example, provides "A Relational View on Data with Hive." What Hive provides is "a data warehouse that uses MapReduce to analyze data stored on HDFS," Dr. Turkington notes. "In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard."
Using Hive as an interface to Hadoop "not only accelerates the time required to produce results from data analysis, it significantly broadens who can use Hadoop and MapReduce. Instead of requiring software development skills, anyone with a familiarity with SQL can use Hive," the author states.
But, as Chapter 9 makes clear, Hive is not a relational database, and it doesn't fully implement SQL. So the text and code examples in Chapter 9 illustrate (1) how to set up MySQL to work with Hadoop and (2) how to use Sqoop to transfer bulk data between Hadoop and MySQL.
Chapter 10 shows how to set up and run Flume NG. This is a distributed service that collects, aggregates, and moves large amounts of log data from applications to Hadoop's HDFS.
The book's final chapter, "Where to Go Next," helps the newcomer see what else is available beyond the Hadoop core product. "There are," Dr. Turkington emphasizes, "a plethora of related projects and tools that build upon Hadoop and provide specific functionality or alternative approaches to existing ideas." He provides a quick tour of several of the projects and tools.
A key strength of this beginner's guide is in how its contents are structured and delivered. Four important headings appear repeatedly in most chapters. The "Time for action" heading singles out step-by-step instructions for performing a particular action. The "What just happened?" heading highlights explanations of "the working of tasks or instructions that you have just completed." The "Pop quiz" heading, meanwhile, is followed by short, multiple-choice questions that help you gauge your understanding. And the "Have a go hero" heading introduces paragraphs that "set practical challenges and give you ideas for experimenting with what you have learned."
Hadoop can be downloaded free from the Apache Software Foundation's Hadoop website.
Dr. Turkington's book does a good job of describing how to get Hadoop running on Ubuntu and other Linux distributions. But while he assures that "Hadoop does run well on other systems," he notes in his text: "Windows is supported only as a development platform, and Mac OS X is not formally supported at all." He refers users to Apache's Hadoop FAQ wiki for more information. Unfortunately, few details are offered there. So web searches become the best option for finding how-to instructions for Windows and Macs.
Running Hadoop on a Windows PC typically involves installing Cygwin and openSSH, so you can simulate using a Linux PC. But other choices can be found via sites such as Hadoop Wizard and Hadoop on Windows with Eclipse".
To install Hadoop on a Mac running OS X Mountain Lion, you will need to search for websites that offer how-to tips. Here is one example.
There are other ways get access to Hadoop on a single computer, using other operating systems or virtual machines. Again, web searches are necessary. The Cloudera Enterprise Free product is one virtual-machine option to consider.
Once you get past the hurdle of installing and running Hadoop, Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.
You will have the opportunity, the book's tagline states, to "[l]earn how to crunch big data to extract meaning from the data avalanche."
Si Dunn is an author, screenwriter, and technology book reviewer.
You can purchase Hadoop Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Book Review: Hadoop Beginner's Guide
First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review. Hadoop Beginner's Guide author Garry Turkington pages 374 publisher Packt Publishing rating 9/10 reviewer Si Dunn ISBN 9781849517300 summary Explains and shows how to use Hadoop software in Big Data settings. Dr. Turkington is vice president of data engineering and lead architect for London-based Improve Digital. He holds a doctorate in computer science from Queens University of Belfast in Northern Ireland. His Hadoop Beginner's Guide provides an effective overview of Hadoop and hands-on guidance in how to use it locally, in distributed hardware clusters, and out in the cloud.
Packt Publishing provided a review copy of the book. I have reviewed one other Packt book previously.
Much of the first chapter is devoted to "exploring the trends that led to Hadoop's creation and its enormous success." This includes brief discussions of Big Data, cloud computing, Amazon Web Services, and the differences between "scale-up" (using increasingly larger computers as data needs grow) and "scale-out" (spreading the data processing onto more and more machines as demand expands).
Dr. Turkington writes, "One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships."
His 374-page book emphasizes three major aspects of Hadoop: (1) its common projects; (2) the Hadoop Distributed File System (HDFS); and (3) MapReduce.
He explains, "Common projects comprise a set of libraries and tools that help the Hadoop product work in the real world."
The HDFS, meanwhile, "is a filesystem unlike most you may have encountered before." As a distributed filesystem, it can spread data storage across many nodes. "[I]t stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems." The book briefly describes several features, strengths, weaknesses, and other aspects of HDFS.
Finally, MapReduce is a well-known programming model for processing large data sets. Typically, MapReduce is used with clusters of computers that perform distributed computing. In the "Map" portion of the process, a single problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes (and there can be sub-nodes). During the "Reduce" part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)
"The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination," Dr. Turkington notes. He calls this "possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system."
In this 11-chapter book, the first two chapters introduce Hadoop and explain how to install and run the software.
Three chapters are devoted to learning to work with MapReduce, from beginner to advanced levels. And the author stresses: "In the book, we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters." ["AWS" is "Amazon Web Services."]
Chapter 6, titled "When Things Break" zeroes in on Hadoop's "resilience to failure and an ability to survive failures when they do happen.much of the architecture and design of Hadoop is predicated on executing in an environment where failures are both frequent and expected." But node failures and numerous other problems still can arise, so the reader is given an overview of potential difficulties and how to handle them.
The next chapter, "Keeping Things Running," lays out what must be done to properly maintain a Hadoop cluster and keep it tuned and ready to crunch data.
Three of the remaining chapters show how Hadoop can be used elsewhere within an organization's systems and infrastructure, by personnel who are not trained to write MapReduce programs.
Chapter 8, for example, provides "A Relational View on Data with Hive." What Hive provides is "a data warehouse that uses MapReduce to analyze data stored on HDFS," Dr. Turkington notes. "In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard."
Using Hive as an interface to Hadoop "not only accelerates the time required to produce results from data analysis, it significantly broadens who can use Hadoop and MapReduce. Instead of requiring software development skills, anyone with a familiarity with SQL can use Hive," the author states.
But, as Chapter 9 makes clear, Hive is not a relational database, and it doesn't fully implement SQL. So the text and code examples in Chapter 9 illustrate (1) how to set up MySQL to work with Hadoop and (2) how to use Sqoop to transfer bulk data between Hadoop and MySQL.
Chapter 10 shows how to set up and run Flume NG. This is a distributed service that collects, aggregates, and moves large amounts of log data from applications to Hadoop's HDFS.
The book's final chapter, "Where to Go Next," helps the newcomer see what else is available beyond the Hadoop core product. "There are," Dr. Turkington emphasizes, "a plethora of related projects and tools that build upon Hadoop and provide specific functionality or alternative approaches to existing ideas." He provides a quick tour of several of the projects and tools.
A key strength of this beginner's guide is in how its contents are structured and delivered. Four important headings appear repeatedly in most chapters. The "Time for action" heading singles out step-by-step instructions for performing a particular action. The "What just happened?" heading highlights explanations of "the working of tasks or instructions that you have just completed." The "Pop quiz" heading, meanwhile, is followed by short, multiple-choice questions that help you gauge your understanding. And the "Have a go hero" heading introduces paragraphs that "set practical challenges and give you ideas for experimenting with what you have learned."
Hadoop can be downloaded free from the Apache Software Foundation's Hadoop website.
Dr. Turkington's book does a good job of describing how to get Hadoop running on Ubuntu and other Linux distributions. But while he assures that "Hadoop does run well on other systems," he notes in his text: "Windows is supported only as a development platform, and Mac OS X is not formally supported at all." He refers users to Apache's Hadoop FAQ wiki for more information. Unfortunately, few details are offered there. So web searches become the best option for finding how-to instructions for Windows and Macs.
Running Hadoop on a Windows PC typically involves installing Cygwin and openSSH, so you can simulate using a Linux PC. But other choices can be found via sites such as Hadoop Wizard and Hadoop on Windows with Eclipse".
To install Hadoop on a Mac running OS X Mountain Lion, you will need to search for websites that offer how-to tips. Here is one example.
There are other ways get access to Hadoop on a single computer, using other operating systems or virtual machines. Again, web searches are necessary. The Cloudera Enterprise Free product is one virtual-machine option to consider.
Once you get past the hurdle of installing and running Hadoop, Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.
You will have the opportunity, the book's tagline states, to "[l]earn how to crunch big data to extract meaning from the data avalanche."
Si Dunn is an author, screenwriter, and technology book reviewer.
You can purchase Hadoop Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
PunkSPIDER Project Puts Vulnerabilities On (Searchable) Display
First time accepted submitter punk2176 writes "Recently I started a free and open source project known as the PunkSPIDER project and presented it at ShmooCon 2013. If you haven't heard of it, it's at heart, a project with the goal of pushing for improved global website security. In order to do this we built a Hadoop distributed computing cluster along with a website vulnerability scanner that can use the cluster. Once we finished that we open sourced the code to our scanner and unleashed it on the Internet. The results of our scans are provided to the public for free in an easy-to-use search engine. The results so far aren't pretty." The Register has an informative article, too. -
OpenOffice: Worth $21 Million Per Day, If It Were Microsoft Office
rbowen of SourceForge writes with an interesting way to look at the value of certain free software options: "Apache OpenOffice 3.4.1 has averaged 138,928 downloads per day. That is an average value to the public of $21 million per day, as calculated by savings over buying the competing product. Or $7.61 billion (7.61 thousand million) per year." (That works out to about $150 per copy of MS Office. There are some holes in the argument, but it holds true for everyone who but for a free office suite would have paid that much for Microsoft's. The numbers are even bigger if you toss in LibreOffice, too.) -
How Open Source Could Benefit Academic Research
dp619 writes "Ross Gardler, of Apache Fame, has written a guest post on the Outercurve Foundation blog advocating that universities accelerate the research process through a collaborative sharing and development of research software while examining reasons why many have been reluctant to publish their source code. Quoting: 'These highly specialized software solutions are not rarely engineered for reuse. They are often hacks to answer a specific question quickly. ... What many academic researchers fail to understand is that this specialization problem is not unique to research projects. Most software developers will seek to provide an adequate solution to their specific problem, as quickly as possible. They don't seek to build a perfect, all-purpose, tool set that can be reused in every conceivable circumstance. They simply solve the problem at hand and move on to the next one. The difference is that open source developers will do this incremental problem solving using shared code. They will share that code in incremental steps rather than wait until they've built the complete system they need but is too specific for others to use. Other people will reuse and improve on the initial solution, perhaps generalizing it a little in the process. There is no need to share the details of why one needs a 'green widget' nor is there any reason to prevent someone modifying it so it can be either a 'green widget' or a 'blue widget.'"