disobey.com · Domains · Slashdot Mirror

Spidering Hacks

Developers · Programming · 2003-12-16 07:00 · posted by timothy · from the use-for-good-not-evil dept. · 121 comments

DrCarbonite writes "Spidering Hacks is a well-written guide to scripting and automating your data-seeking forays onto the Internet. It offers an attractive combination of the solving the problems you have and exposing you to solutions that you weren't aware you needed." Read on for Martin's review of the book. Spidering Hacks author Kevin Hemenway and Tara Calishain pages 402 publisher O'Reilly rating 8 reviewer Jeff Martin ISBN 0596005776 summary A wide-ranging collection of hacks detailing how to be more productive in Internet research and data retrieval

Introduction Spidering Hacks (SH), by Kevin Hemenway and Tara Calishain, is a practical guide to performing Internet research that goes beyond a simple Google search. SH demonstrates how scripting and other techniques can increase the power and efficiency of your Internet searching, allowing the computer to obtain data, leaving the user free to spend more time on analysis.

SH's language of choice is Perl, and while there are a few guest appearances by Java and Python, some basic Perl fluency will serve the reader well in reading the Hack's source code. However, regardless of your language preference, SH is still a useful resource. The authors discuss ethics and guidelines for writing polite and properly behaved spiders as well as the concepts and reasoning behind the scripts they present. For this reason, non-Perl coders can still stand to learn a lot of useful tips that will help them with their own projects.

Overview

Chapter 1, Walking Softly, covers the basics of spiders and scrapers, and includes tips on proper etiquette for Web robots as well as some resources for identifying and registering the many Web robots/spiders that exist on the Internet. Hemenway and Calishain should be credited for taking the time to be civically responsible and giving their readers appreciation for the power they are utilizing.

Chapter 2, "Assembling a Toolbox," covers how to obtain the Perl modules used by the book, respecting robots.txt, and various topics (Perls LWP and WWW::Mechanize modules for example) that will provide the reader with a solid foundation throughout the rest of the book. SH does a great job introducing some topics that not all members in its target audience may be familiar with (i.e., regular expressions, the use of pipes, XPath).

Chapter 3, "Collecting Media Files," deals with obtaining files from POP3 email attachments, the Library of Congress, and Web cams, among other sources. While individual sites described here may not appeal to everyone, the idea is to provide a specific example demonstrating each of certain general concepts, which can be applied to sites of the reader's choosing.

Chapter 4, "Gleaning Data from Databases," approaches various online databases. There are some interesting hacks here, such as those that leverage Google and Yahoo together. This chapter is the longest, and provides the greatest variety of hacks. It also discusses locating, manipulating, and generating RSS feeds, as well as other miscellaneous tasks such as downloading horoscopes to an iPod.

Hack #48, Super Word Lookup, is a good example of why SH is so intriguing. While utilizing a dictionary or thesaurus via a browser is simple, having the ability to do so with a command-line program allows the user an automated approach, reducing distractions.

Chapter 5, "Maintaining Your Collections," discusses ways to automate retrieval using cron and practical alternatives for Windows users.

Chapter 6, "Giving Back to the World," ends SH by covering practical ways the reader can give back to the Internet and avoid the ignominious leech designation. This chapter provides information on creating public RSS feeds, making an organization's resources available for easy retrieval by spiders, and using instant messaging with a spider.

Conclusion

There are extensive links provided throughout the book, and this indirectly contributes to SH's worth. The usual O'Reilly site for source code is available and Hemenway also provides some additional code on his site. A detailed listing of the hacks covered in SH is also available online from SH's table of contents.

The Hacks series is a relatively new genre for O'Reilly, but it is rapidly maturing and this growth is reflected in Spidering Hacks. Hemenway and Calishain have done good work in assembling a wide variety of tips that cover a broad spectrum of interests and applications. This is a solid effort, and I can easily recommend it to those looking to perform more effective Internet research as well as those looking for new scripting projects to undertake.

You can purchase Spidering Hacks from bn.com. Slashdot welcomes readers' book reviews -- to submit a review for consideration, read the book review guidelines, then visit the submission page.

Practical RDF

Tech · Internet · 2003-09-23 04:00 · posted by timothy · from the tools-applied dept. · 120 comments

briandonovan writes "World Wide Web Consortium (W3C) Director Tim Berners-Lee and his compatriots would like to transform the current Web into a 'Semantic Web' where 'software agents roaming from page to page can readily carry out sophisticated tasks for users' using 'structured collections of information and sets of inference rules.' The Resource Description Framework (RDF), designed as a language for expressing information about resources on the Web, and allied technologies are the result to date of ongoing efforts at the W3C to furnish Semantic Web proponents with the requisite tools. While it's far too early to predict whether TimBL's grand vision will be realized, RDF/XML (the XML serialization of RDF) is already in widespread use, having been incorporated into a surprising array of applications." Read on below for briandonovan's link-stuffed review of O'Reilly's Practical RDF. Practical RDF: Solving Problems with the Resource Description Framework author Shelley Powers pages 331 publisher O'Reilly & Associates rating 9/10 reviewer Brian Donovan ISBN 0596002637 summary Great introduction to RDF, an assortment of tools and utilities for working with RDF, and some real-world applications.

RDF first hit my radar screen a couple of years ago while I was working on a barebones tool to manage my personal website. I was writing the code to generate RSS feeds ("What is RSS?") for my site and had to choose whether to support RSS 0.9x (non-RDF) or RSS 1.0 (RDF-based) or both. Long story short: I went with RSS 1.0 and was able to implement the feeds, but never got any further into RDF afterwards. I couldn't make headway through the RDF-related working drafts rapidly enough to justify the time that I was spending, there weren't any worthwhile-looking books available at the time, and the few online tutorials that I found were sorely lacking -- possibly because the specs themselves were still evolving as the RDF Core Working Group hashed out some remaining issues.

Fast forward a few years: the dust in RDF-land seems to be settling a bit (although new working drafts of all of the current RDF specs were released on September 5th, most of the changes from previous versions appear to be relatively minor) and, with the publication of Shelley Powers' Practical RDF: Solving Problems with the Resource Description Framework, there's finally a good book available on the subject.

Overview After an introductory chapter that touches on the history of RDF and some applications of RDF/XML (the preferred, W3C-blessed serialization of RDF), the book is divided into three broad sections. In the first, the reader is guided through the raft of documentation produced by the RDF Core WG, including : Resource Description Framework (RDF): Concepts and Abstract Data Model, RDF/XML Syntax Specification, RDF Model Theory (formerly Semantics), and RDF Vocabulary Description Language 1.0: RDF Schema. Before moving on to Part II, where she surveys programming language support and tools available for working with RDF (with code snippets where appropriate), Powers spends a chapter developing an RDF vocabulary, "PostCon," that's used throughout the remainder of the book for demo purposes.

Chapter 7, the first in the tools-focused portion of Practical RDF is dedicated to (mostly Java-based) editors, parsers, validators, browsers, etc. for desktop use. Next, she dives into Jena, the Java RDF toolkit that began life as the labor of love of HP Labs researcher Brian McBride before being elevated to the status of a formal HP Labs project under their Semantic Web Research umbrella. Another HP Labs Semantic Web project, Damian Steer's BrownSauce, a slick little Java-based RDF browser, was introduced back in Chapter7. Means for manipulating RDF/XML in Perl (RDF::Core, part of Ginger Alliance's PerlRDF project), PHP (RAP, the RDF API for PHP), and Python (RDFLib) are addressed in Chapter 9. RDF query engines/languages are taken up next -- rdfDB QL, the query language of R.V. Guha's rdfDB (written in C); SquishQL, implemented in the Java-based Inkling query engine (built atop PostgreSQL); RDQL, used within Jena; and Sesame, a JSP/Servlet querying engine that supports both RDQL and its own query language, RQL, and can be deployed atop MySQL or PostgreSQL. Powers rounds out this part of her book with a chapter that deals briefly with the leftovers. Drive, an RDF API for C#, is briefly discussed along with RDF APIs for less fashionable programming languages : Nokia's Wilbur for CLOS, XOTcl for Tcl, and RubyRDF for Ruby. Redland, an RDF toolkit written in C with Java, Perl, PHP, Python, Ruby, and Tcl wrappers, is covered at some length (about half a dozen pages) and a couple more are given over to Redfoot, a Python RDF framework consisting of RDFLib (mentioned earlier in the Perl/PHP/Python chapter), a small-footprint HTTP server (according to the changelog at redfoot.net, they're using Medusa), and a native scripting language called Hypercode that lives within CDATA blocks in RDF/XML (example).

The last third of Practical RDF is devoted to uses of RDF and begins with a chapter on the OWL Web Ontology Language, an extension to RDF that's designed to supply more constraints for RDF vocabularies than can be provided by RDF Schema alone. This chapter would have been better situated after Chapter 5, which addresses RDF Schema, and feels a bit out of place here. RSS 1.0, the RDF-based syndication format, gets a chapter all of its own, beginning with a short synopsis of the evolution of RSS and the rift between the RSS 0.9x/2.0 and RSS 1.0 camps, progressing through descriptions of the RSS elements, some discussion of the use of modules, RSS autodiscovery, and aggregators (Amphetadesk, Meerkat, and NetNewsWire are mentioned), and finishing with an example RSS file (a syndicated list of book recommendations), producing RSS 1.0 using the Informa RSS Library (a set of Java classes), and merging two RSS 1.0 files using the XML::RSS Perl module. Two "Applications Based on RDF" (commercial and noncommercial) chapters top off the book. Noncommercial applications of RDF are visited first : Mozilla, where history and bookmarks, among other classes of information, are stored in RDF; the Creative Commons licensing scheme, whose proponents encourage content creators to embed RDF snippets into their documents and applications to provide information about the work itself and the restrictions placed on its reuse under the particular CC license that they've chosen; a Java and PostgreSQL based digital library system jointly developed by MIT and HP that uses RDF; and FOAF (Friend-of-a-Friend), an RDF vocabulary designed to express personal information and interpersonal relationships. Among the list of commercial applications utilizing RDF that comprises the final chapter in the book is Chandler, the same as yet very-alpha personal information manager that's managed to garner multiple mentions on this site.

The Verdict

The real meat of Practical RDF, for me, was in Chapters 1 through 6 (plus the OWL chapter, Chapter 12). This is not to say that the material in the last 2/3 of the book isn't useful or interesting. The section on RDF software tools is a great annotated survey of what's out there right now ... and I would imagine that installing and testdriving each of the software applications featured in those chapters must have been an extremely time-consuming process. The chapters describing real-world applications of RDF could be useful to someone trying to convince a manager that RDF is a viable, widely-used technology. Given a choice, though, I would rather have seen those pages spent on additional coverage of RDF, RDFS, and OWL with more example RDF vocabularies developed (like PostCon, which the author formulated, then refined through RDFS and OWL). The displaced material could have been made available online at the author's site for the book. A lot of that information will become less accurate over time as the software evolves and people come up with more applications for RDF anyway.

All nitpicking aside, though, if you're looking for a book on RDF, then you can't go wrong with Shelley Powers' Practical RDF.

You can purchase Practical RDF from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Fitting Slashdot Into Your Schedule

Apple · Apnetworking · 2002-10-15 00:45 · posted by pudge · from the i-am-posting-this-only-so-i-can-see-it-appear-in-ical dept. · 16 comments

droleary writes "Looking for more ways to fit the new iCal into your life, or just a way to check web site updates without it looking like you're not working? Well Subsume Technologies has just announced a cool new way to do it: wCal. You can subscribe to frequently updated calendars that are headlines of (hopefully a growing number of) web sites, including a constant-refresh-ending Slashdot: Apple calendar (the press release has the subscribe link)." I first heard of this idea from Morbus Iff back on Sept. 11, and am still not convinced of the utility, but it's an interesting idea. Maybe it will catch on.

Net Cemetery

Internet · 2001-06-22 01:29 · posted by ryuzaki0 · from the of-playing-taps-at-doubletime dept. · 56 comments

Ant wrote to us regarding coverage of the .com dead - the Net Cemetery. It's a fun piece, which gets into the problems of covering and reviewing a medium that's changing everyday. If you're into wandering through the .com wasteland, you should also check out Ghost Sites, which does a great job of "museumifing" (sounds like transmorgify) the same type of sites.

Mirsky Makes "Open Business Plans"

Humor · 2000-05-12 06:06 · posted by ryuzaki0 · from the i-love-this-idea dept. · 36 comments

Mirsky, the guy who brought you "Worst of the Web" until 1996, has returned to the Web. He e-mailed me about this "new open business plans." I do have to say I think that the Valueporn is a great idea - and Mirsky has the ultimate sticky EULA *grin*. That, and multiple online weddings.

Join the NetSlaves!

Internet · 1999-01-15 23:17 · posted by ryuzaki0 · from the good-stuff-to-look-at dept. · 14 comments

Well, with the spare time (huh?) that comes with the weekend, I've been poking around inside of NetSlaves. Good site that looks at real life "Dilberts", with real-life examples of disgruntled tech workers from inside the industry. Careful-sometimes the stories-like the current one about "K" who goes from support drone to production, only to meet his doom at the hands of his PHB-ring /way/ too close to real life. My comps to Bill Lessard and Steve Baldwin.

Ghost Sites Catalogs the Dead Weba

Internet · 1998-12-28 00:12 · posted by ryuzaki0 · from the ain't-that-amusing dept. · 0 comments

Otter writes " ghost sites is a monthly (?) publication devoted to identifying and cataloging web sites that have been abandoned on their servers. It's the digital equivalent of wandering through a deserted town." Otter found this over at ">Upside.

Slashdot Mirror

Domain: disobey.com

Stories · 7