Domain: cpan.org
Stories and comments across the archive that link to cpan.org.
Stories · 56
-
Learn About the FRDCSA 'Weak AI' Project (Video)
Today's interviewee, Andrew Dougherty, has a Web page that says he is "...an autodidact mathematician and computer scientist specializing in Artificial Intelligence (AI) and Algorithmic Information Theory (AIT). He is the founder of the FRDCSA (Formalized Research Database: Cluster Study & Apply) project, a practical attempt at weak AI aimed primarily at collecting and interrelating existing software with theoretical motivation from AIT. He has made over 90 open source applications, 400 (unofficial) Debian GNU/Linux packages and 800 Perl5 modules (see http://frdcsa.org/frdcsa)." Tim Lord says Andrew's project "brings together a lot of AI algorithms, collects large sets of data for those algorithms to chew on, and writes software to do things like ... guide your whole life." As you might guess, Andrew occupies a pretty far edge of the eccentric programmer world, as you'll see from this video (and transcript). He calls himself "a serious Stallmanite" (his word), and has chosen the GPL for his software in the hopes that it will therefore help the greatest number of people. (Speaking of help, he's looking for interesting data sets and various "life rules" that can be integrated with his planning software, and one of the reasons he presented at the recent YAPC::NA was to solicit help in putting his hundreds of Perl modules onto CPAN.) -
Perl 5.16.0 Released
An anonymous reader writes "Perl 5.16.0 is now available with plenty of improvements all around. You can view a summary and all the change details here. With Perl on an annual release schedule, and projects like Mojolicious, Dancer, perlbrew, Plack, and Moose continuing to gain in popularity, are we in the middle of a Perl renaissance?" -
RubyGems' Module Count Soon To Surpass CPAN's
mfarver writes "According to the data gathered by modulecounts.com, the total number of modules checked into RubyGems (18,894, and growing at about 27/day) will probably exceed CPAN (18,928, and growing about 8/day) this week." -
Something For (Almost) Every Developer
First up, reader martinjlogan sends along a tutorial for setting up a workable Erlang/OTP development environment on a Mac. Next, reader acid06 notes news of Perl 5.12, including what may be the first delivered fix for the Y2K38 bug. (Hit the Read More link below for some details on Perl's new release strategy.) "After two years of development, the new major version of Perl is now available. Notable new features are: better Unicode support, proper support for time after the Y2038 barrier, new APIs to allow developers to extend Perl with 'pluggable' keywords and syntax, warnings for deprecated features and more. From the linked post: You can get it from the CPAN right now or wait for a platform-specific release (such as Strawberry Perl for Windows)." Finally, from reader snydeq: "InfoWorld's Martin Heller provides an in-depth review of Visual Studio 2010 and finds Microsoft taking several large steps away from its legacy IDE code. 'Visual Studio 2010 is a major upgrade in functionality and capability from its predecessor. Developers, architects, and testers will all find areas where the new version makes their jobs easier. Despite the higher pricing for this version, most serious Microsoft-oriented shops will upgrade to Visual Studio 2010 and never look back,' Heller writes. Chief among the improvements are Microsoft's revamping the core editing and designer views to use WPF, its overhaul of IntelliSense and support for test-driven development, and its intelligent support for multiple versions of the .Net Framework."
Re: Perl. This release cycle marks a change to a time-based release process. Beginning with version 5.11.0, we make a new development release of Perl available on the 20th of each month. Each spring, we will release a new stable version of Perl. One month later, we will make a minor update to deal with any issues discovered after the initial ".0" release. Future releases in the stable series will follow quarterly. In contrast to releases of Perl, maintenance releases will contain fixes for issues discovered after the .0 release, but will not include new features or behavior. -
Microsoft Bots Effectively DDoSing Perl CPAN Testers
at_slashdot writes "The Perl CPAN Testers have been suffering issues accessing their sites, databases and mirrors. According to a posting on the CPAN Testers' blog, the CPAN Testers' server has been being aggressively scanned by '20-30 bots every few seconds' in what they call 'a dedicated denial of service attack'; these bots 'completely ignore the rules specified in robots.txt.'" From the Heise story linked above: "The bots were identified by their IP addresses, including 65.55.207.x, 65.55.107.x and 65.55.106.x, as coming from Microsoft." -
Where's the "IronPerl" Project?
pondlife writes "A friend asked me today about using some Microsoft server components from Perl. Over the years he's built up a large collection of Perl/COM code using Win32::OLE and he had planned on doing the same thing here. The big problem is that as with many current MS APIs, they're available for .NET only because COM is effectively deprecated at this point. I did some Googling, expecting to find quickly the Perl equivalent of IronPython or IronRuby. But to my surprise I found almost nothing. ActiveState has PerlNET, but there's almost no information about it, and the mailing list 'activity' suggests it's dead or dying anyway. So, what are Perl/Windows shops doing now that more and more Microsoft components are .NET? Are people moving to other languages for Windows administration? Are they writing wrappers using COM interop? Or have I completely missed something out there that solves this problem?" -
Slashdot's Setup, Part 2- Software
Today we have Part 2 in our exciting 2 part series about the infrastructure that powers Slashdot. Last week Uriah told us all about the hardware powering the system. This week, Jamie McCarthy picks up the story and tells us about the software... from pound to memcached to mysql and more. Hit that link and read on.The software side of Slashdot takes over at the point where our load balancers -- described in Friday's hardware story -- hand off your incoming HTTP request to our pound servers.
Pound is a reverse proxy, which means it doesn't service the request itself, it just chooses which web server to hand it off to. We run 6 pounds, one for HTTPS traffic and the other 5 for regular HTTP. (Didn't know we support HTTPS, did ya? It's one of the perks for subscribers: you get to read Slashdot on the same webhead that admins use, which is always going to be responsive even during a crush of traffic -- because if it isn't, Rob's going to breathe down our necks!)
The pounds send traffic to one of the 16 apaches on our 16 webheads -- 15 regular, and the 1 HTTPS. Now, pound itself is so undemanding that we run it side-by-side with the apaches. The HTTPS pound handles SSL itself, handing off a plaintext HTTP request to its machine's apache, so the apache it redirects traffic to doesn't need mod_ssl compiled in. One less headache! Of our other 15 webheads, 5 also run a pound, not to distribute load but just for redundancy.
(Trivia: pound normally adds an X-Forwarded-For header, which Slash::Apache substitutes for the (internal) IP of pound itself. But sometimes if you use a proxy on the internet to do something bad, it will send us an X-Forwarded-For header too, which we use to try to track abuse. So we patched pound to insert a special X-Forward-Pound header, so it doesn't overwrite what may come from an abuser's proxy.)
The other 15 webheads are segregated by type. This segregation is mostly what pound is for. We have 2 webheads for static (.shtml) requests, 4 for the dynamic homepage, 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl), and 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose). We segregate partly so that if there's a performance problem or a DDoS on a specific page, the rest of the site will remain functional. We're constantly changing the code and this sets up "performance firewalls" for when us silly coders decide to write infinite loops.
But we also segregate for efficiency reasons like httpd-level caching, and MaxClients tuning. Our webhead bottleneck is CPU, not RAM. We run MaxClients that might seem absurdly low (5-15 for dynamic webheads, 25 for static) but our philosophy is if we're not turning over requests quickly anyway, something's wrong, and stacking up more requests won't help the CPU chew through them any faster.
All the webheads run the same software, which they mount from a /usr/local exported by a read-only NFS machine. Everyone I've ever met outside of this company gives an involuntary shudder when NFS is mentioned, and yet we haven't had any problems since shortly after it was set up (2002-ish). I attribute this to a combination of our brilliant sysadmins and the fact that we only export read-only. The backend task that writes to /usr/local (to update index.shtml every minute, for example) runs on the NFS server itself.
The apaches are versions 1.3, because there's never been a reason for us to switch to 2.0. We compile in mod_perl, and lingerd to free up RAM during delivery, but the only other nonstandard module we use is mod_auth_useragent to keep unfriendly bots away. Slash does make extensive use of each phase of the request loop (largely so we can send our 403's to out-of-control bots using a minimum of resources, and so your page is fully on its way while we write to the logging DB).
Slash, of course, is the open-source perl code that runs Slashdot. If you're thinking of playing around with it, grab a recent copy from CVS: it's been years since we got around to a tarball release. The various scripts that handle web requests access the database through Slash's SQL API, implemented on top of DBD::mysql (now maintained, incidentally, by one of the original Slash 1.0 coders) and of course DBI.pm. The most interesting parts of this layer might be:
(a) We don't use Apache::DBI. We use connect_cached, but actually our main connection cache is the global objects that hold the connections. Some small chunks of data are so frequently used that we keep them around in those objects.
(b) We almost never use statement handles. We have eleven ways of doing a SELECT and the differences are mostly how we massage the results into the perl data structure they return.
(c) We don't use placeholders. Originally because DBD::mysql didn't take advantage of them, and now because we think any speed increase in a reasonably-optimized web app should be a trivial payoff for non-self-documenting argument order. Discuss!
(d) We built in replication support. A database object requested as a reader picks a random slave to read from for the duration of your HTTP request (or the backend task). We can weight them manually, and we have a task that reweights them automatically. (If we do something stupid and wedge a slave's replication thread, every Slash process, across 17 machines, starts throttling back its connections to that machine within 10 seconds. This was originally written to handle slave DBs getting bogged down by load, but with our new faster DBs, that just never happens, so if a slave falls behind, one of us probably typed something dumb at the mysql> prompt.)
(e) We bolted on memcached support. Why bolted-on? Because back when we first tried memcached, we got a huge performance boost by caching our three big data types (users, stories, comment text) and we're pretty sure additional caching would provide minimal benefit at this point. Memcached's main use is to get and set data objects, and Slash doesn't really bottleneck that way.
Slash 1.0 was written way back in early 2000 with decent support for get and set methods to abstract objects out of a database (getDescriptions, subclassed _wheresql) -- but over the years we've only used them a few times. Most data types that are candidates to be objectified either are processed in large numbers (like tags and comments), in ways that would be difficult to do efficiently by subclassing, or have complicated table structures and pre- and post-processing (like users) that would make any generic objectification code pretty complicated. So most data access is done through get and set methods written custom for each data type, or, just as often, through methods that perform one specific update or select.
Overall, we're pretty happy with the database side of things. Most tables are fairly well normalized, not fully but mostly, and we've found this improves performance in most cases. Even on a fairly large site like Slashdot, with modern hardware and a little thinking ahead, we're able to push code and schema changes live quickly. Thanks to running multiple-master replication, we can keep the site fully live even during blocking queries like ALTER TABLE. After changes go live, we can find performance problem spots and optimize (which usually means caching, caching, caching, and occasionally multi-pass log processing for things like detecting abuse and picking users out of a hat who get mod points).
In fact, I'll go further than "pretty happy." Writing a database-backed web site has changed dramatically over the past seven years. The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.
-
Slashdot's Setup, Part 2- Software
Today we have Part 2 in our exciting 2 part series about the infrastructure that powers Slashdot. Last week Uriah told us all about the hardware powering the system. This week, Jamie McCarthy picks up the story and tells us about the software... from pound to memcached to mysql and more. Hit that link and read on.The software side of Slashdot takes over at the point where our load balancers -- described in Friday's hardware story -- hand off your incoming HTTP request to our pound servers.
Pound is a reverse proxy, which means it doesn't service the request itself, it just chooses which web server to hand it off to. We run 6 pounds, one for HTTPS traffic and the other 5 for regular HTTP. (Didn't know we support HTTPS, did ya? It's one of the perks for subscribers: you get to read Slashdot on the same webhead that admins use, which is always going to be responsive even during a crush of traffic -- because if it isn't, Rob's going to breathe down our necks!)
The pounds send traffic to one of the 16 apaches on our 16 webheads -- 15 regular, and the 1 HTTPS. Now, pound itself is so undemanding that we run it side-by-side with the apaches. The HTTPS pound handles SSL itself, handing off a plaintext HTTP request to its machine's apache, so the apache it redirects traffic to doesn't need mod_ssl compiled in. One less headache! Of our other 15 webheads, 5 also run a pound, not to distribute load but just for redundancy.
(Trivia: pound normally adds an X-Forwarded-For header, which Slash::Apache substitutes for the (internal) IP of pound itself. But sometimes if you use a proxy on the internet to do something bad, it will send us an X-Forwarded-For header too, which we use to try to track abuse. So we patched pound to insert a special X-Forward-Pound header, so it doesn't overwrite what may come from an abuser's proxy.)
The other 15 webheads are segregated by type. This segregation is mostly what pound is for. We have 2 webheads for static (.shtml) requests, 4 for the dynamic homepage, 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl), and 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose). We segregate partly so that if there's a performance problem or a DDoS on a specific page, the rest of the site will remain functional. We're constantly changing the code and this sets up "performance firewalls" for when us silly coders decide to write infinite loops.
But we also segregate for efficiency reasons like httpd-level caching, and MaxClients tuning. Our webhead bottleneck is CPU, not RAM. We run MaxClients that might seem absurdly low (5-15 for dynamic webheads, 25 for static) but our philosophy is if we're not turning over requests quickly anyway, something's wrong, and stacking up more requests won't help the CPU chew through them any faster.
All the webheads run the same software, which they mount from a /usr/local exported by a read-only NFS machine. Everyone I've ever met outside of this company gives an involuntary shudder when NFS is mentioned, and yet we haven't had any problems since shortly after it was set up (2002-ish). I attribute this to a combination of our brilliant sysadmins and the fact that we only export read-only. The backend task that writes to /usr/local (to update index.shtml every minute, for example) runs on the NFS server itself.
The apaches are versions 1.3, because there's never been a reason for us to switch to 2.0. We compile in mod_perl, and lingerd to free up RAM during delivery, but the only other nonstandard module we use is mod_auth_useragent to keep unfriendly bots away. Slash does make extensive use of each phase of the request loop (largely so we can send our 403's to out-of-control bots using a minimum of resources, and so your page is fully on its way while we write to the logging DB).
Slash, of course, is the open-source perl code that runs Slashdot. If you're thinking of playing around with it, grab a recent copy from CVS: it's been years since we got around to a tarball release. The various scripts that handle web requests access the database through Slash's SQL API, implemented on top of DBD::mysql (now maintained, incidentally, by one of the original Slash 1.0 coders) and of course DBI.pm. The most interesting parts of this layer might be:
(a) We don't use Apache::DBI. We use connect_cached, but actually our main connection cache is the global objects that hold the connections. Some small chunks of data are so frequently used that we keep them around in those objects.
(b) We almost never use statement handles. We have eleven ways of doing a SELECT and the differences are mostly how we massage the results into the perl data structure they return.
(c) We don't use placeholders. Originally because DBD::mysql didn't take advantage of them, and now because we think any speed increase in a reasonably-optimized web app should be a trivial payoff for non-self-documenting argument order. Discuss!
(d) We built in replication support. A database object requested as a reader picks a random slave to read from for the duration of your HTTP request (or the backend task). We can weight them manually, and we have a task that reweights them automatically. (If we do something stupid and wedge a slave's replication thread, every Slash process, across 17 machines, starts throttling back its connections to that machine within 10 seconds. This was originally written to handle slave DBs getting bogged down by load, but with our new faster DBs, that just never happens, so if a slave falls behind, one of us probably typed something dumb at the mysql> prompt.)
(e) We bolted on memcached support. Why bolted-on? Because back when we first tried memcached, we got a huge performance boost by caching our three big data types (users, stories, comment text) and we're pretty sure additional caching would provide minimal benefit at this point. Memcached's main use is to get and set data objects, and Slash doesn't really bottleneck that way.
Slash 1.0 was written way back in early 2000 with decent support for get and set methods to abstract objects out of a database (getDescriptions, subclassed _wheresql) -- but over the years we've only used them a few times. Most data types that are candidates to be objectified either are processed in large numbers (like tags and comments), in ways that would be difficult to do efficiently by subclassing, or have complicated table structures and pre- and post-processing (like users) that would make any generic objectification code pretty complicated. So most data access is done through get and set methods written custom for each data type, or, just as often, through methods that perform one specific update or select.
Overall, we're pretty happy with the database side of things. Most tables are fairly well normalized, not fully but mostly, and we've found this improves performance in most cases. Even on a fairly large site like Slashdot, with modern hardware and a little thinking ahead, we're able to push code and schema changes live quickly. Thanks to running multiple-master replication, we can keep the site fully live even during blocking queries like ALTER TABLE. After changes go live, we can find performance problem spots and optimize (which usually means caching, caching, caching, and occasionally multi-pass log processing for things like detecting abuse and picking users out of a hat who get mod points).
In fact, I'll go further than "pretty happy." Writing a database-backed web site has changed dramatically over the past seven years. The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.
-
Slashdot's Setup, Part 2- Software
Today we have Part 2 in our exciting 2 part series about the infrastructure that powers Slashdot. Last week Uriah told us all about the hardware powering the system. This week, Jamie McCarthy picks up the story and tells us about the software... from pound to memcached to mysql and more. Hit that link and read on.The software side of Slashdot takes over at the point where our load balancers -- described in Friday's hardware story -- hand off your incoming HTTP request to our pound servers.
Pound is a reverse proxy, which means it doesn't service the request itself, it just chooses which web server to hand it off to. We run 6 pounds, one for HTTPS traffic and the other 5 for regular HTTP. (Didn't know we support HTTPS, did ya? It's one of the perks for subscribers: you get to read Slashdot on the same webhead that admins use, which is always going to be responsive even during a crush of traffic -- because if it isn't, Rob's going to breathe down our necks!)
The pounds send traffic to one of the 16 apaches on our 16 webheads -- 15 regular, and the 1 HTTPS. Now, pound itself is so undemanding that we run it side-by-side with the apaches. The HTTPS pound handles SSL itself, handing off a plaintext HTTP request to its machine's apache, so the apache it redirects traffic to doesn't need mod_ssl compiled in. One less headache! Of our other 15 webheads, 5 also run a pound, not to distribute load but just for redundancy.
(Trivia: pound normally adds an X-Forwarded-For header, which Slash::Apache substitutes for the (internal) IP of pound itself. But sometimes if you use a proxy on the internet to do something bad, it will send us an X-Forwarded-For header too, which we use to try to track abuse. So we patched pound to insert a special X-Forward-Pound header, so it doesn't overwrite what may come from an abuser's proxy.)
The other 15 webheads are segregated by type. This segregation is mostly what pound is for. We have 2 webheads for static (.shtml) requests, 4 for the dynamic homepage, 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl), and 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose). We segregate partly so that if there's a performance problem or a DDoS on a specific page, the rest of the site will remain functional. We're constantly changing the code and this sets up "performance firewalls" for when us silly coders decide to write infinite loops.
But we also segregate for efficiency reasons like httpd-level caching, and MaxClients tuning. Our webhead bottleneck is CPU, not RAM. We run MaxClients that might seem absurdly low (5-15 for dynamic webheads, 25 for static) but our philosophy is if we're not turning over requests quickly anyway, something's wrong, and stacking up more requests won't help the CPU chew through them any faster.
All the webheads run the same software, which they mount from a /usr/local exported by a read-only NFS machine. Everyone I've ever met outside of this company gives an involuntary shudder when NFS is mentioned, and yet we haven't had any problems since shortly after it was set up (2002-ish). I attribute this to a combination of our brilliant sysadmins and the fact that we only export read-only. The backend task that writes to /usr/local (to update index.shtml every minute, for example) runs on the NFS server itself.
The apaches are versions 1.3, because there's never been a reason for us to switch to 2.0. We compile in mod_perl, and lingerd to free up RAM during delivery, but the only other nonstandard module we use is mod_auth_useragent to keep unfriendly bots away. Slash does make extensive use of each phase of the request loop (largely so we can send our 403's to out-of-control bots using a minimum of resources, and so your page is fully on its way while we write to the logging DB).
Slash, of course, is the open-source perl code that runs Slashdot. If you're thinking of playing around with it, grab a recent copy from CVS: it's been years since we got around to a tarball release. The various scripts that handle web requests access the database through Slash's SQL API, implemented on top of DBD::mysql (now maintained, incidentally, by one of the original Slash 1.0 coders) and of course DBI.pm. The most interesting parts of this layer might be:
(a) We don't use Apache::DBI. We use connect_cached, but actually our main connection cache is the global objects that hold the connections. Some small chunks of data are so frequently used that we keep them around in those objects.
(b) We almost never use statement handles. We have eleven ways of doing a SELECT and the differences are mostly how we massage the results into the perl data structure they return.
(c) We don't use placeholders. Originally because DBD::mysql didn't take advantage of them, and now because we think any speed increase in a reasonably-optimized web app should be a trivial payoff for non-self-documenting argument order. Discuss!
(d) We built in replication support. A database object requested as a reader picks a random slave to read from for the duration of your HTTP request (or the backend task). We can weight them manually, and we have a task that reweights them automatically. (If we do something stupid and wedge a slave's replication thread, every Slash process, across 17 machines, starts throttling back its connections to that machine within 10 seconds. This was originally written to handle slave DBs getting bogged down by load, but with our new faster DBs, that just never happens, so if a slave falls behind, one of us probably typed something dumb at the mysql> prompt.)
(e) We bolted on memcached support. Why bolted-on? Because back when we first tried memcached, we got a huge performance boost by caching our three big data types (users, stories, comment text) and we're pretty sure additional caching would provide minimal benefit at this point. Memcached's main use is to get and set data objects, and Slash doesn't really bottleneck that way.
Slash 1.0 was written way back in early 2000 with decent support for get and set methods to abstract objects out of a database (getDescriptions, subclassed _wheresql) -- but over the years we've only used them a few times. Most data types that are candidates to be objectified either are processed in large numbers (like tags and comments), in ways that would be difficult to do efficiently by subclassing, or have complicated table structures and pre- and post-processing (like users) that would make any generic objectification code pretty complicated. So most data access is done through get and set methods written custom for each data type, or, just as often, through methods that perform one specific update or select.
Overall, we're pretty happy with the database side of things. Most tables are fairly well normalized, not fully but mostly, and we've found this improves performance in most cases. Even on a fairly large site like Slashdot, with modern hardware and a little thinking ahead, we're able to push code and schema changes live quickly. Thanks to running multiple-master replication, we can keep the site fully live even during blocking queries like ALTER TABLE. After changes go live, we can find performance problem spots and optimize (which usually means caching, caching, caching, and occasionally multi-pass log processing for things like detecting abuse and picking users out of a hat who get mod points).
In fact, I'll go further than "pretty happy." Writing a database-backed web site has changed dramatically over the past seven years. The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.
-
Slashdot's Setup, Part 2- Software
Today we have Part 2 in our exciting 2 part series about the infrastructure that powers Slashdot. Last week Uriah told us all about the hardware powering the system. This week, Jamie McCarthy picks up the story and tells us about the software... from pound to memcached to mysql and more. Hit that link and read on.The software side of Slashdot takes over at the point where our load balancers -- described in Friday's hardware story -- hand off your incoming HTTP request to our pound servers.
Pound is a reverse proxy, which means it doesn't service the request itself, it just chooses which web server to hand it off to. We run 6 pounds, one for HTTPS traffic and the other 5 for regular HTTP. (Didn't know we support HTTPS, did ya? It's one of the perks for subscribers: you get to read Slashdot on the same webhead that admins use, which is always going to be responsive even during a crush of traffic -- because if it isn't, Rob's going to breathe down our necks!)
The pounds send traffic to one of the 16 apaches on our 16 webheads -- 15 regular, and the 1 HTTPS. Now, pound itself is so undemanding that we run it side-by-side with the apaches. The HTTPS pound handles SSL itself, handing off a plaintext HTTP request to its machine's apache, so the apache it redirects traffic to doesn't need mod_ssl compiled in. One less headache! Of our other 15 webheads, 5 also run a pound, not to distribute load but just for redundancy.
(Trivia: pound normally adds an X-Forwarded-For header, which Slash::Apache substitutes for the (internal) IP of pound itself. But sometimes if you use a proxy on the internet to do something bad, it will send us an X-Forwarded-For header too, which we use to try to track abuse. So we patched pound to insert a special X-Forward-Pound header, so it doesn't overwrite what may come from an abuser's proxy.)
The other 15 webheads are segregated by type. This segregation is mostly what pound is for. We have 2 webheads for static (.shtml) requests, 4 for the dynamic homepage, 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl), and 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose). We segregate partly so that if there's a performance problem or a DDoS on a specific page, the rest of the site will remain functional. We're constantly changing the code and this sets up "performance firewalls" for when us silly coders decide to write infinite loops.
But we also segregate for efficiency reasons like httpd-level caching, and MaxClients tuning. Our webhead bottleneck is CPU, not RAM. We run MaxClients that might seem absurdly low (5-15 for dynamic webheads, 25 for static) but our philosophy is if we're not turning over requests quickly anyway, something's wrong, and stacking up more requests won't help the CPU chew through them any faster.
All the webheads run the same software, which they mount from a /usr/local exported by a read-only NFS machine. Everyone I've ever met outside of this company gives an involuntary shudder when NFS is mentioned, and yet we haven't had any problems since shortly after it was set up (2002-ish). I attribute this to a combination of our brilliant sysadmins and the fact that we only export read-only. The backend task that writes to /usr/local (to update index.shtml every minute, for example) runs on the NFS server itself.
The apaches are versions 1.3, because there's never been a reason for us to switch to 2.0. We compile in mod_perl, and lingerd to free up RAM during delivery, but the only other nonstandard module we use is mod_auth_useragent to keep unfriendly bots away. Slash does make extensive use of each phase of the request loop (largely so we can send our 403's to out-of-control bots using a minimum of resources, and so your page is fully on its way while we write to the logging DB).
Slash, of course, is the open-source perl code that runs Slashdot. If you're thinking of playing around with it, grab a recent copy from CVS: it's been years since we got around to a tarball release. The various scripts that handle web requests access the database through Slash's SQL API, implemented on top of DBD::mysql (now maintained, incidentally, by one of the original Slash 1.0 coders) and of course DBI.pm. The most interesting parts of this layer might be:
(a) We don't use Apache::DBI. We use connect_cached, but actually our main connection cache is the global objects that hold the connections. Some small chunks of data are so frequently used that we keep them around in those objects.
(b) We almost never use statement handles. We have eleven ways of doing a SELECT and the differences are mostly how we massage the results into the perl data structure they return.
(c) We don't use placeholders. Originally because DBD::mysql didn't take advantage of them, and now because we think any speed increase in a reasonably-optimized web app should be a trivial payoff for non-self-documenting argument order. Discuss!
(d) We built in replication support. A database object requested as a reader picks a random slave to read from for the duration of your HTTP request (or the backend task). We can weight them manually, and we have a task that reweights them automatically. (If we do something stupid and wedge a slave's replication thread, every Slash process, across 17 machines, starts throttling back its connections to that machine within 10 seconds. This was originally written to handle slave DBs getting bogged down by load, but with our new faster DBs, that just never happens, so if a slave falls behind, one of us probably typed something dumb at the mysql> prompt.)
(e) We bolted on memcached support. Why bolted-on? Because back when we first tried memcached, we got a huge performance boost by caching our three big data types (users, stories, comment text) and we're pretty sure additional caching would provide minimal benefit at this point. Memcached's main use is to get and set data objects, and Slash doesn't really bottleneck that way.
Slash 1.0 was written way back in early 2000 with decent support for get and set methods to abstract objects out of a database (getDescriptions, subclassed _wheresql) -- but over the years we've only used them a few times. Most data types that are candidates to be objectified either are processed in large numbers (like tags and comments), in ways that would be difficult to do efficiently by subclassing, or have complicated table structures and pre- and post-processing (like users) that would make any generic objectification code pretty complicated. So most data access is done through get and set methods written custom for each data type, or, just as often, through methods that perform one specific update or select.
Overall, we're pretty happy with the database side of things. Most tables are fairly well normalized, not fully but mostly, and we've found this improves performance in most cases. Even on a fairly large site like Slashdot, with modern hardware and a little thinking ahead, we're able to push code and schema changes live quickly. Thanks to running multiple-master replication, we can keep the site fully live even during blocking queries like ALTER TABLE. After changes go live, we can find performance problem spots and optimize (which usually means caching, caching, caching, and occasionally multi-pass log processing for things like detecting abuse and picking users out of a hat who get mod points).
In fact, I'll go further than "pretty happy." Writing a database-backed web site has changed dramatically over the past seven years. The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.
-
Making Sense of Census Data With Google Earth
mikemuch writes "Imran Haque has developed a mashup of Google Earth with data from the U.S. Census Bureau, called gCensus. The app uses the XML format known as KML (Keyhole Markup Language), which can create shapes and colors on the maps displayed by GE. Haque had to build custom code libraries (which he's made available as open source) that could generate KML for the project. He also had to extract the relevant data from the highly counter-intuitive Census Bureau files and store them in a database that could handle geographic data. gCensus lets you do stuff like create colorful overlays on maps showing population ages, race, and family size distributions." -
Why Is "Design by Contract" Not More Popular?
Coryoth writes "Design by Contract, writing pre- and post-conditions on functions, seemed like straightforward common sense to me. Such conditions, in the form of executable code, not only provide more exacting API documentation, but also provide a test harness. Having easy to write unit tests, that are automatically integrated into the inheritance hierarchy in OO languages, 'just made sense'. However, despite being available (to varying degrees of completeness) for many languages other than Eiffel, including Java, C++, Perl, Python, Ruby, Ada, and even Haskell and Ocaml, the concept has never gained significant traction, particularly in comparison to unit testing frameworks (which DbC complements nicely), and hype like 'Extreme Programming'. So why did Design by Contract fail to take off?" -
mod_perl 2.0.0 Released
-
mod_perl 2.0.0 Released
-
Run Perl 6 Today: Pugs 6.0.11 released
autrijus writes "I am delighted to announce that Pugs 6.0.11 is released, with experimental pugscc support to turn Perl 6 programs into stand-alone executables, as well as many new features. Pugs is an implementation of Perl 6, written in Haskell. For more information, see Pugs Apocryphon 1 and this perl.com interview." -
Parrot 0.1.1 'Poicephalus' Released
Pan T. Hose writes "The long awaited release of Parrot 0.1.1 "Poicephalus" has been finally announced on perl.perl6.internals newsgroup and perl6-internals mailing list simultaneously by Leopold Toetsch followed by an announcement on use Perl by Will Coleda and now also on Slashdot." (Read on for a list of changes since the last release, as well as a number of useful links.) Pan T. Hose continues "The most important changes since the previous version 0.1.0 (code-named 'Leaping Kakapo' and released in February) are:- Python support: Parrot runs 4/7 of the pie-thon test suite
- Better OS support: more platforms, compilers, OS functions
- Improved PIR syntax for method calls and <op>= assignment
- Dynamic loading reworked including a "make install" target
- MMD - multi method dispatch for binary vtable methods
- Library improvement and cleanup
- BigInt, Complex, *Array, Slice, Enumerate, None PMC classes
- IA64 and hppa JIT support
- Tons of fixes, improvements, new tests, and documentation updates
-
CPAN: $677 Million of Perl
Adam K writes "It had to happen eventually. CPAN has finally gotten the sloccount treatment, and the results are interesting. At 15.4 million lines of code, CPAN is starting to approach the size of the entire Redhat 6.2 distribution mentioned in David Wheeler's original paper. Could this help explain perl's relatively low position in the SourceForge.net language numbers?" -
Gopher ProtocolHandler for Apache2 Released
hardburn writes "One of the stated goals of the Gopher Manifesto (previously mentioned on Slashdot) was to create a Gopher plugin for Apache. That goal has now been realized with the release of Apache::GopherHandler. Get it off Gopher itself or off CPAN." -
Exegesis 7 Released (Perl 6 Text Formatting)
chromatic writes "Perl.com has just published Exegesis 7, Damian Conway's explanation of how text formatting will work Perl 6 (and now, Perl 5, thanks to his Perl6::Form module) will work. Think of it as Perl 1 for the 21st century. Also, Parrot 0.1.0, the virtual machine for Perl 6 and several other dynamic languages, released on Leap Day -- ever wanted to program in an object oriented assembly language?" -
Perl is Sweet Sixteen
surflorida writes "Perl turned sweet 16 yesterday. 'Larry Wall released Perl 1 on this day in 1987, so today Perl is 16 years old. Happy birthday Perl! You can read more about the timeline of Perl releases in perlhist.pod and at history.perl.org.' Happy birthday Perl! You are now old enough to get a US drivers license." -
Perl 5.8.1 Released
langles writes "Perl 5.8.1 has been released. Read the official announcement, then download it from a CPAN mirror near you. If you've been following the Perl5 Porters List, you'll know this version was very well tested before they released it. However, there may be some modules that will need to be fixed before they will work with this release." The announcement also contains full details on incompatibilities and known issues, so give it a once over before upgrading, especially if from a pre-5.8 version. -
Perl 5.8.1 RC1 Released
moksliukas writes "Perl 5.8.1 RC1 has been released. A major change is that due to security reasons, the "random ordering" of hashes has been made even more random. As always, there are other numerous changes, including some improved utf-8 handling support, deprecation of 5.005 style threads, $* magic variable, pseudo-hashes and a number of other improvements. As usual, perldelta summarises all changes." -
Perl 5.8.1 RC1 Released
moksliukas writes "Perl 5.8.1 RC1 has been released. A major change is that due to security reasons, the "random ordering" of hashes has been made even more random. As always, there are other numerous changes, including some improved utf-8 handling support, deprecation of 5.005 style threads, $* magic variable, pseudo-hashes and a number of other improvements. As usual, perldelta summarises all changes." -
JSP and Tag Libraries for Web Development
PotPieMan writes "I recently finished reading JSP and Tag Libraries for Web Development, a book for JSP developers wanting to improve their skillset. Read on for my review." It's not a new book, but still relevant. JSP and Tag Libraries for Web Development author Wellington L.S. da Silva pages 420, including appendices publisher New Riders rating 6 reviewer PotPieMan ISBN 0735710953 summary A guide to designing and implementing JSP applications, with a focus on tag libraries.
The Scoop Web developers and designers have long wrestled with strategies for combining their efforts. Web developers don't mind looking at code but dislike dealing with the look of a page, while Web designers are the opposite. Dynamic Web page technologies, such as Microsoft's ASP, Perl's many template systems and Web frameworks (Text::Template, HTML::Template, HTML::Mason, CGI::Application, etc.), and PHP, were designed to give both developers and designers a chance to do their work without stepping on each other's toes.Sun's answer was to release the Servlet API and later extend that to make JavaServer Pages. Initially, there was no clear role separation for servlets and JSPs, since a servlet could generate and display HTML just as easily as a JSP could perform business logic. The Model 2 architecture, based on Smalltalk's Model-View-Controller (MVC) design pattern, showed that servlets and JSPs complemented each other. Tag libraries extended the functionality of JSPs in a way that made it easier for developers and designers to collaborate.
JSP and Tag Libraries for Web Development is mostly targeted at Web developers who want advice on designing JSP applications and incorporating tag libraries. The book covers custom tag libraries, the Jakarta Struts framework, and various commercial and noncommercial tag libraries, such as Jakarta Taglibs.
What's to Like? The author starts with an introduction to servlets and JSPs, including a decent explanation of MVC. If you are comfortable with servlets and JSPs, this discussion is really more of a review than anything else.The next two chapters introduce tag libraries and the author's example application (a simple article and author tracking system). The author illustrates the lifecycle of a tag, which helps if you haven't really used or written custom tags before. Da Silva also gives a very detailed discussion of tag library descriptors (TLDs). Some details might have been better left as an appendix, but it is nice to see such a comprehensive explanation of what you can put in a TLD.
Da Silva then spends about 100 pages on writing simple tags, iteration tags, body tags, and making all of these types of tags cooperate. The discussion is again very detailed, but seems unfocused in many parts. Very little of the code in these chapters ties in with his example application.
Next, the author spends three chapters on the Jakarta Struts framework. He explains how Struts naturally fits into the MVC design pattern and gives various examples of how to structure your Struts application. He also includes an entire chapter on finishing his example application, going over Struts ActionForms, Struts Actions (including a method to prevent double submission that I had not seen before), and Struts' method of internationalization on JSPs.
Finally, the author runs through the Jakarta Taglibs project and some commercial tag libraries. Brief examples are provided, but this chapter really needed more attention than da Silva gave it.
What's to Consider Overall, JSP and Tag Libraries for Web Development feels unfocused. The author's central points are explained well in many places, but lost in many others. With some reorganization, I think the book could make a much stronger case for appropriate uses of tag libraries, both application-specific and general (e.g. Struts and Taglibs).Sections where general tag libraries are discussed read very much like the documentation available on project Web sites, such as the struts-html tag library documentation. These really should have been left as an appendix, with better explanations and usage examples provided in their place.
I was also very disappointed in the author's use of Struts Action classes. He combined various actions (add, edit, delete, etc.) to perform on a specific object and tested for a URL parameter to decide what to do. In my opinion, each action should be encapsulated in one Action class (AddObjectAction, EditObjectAction, and DeleteObjectAction). The author's design leads to URL hackery which Struts tries to avoid.
Recently, Struts released a stable version of the 1.1 series, which this book does not cover (it was published in early 2002). Readers should be familiar with the Struts documentation for this release before picking up this book.
The book's Web site is under construction, and I've been able to find little information on the publisher's site.
The Summary A okay book with room for improvement. While the author shows his technical knowledge, the book loses its direction in places. Most developers can probably get by with the documentation available on the Web. Table of Contents- Understanding the Tag Library Extension API
- Introduction to Servlets and JavaServer Pages
- Introduction to Tag Libraries
- Writing Custom Tags
- Cooperating Tags and Validation
- Design Considerations
- The Struts Framework
- The Jakarta Struts Project
- Struts Tag Libraries
- Anatomy of a Struts Application
- The Jakarta Taglibs and Other Resources
- The Jakarta Taglibs Project
- Commercial Tag Libraries
- Other Resources
- Appendices
- Tomcat
- Allaire JRun
- Orion
- MySQL
- Mapping Servlet-JSP Objects
- The Apache Software License, Version 1.1
You can purchase the JSP and Tag Libraries for Web Development from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
JSP and Tag Libraries for Web Development
PotPieMan writes "I recently finished reading JSP and Tag Libraries for Web Development, a book for JSP developers wanting to improve their skillset. Read on for my review." It's not a new book, but still relevant. JSP and Tag Libraries for Web Development author Wellington L.S. da Silva pages 420, including appendices publisher New Riders rating 6 reviewer PotPieMan ISBN 0735710953 summary A guide to designing and implementing JSP applications, with a focus on tag libraries.
The Scoop Web developers and designers have long wrestled with strategies for combining their efforts. Web developers don't mind looking at code but dislike dealing with the look of a page, while Web designers are the opposite. Dynamic Web page technologies, such as Microsoft's ASP, Perl's many template systems and Web frameworks (Text::Template, HTML::Template, HTML::Mason, CGI::Application, etc.), and PHP, were designed to give both developers and designers a chance to do their work without stepping on each other's toes.Sun's answer was to release the Servlet API and later extend that to make JavaServer Pages. Initially, there was no clear role separation for servlets and JSPs, since a servlet could generate and display HTML just as easily as a JSP could perform business logic. The Model 2 architecture, based on Smalltalk's Model-View-Controller (MVC) design pattern, showed that servlets and JSPs complemented each other. Tag libraries extended the functionality of JSPs in a way that made it easier for developers and designers to collaborate.
JSP and Tag Libraries for Web Development is mostly targeted at Web developers who want advice on designing JSP applications and incorporating tag libraries. The book covers custom tag libraries, the Jakarta Struts framework, and various commercial and noncommercial tag libraries, such as Jakarta Taglibs.
What's to Like? The author starts with an introduction to servlets and JSPs, including a decent explanation of MVC. If you are comfortable with servlets and JSPs, this discussion is really more of a review than anything else.The next two chapters introduce tag libraries and the author's example application (a simple article and author tracking system). The author illustrates the lifecycle of a tag, which helps if you haven't really used or written custom tags before. Da Silva also gives a very detailed discussion of tag library descriptors (TLDs). Some details might have been better left as an appendix, but it is nice to see such a comprehensive explanation of what you can put in a TLD.
Da Silva then spends about 100 pages on writing simple tags, iteration tags, body tags, and making all of these types of tags cooperate. The discussion is again very detailed, but seems unfocused in many parts. Very little of the code in these chapters ties in with his example application.
Next, the author spends three chapters on the Jakarta Struts framework. He explains how Struts naturally fits into the MVC design pattern and gives various examples of how to structure your Struts application. He also includes an entire chapter on finishing his example application, going over Struts ActionForms, Struts Actions (including a method to prevent double submission that I had not seen before), and Struts' method of internationalization on JSPs.
Finally, the author runs through the Jakarta Taglibs project and some commercial tag libraries. Brief examples are provided, but this chapter really needed more attention than da Silva gave it.
What's to Consider Overall, JSP and Tag Libraries for Web Development feels unfocused. The author's central points are explained well in many places, but lost in many others. With some reorganization, I think the book could make a much stronger case for appropriate uses of tag libraries, both application-specific and general (e.g. Struts and Taglibs).Sections where general tag libraries are discussed read very much like the documentation available on project Web sites, such as the struts-html tag library documentation. These really should have been left as an appendix, with better explanations and usage examples provided in their place.
I was also very disappointed in the author's use of Struts Action classes. He combined various actions (add, edit, delete, etc.) to perform on a specific object and tested for a URL parameter to decide what to do. In my opinion, each action should be encapsulated in one Action class (AddObjectAction, EditObjectAction, and DeleteObjectAction). The author's design leads to URL hackery which Struts tries to avoid.
Recently, Struts released a stable version of the 1.1 series, which this book does not cover (it was published in early 2002). Readers should be familiar with the Struts documentation for this release before picking up this book.
The book's Web site is under construction, and I've been able to find little information on the publisher's site.
The Summary A okay book with room for improvement. While the author shows his technical knowledge, the book loses its direction in places. Most developers can probably get by with the documentation available on the Web. Table of Contents- Understanding the Tag Library Extension API
- Introduction to Servlets and JavaServer Pages
- Introduction to Tag Libraries
- Writing Custom Tags
- Cooperating Tags and Validation
- Design Considerations
- The Struts Framework
- The Jakarta Struts Project
- Struts Tag Libraries
- Anatomy of a Struts Application
- The Jakarta Taglibs and Other Resources
- The Jakarta Taglibs Project
- Commercial Tag Libraries
- Other Resources
- Appendices
- Tomcat
- Allaire JRun
- Orion
- MySQL
- Mapping Servlet-JSP Objects
- The Apache Software License, Version 1.1
You can purchase the JSP and Tag Libraries for Web Development from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
JSP and Tag Libraries for Web Development
PotPieMan writes "I recently finished reading JSP and Tag Libraries for Web Development, a book for JSP developers wanting to improve their skillset. Read on for my review." It's not a new book, but still relevant. JSP and Tag Libraries for Web Development author Wellington L.S. da Silva pages 420, including appendices publisher New Riders rating 6 reviewer PotPieMan ISBN 0735710953 summary A guide to designing and implementing JSP applications, with a focus on tag libraries.
The Scoop Web developers and designers have long wrestled with strategies for combining their efforts. Web developers don't mind looking at code but dislike dealing with the look of a page, while Web designers are the opposite. Dynamic Web page technologies, such as Microsoft's ASP, Perl's many template systems and Web frameworks (Text::Template, HTML::Template, HTML::Mason, CGI::Application, etc.), and PHP, were designed to give both developers and designers a chance to do their work without stepping on each other's toes.Sun's answer was to release the Servlet API and later extend that to make JavaServer Pages. Initially, there was no clear role separation for servlets and JSPs, since a servlet could generate and display HTML just as easily as a JSP could perform business logic. The Model 2 architecture, based on Smalltalk's Model-View-Controller (MVC) design pattern, showed that servlets and JSPs complemented each other. Tag libraries extended the functionality of JSPs in a way that made it easier for developers and designers to collaborate.
JSP and Tag Libraries for Web Development is mostly targeted at Web developers who want advice on designing JSP applications and incorporating tag libraries. The book covers custom tag libraries, the Jakarta Struts framework, and various commercial and noncommercial tag libraries, such as Jakarta Taglibs.
What's to Like? The author starts with an introduction to servlets and JSPs, including a decent explanation of MVC. If you are comfortable with servlets and JSPs, this discussion is really more of a review than anything else.The next two chapters introduce tag libraries and the author's example application (a simple article and author tracking system). The author illustrates the lifecycle of a tag, which helps if you haven't really used or written custom tags before. Da Silva also gives a very detailed discussion of tag library descriptors (TLDs). Some details might have been better left as an appendix, but it is nice to see such a comprehensive explanation of what you can put in a TLD.
Da Silva then spends about 100 pages on writing simple tags, iteration tags, body tags, and making all of these types of tags cooperate. The discussion is again very detailed, but seems unfocused in many parts. Very little of the code in these chapters ties in with his example application.
Next, the author spends three chapters on the Jakarta Struts framework. He explains how Struts naturally fits into the MVC design pattern and gives various examples of how to structure your Struts application. He also includes an entire chapter on finishing his example application, going over Struts ActionForms, Struts Actions (including a method to prevent double submission that I had not seen before), and Struts' method of internationalization on JSPs.
Finally, the author runs through the Jakarta Taglibs project and some commercial tag libraries. Brief examples are provided, but this chapter really needed more attention than da Silva gave it.
What's to Consider Overall, JSP and Tag Libraries for Web Development feels unfocused. The author's central points are explained well in many places, but lost in many others. With some reorganization, I think the book could make a much stronger case for appropriate uses of tag libraries, both application-specific and general (e.g. Struts and Taglibs).Sections where general tag libraries are discussed read very much like the documentation available on project Web sites, such as the struts-html tag library documentation. These really should have been left as an appendix, with better explanations and usage examples provided in their place.
I was also very disappointed in the author's use of Struts Action classes. He combined various actions (add, edit, delete, etc.) to perform on a specific object and tested for a URL parameter to decide what to do. In my opinion, each action should be encapsulated in one Action class (AddObjectAction, EditObjectAction, and DeleteObjectAction). The author's design leads to URL hackery which Struts tries to avoid.
Recently, Struts released a stable version of the 1.1 series, which this book does not cover (it was published in early 2002). Readers should be familiar with the Struts documentation for this release before picking up this book.
The book's Web site is under construction, and I've been able to find little information on the publisher's site.
The Summary A okay book with room for improvement. While the author shows his technical knowledge, the book loses its direction in places. Most developers can probably get by with the documentation available on the Web. Table of Contents- Understanding the Tag Library Extension API
- Introduction to Servlets and JavaServer Pages
- Introduction to Tag Libraries
- Writing Custom Tags
- Cooperating Tags and Validation
- Design Considerations
- The Struts Framework
- The Jakarta Struts Project
- Struts Tag Libraries
- Anatomy of a Struts Application
- The Jakarta Taglibs and Other Resources
- The Jakarta Taglibs Project
- Commercial Tag Libraries
- Other Resources
- Appendices
- Tomcat
- Allaire JRun
- Orion
- MySQL
- Mapping Servlet-JSP Objects
- The Apache Software License, Version 1.1
You can purchase the JSP and Tag Libraries for Web Development from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
JSP and Tag Libraries for Web Development
PotPieMan writes "I recently finished reading JSP and Tag Libraries for Web Development, a book for JSP developers wanting to improve their skillset. Read on for my review." It's not a new book, but still relevant. JSP and Tag Libraries for Web Development author Wellington L.S. da Silva pages 420, including appendices publisher New Riders rating 6 reviewer PotPieMan ISBN 0735710953 summary A guide to designing and implementing JSP applications, with a focus on tag libraries.
The Scoop Web developers and designers have long wrestled with strategies for combining their efforts. Web developers don't mind looking at code but dislike dealing with the look of a page, while Web designers are the opposite. Dynamic Web page technologies, such as Microsoft's ASP, Perl's many template systems and Web frameworks (Text::Template, HTML::Template, HTML::Mason, CGI::Application, etc.), and PHP, were designed to give both developers and designers a chance to do their work without stepping on each other's toes.Sun's answer was to release the Servlet API and later extend that to make JavaServer Pages. Initially, there was no clear role separation for servlets and JSPs, since a servlet could generate and display HTML just as easily as a JSP could perform business logic. The Model 2 architecture, based on Smalltalk's Model-View-Controller (MVC) design pattern, showed that servlets and JSPs complemented each other. Tag libraries extended the functionality of JSPs in a way that made it easier for developers and designers to collaborate.
JSP and Tag Libraries for Web Development is mostly targeted at Web developers who want advice on designing JSP applications and incorporating tag libraries. The book covers custom tag libraries, the Jakarta Struts framework, and various commercial and noncommercial tag libraries, such as Jakarta Taglibs.
What's to Like? The author starts with an introduction to servlets and JSPs, including a decent explanation of MVC. If you are comfortable with servlets and JSPs, this discussion is really more of a review than anything else.The next two chapters introduce tag libraries and the author's example application (a simple article and author tracking system). The author illustrates the lifecycle of a tag, which helps if you haven't really used or written custom tags before. Da Silva also gives a very detailed discussion of tag library descriptors (TLDs). Some details might have been better left as an appendix, but it is nice to see such a comprehensive explanation of what you can put in a TLD.
Da Silva then spends about 100 pages on writing simple tags, iteration tags, body tags, and making all of these types of tags cooperate. The discussion is again very detailed, but seems unfocused in many parts. Very little of the code in these chapters ties in with his example application.
Next, the author spends three chapters on the Jakarta Struts framework. He explains how Struts naturally fits into the MVC design pattern and gives various examples of how to structure your Struts application. He also includes an entire chapter on finishing his example application, going over Struts ActionForms, Struts Actions (including a method to prevent double submission that I had not seen before), and Struts' method of internationalization on JSPs.
Finally, the author runs through the Jakarta Taglibs project and some commercial tag libraries. Brief examples are provided, but this chapter really needed more attention than da Silva gave it.
What's to Consider Overall, JSP and Tag Libraries for Web Development feels unfocused. The author's central points are explained well in many places, but lost in many others. With some reorganization, I think the book could make a much stronger case for appropriate uses of tag libraries, both application-specific and general (e.g. Struts and Taglibs).Sections where general tag libraries are discussed read very much like the documentation available on project Web sites, such as the struts-html tag library documentation. These really should have been left as an appendix, with better explanations and usage examples provided in their place.
I was also very disappointed in the author's use of Struts Action classes. He combined various actions (add, edit, delete, etc.) to perform on a specific object and tested for a URL parameter to decide what to do. In my opinion, each action should be encapsulated in one Action class (AddObjectAction, EditObjectAction, and DeleteObjectAction). The author's design leads to URL hackery which Struts tries to avoid.
Recently, Struts released a stable version of the 1.1 series, which this book does not cover (it was published in early 2002). Readers should be familiar with the Struts documentation for this release before picking up this book.
The book's Web site is under construction, and I've been able to find little information on the publisher's site.
The Summary A okay book with room for improvement. While the author shows his technical knowledge, the book loses its direction in places. Most developers can probably get by with the documentation available on the Web. Table of Contents- Understanding the Tag Library Extension API
- Introduction to Servlets and JavaServer Pages
- Introduction to Tag Libraries
- Writing Custom Tags
- Cooperating Tags and Validation
- Design Considerations
- The Struts Framework
- The Jakarta Struts Project
- Struts Tag Libraries
- Anatomy of a Struts Application
- The Jakarta Taglibs and Other Resources
- The Jakarta Taglibs Project
- Commercial Tag Libraries
- Other Resources
- Appendices
- Tomcat
- Allaire JRun
- Orion
- MySQL
- Mapping Servlet-JSP Objects
- The Apache Software License, Version 1.1
You can purchase the JSP and Tag Libraries for Web Development from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Perl Design Patterns - Free Book
Scott Walters writes "Perl programmers and Java programmers have a lot to learn from each other. Perl 5 is providing more and more optional stricture, and Perl 6 is expected to provide stricture by default and other goodies as type safety. Perhaps better named "Stupid Perl Object Tricks", Perl Design Patterns proves that a little design, a little computer science, and some style are nothing to fear. This is dedicated to all of the newbies screaming, "The code is fine! Just tell me what the bug is!", and to all of the talented programmers who think Perl deserves a better reputation. Object::PerlDesignPatterns on CPAN is updated less often but has a fatter pipe." -
Open Source Design Tools?
mbogosian asks: "Recently, my broadened responsibilities have me doing some database design and modeling, and I'm happy for the new knowledge and experience, but I'm a bit frustrated about the tool selection. I know most of us have had plenty of experience with at least a handful of all the wonderful Open Source development tools out there (like GCC, GNU Make, Subversion , and Perl to name a few). My question is this: where are OpenSource design tools? I've tried what I could find on SourceForge, but (as usual?) most of the projects that sounded promising were either still in the planning stages or seemed abandoned. Of course something which allowed be to create nifty class charts and output them to UML and/or SQL would be really cool, but I've yet to find something that works (especially in Linux). What are your favorite Open Source design tools and what do you like about them?" -
iCalendar, Project Management, Agenda, CVS and Perl?
parasew asks: "I am searching for Web-based Project Management Software, which should be (mod-)perl based, so I can enhance it or put it into an existing environment using MovableType, which is in a sort of alpha-state. I found a site about Call Center, Bug Tracking and Project Management Tools for Linux and also this short listing, but sadly they are just a bunch of projects which only come close to the kind of tool I am searching for. Gantt and Chronos, seem to be a very nice Web-Calendar packages written in Perl. I was just wondering why no one is using iCalendar (does anyone know of Perl-based Software using iCalendar), as most of the Agenda Software uses iCalendar, and even Mozilla Calendar is capable of subscribing to remote-Calendars. This looks very interesting to me. In general, I wanted to ask you Monks for the best way to do this. Should I create a new app from scratch or reusing existing stuff?""Here are the features I am looking for:
- The use of Calendars (multiple users) and iCalendar Support
- File-Pool for projects (CVS-based or similar)
- Progress-bar for showing the current state of a project
- A public calendar where users can publish events from their private calendars
Please also see my topics on PerlMonks and MovableType
Thanks for any help, hints or suggestions." -
Content Syndication With RSS
Alex Moskalyuk writes "Ben Hammersley's Content Syndication with RSS is a step-by-step guide to implementing RSS. This standard is gaining popularity among the Web community, and some of your favorite sites might syndicate their content as RSS feeds. The new O'Reilly publication focuses on many aspects of this standard, and is of primary interest to developers, Web site designers, data architects and anyone interested in distributing their data around the Web." So if you have a steady stream of information for your customers, family, or fans, read on for the rest of Alex's review. Content Syndication With RSS author Ben Hammersley pages 222 publisher O'Reilly rating 8/10 reviewer Alex Moskalyuk ISBN 0596003838 summary Introduction and guide for RSS implementationsThe first three chapters are primarily discussing the multiplicity of RSS standards. While with some other technologies it might seem a bit excessive, remember that RSS is a forked project with the forks at this moment bearing little resemblance to one another. The abbreviations even have different abbreviations - RSS means Really Simple Syndication if you are using RSS 0.91 or RSS 0.92, that was developed by Dave Winer. RSS means RDF Site Summary if the version you're using RSS 1.0. The development credits in this case go to RSS DEV team. To confuse you even more, the RSS 2.0 standard is deciphered as... correct, Really Simple Syndication again.
Hence chapter 4 discusses Winer's implementation (simplistic and user-friendly), while chapter 6 focuses on RSS 1.0 (RDF-compliant and data-architect-friendly), and chapter 8 talks about RSS 2.0 (improved RSS 0.9x). Chapter 4 is available online as a PDF file. Section 4.4 is recommended for those interested in promoting their RSS feeds as it provides pretty good reference to meta data.
Chapter 9 is perhaps of special interest to Web developers and administrators out there. It presents several code samples to properly parse RSS and present the result in readable HTML. The examples include (a) parsing with XML::Simple in Perl, (b) parsing with Perl regular expressions, (c) parsing with XML::Simple and sending the headlines to cell phones via WWW::SMS, (d) parsing via XSLT transformation. Python, PHP and ASP folks might feel left out due to the abundance of Perl examples, but if you got so far in the book, you can probably apply the regular expressions example or search for appropriate support for RSS format in your preferred language.
Going beyond the standard itself, RSS directories, aggregators and readers are discussed. Author makes a distinction between the last two by classifying Meerkat-like services into aggregators and desktop or Web applications designed to present the information to the user into readers. The chapter also provides information about Syndic8, its API, and describes the feed registration process. OReilly's Meerkat is also discussed in chapter, together with reference table for its API (you can make Meerkat generate HTML or RSS news headlines on certain topic or using certain keywords by providing a right query to its Web interface).
The book is quite a smooth read for a text describing the details of data specification. The chapters are informative and the book is not overloaded with useless information just to increase the page count. The tips are quite useful for someone, who is knew to the field and answers some questions not covered by standards (e.g., how often should you request an RSS feed, what to do if you're being screen-scraped, etc.)
I like the way the author divided the chapters into RSS 0.9x/2.0 and RSS 1.0 and kept two worlds apart. Most of the time you probably won't be interested in developing a feed to support both standards, but would like to focus just on one. The examples in Perl are perfect with me, although for someone new to Perl or programming in general those examples with abundant regular expressions might look a bit convoluted. Kudos to the author for not expanding on the topic, like many do, and providing an example of a script for RSS manipulation in every possible language out there.
What's missing? I wish more pages were dedicated to desktop RSS readers. FeedReader, HotSheet, Syndirella, Beaver and SharpReader are excellent end user applications currently gaining some popularity among those who'd prefer to browse the favorite headlines at a glance, instead of going to a dozen of sites every morning. To be fair, there's a huge list of readers in Appendix, and some applications mentioned above only came around in the last few months, which was probably after the book hit the press. Some sites also didn't make it into the book. I like DailyRotation and FreshNews that borrow from Meerkat's versatility and provide their own feed portal.
Overall, the book is a pretty good developer's guide to RSS standard. Accompanied with helpful illustrations and numerous tips it's an excellent resource for those unfamiliar with RSS and a helpful reference for those who have been doing Web syndication for a while.
You can purchase Content Syndication With RSS from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Content Syndication With RSS
Alex Moskalyuk writes "Ben Hammersley's Content Syndication with RSS is a step-by-step guide to implementing RSS. This standard is gaining popularity among the Web community, and some of your favorite sites might syndicate their content as RSS feeds. The new O'Reilly publication focuses on many aspects of this standard, and is of primary interest to developers, Web site designers, data architects and anyone interested in distributing their data around the Web." So if you have a steady stream of information for your customers, family, or fans, read on for the rest of Alex's review. Content Syndication With RSS author Ben Hammersley pages 222 publisher O'Reilly rating 8/10 reviewer Alex Moskalyuk ISBN 0596003838 summary Introduction and guide for RSS implementationsThe first three chapters are primarily discussing the multiplicity of RSS standards. While with some other technologies it might seem a bit excessive, remember that RSS is a forked project with the forks at this moment bearing little resemblance to one another. The abbreviations even have different abbreviations - RSS means Really Simple Syndication if you are using RSS 0.91 or RSS 0.92, that was developed by Dave Winer. RSS means RDF Site Summary if the version you're using RSS 1.0. The development credits in this case go to RSS DEV team. To confuse you even more, the RSS 2.0 standard is deciphered as... correct, Really Simple Syndication again.
Hence chapter 4 discusses Winer's implementation (simplistic and user-friendly), while chapter 6 focuses on RSS 1.0 (RDF-compliant and data-architect-friendly), and chapter 8 talks about RSS 2.0 (improved RSS 0.9x). Chapter 4 is available online as a PDF file. Section 4.4 is recommended for those interested in promoting their RSS feeds as it provides pretty good reference to meta data.
Chapter 9 is perhaps of special interest to Web developers and administrators out there. It presents several code samples to properly parse RSS and present the result in readable HTML. The examples include (a) parsing with XML::Simple in Perl, (b) parsing with Perl regular expressions, (c) parsing with XML::Simple and sending the headlines to cell phones via WWW::SMS, (d) parsing via XSLT transformation. Python, PHP and ASP folks might feel left out due to the abundance of Perl examples, but if you got so far in the book, you can probably apply the regular expressions example or search for appropriate support for RSS format in your preferred language.
Going beyond the standard itself, RSS directories, aggregators and readers are discussed. Author makes a distinction between the last two by classifying Meerkat-like services into aggregators and desktop or Web applications designed to present the information to the user into readers. The chapter also provides information about Syndic8, its API, and describes the feed registration process. OReilly's Meerkat is also discussed in chapter, together with reference table for its API (you can make Meerkat generate HTML or RSS news headlines on certain topic or using certain keywords by providing a right query to its Web interface).
The book is quite a smooth read for a text describing the details of data specification. The chapters are informative and the book is not overloaded with useless information just to increase the page count. The tips are quite useful for someone, who is knew to the field and answers some questions not covered by standards (e.g., how often should you request an RSS feed, what to do if you're being screen-scraped, etc.)
I like the way the author divided the chapters into RSS 0.9x/2.0 and RSS 1.0 and kept two worlds apart. Most of the time you probably won't be interested in developing a feed to support both standards, but would like to focus just on one. The examples in Perl are perfect with me, although for someone new to Perl or programming in general those examples with abundant regular expressions might look a bit convoluted. Kudos to the author for not expanding on the topic, like many do, and providing an example of a script for RSS manipulation in every possible language out there.
What's missing? I wish more pages were dedicated to desktop RSS readers. FeedReader, HotSheet, Syndirella, Beaver and SharpReader are excellent end user applications currently gaining some popularity among those who'd prefer to browse the favorite headlines at a glance, instead of going to a dozen of sites every morning. To be fair, there's a huge list of readers in Appendix, and some applications mentioned above only came around in the last few months, which was probably after the book hit the press. Some sites also didn't make it into the book. I like DailyRotation and FreshNews that borrow from Meerkat's versatility and provide their own feed portal.
Overall, the book is a pretty good developer's guide to RSS standard. Accompanied with helpful illustrations and numerous tips it's an excellent resource for those unfamiliar with RSS and a helpful reference for those who have been doing Web syndication for a while.
You can purchase Content Syndication With RSS from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
XML Is Too Hard, Part 2
orangerobot writes "A new article on XML.com summarizes some of the response from the XML-DEV mailing list to Tim Bray's recent comments about his frustrations with XML. The overall feedback is mixed but several parsing packages are mentioned that satisfy some of Bray's complaints about the difficulty of using DOM and SAX-based APIs. The packages include Pyxie, XML::Filter::Dispatcher and XML::Essex." -
XML Is Too Hard, Part 2
orangerobot writes "A new article on XML.com summarizes some of the response from the XML-DEV mailing list to Tim Bray's recent comments about his frustrations with XML. The overall feedback is mixed but several parsing packages are mentioned that satisfy some of Bray's complaints about the difficulty of using DOM and SAX-based APIs. The packages include Pyxie, XML::Filter::Dispatcher and XML::Essex." -
POE 0.25 Released
Casey West writes "Version 0.25 of the award-winning POE networking and multitasking framework has been released. This version is mainly a bug fix release." Read on for more... Thanks go out to everyone who helped make this release happen, especially our new committers and testers.- ActivePerl 5.8.0 is supported.
- Gentoo Linux is supported. Previously Perl would segfault.
- TCP clients and servers now support different kinds of sessions (Session, NFA, and custom types).
- TCP servers now gracefully handle aborted connections. This prevents them from stopping under heavy load.
- TCP clients and servers are more configurable in general.
- Several unimplemented features in Wheel::Run have been completed.
- POE::Kernel's call() honors array vs. scalar context now.
- Fixed a bug that sometimes prevented POE::Kernel from returning.
- Fixed a leak in signal dispatching. Terminal signals now destroy sessions at the proper times.
POE's web site contains detailed changes for every public release.
http://poe.perl.org/?POE_CHANGES
The latest tarball should be heading towards your favorite CPAN mirror. It is also on the web, and so is a Windows PPD. Users who need advanced notice of changes can follow it via anonymous CVS or POE's mailing list.
http://poe.perl.org/?Where_to_Get_POE http://poe.perl.org/?POE_Support_Resources
Thanks again to everyone who helped with this release.
About POE
POE is an award-winning networking and multitasking framework. It has been in active, open development for over four years. Its developer community has created a large and growing list of reusable components.
http://poe.perl.org/?What_POE_Is http://search.cpan.org/search?mode=module&query=PO E::Component
POE's robustness and performance have made it an integral part of mission critical applications since 1998. It is used in a wide variety of fields and in projects ranging from just a few lines of code to tens of thousands.
- Financial:
Market servers, clients, billing systems, and automated trading agents. - Web:
Commerce servers, content management systems, application servers, data warehousing, WAP proxies, ad exchanges, web crawlers/spiders, and a variety of specialized agents. - System Administration:
Large-scale host monitoring and maintenance, distributed load testing, a distributed file system (InterMezzo), radius monitoring, system log management and reporting, and spam detection. - Entertainment:
Interactive TV servers; mp3 jukeboxes and streaming servers; multi-server multi-game server monitoring, management, billing, and tournament control; and a plethora of IRC applications and agents. - Software Development:
Compile farm management, build management, distributed testing. - Monitoring and Automation:
X10 home control, weather station monitoring, alarm monitoring.
We look forward to hearing how POE has helped you.
-- Rocco Caputo / troc@pobox.com / poe.perl.org / poe.sf.net
" -
Controlling iTunes with Perl
EccentricAnomaly writes "brian d foy has created perl modules for controlling iTunes. His modules, Mac::iTunes and Apache::iTunes, can be found on the CPAN. Now perl mongers can run iTunes remotely via the command line or via a web interface on a Mac hooked-up to a nice stereo to use as a home or office jukebox. I shudder to think what else may be possible now that iTunes is in perl's clutches." -
Controlling iTunes with Perl
EccentricAnomaly writes "brian d foy has created perl modules for controlling iTunes. His modules, Mac::iTunes and Apache::iTunes, can be found on the CPAN. Now perl mongers can run iTunes remotely via the command line or via a web interface on a Mac hooked-up to a nice stereo to use as a home or office jukebox. I shudder to think what else may be possible now that iTunes is in perl's clutches." -
Controlling iTunes with Perl
EccentricAnomaly writes "brian d foy has created perl modules for controlling iTunes. His modules, Mac::iTunes and Apache::iTunes, can be found on the CPAN. Now perl mongers can run iTunes remotely via the command line or via a web interface on a Mac hooked-up to a nice stereo to use as a home or office jukebox. I shudder to think what else may be possible now that iTunes is in perl's clutches." -
Writing Perl Modules for CPAN
chromatic writes with the review below of Writing Perl Modules for CPAN, which explains at a level "between novice and intermediate user" (and in a minimum of space) how to contribute to Perl's own Library of Alexandria. Writing Perl Modules for CPAN author Sam Tregar pages 288 publisher Apress rating Recommended. reviewer chromatic ISBN 159059018X summary A guide to the use and production of Perl modules, from start to finish, in C and in Perl. The ScoopBesides Perl's abilities as a rapid development language, it's widely believed that the CPAN is its most valuable feature. This network of freely distributable code allows competent developers to achieve great heights of productivity, reusing the work of a generous community of programmers.
Of course, just as some will argue that Perl's copious documentation (spread over two thousand pages) is not immediately obvious to beginners, neither is how to use and even to contribute to the CPAN. For every coder who's successfully published a module, how many more would jump at the chance? How many registered CPAN authors would like to improve their skills?
With that audience in mind, Sam Tregar's Writing Perl Modules for CPAN plants itself firmly in the gap between novice and intermediate user. While much of the book presents information present in a multitude of FAQs, manpages, and the bittersweet experiences of those of us who did things the hard way, he's collected much knowledge into a short and readable guide.
What's to Like?Tregar starts by describing the history and usage of the CPAN itself. This includes the three most popular approaches to building modules: through the CPAN shell (including its configuration), by hand, and with ActiveState's PPM tool. Next, he explains module development in forty pages. This is pretty dense stuff for the intended audience and might require several passes by newer coders. Only after re-reviewing the chapter for this summary did I realize how much he covered. The next chapter covers design and style, from naming schemes to appropriate laziness through code reuse. It's more philosophical and more important.
The next two chapters cover bundling and submitting modules to the CPAN, as well as being a good author and maintainer. The general tone is quite similar to the impressive Open Source Development with CVS. While manpages usually describe the mechanics of making a distribution, for example, they rarely explain the reasons why things are done that way. As with previous chapters, several code examples illustrate the concepts under discussion.
After a brief chapter discussing a few very effective CPAN modules, Tregar dives into XS (the interface between Perl and C). In 60 pages, he describes just enough of XS and the Perl API to teach careful programmers how to be effective at extending Perl. This introduction compares favorably to the first few chapters of the new (and excellent) Extending and Embedding Perl. As expected in an overview, he provides links to more information. The writing and example style is clear enough that a decent coder with sufficient C knowledge should be able to write a Perl wrapper to a C library with relative ease.
The last two chapters describe Inline::C, an abstraction layer that makes XS much easier, and CGI::ApplicationC, a state machine framework for Perl CGI applications. It's not quite clear why the last chapter was included (besides Tregar's desire to see more CPAN modules extending CGI::Application), but it serves as an example of using and extending a CPAN module. Perhaps a future version of the book will elaborate further.
What's to ConsiderThe book's code samples are generally good. In the first half, they are all related parts of a larger project. The rest of the book moves away from this approach. Perhaps it would have been worthwhile to continue the theme, though the nature of the material makes it difficult to see exactly how to accomplish this.
Tregar also avoids the use of strictures and warnings in his code examples, claiming that they would make the examples too verbose. I disagree with the given reasoning -- teaching is the best time to enforce good habits, especially when encouraging the students to distribute their code to the world. This is a minor issue, though, as the code is readable and reasonable.
In the past few months, two projects have gained a great deal of momentum in Perl space. These are the CPANPLUS (disclaimer: I am contributing to this project and have contributed to CPAN.pm) and Module::Build. They may become the new standards, replacing CPAN.pm and MakeMaker as early as Perl 5.10. The book omits mention of these. This is understandable, given the time frame -- and the current tools will not be disappearing any time soon. Potential replacements for h2xs are described in a sidebar, though.
The SummaryThis is a readable book. It took only a couple of hours to read (though I'm assuredly not the target audience), and is well packed with good advice. Fresher Perl programmers who aren't yet comfortable enough with packages and interfaces will get the most benefit, but there's plenty of information for intermediate hackers as well.
Table of Contents- CPAN
- Perl Module Basics
- Module Design and Implementation
- CPAN Module Distribution
- Submitting Your Module to CPAN
- Module Maintenance
- Great CPAN Modules
- Programming Perl in C
- Writing C Modules with XS
- Writing C Modules with Inline::C
- CGI Application Modules for CPAN
You can purchase Writing Perl Modules for CPAN from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Writing Perl Modules for CPAN
chromatic writes with the review below of Writing Perl Modules for CPAN, which explains at a level "between novice and intermediate user" (and in a minimum of space) how to contribute to Perl's own Library of Alexandria. Writing Perl Modules for CPAN author Sam Tregar pages 288 publisher Apress rating Recommended. reviewer chromatic ISBN 159059018X summary A guide to the use and production of Perl modules, from start to finish, in C and in Perl. The ScoopBesides Perl's abilities as a rapid development language, it's widely believed that the CPAN is its most valuable feature. This network of freely distributable code allows competent developers to achieve great heights of productivity, reusing the work of a generous community of programmers.
Of course, just as some will argue that Perl's copious documentation (spread over two thousand pages) is not immediately obvious to beginners, neither is how to use and even to contribute to the CPAN. For every coder who's successfully published a module, how many more would jump at the chance? How many registered CPAN authors would like to improve their skills?
With that audience in mind, Sam Tregar's Writing Perl Modules for CPAN plants itself firmly in the gap between novice and intermediate user. While much of the book presents information present in a multitude of FAQs, manpages, and the bittersweet experiences of those of us who did things the hard way, he's collected much knowledge into a short and readable guide.
What's to Like?Tregar starts by describing the history and usage of the CPAN itself. This includes the three most popular approaches to building modules: through the CPAN shell (including its configuration), by hand, and with ActiveState's PPM tool. Next, he explains module development in forty pages. This is pretty dense stuff for the intended audience and might require several passes by newer coders. Only after re-reviewing the chapter for this summary did I realize how much he covered. The next chapter covers design and style, from naming schemes to appropriate laziness through code reuse. It's more philosophical and more important.
The next two chapters cover bundling and submitting modules to the CPAN, as well as being a good author and maintainer. The general tone is quite similar to the impressive Open Source Development with CVS. While manpages usually describe the mechanics of making a distribution, for example, they rarely explain the reasons why things are done that way. As with previous chapters, several code examples illustrate the concepts under discussion.
After a brief chapter discussing a few very effective CPAN modules, Tregar dives into XS (the interface between Perl and C). In 60 pages, he describes just enough of XS and the Perl API to teach careful programmers how to be effective at extending Perl. This introduction compares favorably to the first few chapters of the new (and excellent) Extending and Embedding Perl. As expected in an overview, he provides links to more information. The writing and example style is clear enough that a decent coder with sufficient C knowledge should be able to write a Perl wrapper to a C library with relative ease.
The last two chapters describe Inline::C, an abstraction layer that makes XS much easier, and CGI::ApplicationC, a state machine framework for Perl CGI applications. It's not quite clear why the last chapter was included (besides Tregar's desire to see more CPAN modules extending CGI::Application), but it serves as an example of using and extending a CPAN module. Perhaps a future version of the book will elaborate further.
What's to ConsiderThe book's code samples are generally good. In the first half, they are all related parts of a larger project. The rest of the book moves away from this approach. Perhaps it would have been worthwhile to continue the theme, though the nature of the material makes it difficult to see exactly how to accomplish this.
Tregar also avoids the use of strictures and warnings in his code examples, claiming that they would make the examples too verbose. I disagree with the given reasoning -- teaching is the best time to enforce good habits, especially when encouraging the students to distribute their code to the world. This is a minor issue, though, as the code is readable and reasonable.
In the past few months, two projects have gained a great deal of momentum in Perl space. These are the CPANPLUS (disclaimer: I am contributing to this project and have contributed to CPAN.pm) and Module::Build. They may become the new standards, replacing CPAN.pm and MakeMaker as early as Perl 5.10. The book omits mention of these. This is understandable, given the time frame -- and the current tools will not be disappearing any time soon. Potential replacements for h2xs are described in a sidebar, though.
The SummaryThis is a readable book. It took only a couple of hours to read (though I'm assuredly not the target audience), and is well packed with good advice. Fresher Perl programmers who aren't yet comfortable enough with packages and interfaces will get the most benefit, but there's plenty of information for intermediate hackers as well.
Table of Contents- CPAN
- Perl Module Basics
- Module Design and Implementation
- CPAN Module Distribution
- Submitting Your Module to CPAN
- Module Maintenance
- Great CPAN Modules
- Programming Perl in C
- Writing C Modules with XS
- Writing C Modules with Inline::C
- CGI Application Modules for CPAN
You can purchase Writing Perl Modules for CPAN from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Writing Perl Modules for CPAN
chromatic writes with the review below of Writing Perl Modules for CPAN, which explains at a level "between novice and intermediate user" (and in a minimum of space) how to contribute to Perl's own Library of Alexandria. Writing Perl Modules for CPAN author Sam Tregar pages 288 publisher Apress rating Recommended. reviewer chromatic ISBN 159059018X summary A guide to the use and production of Perl modules, from start to finish, in C and in Perl. The ScoopBesides Perl's abilities as a rapid development language, it's widely believed that the CPAN is its most valuable feature. This network of freely distributable code allows competent developers to achieve great heights of productivity, reusing the work of a generous community of programmers.
Of course, just as some will argue that Perl's copious documentation (spread over two thousand pages) is not immediately obvious to beginners, neither is how to use and even to contribute to the CPAN. For every coder who's successfully published a module, how many more would jump at the chance? How many registered CPAN authors would like to improve their skills?
With that audience in mind, Sam Tregar's Writing Perl Modules for CPAN plants itself firmly in the gap between novice and intermediate user. While much of the book presents information present in a multitude of FAQs, manpages, and the bittersweet experiences of those of us who did things the hard way, he's collected much knowledge into a short and readable guide.
What's to Like?Tregar starts by describing the history and usage of the CPAN itself. This includes the three most popular approaches to building modules: through the CPAN shell (including its configuration), by hand, and with ActiveState's PPM tool. Next, he explains module development in forty pages. This is pretty dense stuff for the intended audience and might require several passes by newer coders. Only after re-reviewing the chapter for this summary did I realize how much he covered. The next chapter covers design and style, from naming schemes to appropriate laziness through code reuse. It's more philosophical and more important.
The next two chapters cover bundling and submitting modules to the CPAN, as well as being a good author and maintainer. The general tone is quite similar to the impressive Open Source Development with CVS. While manpages usually describe the mechanics of making a distribution, for example, they rarely explain the reasons why things are done that way. As with previous chapters, several code examples illustrate the concepts under discussion.
After a brief chapter discussing a few very effective CPAN modules, Tregar dives into XS (the interface between Perl and C). In 60 pages, he describes just enough of XS and the Perl API to teach careful programmers how to be effective at extending Perl. This introduction compares favorably to the first few chapters of the new (and excellent) Extending and Embedding Perl. As expected in an overview, he provides links to more information. The writing and example style is clear enough that a decent coder with sufficient C knowledge should be able to write a Perl wrapper to a C library with relative ease.
The last two chapters describe Inline::C, an abstraction layer that makes XS much easier, and CGI::ApplicationC, a state machine framework for Perl CGI applications. It's not quite clear why the last chapter was included (besides Tregar's desire to see more CPAN modules extending CGI::Application), but it serves as an example of using and extending a CPAN module. Perhaps a future version of the book will elaborate further.
What's to ConsiderThe book's code samples are generally good. In the first half, they are all related parts of a larger project. The rest of the book moves away from this approach. Perhaps it would have been worthwhile to continue the theme, though the nature of the material makes it difficult to see exactly how to accomplish this.
Tregar also avoids the use of strictures and warnings in his code examples, claiming that they would make the examples too verbose. I disagree with the given reasoning -- teaching is the best time to enforce good habits, especially when encouraging the students to distribute their code to the world. This is a minor issue, though, as the code is readable and reasonable.
In the past few months, two projects have gained a great deal of momentum in Perl space. These are the CPANPLUS (disclaimer: I am contributing to this project and have contributed to CPAN.pm) and Module::Build. They may become the new standards, replacing CPAN.pm and MakeMaker as early as Perl 5.10. The book omits mention of these. This is understandable, given the time frame -- and the current tools will not be disappearing any time soon. Potential replacements for h2xs are described in a sidebar, though.
The SummaryThis is a readable book. It took only a couple of hours to read (though I'm assuredly not the target audience), and is well packed with good advice. Fresher Perl programmers who aren't yet comfortable enough with packages and interfaces will get the most benefit, but there's plenty of information for intermediate hackers as well.
Table of Contents- CPAN
- Perl Module Basics
- Module Design and Implementation
- CPAN Module Distribution
- Submitting Your Module to CPAN
- Module Maintenance
- Great CPAN Modules
- Programming Perl in C
- Writing C Modules with XS
- Writing C Modules with Inline::C
- CGI Application Modules for CPAN
You can purchase Writing Perl Modules for CPAN from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Writing Perl Modules for CPAN
chromatic writes with the review below of Writing Perl Modules for CPAN, which explains at a level "between novice and intermediate user" (and in a minimum of space) how to contribute to Perl's own Library of Alexandria. Writing Perl Modules for CPAN author Sam Tregar pages 288 publisher Apress rating Recommended. reviewer chromatic ISBN 159059018X summary A guide to the use and production of Perl modules, from start to finish, in C and in Perl. The ScoopBesides Perl's abilities as a rapid development language, it's widely believed that the CPAN is its most valuable feature. This network of freely distributable code allows competent developers to achieve great heights of productivity, reusing the work of a generous community of programmers.
Of course, just as some will argue that Perl's copious documentation (spread over two thousand pages) is not immediately obvious to beginners, neither is how to use and even to contribute to the CPAN. For every coder who's successfully published a module, how many more would jump at the chance? How many registered CPAN authors would like to improve their skills?
With that audience in mind, Sam Tregar's Writing Perl Modules for CPAN plants itself firmly in the gap between novice and intermediate user. While much of the book presents information present in a multitude of FAQs, manpages, and the bittersweet experiences of those of us who did things the hard way, he's collected much knowledge into a short and readable guide.
What's to Like?Tregar starts by describing the history and usage of the CPAN itself. This includes the three most popular approaches to building modules: through the CPAN shell (including its configuration), by hand, and with ActiveState's PPM tool. Next, he explains module development in forty pages. This is pretty dense stuff for the intended audience and might require several passes by newer coders. Only after re-reviewing the chapter for this summary did I realize how much he covered. The next chapter covers design and style, from naming schemes to appropriate laziness through code reuse. It's more philosophical and more important.
The next two chapters cover bundling and submitting modules to the CPAN, as well as being a good author and maintainer. The general tone is quite similar to the impressive Open Source Development with CVS. While manpages usually describe the mechanics of making a distribution, for example, they rarely explain the reasons why things are done that way. As with previous chapters, several code examples illustrate the concepts under discussion.
After a brief chapter discussing a few very effective CPAN modules, Tregar dives into XS (the interface between Perl and C). In 60 pages, he describes just enough of XS and the Perl API to teach careful programmers how to be effective at extending Perl. This introduction compares favorably to the first few chapters of the new (and excellent) Extending and Embedding Perl. As expected in an overview, he provides links to more information. The writing and example style is clear enough that a decent coder with sufficient C knowledge should be able to write a Perl wrapper to a C library with relative ease.
The last two chapters describe Inline::C, an abstraction layer that makes XS much easier, and CGI::ApplicationC, a state machine framework for Perl CGI applications. It's not quite clear why the last chapter was included (besides Tregar's desire to see more CPAN modules extending CGI::Application), but it serves as an example of using and extending a CPAN module. Perhaps a future version of the book will elaborate further.
What's to ConsiderThe book's code samples are generally good. In the first half, they are all related parts of a larger project. The rest of the book moves away from this approach. Perhaps it would have been worthwhile to continue the theme, though the nature of the material makes it difficult to see exactly how to accomplish this.
Tregar also avoids the use of strictures and warnings in his code examples, claiming that they would make the examples too verbose. I disagree with the given reasoning -- teaching is the best time to enforce good habits, especially when encouraging the students to distribute their code to the world. This is a minor issue, though, as the code is readable and reasonable.
In the past few months, two projects have gained a great deal of momentum in Perl space. These are the CPANPLUS (disclaimer: I am contributing to this project and have contributed to CPAN.pm) and Module::Build. They may become the new standards, replacing CPAN.pm and MakeMaker as early as Perl 5.10. The book omits mention of these. This is understandable, given the time frame -- and the current tools will not be disappearing any time soon. Potential replacements for h2xs are described in a sidebar, though.
The SummaryThis is a readable book. It took only a couple of hours to read (though I'm assuredly not the target audience), and is well packed with good advice. Fresher Perl programmers who aren't yet comfortable enough with packages and interfaces will get the most benefit, but there's plenty of information for intermediate hackers as well.
Table of Contents- CPAN
- Perl Module Basics
- Module Design and Implementation
- CPAN Module Distribution
- Submitting Your Module to CPAN
- Module Maintenance
- Great CPAN Modules
- Programming Perl in C
- Writing C Modules with XS
- Writing C Modules with Inline::C
- CGI Application Modules for CPAN
You can purchase Writing Perl Modules for CPAN from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Perl and XML
prostoalex writes: "In the world of information technology, information, as the name suggests, is as important as technology itself. Erik T. Ray's and Jason McIntosh's Perl and XML is an attempt to take a look at perhaps the most popular languages for data processing. XML is an open-standard specification for documents, while Perl's natural powers lie in the area of data processing, and, as the name suggests, practical extracting and reporting." Prostalex has reviewed Perl & XML below; read on for his take on the book. Perl and XML author Erik T. Ray, Jason McIntosh pages 216 publisher O'Reilly rating 4/5 reviewer Alex Moskalyuk ISBN 059600205X summary Introduction to XML processing with PerlWith qualities like these, one might think that the marriage of Perl and XML would be total bliss, and the two languages would live happily ever after. In reality, however, the marriage has been far from perfect, and has produced an enormous number of kids: some uglier, some prettier, some simpler, some more sophisticated. Perl & XML is a good attempt to provide an overview of XML processing techniques existing nowadays in the Perl world.
The book does not even make an attempt to give you a brief introduction to Perl, and thus eliminates the weak point of trying to be another Camel book, as many publications in the field attempt to do. The logical assumption is that you know Perl and have heard something about XML. The first chapter of the book tells you why there are so many variations of Perl modules for XML processing, who is behind the well-known modules and why the interaction of Perl with XML has been rather disorganized. Indeed, a short visit to the XML section of CPAN brings up dozens of available modules, most of which characterized by some intimidating or non-descriptive names like SAX, Grove, YAWriter, etc.
The second chapter is titled "XML Recap"; the contents of the chapter, though, are good enough to be called "Concise but Informative Introduction to XML". Don't get your expectations too high -- O'Reilly has a whole bundle of books related just to learning XML, and thus a single chapter can barely touch the surface of what you might need to know, but it provides a good introduction to the world of markup, elements, namespaces, character encoding, processing instructions, schemas and transformations in XML.
Chapter 3 goes from theory to practice, and gives the reader an opportunity to try his first Perl script on XML data. The parsers covered in this chapter are XML::Parser, XML::LibXML, XML::XPath and XML::Writer. Document validation and well-formedness are also explained, and luckily enough this exact chapter is what O'Reilly Publishing decided to publish as a free chapter available on the Web. In this chapter, the authors make a distinction between stream-based and tree-based XML processing, and thus it doesn't come as a big surprise that the next four chapters are dedicated to examples of such processing.
Chapter 4, Event Streams, discusses the issues of processing XML document as a stream of data, where your application has to react to various input without really knowing where the end of the document is. XML::PYX and XML::Parser are covered in this chapter.
Chapter 5 shows examples of using SAX for XML processing with Perl, and also provides an overview of SAX history, which in a nutshell tells you that SAX has been designed for Java with its strong type checking and interface classes. It goes to explain that using it in Perl, which is known for its forgiving nature, thus requires a certain responsibility on the part of programmer. XML::Handler::YAWriter is also discussed in this chapter.
From stream processing, the authors take you to parsing XML trees. In this case, the document is assumed to be loaded into memory and Perl script can safely assume that the whole XML document has been loaded. XML::Simple, XML::SimpleObject, XML::TreeBuilder and XML::Grove are discussed in this chapter, with XML::Parser revisited.
DOM (Document Object Model) is another standard recommended by W3C and it is mostly concerned with how an XML document is stored in computer's memory. XML::DOM is discussed in this chapter with XML::LibXML revisited. The authors also provide a good overview of DOM standard.
The last three chapters deal with applications of Perl in XML data processing that go beyond stream and tree processing -- XPath and XSLT are explained with copious examples. Remember though, that both technologies have several-hundred-page books written about them, and thus several pages in a Perl and XML book can serve at best as good introduction. Chapter 9 deals with RSS and writing SOAP with Perl and XML, with XML::RSS and SOAP::Lite being explained. The last chapter deals with such issues as namespacing, subclassing and for Web designers provides a handy tutorial on converting your XML data into HTML via XSLT stylesheets.
The table of contents is posted on the publisher's Web site.
The first three chapters of the book are easy to read, since they provide a general overview of the data-processing world, history of XML with reference to appropriate events in the Perl community. However, data processing can hardly be called an exciting topic and thus bulk of the book is about routinely introducing particular modules, telling you what you can do with each, and then giving you an example of Perl code processing some XML document. The examples are apt and relate to some of data processing that some us had to do, i.e. shopping lists, address books, recipes, diaries of mad professors, etc.
The code examples are numerous, and if you get tired after looking at pages and pages of Perl lines, you better plan accordingly, as sometimes the subchapter consists of nothing more than an XML file and related Perl processing code with author's notes. For a 200-page book Perl and XML provides a great introduction into the area, provided you have good knowledge of Perl, using CPAN modules and just general knowledge about data processing. The book would probably have a more exact title if it had the word "Cookbook" in its name -- some might consider it a good reference. However, for those just getting acquainted with XML, another tutorial might be needed to get a full comprehension of XML's power.
You can purchase Perl & XML from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page. -
Perl & LWP
When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.
The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.
The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.
At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.
Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.
The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.
It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.
Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.
Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.
Breakdown by chapter:- Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
-
Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
-
The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
-
URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
-
Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
-
Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
-
HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
-
Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
-
HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
-
Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
-
Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
-
Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
-
Appendices
I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.
-
Perl & LWP
When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.
The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.
The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.
At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.
Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.
The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.
It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.
Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.
Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.
Breakdown by chapter:- Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
-
Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
-
The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
-
URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
-
Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
-
Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
-
HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
-
Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
-
HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
-
Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
-
Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
-
Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
-
Appendices
I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.
-
Perl & LWP
When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.
The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.
The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.
At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.
Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.
The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.
It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.
Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.
Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.
Breakdown by chapter:- Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
-
Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
-
The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
-
URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
-
Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
-
Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
-
HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
-
Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
-
HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
-
Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
-
Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
-
Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
-
Appendices
I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.
-
Perl & LWP
When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.
The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.
The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.
At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.
Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.
The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.
It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.
Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.
Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.
Breakdown by chapter:- Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
-
Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
-
The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
-
URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
-
Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
-
Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
-
HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
-
Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
-
HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
-
Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
-
Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
-
Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
-
Appendices
I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.
-
Perl & LWP
When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.
The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.
The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.
At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.
Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.
The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.
It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.
Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.
Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.
Breakdown by chapter:- Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
-
Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
-
The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
-
URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
-
Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
-
Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
-
HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
-
Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
-
HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
-
Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
-
Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
-
Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
-
Appendices
I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.
-
Perl & LWP
When direct database access to the information you need isn't available, but web pages with the right data are, you might pursue "screen-scraping" -- fetching a web page and scanning its text for the appropriate pieces of text in order to do further processing. LWP (Library for WWW access in Perl) is a collection of module to help you do this. mir writes: " Perl & LWP is a solid, no-nonsense book that will teach you how to do screen-scraping using Perl. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML." Perl & LWP author Sean M. Burke pages 264 publisher O'Reilly and Associates rating 9 reviewer mir ISBN 0596001789 summary Excellent introduction to extracting and processing information from web sites.
The good: The book has a nice style and good coverage of the subject, includes introduction to all the modules used, reference material and includes good, well-developed examples. I really liked the way the authors describe the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.
The bad: Not much is bad, really. Some chapters are a little dry, though, and sometimes the reference material could be better separated from the rest of the text. The book covers only simple access to web sites; I would have liked to see an example where the application engages in more dialogue with the server. In addition, the appendixes are not really useful. More Info:If it had not been published by O'Reilly, Perl and LWP could have been titled Leveraging the Web: Object-Oriented techniques for information re-purposing, or Web Services, Generation 0. An even better title would have been Screen-scraping for fun and profit: one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achieve this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short, "screen-scraping." This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.
At 200 pages, plus 40 pages of appendices and index, this one is part of that line of compact O'Reilly books which covers only a narrow topic in each volume but which covers those topics well. Just like Perl & XML , its target audience is Perl programmers who need to tackle a new domain. It gives them a toolbox and basic techniques that to provide a jump start and avoid many mistakes.
Perl & LWP starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URLs. It continues with five chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that builds a tree from the HTML. It goes on with an explanation of how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.
The book is well-written and to-the-point. It is structured in a way that mimics what a programmer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys, and not only the hows, of the various pieces of code he shows us.
It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programmers, and even by the language itself: when Perl starts being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.
Finally I would like to thank the author for following his own advice by including interesting examples and above all for not including anything about retrieving stock-quotes.
Another recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.
Breakdown by chapter:- Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
-
Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
-
The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
-
URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentioned in the introduction roadmap.
-
Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
-
Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
-
HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. Another reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
-
Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
-
HTML processing with Trees (16p): even more powerful than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
-
Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerful.
-
Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
-
Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previously described in the book, plus some additional ones to deal with redirection and robots.txt files.
-
Appendices
I think the Appendices are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programmers are not very familiar with OO, and in truth they don't need to be. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focused enough that I have never needed to use the index.
-
CPAN Shifts Focus
cascadefx writes "Looks like CPAN has changed its focus to support Java now. A look at their page shows that is is now CJAN, the Comprehensive Java Archive Network where you will find all things Java." This should be a great boon to Java, a language renown for, well, sucking. But at the expense of the greatest of all languages? It's just too sad for me to express in words. I mean, who uses java anyway?