Thousands of Sites Wrongly Blocked
Ben Edelman writes: "In the context of the ACLU's pending
challenge to the Children's
Internet Protection Act (PDF), I recently prepared a list of some 6000+ web sites that, by and large, fail to meet the category definitions of popular Internet filtering programs yet are blocked by at least one such program. This topic may be old hat, but my work is new: I have prepared an unusually large list of sites (including police departments, libraries, home-schooling sites, candidates for political office, and on and on), and I have retested these sites over a period of several months."
Troller_Park_Trash, If you're already knowledgeable about the means of operation of filtering software, you may find that the most new & interesting part of the http://cyber.law.harvard.edu/people/edelman/mul-v- us/ page is the Appendices listing specific sites that have been, by and large, wrongly classified by filtering programs.
- us/index-subset.html ("Blocked Site Archives - Subset with Linked Pages - Appendix A") gives information about 395 such URLs. You'll likely find yourself surprised that many of these are blocked -- I know I was.
- us/ mentions, a protective order (from the court in which the underlying case is pending) limits distribution of certain portions of my report -- namely anything I learned from reviewing confidential documents from filtering companies, or from attending confidential portions of depositions of their employees. But the work you, and most others here, are likely to find of greatest interest is the listings of specific sites blocked. (I'm presently adding a bit of text and formatting to help folks find this content more quickly and easily.)
For example, http://cyber.law.harvard.edu/people/edelman/mul-v
Regarding the blacking out of certain text from my report: As http://cyber.law.harvard.edu/people/edelman/mul-v
Ben Edelman
[Originally sent to a mailing-list]
In honor of the censorware material just released by ACLU, I thought I'd try a little experiment in distributed verification.
I took one example from Edelman's report:
16. Southern Alberta Fly Fishing Outfitters #6809 /Regional/Countries/Canada/Business and Economy/Shopping and /Regional/North America/Canada/Alberta/Recreation and
http://www.albertaflyfish.com
Blocked by: N2H2 (Pornography - Sep 11, Oct 7), Websense (Sex - Jul 5,
Aug 18, Sep 11)
Yahoo:
Services/Outdoors/Fishing/Fly Fishing/Lodges/
Google:
Sports/Fishing
Fly fishing in Alberta Canada on the world famous Bow River.
Now, what does censorware have against this site? Maybe it doesn't like too many 'Fly' references in one place? No, it turns out that this site has the misfortune to be virtually hosted and share an internet address with:
http://clubexoticx.com - Club Exoticx
There's a bunch of other completely innocuous sites suffering the same collective guilt of the censorware blacklist. I'd like people to go to N2H2's lookup, at http://database.n2h2.com/cgi-perl/catrpt.pl and *verify* this for themselves by testing the following sites:
http://albertaflyfish.com - Southern Alberta Fly Fishing Outfitters
http://alistairbrown.com - Alistair Brown Folksinger
http://eclothing.com - 'The Game Is On Sportswear Company Ltd.'
http://effectivemanagementsolutions.com - Effective Solutions
http://eligh.com - Springboard Consulting
http://eyepowered.com - E Y E P O W E R E D - 360 Degree Panoramas
http://friendlyfacesonline.com - Create personalized family cartoon
http://gear4pickups.com - Gear4Trucks: HitchHoist Portable Truck Crane,
http://informationonhold.com - Information On-Hold
http://letsmakewine.com - Let's Make Wine
http://planetregister.com - Planet Registe
http://ppt-slides.com - 35mm Slides from your computer file
http://proteach.net - Pro Teach Main Page - Baseball instruction
http://rosiedonovan.com - Rosie Donovan Photography
http://springboardtoinnovation.com - Springboard Consulting
Here, I'll make this easy. Just click these URLs:
http://database.n2h2.com/cgi-perl/catrpt.pl?req_UR L=http://albertaflyfish.com R L=http://alistairbrown.com R L=http://eclothing.com R L=http://effectivemanagementsolutions.com R L=http://eligh.com R L=http://eyepowered.com R L=http://friendlyfacesonline.com R L=http://gear4pickups.com R L=http://informationonhold.com R L=http://letsmakewine.com R L=http://planetregister.com R L=http://ppt-slides.com R L=http://proteach.net R L=http://rosiedonovan.com R L=http://springboardtoinnovation.com
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
http://database.n2h2.com/cgi-perl/catrpt.pl?req_U
You should get
The Site: [all sites above]
is categorized by N2H2 as:
Pornography
If there's some error-message text in a red font, that means the N2H2 program itself wasn't working, try again.
Now, since I've publicized this, I expect it'll be changed very rapidly for this one item. I have a saying: "Alacrity varies directly with publicity". But this is just one example in a HUGE blacklist. What else is lurking in there?
Sig: What Happened To The Censorware Project (censorware.org)
Sh00z, two thoughts:
1) I agree that some portions of content on some of the sites on my list have been correctly categorized. But in the instance you described, it sounds like the specific URL on my list doesn't contain content meeting filtering programs' category definitions. As a result, even if there's reason to categorize other content on that same server, there's no need to categorize this specific page.
(To put this a different way: Many of the filtering programs seem to classify entire sites -- all content on an entire domain name, for example. But there's no reason why pages couldn't instead be rated on a page-by-page basis [and indeed some filtering companies report that they do this, too, in at least some instances]. To the extent that programs fail to do review and separately categorize every individual page, they may overblock pages without content meeting their criteria.)
2) There's no doubt that some URLs on my lists actually do meet filtering companies' category definitions. I'm no librarian, and neither am I otherwise trained in content categorization, so it wasn't my job to identify this content. (Plus, as you can imagine, it's a large task to view many thousands of sites!) Instead, librarians reviewed certain of the sites (including a random sample of the entire list) to attempt to estimate the proportion of sites from my lists that are, in their professional opinions, suitable for use within a library. It's my understanding that the results of their study are forthcoming.