Slashdot Mirror


Open Source Filtering?

David Guichard asks: "Maybe I've just missed it, but has there been any talk or action on an open source Internet filter? I'm thinking of something that would allow libraries and schools to comply with the law, but would not hide the list of forbidden sites and would allow complete local control, and certainly would not track user browsing. I realize a lot of people wouldn't want to get anywhere near this on principle, but it seems like a winner to me. For example, would junkbuster satisfy the law already? What is missing that the law requires?" If you have to have some form of filtering in place, better an open solution than a closed one.

2 of 13 comments (clear)

  1. Two reasons why not by jamiemccarthy · · Score: 3
    There are two important considerations here.

    First, the whole point of censorware is that you can't get around it. If you have a choice of whether to run it or not, it might be searching, filtering, categorizing, whatever, but it's not censorware.

    The idea of an "open" solution which is forced upon people is a little silly. Apart from the philosophical absurdity, censorware can never work on an open-source operating system without stringent physical controls as well.

    (Recall the first rule of security: anyone who has physical access to your machine has the potential to compromise it. This may be as simple as booting from floppy!)

    Second, making up a blacklist of porn sites is trivial if you just want to list the ones who want to be listed. Use RSACi. It's already built into your browser. Almost all porn sites rate with RSACi, and they want to be blacklisted, because it helps immunize them from prosecution for providing porn to kids (or at least that's the perception).

    If you want to make up a blacklist of sites which don't want to be blacklisted, you have a fight on your hands. It's a phenomenal amount of work to scan the web. Consider the massive server farms and pipes of unholy size that Google or Alta Vista have to use to spider the web. Who's going to volunteer to set up a similar installation to spider porn sites?

    If you think you're just going to provide a way for volunteers to send in "hey, I found another porn site" URLs, don't be silly. Most of those submissions are going to be RASCi-rated; almost all the rest will be overlap. The web is huge. Porn is about 1% of it. One percent of huge is still huge.

    And then, the big question: who's going to make decisions about these allegedly porn (but not self-rated) sites? Some human being has to categorize them, or you'll be no more accurate than the existing closed-source blacklists (which is to say, laughably inaccurate).

    That takes time, and with millions of new or changed pages on the web every hour, do the math and figure out how much time you can expect to get out of your volunteers. How many dollars of free labor does this hypothetical project depend on? Do porn-hating geeks really hate porn that much, that they'll sit in front of a monitor all day for free and surf porn sites?

    Short version: if it were easy to do, someone already would have done it. In fact there already exist several places that keep an "open" list of porn sites which can be dropped into any Squid proxy. Most of them are years old and will never be maintained again:

    • squidblock.tgz from July 1999
    • sxcontrol, last change February 2000
    • INfilter, last revised March 2000
    • Linux Center's squidblock.tgz
      Click the "Latest" link, which is there "just to show that someone is using it!" Note that the "latest" additions to the blacklist include such obscure sites as playboy.com, and such recent new sites as dailydirt.com (domain registered on Jan 12, 1998).

    Jamie McCarthy

    --

    Jamie McCarthy
    jamie.mccarthy.vg

  2. grammar-based filtering, not keyword by scotpurl · · Score: 3

    The solution to the entire problem is not, NOT, keyword filtering. It's grammar-based filtering.

    What the @#!! is grammar-based filtering?

    It's where the parsing engine has enough intelligence to figure out what's going on. What the subtleties are. What the nuances are. If there are any double-entendres or hidden meanings.

    Then, and only then, can you use the computer to make value-based decisions using fuzzy rules about whether or not the content should be seen. And once that happens, I'll gladly use filtering. Why? Because I'll be able to filter out advertisements at a minimum. :-)