Slashdot Mirror


Meet Cyveillancebot

gulker writes "A rant about making a new 'acquaintance'... Googlebot is like the UPS driver who comes to the door in a uniform, and will happily show you his ID and business card: Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window. This after Cyveillance defeats a 'protection mechanism' - robots.txt - and grabs 155 copyrighted files from my Web server, which files it will presumably share with others, for a profit..."

7 of 47 comments (clear)

  1. This guy is a bit stupid, right? by swmccracken · · Score: 5, Informative

    This guy is a moron, right?

    Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.

    I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).

    It is, in fact a mechanism for safeguarding content that owners wish to keep private from crawlers. WRONG! It is a mechanism for discouraging crawlers from downloading vast hunks of your site. (Good example: Crawling all of slashdot would be much larger than slashdot itself because of all the different views of comments you can have. That's why the robots.txt of /. discourages spiders in the dynamically generated views.) Yes, in theory he's right, but reality beckons.

    Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.

    Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...

  2. Cyveillance in a nutshell by Anonymous Coward · · Score: 5, Informative

    Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

    The reason they're widely hated is that their bot misbehaves. Badly. Not only does it send bogus User-Agent headers and disregard the robots.txt file, it'll literally hammer a site. It's one of the most aggressive bots I've ever come across, and it seems its operators don't care. I've seen a server go down because a spider in Cyveillance's IP space was hitting a MySQL-based message board thousands of times per minute.

    Most spiders either ignore URLs with query strings in them, recognize them as potentially resource-intensive and avoid fetching more than once or twice per minute, or are at least smart enough to avoid getting caught in a recursive loop. Not Cyveillance; the damned thing would fetch the forum index, then fetch a thread, then follow the link from that thread right back to the forum index, ad nauseum.

    Cyveillance doesn't just crawl the IP space of webhosting and colo companies, either. They hit my cablemodem all the time - I'm not sure whether they scan all cable modems, or whether they've just grown fond of me because I'm running a web server (which serves nothing externally, save for a tiny index page that shows my uptime).

    Drop 63.148.99.0/24 into the bit bucket and save your server some strain.

    (By the way, why the fuck do I have to logout to post as AC now? Are registered users only allowed one AC post per month or something?)

    1. Re:Cyveillance in a nutshell by PurpleFloyd · · Score: 4, Insightful
      To me, these actions (hammering databases, getting caught in recursive loops that could be easily avoided) are much worse than ignoring robots.txt. While the whole robots.txt issue could be justifiable from their position (so people couldn't hide copyrighted info via robots.txt), bringing down servers through what amounts to a DOS attack is simply inexcusable.

      There are any number of spiders out there that are smart enough to index whole sites, including dynamically-generated pages, without taking a site down or even hitting it harder than a couple of simeltaneous users. This behavior is not only negligent, but malicious. Any site brought down by Cyveillance would probably have good grounds for legal action (I am not a lawyer, this is not legal advice, talk to a lawyer if you want legal advice, etc.).

      --

      That's it. I'm no longer part of Team Sanity.
  3. IP-BLOCK TO BLOCK by Oriumpor · · Score: 4, Informative

    I used SAMSPADE to reference their owned IP block (off the wonderful article) this is most definitely not their ONLY ip block, but if anyone does have more, it would be great to compile a whole list of "mean" IPS.

    I do not care for this kind of intrusion (I equate this to exactly what spammers do to harvest your email...) then you can block these ips (route em to never never land.)

  4. Another guy's experiences by plsuh · · Score: 4, Informative

    Take a look at one guy's experiences with blocking rude bots and spiders. Mark is a buddy of mine and this got him pretty steamed.

    --Paul

  5. Re:Saddam, Cyveillance, etc. etc. by gulker · · Score: 5, Interesting

    The point isn't that I'm shocked to see material downloaded from a public Web site... the point is that Cyveillance brags about how it protects copyright: their PR placed a Businesweek piece about how they had forced a site that was using Washington Post content to pay up.

    Cyveillance is basically reselling content from thousands of Web sites - original thinking, research and writing, that is not theirs... they are exactly what they claim to protect the corporate copyright owners from - they basically rip off work, including copyrighted material, and resell it.

    Good scam, they make a ton of money according to their press releases, but a scam, nevertheless.

    --
    Rules? We have no rules. We're trying to accomplish something. - Thomas Edsion
  6. CYVEILLANCEBOT by moc.tfosorcimgllib · · Score: 4, Funny

    C EVIL BOT CAN LYE