Slashdot Mirror


Meet Cyveillancebot

gulker writes "A rant about making a new 'acquaintance'... Googlebot is like the UPS driver who comes to the door in a uniform, and will happily show you his ID and business card: Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window. This after Cyveillance defeats a 'protection mechanism' - robots.txt - and grabs 155 copyrighted files from my Web server, which files it will presumably share with others, for a profit..."

14 of 47 comments (clear)

  1. Amusement! by fm6 · · Score: 2, Insightful
    What's really dumb about this article is the belief that any documents on a public web site can be considered "private". Indeed, the guy seems to totally misunderstand the purpose of robots.txt. It's not there to specify what's private, it's there to control the way your site is presented on public web servers, and also to help spiders avoid overloading your site.

    And in any case, Cyveillancebot is hardly a real threat to security, compared to script kiddies and the like. If you're trying to keep your private information private, you should be thinking in terms of passwords and encryption, not robot.txt files!

    Oh well, those who can, do. Those who can't, write columns.

    1. Re:Amusement! by You're+All+Wrong · · Score: 2, Insightful

      There's a difference between private and copyright.
      All my website is copyright me, but not private. I have no problem with sharing the results of my research with humans, however, I don't want my copyrights violated. I'm happy with google caching them, I consider that a favour, as it does a public service like a library. This is different though, it's not a public resource.

      If every website were to contain a query-response entry page which screened out non-humans (or unintelligent ones, or ones that can't read English well, or do maths well, or whatever query I set them), then I'd piss of many hundreds of humans.
      It's ungentlemanly to force me to piss off hundreds of people just to keep those who I don't want to read my site out.

      Where has honour gone?

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
    2. Re:Amusement! by You're+All+Wrong · · Score: 2, Insightful

      I think highly of Spyveillance's bot in the same way that I'd like every airport security guard to stick his finger up my arse in order to see if I was smuggling heroin.

      Maybe some people approve of such things, but I ain't one of them.

      YAW

      --
      Your head of state is a corrupt weasel, I hope you're happy.
  2. This guy is a bit stupid, right? by swmccracken · · Score: 5, Informative

    This guy is a moron, right?

    Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.

    I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).

    It is, in fact a mechanism for safeguarding content that owners wish to keep private from crawlers. WRONG! It is a mechanism for discouraging crawlers from downloading vast hunks of your site. (Good example: Crawling all of slashdot would be much larger than slashdot itself because of all the different views of comments you can have. That's why the robots.txt of /. discourages spiders in the dynamically generated views.) Yes, in theory he's right, but reality beckons.

    Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.

    Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...

    1. Re:This guy is a bit stupid, right? by 91degrees · · Score: 2, Insightful

      He's a bit of an idiot.

      I agree with the basic principles that this robot is being a little impolite though. The guy opens up his website, hoping that people will act in a civil manner. Cyveillancebot marches in there with the digital equivalent of hobnail boots, ignores the signs, and takes copies of everything, assuming that anything there is probably stolen.

      Equating it to mugging or breaking and entering is a bit much, but the shifty unshaven lurker seemed quite apt.

  3. Cyveillance in a nutshell by Anonymous Coward · · Score: 5, Informative

    Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

    The reason they're widely hated is that their bot misbehaves. Badly. Not only does it send bogus User-Agent headers and disregard the robots.txt file, it'll literally hammer a site. It's one of the most aggressive bots I've ever come across, and it seems its operators don't care. I've seen a server go down because a spider in Cyveillance's IP space was hitting a MySQL-based message board thousands of times per minute.

    Most spiders either ignore URLs with query strings in them, recognize them as potentially resource-intensive and avoid fetching more than once or twice per minute, or are at least smart enough to avoid getting caught in a recursive loop. Not Cyveillance; the damned thing would fetch the forum index, then fetch a thread, then follow the link from that thread right back to the forum index, ad nauseum.

    Cyveillance doesn't just crawl the IP space of webhosting and colo companies, either. They hit my cablemodem all the time - I'm not sure whether they scan all cable modems, or whether they've just grown fond of me because I'm running a web server (which serves nothing externally, save for a tiny index page that shows my uptime).

    Drop 63.148.99.0/24 into the bit bucket and save your server some strain.

    (By the way, why the fuck do I have to logout to post as AC now? Are registered users only allowed one AC post per month or something?)

    1. Re:Cyveillance in a nutshell by PurpleFloyd · · Score: 4, Insightful
      To me, these actions (hammering databases, getting caught in recursive loops that could be easily avoided) are much worse than ignoring robots.txt. While the whole robots.txt issue could be justifiable from their position (so people couldn't hide copyrighted info via robots.txt), bringing down servers through what amounts to a DOS attack is simply inexcusable.

      There are any number of spiders out there that are smart enough to index whole sites, including dynamically-generated pages, without taking a site down or even hitting it harder than a couple of simeltaneous users. This behavior is not only negligent, but malicious. Any site brought down by Cyveillance would probably have good grounds for legal action (I am not a lawyer, this is not legal advice, talk to a lawyer if you want legal advice, etc.).

      --

      That's it. I'm no longer part of Team Sanity.
    2. Re:Cyveillance in a nutshell by mdielmann · · Score: 2, Interesting

      The ironic part is, they may well download material copyrighted by the web host, protected by a digital notice of the unacceptability of doing so...sounds like these guys want to play with the DMCA...

      --
      Sure I'm paranoid, but am I paranoid enough?
    3. Re:Cyveillance in a nutshell by Cy+Guy · · Score: 2, Insightful

      Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

      What I don't understand is why scouring the web for Copyrighted material is considered being violated. If you are depending on the copyright laws, then you must abide by the limitations on those rights. Once the copyright owner has made the document publicly accessibly without encryption, fair-use would dictate that anyone that comes across it can at least read and index the text. They may not be able to keep a complete copy, but they would be able to keep their index, and even profit from the sale/rental of access to that index. If they are caching the page ala Google, then persue them under the copyright laws. If they are merely scouring and indexing and you don't want that done, then don't allow public access to the document. As noted elsewhere robots.txt is not the method for denying public access - some combination of userid/password and/or encryption where you control the encryption key is.

  4. IP-BLOCK TO BLOCK by Oriumpor · · Score: 4, Informative

    I used SAMSPADE to reference their owned IP block (off the wonderful article) this is most definitely not their ONLY ip block, but if anyone does have more, it would be great to compile a whole list of "mean" IPS.

    I do not care for this kind of intrusion (I equate this to exactly what spammers do to harvest your email...) then you can block these ips (route em to never never land.)

  5. Another guy's experiences by plsuh · · Score: 4, Informative

    Take a look at one guy's experiences with blocking rude bots and spiders. Mark is a buddy of mine and this got him pretty steamed.

    --Paul

  6. Re:And this is why many ISPs don't give log access by km790816 · · Score: 3, Insightful

    I totally agree...but...

    This is classic American business practices.

    We are a good, upstanding corporation.
    We want to protect our turf.
    We employ a company to help us.
    We don't ask about that companies means or, more likely, turn a blind eye.

    Dell would never agree that applications on the Internet should, in general, act the way that Cyveillancebox does.

    I believe that the author understands your point. He's not whining.

    He is, however, pointing out the hypocrisy, which I think is valuable. I'll think twice about buying another Dell.

  7. Re:Saddam, Cyveillance, etc. etc. by gulker · · Score: 5, Interesting

    The point isn't that I'm shocked to see material downloaded from a public Web site... the point is that Cyveillance brags about how it protects copyright: their PR placed a Businesweek piece about how they had forced a site that was using Washington Post content to pay up.

    Cyveillance is basically reselling content from thousands of Web sites - original thinking, research and writing, that is not theirs... they are exactly what they claim to protect the corporate copyright owners from - they basically rip off work, including copyrighted material, and resell it.

    Good scam, they make a ton of money according to their press releases, but a scam, nevertheless.

    --
    Rules? We have no rules. We're trying to accomplish something. - Thomas Edsion
  8. CYVEILLANCEBOT by moc.tfosorcimgllib · · Score: 4, Funny

    C EVIL BOT CAN LYE