Slashdot Mirror


Meet Cyveillancebot

gulker writes "A rant about making a new 'acquaintance'... Googlebot is like the UPS driver who comes to the door in a uniform, and will happily show you his ID and business card: Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window. This after Cyveillance defeats a 'protection mechanism' - robots.txt - and grabs 155 copyrighted files from my Web server, which files it will presumably share with others, for a profit..."

47 comments

  1. On the web? by ObviousGuy · · Score: 1

    Available to all.

    ObviousGuy's axiom

    --
    I have been pwned because my /. password was too easy to guess.
  2. shock! by kevin+lyda · · Score: 0

    oh my god! a web crawler not honouring the robots.txt file! and lying about what it is!

    what has the world come to?!?!?!

    for the sarcasm impaired: why are you even reading things on the web anyway? just give up already.

    --
    US Citizen living abroad? Register to vote!
  3. NetObjects Fusion does that too by infonography · · Score: 1

    It's even friendly enough to grap that robot.txt file. If you want to snatch a whole site for (uhum) research just tell it it's your site, and wait for the great slurping sound.

    --
    Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
  4. This is the same as dealing with Gnutella by Rares+Marian · · Score: 1

    Don't blame the software, blame the users.

    --
    The message on the other side of this sig is false.
  5. Amusement! by fm6 · · Score: 2, Insightful
    What's really dumb about this article is the belief that any documents on a public web site can be considered "private". Indeed, the guy seems to totally misunderstand the purpose of robots.txt. It's not there to specify what's private, it's there to control the way your site is presented on public web servers, and also to help spiders avoid overloading your site.

    And in any case, Cyveillancebot is hardly a real threat to security, compared to script kiddies and the like. If you're trying to keep your private information private, you should be thinking in terms of passwords and encryption, not robot.txt files!

    Oh well, those who can, do. Those who can't, write columns.

    1. Re:Amusement! by You're+All+Wrong · · Score: 2, Insightful

      There's a difference between private and copyright.
      All my website is copyright me, but not private. I have no problem with sharing the results of my research with humans, however, I don't want my copyrights violated. I'm happy with google caching them, I consider that a favour, as it does a public service like a library. This is different though, it's not a public resource.

      If every website were to contain a query-response entry page which screened out non-humans (or unintelligent ones, or ones that can't read English well, or do maths well, or whatever query I set them), then I'd piss of many hundreds of humans.
      It's ungentlemanly to force me to piss off hundreds of people just to keep those who I don't want to read my site out.

      Where has honour gone?

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
    2. Re:Amusement! by fm6 · · Score: 1

      Well then, you must think very highly of Cyveillance's intrusive spybot. It's only purpose is to sniff out copyright violations!

    3. Re:Amusement! by You're+All+Wrong · · Score: 2, Insightful

      I think highly of Spyveillance's bot in the same way that I'd like every airport security guard to stick his finger up my arse in order to see if I was smuggling heroin.

      Maybe some people approve of such things, but I ain't one of them.

      YAW

      --
      Your head of state is a corrupt weasel, I hope you're happy.
    4. Re:Amusement! by Anonymous Coward · · Score: 0

      Your .sig is a couple years out of date, you may want to update it.

    5. Re:Amusement! by fm6 · · Score: 1

      Sloppy of me, I forgot the smily. What's the smily for "irony", anyway?

  6. This guy is a bit stupid, right? by swmccracken · · Score: 5, Informative

    This guy is a moron, right?

    Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.

    I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).

    It is, in fact a mechanism for safeguarding content that owners wish to keep private from crawlers. WRONG! It is a mechanism for discouraging crawlers from downloading vast hunks of your site. (Good example: Crawling all of slashdot would be much larger than slashdot itself because of all the different views of comments you can have. That's why the robots.txt of /. discourages spiders in the dynamically generated views.) Yes, in theory he's right, but reality beckons.

    Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.

    Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...

    1. Re:This guy is a bit stupid, right? by Erebus · · Score: 1

      Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...

      Just because the Donut Brigade won't move enough to make the powdered sugar fall from their rotund bellies doesn't make breaking and entering any more legal, regardless of how little effort *you* think 'breaking' requires...

    2. Re:This guy is a bit stupid, right? by swmccracken · · Score: 1

      Ironically, it takes *more* effort to not "break in" in this case.

      Yeah, it's a bit of a strech though, I know.

      But, thinking that "reading links on your site that you don't want them to even though you didn't try and stop them is an invasion" is just niaeve and stupid.

    3. Re:This guy is a bit stupid, right? by You're+All+Wrong · · Score: 1

      """
      Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.

      Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
      """

      You need to see "Bowling for Columbine", particularly the parts about Canada and front doors.

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
    4. Re:This guy is a bit stupid, right? by Hard_Code · · Score: 1
      Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
      On the contrary, if there were a strange figure that was rifling through my house and refused to identify himself, I would sure as hell hope that the police would concern themselves, despite the fact that my doormat says "Welcome"...

      You are equating technology with law, and that is a very dangerous thing to do. That I have a technological means to commit a crime does not invalidate the fact that it is a crime. That this person puts copyrighted material on the web does not annul the copyright.

      It seems like attaching a simple copyright license to the web site that prohibits this behavior would be a legitimate counter. (Cyveillance, you DID read the license on my site didn't you? Oh, you didn't? Here's your subpeona.)
      --

      It's 10 PM. Do you know if you're un-American?
    5. Re:This guy is a bit stupid, right? by www.sorehands.com · · Score: 1

      Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.


      No, it is more like a sign at the airport that says "Employees only" and then when you are surrounded by the police, you claim "but there was no lock on the door."


      Or at a Radio Shack, there is a sign on the back room door, "Private, employees only."

    6. Re:This guy is a bit stupid, right? by 91degrees · · Score: 2, Insightful

      He's a bit of an idiot.

      I agree with the basic principles that this robot is being a little impolite though. The guy opens up his website, hoping that people will act in a civil manner. Cyveillancebot marches in there with the digital equivalent of hobnail boots, ignores the signs, and takes copies of everything, assuming that anything there is probably stolen.

      Equating it to mugging or breaking and entering is a bit much, but the shifty unshaven lurker seemed quite apt.

  7. Cyveillance in a nutshell by Anonymous Coward · · Score: 5, Informative

    Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

    The reason they're widely hated is that their bot misbehaves. Badly. Not only does it send bogus User-Agent headers and disregard the robots.txt file, it'll literally hammer a site. It's one of the most aggressive bots I've ever come across, and it seems its operators don't care. I've seen a server go down because a spider in Cyveillance's IP space was hitting a MySQL-based message board thousands of times per minute.

    Most spiders either ignore URLs with query strings in them, recognize them as potentially resource-intensive and avoid fetching more than once or twice per minute, or are at least smart enough to avoid getting caught in a recursive loop. Not Cyveillance; the damned thing would fetch the forum index, then fetch a thread, then follow the link from that thread right back to the forum index, ad nauseum.

    Cyveillance doesn't just crawl the IP space of webhosting and colo companies, either. They hit my cablemodem all the time - I'm not sure whether they scan all cable modems, or whether they've just grown fond of me because I'm running a web server (which serves nothing externally, save for a tiny index page that shows my uptime).

    Drop 63.148.99.0/24 into the bit bucket and save your server some strain.

    (By the way, why the fuck do I have to logout to post as AC now? Are registered users only allowed one AC post per month or something?)

    1. Re:Cyveillance in a nutshell by TubeSteak · · Score: 1
      Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window.

      you'd think a corp. would take more care. is it to hard to believe this bot will get stuck in a loop and tie up someone's bandwidth... perhaps sending them over their limit and costing them money? when you've got a name and an address that can be sued, its best to use some common sense.

      maybe this bot doesn't do that, so feel free to explain why

      --
      [Fuck Beta]
      o0t!
    2. Re:Cyveillance in a nutshell by PurpleFloyd · · Score: 4, Insightful
      To me, these actions (hammering databases, getting caught in recursive loops that could be easily avoided) are much worse than ignoring robots.txt. While the whole robots.txt issue could be justifiable from their position (so people couldn't hide copyrighted info via robots.txt), bringing down servers through what amounts to a DOS attack is simply inexcusable.

      There are any number of spiders out there that are smart enough to index whole sites, including dynamically-generated pages, without taking a site down or even hitting it harder than a couple of simeltaneous users. This behavior is not only negligent, but malicious. Any site brought down by Cyveillance would probably have good grounds for legal action (I am not a lawyer, this is not legal advice, talk to a lawyer if you want legal advice, etc.).

      --

      That's it. I'm no longer part of Team Sanity.
    3. Re:Cyveillance in a nutshell by mdielmann · · Score: 2, Interesting

      The ironic part is, they may well download material copyrighted by the web host, protected by a digital notice of the unacceptability of doing so...sounds like these guys want to play with the DMCA...

      --
      Sure I'm paranoid, but am I paranoid enough?
    4. Re:Cyveillance in a nutshell by toastyman · · Score: 1

      Actually, they only have a /27

      OrgName: Cyveillance
      OrgID: CYVEIL
      Address: 1555 Wilson Blvd., Ste. 404
      City: Arlington
      StateProv: VA
      PostalCode: 22209-2405
      Country: US

      NetRange: 63.148.99.224 - 63.148.99.255
      CIDR: 63.148.99.224/27


      If you block the whole /24, you're hitting a few unrelated (probably innocent) organizations.

    5. Re:Cyveillance in a nutshell by jago25_98 · · Score: 1

      ok, it looks like the abuse could lead to change. What we the inconvience likely be for us?

      -> what are the defences for aggressive spiders and
      --> what is the impact of these defences?

      And, a case study. What happens if I copy+paste a WP posting to my own free site when:

      - site is hosted under cuban domain?
      - I copy data to paper word for word and fly to cuba, then submit and host there?

      ^ Laws for US/EU?

      Where might be a good source to answer these ridiculous legal copyright related questions? They seem to crop up whenever I do so much as think of publishing or creating online.

    6. Re:Cyveillance in a nutshell by Cy+Guy · · Score: 2, Insightful

      Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.

      What I don't understand is why scouring the web for Copyrighted material is considered being violated. If you are depending on the copyright laws, then you must abide by the limitations on those rights. Once the copyright owner has made the document publicly accessibly without encryption, fair-use would dictate that anyone that comes across it can at least read and index the text. They may not be able to keep a complete copy, but they would be able to keep their index, and even profit from the sale/rental of access to that index. If they are caching the page ala Google, then persue them under the copyright laws. If they are merely scouring and indexing and you don't want that done, then don't allow public access to the document. As noted elsewhere robots.txt is not the method for denying public access - some combination of userid/password and/or encryption where you control the encryption key is.

  8. And this is why many ISPs don't give log access by stienman · · Score: 0, Troll

    And this is why many ISPs don't give log access to the people they host for

    Too many user will run around screaming "Somebody's stealing my stuff! WAAAAAAH!"

    Look, robots.txt is a gentleman's agreement. The internet is open for all, not just gentlemen.

    WELCOME TO THE INTERNET!
    HOW WOULD YOU LIKE TO BE ABUSED TODAY?


    Please take some time to re-read that, and disabuse yourself of the idea that you can control other people - the only control you have over them is going to be control they will also have over you.

    Secondly, if it really bothers you, block, ban, tarpit, data-spam, whatever the heck you want out of them. If they dare contact you, you can give them whatever you want, however you want. Send them data from random at a rate of 2 bytes per second. Drop half the packets they send you. Abuse the TCP/IP protocol and see if their software is robust enough to handle it. Make it just annoying enough to contact your website in certian ways that they will put you on their 'do not spider' list. Here's a simple example.

    But, man get a grip. Seriously. Do you honestly think you are the first person to discover the dark underbelly of corporate money making schemes on the net?

    -Adam

    1. Re:And this is why many ISPs don't give log access by km790816 · · Score: 3, Insightful

      I totally agree...but...

      This is classic American business practices.

      We are a good, upstanding corporation.
      We want to protect our turf.
      We employ a company to help us.
      We don't ask about that companies means or, more likely, turn a blind eye.

      Dell would never agree that applications on the Internet should, in general, act the way that Cyveillancebox does.

      I believe that the author understands your point. He's not whining.

      He is, however, pointing out the hypocrisy, which I think is valuable. I'll think twice about buying another Dell.

  9. IP-BLOCK TO BLOCK by Oriumpor · · Score: 4, Informative

    I used SAMSPADE to reference their owned IP block (off the wonderful article) this is most definitely not their ONLY ip block, but if anyone does have more, it would be great to compile a whole list of "mean" IPS.

    I do not care for this kind of intrusion (I equate this to exactly what spammers do to harvest your email...) then you can block these ips (route em to never never land.)

  10. Another guy's experiences by plsuh · · Score: 4, Informative

    Take a look at one guy's experiences with blocking rude bots and spiders. Mark is a buddy of mine and this got him pretty steamed.

    --Paul

  11. current state of things by danoatvulaw · · Score: 1
    Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.
    I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).
    And this would get you sued if they didnt like what you were doing anymore, as you would be a trespasser. While cases such as Bidders edge v. eBay didnt explicitly hold that robots.txt constituted notice that robots were not allowed, I think that's going to be the next wave of cases to be litigated. The bottom line is that a site doesn't have to provide the best protection out there, they only have to put you on notice that certain conduct is not wanted, and once you overstep that, you turn from licensee/invitee to trespasser and become liable. weak latch or not, the conduct still gives a cause of action. (please note that my background is CS and although I dont necessarially agree with allowing suits of this kind to be filed, since there should be a more adequate self help remedy taken beforehand other then robots.txt, but this is the current state of things. dont blame the person who takes advantage of the architecture unless they are clearly upn to no good.)
    1. Re:current state of things by swmccracken · · Score: 1

      Thankfully, US Case Law isn't binding on me, yet.

      Really, you're arguing that robots.txt is just a special case of "Terms of Use" that you see around the place.

      (Don't get me started on the so-called "justice" system. :-)

      I would perfer to hope that it becomes accepted knowledge that putting anything on a website is considered publication of that information.. but this could just be idle hope.

  12. Re:Saddam, Cyveillance, etc. etc. by gulker · · Score: 5, Interesting

    The point isn't that I'm shocked to see material downloaded from a public Web site... the point is that Cyveillance brags about how it protects copyright: their PR placed a Businesweek piece about how they had forced a site that was using Washington Post content to pay up.

    Cyveillance is basically reselling content from thousands of Web sites - original thinking, research and writing, that is not theirs... they are exactly what they claim to protect the corporate copyright owners from - they basically rip off work, including copyrighted material, and resell it.

    Good scam, they make a ton of money according to their press releases, but a scam, nevertheless.

    --
    Rules? We have no rules. We're trying to accomplish something. - Thomas Edsion
  13. Traps, ripoffs. by fm6 · · Score: 1
    I'm not a webmaster, but it sounds like a spambot trap is close to being a necessary feature for a small web site. But I can't say I like to idea of using a firewall this way. Mark also provides a link to a site that supposedly does the same thing with apache, but that site is offline. (!)

    I find it interesting that he can lock out Cyveillancebot and other spybots simply by banning their IP addresses. Sounds like Cyveillance and other "ebusiness intelligence" companies are being less than diligent in providing the serve that their customers are paying for. I'm reminded of Bruce Schneier's dictum that security is something you can't just buy and forget. Only in this case, it's anti-security!

  14. Sorry, but no sympathy at all. by The+Fink · · Score: 1
    Sorry, mate, but as much as I dislike abuse of copyright (I've had some of my own works pillaged in the past), if you don't take steps to protect it, you can assume someone will copy it and use it illegitimately.

    The best you can do is chase - legally if necessary - those who steal your work, and gain whatever compensation you can. Oh, and make sure that copyright is broadly proclaimed in the first instance, too.

    No, the `bot shouldn't crawl past robots.txt (rfc-ignorant, anyone?). But, given that it does, the next best bet is to IP/domain/UA block it, and/or password protect (using whatever passwords you like, if it's meant to be somewhat viewable).

    It's a simple, albeit should-be-unnecessary, rule. And yes, it's sad that there are unscrupulous people out there, but that's the way it is.

  15. Block it with Apache and mod_rewrite! by scrod · · Score: 1

    RewriteCond %{REMOTE_HOST} ^www\.cyveillance\.com$
    RewriteRule ^.*$ - [F]


    Of course the actual address of the bot may vary.

    1. Re:Block it with Apache and mod_rewrite! by hansk · · Score: 1

      Nope. Because the cyveillence bot doesn't announce itself. It masks its user-agent.

      See the above comment: Cyveillance in a nutshell

      You need to block it's IP:

      # Cyveillance
      RewriteCond %{REMOTE_ADDR} ^63.148.99.(22[4-9]|2[3-5][0-9])$

      # FILTER BOTS : 403-Forbidden
      RewriteRule ^.* - [F,L]
    2. Re:Block it with Apache and mod_rewrite! by scrod · · Score: 1
      Nope. Because the cyveillence bot doesn't announce itself. It masks its user-agent.
      You need to block it's IP


      Uh huh, and did you see my rule mention HTTP_USER_AGENT anywhere in it? No. Look at what you wrote--the only difference between your rule and mine is that you followed my advice and used an IP address range instead of the host name.
    3. Re:Block it with Apache and mod_rewrite! by hansk · · Score: 1

      Yup, you are correct. But, using "remote_host" may not work if your server does not have a reliable reverse dns lookup. Also, it can add additional overhead because of the lookup time. Therefore, banning by IP is better.

    4. Re:Block it with Apache and mod_rewrite! by Anonymous Coward · · Score: 0

      I hate when stupid people still try to salvage some dignity and backpeddle out of things. Fag.

  16. CYVEILLANCEBOT by moc.tfosorcimgllib · · Score: 4, Funny

    C EVIL BOT CAN LYE

  17. What robots.txt? by eet23 · · Score: 1
    64.68.82.39 - - [05/May/2003:15:18:23 -0700] "GET /robots.txt HTTP/1.0" 404 275 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

    Unless I am misunderstanding the log entry, robots.txt doesn't actually exist on this guy's server. So why does he spend so much time complaining about this thing not looking for it?

  18. And now, Cyveillance's robots.txt file by tregoweth · · Score: 1
    HTMLized version of Cyveillance's robots.txt file, for your browsing pleasure:


    User-agent: *
    Disallow: /web/us/partners/submit_pw.asp
    Disallow: /web/uk/partners/submit_pw.asp
    Disallow: /web1/us/partners/submit_pw.asp
    Desallow: /web1/uk/partners/submit_pw.asp


    Notice how they misspelled "Disallow" in the fourth item, and that none of the pages seem to exist. Good job, Cyveillance!
  19. that is a good statement. by www.sorehands.com · · Score: 1

    eally, you're arguing that robots.txt is just a special case of "Terms of Use" that you see around the place.

    A like that wording. robots.txt is a terms of use that a computer can usually understand.

  20. Intrusive Spybots by fm6 · · Score: 1
    What I don't understand is why scouring the web for Copyrighted material is considered being violated.
    Well, I certainly don't consider it wrong for copyright holders to search the web for theft of their IP. Problem is, Cyveillance does it in an extremely disruptive manner. It's probably not reasonable to expect the cyveillancebot to honor robots.txt, as Chris Gulker thinks it should. But if it doesn't act nicer than it currently does, then web masters will just lock it out -- and it will defeat its own purpose.
  21. Perl is magic by Anonymous Coward · · Score: 0

    Write a CGI to feed the bot a neverending file at 12 bytes a second.

  22. I'm stupid. What does all that mean? by Anonymous Coward · · Score: 0
  23. Re:Saddam, Cyveillance, etc. etc. by Anonymous Coward · · Score: 0

    how is cyveillance reselling content? it doesn't look like they do that...