Meet Cyveillancebot
gulker writes "A rant about making a new 'acquaintance'... Googlebot is like the UPS driver who comes to the door in a uniform, and will happily show you his ID and business card: Cyveillancebot is like a coarse, unshaven, itchy guy with his hat pulled down lurking near your half-open bedroom window. This after Cyveillance defeats a 'protection mechanism' - robots.txt - and grabs 155 copyrighted files from my Web server, which files it will presumably share with others, for a profit..."
Available to all.
ObviousGuy's axiom
I have been pwned because my
oh my god! a web crawler not honouring the robots.txt file! and lying about what it is!
what has the world come to?!?!?!
for the sarcasm impaired: why are you even reading things on the web anyway? just give up already.
US Citizen living abroad? Register to vote!
It's even friendly enough to grap that robot.txt file. If you want to snatch a whole site for (uhum) research just tell it it's your site, and wait for the great slurping sound.
Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
Don't blame the software, blame the users.
The message on the other side of this sig is false.
And in any case, Cyveillancebot is hardly a real threat to security, compared to script kiddies and the like. If you're trying to keep your private information private, you should be thinking in terms of passwords and encryption, not robot.txt files!
Oh well, those who can, do. Those who can't, write columns.
This guy is a moron, right?
/. discourages spiders in the dynamically generated views.) Yes, in theory he's right, but reality beckons.
Anyone that has *anything* on a public web server that isn't protected with a username and password (and that isn't very difficult, now, is it?) and they want it kept private is some kind moron.
I mean, I could easily spider his site using wget ignoring his robots.txt.. (For the record, his robots.txt is disalow everyone).
It is, in fact a mechanism for safeguarding content that owners wish to keep private from crawlers. WRONG! It is a mechanism for discouraging crawlers from downloading vast hunks of your site. (Good example: Crawling all of slashdot would be much larger than slashdot itself because of all the different views of comments you can have. That's why the robots.txt of
Robots.txt is not like locking your door with a weak latch. It's like leaving the door unlocked with a "please behave while inside" sign on it.
Oddly enough, I don't think the police would have much sympathy for anyone who's house got burgled like this...
Cyveillance runs a web robot. That web robot has one purpose, and one purpose only: to scour the web looking for "copyrighted material" owned by its clients. What happens when such material is found, I don't know; it's probably reported back to the Mother Ship for C&D processing.
The reason they're widely hated is that their bot misbehaves. Badly. Not only does it send bogus User-Agent headers and disregard the robots.txt file, it'll literally hammer a site. It's one of the most aggressive bots I've ever come across, and it seems its operators don't care. I've seen a server go down because a spider in Cyveillance's IP space was hitting a MySQL-based message board thousands of times per minute.
Most spiders either ignore URLs with query strings in them, recognize them as potentially resource-intensive and avoid fetching more than once or twice per minute, or are at least smart enough to avoid getting caught in a recursive loop. Not Cyveillance; the damned thing would fetch the forum index, then fetch a thread, then follow the link from that thread right back to the forum index, ad nauseum.
Cyveillance doesn't just crawl the IP space of webhosting and colo companies, either. They hit my cablemodem all the time - I'm not sure whether they scan all cable modems, or whether they've just grown fond of me because I'm running a web server (which serves nothing externally, save for a tiny index page that shows my uptime).
Drop 63.148.99.0/24 into the bit bucket and save your server some strain.
(By the way, why the fuck do I have to logout to post as AC now? Are registered users only allowed one AC post per month or something?)
And this is why many ISPs don't give log access to the people they host for
Too many user will run around screaming "Somebody's stealing my stuff! WAAAAAAH!"
Look, robots.txt is a gentleman's agreement. The internet is open for all, not just gentlemen.
WELCOME TO THE INTERNET!
HOW WOULD YOU LIKE TO BE ABUSED TODAY?
Please take some time to re-read that, and disabuse yourself of the idea that you can control other people - the only control you have over them is going to be control they will also have over you.
Secondly, if it really bothers you, block, ban, tarpit, data-spam, whatever the heck you want out of them. If they dare contact you, you can give them whatever you want, however you want. Send them data from random at a rate of 2 bytes per second. Drop half the packets they send you. Abuse the TCP/IP protocol and see if their software is robust enough to handle it. Make it just annoying enough to contact your website in certian ways that they will put you on their 'do not spider' list. Here's a simple example.
But, man get a grip. Seriously. Do you honestly think you are the first person to discover the dark underbelly of corporate money making schemes on the net?
-Adam
I used SAMSPADE to reference their owned IP block (off the wonderful article) this is most definitely not their ONLY ip block, but if anyone does have more, it would be great to compile a whole list of "mean" IPS.
I do not care for this kind of intrusion (I equate this to exactly what spammers do to harvest your email...) then you can block these ips (route em to never never land.)
Take a look at one guy's experiences with blocking rude bots and spiders. Mark is a buddy of mine and this got him pretty steamed.
--Paul
The point isn't that I'm shocked to see material downloaded from a public Web site... the point is that Cyveillance brags about how it protects copyright: their PR placed a Businesweek piece about how they had forced a site that was using Washington Post content to pay up.
Cyveillance is basically reselling content from thousands of Web sites - original thinking, research and writing, that is not theirs... they are exactly what they claim to protect the corporate copyright owners from - they basically rip off work, including copyrighted material, and resell it.
Good scam, they make a ton of money according to their press releases, but a scam, nevertheless.
Rules? We have no rules. We're trying to accomplish something. - Thomas Edsion
I find it interesting that he can lock out Cyveillancebot and other spybots simply by banning their IP addresses. Sounds like Cyveillance and other "ebusiness intelligence" companies are being less than diligent in providing the serve that their customers are paying for. I'm reminded of Bruce Schneier's dictum that security is something you can't just buy and forget. Only in this case, it's anti-security!
The best you can do is chase - legally if necessary - those who steal your work, and gain whatever compensation you can. Oh, and make sure that copyright is broadly proclaimed in the first instance, too.
No, the `bot shouldn't crawl past robots.txt (rfc-ignorant, anyone?). But, given that it does, the next best bet is to IP/domain/UA block it, and/or password protect (using whatever passwords you like, if it's meant to be somewhat viewable).
It's a simple, albeit should-be-unnecessary, rule. And yes, it's sad that there are unscrupulous people out there, but that's the way it is.
RewriteCond %{REMOTE_HOST} ^www\.cyveillance\.com$
RewriteRule ^.*$ - [F]
Of course the actual address of the bot may vary.
C EVIL BOT CAN LYE
Unless I am misunderstanding the log entry, robots.txt doesn't actually exist on this guy's server. So why does he spend so much time complaining about this thing not looking for it?
Notice how they misspelled "Disallow" in the fourth item, and that none of the pages seem to exist. Good job, Cyveillance!
A like that wording. robots.txt is a terms of use that a computer can usually understand.
Fight Spammers!
Write a CGI to feed the bot a neverending file at 12 bytes a second.
how is cyveillance reselling content? it doesn't look like they do that...