'Scrapers' Dig Deep For Data On Web
srwellman writes "The practice of Web 'scraping' is growing as many firms offer to collect personal, and potentially incriminating, data about users from their social networking profiles and discussions. Many companies even collect online conversations and personal details from social networks, job sites and forums where people might discuss their lives and even potentially sensitive data, such as health issues. These scrapers operate in a legal grey area leaving many users exposed." We ban scrapers like this regularly here simply for not adhering to the rules spelled out in robots.txt.
You mean like Google already does for its advertisers? In fact, one of the related links in the article is a story about Google titled Google Agonizes on Privacy as Ad World Vaults Ahead, discussing their plans for utilizing their vast archive of valuable user data. The battle for online privacy was lost long ago.
I'm not on FB, Twitter, MyCloud or whatever else, so there's no data out there about me. If there's nothing to harvest then they can't harvest it - I'd rather be classified as 'boring' or 'not with it' (whatever the fuck 'It' is), than have stuff out there that might come back to bite me in the ass in 10 or 20 years time.
That Anonymous Coward guy is going to have a mailbox full of goatse spam.
Now what kind of individual stands to gain from the of generating this rumour? Lets see now ...
The purpose of existence is to make money.
He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.
I don't think we need to dig any deeper to come to the conclusion that this guy is an idiot.
0 = 1 + e^(Alt something)
This was talked about back in October:
http://yro.slashdot.org/story/10/10/15/1340244/Data-Miners-Scraping-Away-Our-Privacy?from=rss
I thought the guy in the picture looked familiar...
"We ban scrapers like this regularly here simply for not adhering to the rules spelled out in robots.txt." Hah! robots.txt doesn't stop any decent crawler
Known robots, and scrapers
IP addresses that do not honor /robots.txt.
and IP addresses that robotically submit spam on robots.txt disallowed HTML feedback feedback forms
Much web scraping can be automatically detected.
Sites like Facebook/social networking sites are perfect places to trap/detect scrapers, if they would be willing to contribute to a DNSBL
I've always wondered -- how would this work for future politicians from our generation?
All your comments, history etc are probably available in a multitude of places, and anyone with enough motivation can go around digging and find some pretty serious material. Combined with the fact that most people know (or care) little to nothing about privacy, you will have an entire generation of users with a good chunk of their private lives and opinions shared out on the Internet for everyone to see.
And knowing how we all have skeletons in our closets, and how we've all been immature at some point in time or the other in our lives, how many future politicians candidates can claim to be "squeaky clean"?
I mean, I see this primarily as a problem for the right more than the left, given how their voter base expects them to have "conservative values" or some such nonsense.
Getting banned sure will though.
What's to stop me from 'scraping' the info? What's to stop me from simply downloading the entire site with something like this? Slowly if needed to avoid arousing suspicion..
For justice, we must go to Don Corleone
I did expect the Spanish Inqueisiton!
A feeling of having made the same mistake before: Deja Foobar
I don't think there can be such a "ban" - if humans can browse a website, then crawlers can crawl.
robots.txt isn't meant to have any enforcement capability; by its nature it's just an advisory mechanism telling bots who and what they will and will not accept. If a bot chooses to ignore it (as pretty much all of the types of bots described in this article do), it's up to the site admins to enforce it via IP bans etc.
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
Soon as I click to read the comments, the ad on the right is for a web scraping solution.
However, there are patterns of browsing that are clearly not human. Humans do not make 100 requests in a 10 second timespan, nor do humans traverse every post made by every user.
Yes, it is imperfect and you might ban an occasional human, but this is essentially the situation we have with spam filtering. It is a bit sad that the Internet is becoming so adversarial, but that is what we face.
Palm trees and 8
You're telling me that stuff on a public web site is public?
Because the public sector has very little time to handle FOIA requests and they sometimes cost more money to complete than I'm willing to pay (usually because they don't do much of their own data work in-house and have to call on a contractor to do it for me), I use their websites to glean the data I want.
Last week I gave a talk about using SAS to do screen scraping and then perform analysis on the data of jail inmate registries and level 3 sex offenders in MN. I have dashboards of the data available on my website and as I mentioned in my presentation it has even been used to help one county avoid what could have been a serious privacy issue.
So while there are any number of pitfalls to screen scraping (not understanding the meaning of the data and trends, being fed incomplete or purposefully incorrect data, or even being banned outright) screen scraping can be great for learning about and reporting on the public sector when they are physically or financially incapable or simply unwilling to do it themselves.
I think they are 2 distinct issues that do not combine the way you suggest.
1. If you violate a websites TOS the website can come after you.
2. The info they gain spidering a website is pretty much free for them to use to discriminate against you.
Anything I post on slashdot/FB/any online forum I treat like it is viewable by every future and past employer, insurer, lender, ex girlfriend etc. Anything online will exist forever and if it's not already permanently linked to you, it will be before you die. If that's right or wrong, legal or illegal is really besides the point IMHO.
TODO create witty sig.
Slashdot is filled to the brim with people who take the time to create an alias and then list their homepage on their profile, which of course, is displayed in a link on the same line as their alias in the post they just made.
I click on those homepages whenever I read something really stupid or ridiculous or inflammatory or completely polar opposite my perspective. Which is to say, I click on them A LOT. I am amazed at how many of these "homepages" are links to commerce sites, or sites advertising some kind of service.
"Why," I inevitably ask myself, "would I ever buy anything from you, you knucklehead, you?"
It's like the guy who walks into a business meeting with a potential new client, someone he's never met before, wearing a big "I Love Obama!" button on his jacket. Or an equally large "Palin/Romney '12" button. Sure, you appreciate their passion -- maybe... if you agree with their POV -- but you immediately question their common sense, maturity, and business acumen.
The company was SEM/SEO then they moved to social optimization and scraping. It was a black art, like the SEO stuff, and totally dependent on the provider (in this case facebook and twitter) to not change anything. It's the same basic the problem with SEO and Google; if facebook's (or Google's) API coughs the social media scrapers (or SEM/SEO people) get pneumonia. If Facebook wants to stop it, they can do so fairly easily.
Unfortunately for privacy, a huge part of FB's business model (like Google) is selling that data to the scrapers and the scrapers' clients.
I think the point they're making is that crawlers which do not obey the rules spelled out in robots.txt are blocked.
Face it, the type of people who go into marketing have very little to offer this world. Their whole reason for existence is to hopefully sell something to somebody who might not otherwise buy it. The only redeeming aspect of marketing is that it is a non-violent sinkhole in which to drop money, vs say a war in some God forsaken desert.
Have you ever met a marketing/advertising person who actually liked people?
Average Intelligence is a Scary Thing
Humans do not make 100 requests in a 10 second timespan, nor do humans traverse every post made by every user..
That's what I use a Greasemonkey script for, you insensitive clod!
Collecting data about others is somewhat an essential freedom. But my view and the modern view differ as most people do not feel the same way. But if we take the usual view any company collecting data about a specific person could be charged with stalking. We usually think of a pervert stalking a child or pretty girl. But stalking is stalking regardless of whether it is a corporation or a pervert. The motive for the stalking is irrelevant. Considering the current mood huge civil suits might take place and even criminal prosecutions might be applied. This is one demonstration of why hacking and social engineering need to be legal. After all, how will you ever know to what degree others are studying you without being able to penetrate their data? Restricting hacking is a path to tyranny that is quite direct and predictable. The natural balance is to allow all people and groups to completely study each other in great depth.
iptables -a INPUT -j DROP $Bad_Scraper_IP_Address
mod_security is pretty handy at spotting crawler patterns (you have to be a really weird human or a well designed crawler to look like something you're not).
Add a line in your acceptable use / EULA section stating that you expect the user of the account to be human and that any attempt to scrape the data off of the server is fined at $100,000 per message, plus $10,000 to each message author.
III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIII
A smart discrete scraper will scrape breadth-first, ie: scrape 100 websites alternating the next page from each site in turn, instead of the next page on a single site until that site is finished. Some scraping on active sites like Slashdot or just Google's spidering is never done; It just continues on as new content is created. It would be easy for a scraper to act just like a human on Slashdot, just keep clicking 'refresh' every once in a while. An astro-turf post from GNA would really throw the admins off the trail.
- For the complete works of Shakespeare: cat
The report is back sir, and the results are disturbing. Almost everybody likes sex, and a lot of them are weird. The ones that don't like sex have very strange hobbies. The ones that don't abuse illegal drugs are abusing legal drugs, and almost nobody weighs what they say or looks like their online picture. What should we do?
(boss pauses for a moment) "Don't hire anybody ever again".
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Its ridiculous to expect users to anticipate and thwart privacy invasions. These companies could be shut down overnight (or at least rendered illegal) with common-sense legislation. The problem is not users, it is their bought-and-paid-for "representative" government(s) which sell out their constituents to be deceived and abused by sleazy industries.
Our SiteTruth system does some "scraping". We're looking for the name and address of the company behind the web site, so we can check the business out. We also look for ad links and a few other things, like BBBonline seals, which we check. We use a user agent name of SiteTruth.com site rating system. We don't look very deeply into a site; if after examining the most likely 20 pages, we haven't found out who runs the site, we figure they're not going to tell us. The site is down-rated accordingly.
Our experience is that 0.1% of sites have a "robots.txt" file that tells us to not look at any pages at all. We don't look at those sites, and their SiteTruth rating information says "Blocked". Total exclusion of crawlers is rare. Most sites want some visibility.
One of the more amusing uses of a "robots.txt" file used to be seen on Marchex (the "What you need, when you need it" domainer) pages. The site wasn't blocked from crawling, but the link to the page that told you about Marchex was. That, we suspect, was to keep search engines from noticing that all those domains were really one business. That didn't help Marchex much. Marchex (NASDAQ: MCHX) is still around, stock way down from the peak and reporting a slight loss this quarter.
We do have one exception to obeying the "robots.txt" file. We look at the home page of the site to see if it's a redirect before looking at the "robots.txt" file. Some sites have both a redirect and a "keep out" robots.txt file on the same domain. This is like posting signs that say "Keep Out" and "Please Use Other Door" on the same entrance. That contradiction was apparently a workaround for an old Google crawler bug. Google would index both "example.com" and "www.example.com" separately, then consider them duplicates, which caused some SEO problems.
Actually logging into sites from a crawler is just wrong. I'm amazed that a deep pocket like Nielsen would do that.
I don't know how good of a comparison this is.
So if I write a book, can I include TOS that makes it illegal for anyone to use the information within the book? If I write a book about how much my boss sucks, and how I slack off at work, can I include TOS so that nobody is allowed to relay that information to him? Even if I only sell my book to members of a book club, I wouldn't think this changes anything.
If you intentionally post information about yourself on a widely viewable forum, I would expect other people might read it.
... nor do humans traverse every post made by every user.
...unless they have a fistful of mod points to spend...heck, sometimes I'm just very interested in a story and want to see what everyone has to say about it. True, that doesn't happen often, and I certainly don't read 10 posts a second, but it does happen...
"I love animals! Some are cute, others are tasty, what's not to like?" - Betsy Schroeder, Jeopardy contestant
If the scrapers are already not following the rules laid out in the robots.txt file, what's to say they'll honor your ban. They'll find some way around any technical means of blocking them, in time.
I use irony whenever I can, but my shirts are still wrinkled...
Well, the problem with (1) is that a TOS is an agreement with no signature, no confirmation of acceptance (implicit is unlikely to hold up in court) and no proof that the TOS was even visible by the user (since what is visible to the user is a function of the browser and cannot be established at the server-side).
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
On this topic, here is some bad practices in HR that needs to end:
1. Hiring based on stereotypes is NOT a good idea.
2. The purpose of HR should not be to minimize legal liability.
3. The illusion that celebrities are perfect needs to end.
4. Filtering people based on health problems to minimize health insurance costs is not a good idea.
5. Not hiring people based on debt creates a paradox for those who have to pay it off.
And as a side note, companies with seriously broken HR often have other problems too.
Actually, it stops ALL "decent" crawlers. It's the ones that behave indecently that ignore robots.txt.
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
Even though you never post a thing, someone else may post something about you. You may already be tagged in multiple photos on Facebook. You may have loan applications visible on the web. Your information is not entirely under your control - with pervasive digital storage, constant security challenges, and an increasing cultural trend to blurring the line between public and private, there is a growing chance that your information will leak out into the public.
Would that be legal? Could I setup a company that collected DNA samples without their owners permission(say, by tying the hair clippings from a salon to the CC that paid for the cut)? Could I sell that info to the government?
If no one's done it, someone should, if for no other reason than to scare the shit out of people and hopefully wake them up.
http://www.masturbateforpeace.com/
When they say ban, they mean IP ban presumably. As in, the robot doesn't follow robots.txt, and because of this, they get their ass kicked, and banned. That makes a lot more sense I think.
Appended to the end of comments you post. The maximum is 120 characters.
...between generations. I'm not sure how children or students will take you seriously once they will be able to see every dumb thing you did when you were their age.
Certain kinds of discrimination are illegal in specific cases, of course, and remain illegal regardless of how you obtained the information.
John
Open source has an uphill battle educating the masses as more uneducated people join it with zero expectacions of passing some required level of readiness prior to being let loose online.
Merge a good version of a "secure" OS, like Debian, say, Ubuntu with a paranoid version out there where your proposed security is ON by default --no need to know where to get Adblock for grandma's firefox. Test and tweak to ensure the security doesn't cripple the top 50 websites, (youtube, facebook, myspace, hotmail, google services, etc) and call it "Securiva 2012" so that the newbies go "hmm, it *must* be good because it's selling a year in *advance* of 2011, like any good new car model (free discourages people, but good enough things will get pirated anyway). Sell it at the bargain bins next to those 10 dollar games. Next year, do the same battery of tests to remove/add sites, and release "Securiva 2013". Better yet, make it automated by default a la Chrome. Make sure your users understand that their data / programs need to be manually checked between scheduled upgrades, or perhaps charge extra for use of the "the cloud" to keep the data safe and just test the programs.
Speaking of forking, I have marveled how forks of Good(TM) Open Source distros are so obscure to even us IT geeks that even if good, they have no chance of getting the attention they deserve and helping out the common unprotected newb. For every, say, 10000 Windows users there may be 1 user of $TOP_BRAND_LINUX, but why doesn't every $TOP_BRAND_LINUX user know and PREFER $NEWERTOP_BRAND_LINUX_FORK? To illustrate more or less, pretend instead of OSs, we're comparing adoption of Google Chrome among geeks to how many geeks even KNOW about Chromium. Let's ignore informed /. geeks --think about your wife's or grandma's "assisted" choices when all they have is US for security consultation.
Well, considering that there are two additional escalation steps:
*) emulate a human-like access pattern that works at a human-speed.
*) passively record data via a proxy when you normally browse.
Add to this multiple IP addresses, and catching your scraper becomes so much more problematic.
Until you get a virus/trojan that decides to overwrite your HOSTS file first thing after it roots your machine.
Oops.
@Mindless Drivel: 100% of Twitter posts ever Tweeted.