Slashdot Mirror


'Scrapers' Dig Deep For Data On Web

srwellman writes "The practice of Web 'scraping' is growing as many firms offer to collect personal, and potentially incriminating, data about users from their social networking profiles and discussions. Many companies even collect online conversations and personal details from social networks, job sites and forums where people might discuss their lives and even potentially sensitive data, such as health issues. These scrapers operate in a legal grey area leaving many users exposed." We ban scrapers like this regularly here simply for not adhering to the rules spelled out in robots.txt.

32 of 158 comments (clear)

  1. Like Google? by bonch · · Score: 3, Interesting

    Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

    You mean like Google already does for its advertisers? In fact, one of the related links in the article is a story about Google titled Google Agonizes on Privacy as Ad World Vaults Ahead, discussing their plans for utilizing their vast archive of valuable user data. The battle for online privacy was lost long ago.

    1. Re:Like Google? by betterunixthanunix · · Score: 4, Insightful

      The battle for online privacy was lost long ago.

      Only because one side of the battle never bothered to fight. Nobody was forced to go to social networking websites and post their life story, anyone could encrypt their email and IM conversations, and ad blocking software is widely available. Large amounts of the information that these companies are aggregating could have been made far more difficult to obtain if the majority of computer users could have been bothered.

      Sadly, the Internet has become more of an adversarial game than a way to unite people.

      --
      Palm trees and 8
    2. Re:Like Google? by hoggoth · · Score: 2

      / sheepishly pulls sleeve over tribal armband tattoo...

      --
      - For the complete works of Shakespeare: cat /dev/random (may take some time)
    3. Re:Like Google? by VolciMaster · · Score: 2

      The battle for online privacy was lost long ago.

      Only because one side of the battle never bothered to fight. Nobody was forced to go to social networking websites and post their life story, anyone could encrypt their email and IM conversations, and ad blocking software is widely available. Large amounts of the information that these companies are aggregating could have been made far more difficult to obtain if the majority of computer users could have been bothered. Sadly, the Internet has become more of an adversarial game than a way to unite people.

      forced to use social tools? no.

      encryption available? yes

      understood by anyone in the general public? nope

    4. Re:Like Google? by jd · · Score: 2

      There's that and there's the fact that the US (one of the largest consumers of data) has no data privacy laws and has been pressuring places that do (such as the EU) to violate their own laws. The laws don't solve the problem in and of themselves, what they do is make the public more* aware that the problem even exists. (*You can have more than nothing.)

      The older ITAR laws and RSA patents didn't help - it effectively criminalized any effort to produce a product, since you'd need to sell the product in the US to be able to generate enough interest.

      The problem now is that the legacy protocols are too widely used to be easily replaced and legacy products have so much staying power that a backwards-compatible system would remain effectively insecure for decades.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  2. They won't get me by Tigger's+Pet · · Score: 2

    I'm not on FB, Twitter, MyCloud or whatever else, so there's no data out there about me. If there's nothing to harvest then they can't harvest it - I'd rather be classified as 'boring' or 'not with it' (whatever the fuck 'It' is), than have stuff out there that might come back to bite me in the ass in 10 or 20 years time.

    1. Re:They won't get me by yog · · Score: 2

      Definitely avoid using a real or traceable name in online discussion forums and social sites. Also, avoid embedding your real name into your email address, such as "JohnSurfer@cox.net" or the like.

      Unfortunately, my real name is embedded in one of my email addresses, and it's all over the web by now. I guess I can eventually switch to a different address, but the damage is done.

      If you have someone's name, you can now obtain their current and past addresses, their age, their schools, possibly where they work, possibly their political party affiliation, and possibly a ton of other information if they have used their real name in online activities. It's not rocket science to do this; the information is just sitting out there waiting to be grabbed.

      I suppose if you have nothing to hide and have avoided getting too controversial in your online discussions, or too outrageous in your social network photos and statuses, you're probably safe from major problems. Employers are going to be looking for extreme behavior, not slightly out of the ordinary behavior. If an employer doesn't like some minor thing about you, e.g. a picture of you on Facebook wearing green antennas at a Halloween party, then probably they're not someone you'd want to work for anyway.

      --
      it's = "it is"; its = possessive. E.g., it's flapping its wings.
    2. Re:They won't get me by Anonymous Coward · · Score: 2, Funny

      I suppose if you have nothing to hide and have avoided getting too controversial in your online discussions, or too outrageous in your social network photos and statuses, you're probably safe from major problems.

      Yep. That's why my pic on chatroulette is an exact average size penis.

    3. Re:They won't get me by Anonymous Coward · · Score: 2, Funny

      That's OK, Phillip Wilkerson of Midland, MI. We still know all about you. Tell Donna and the kids hi for us. Don't forget to pick up dog food on your way home from the tanning salon.

      Sincerely,

      Google

    4. Re:They won't get me by sakti · · Score: 3, Insightful

      IMO it's better to have an easy to find public 'you' online for these people to track. You use that for everything 'safe'. You then use multiple anonymous accounts for anything you don't want tracked.

      If you have nothing tracking online I think it might start looking more suspicious than not. Plus having nothing might encourage 'them' to dig in and try to relate you to your anonymous account(s).

      --
      "It is better to die on one's feet than to live on one's knees." - Albert Camus
    5. Re:They won't get me by hoggoth · · Score: 3, Insightful

      Wow, that's pretty inappropriate for an interviewer to require you to open your personal family or friends circle to him. What if my family is discussing my alcoholic father, my pregnant niece, my HIV+ friend, and my habit of killing interviewers and burying them in my backyard?

      --
      - For the complete works of Shakespeare: cat /dev/random (may take some time)
    6. Re:They won't get me by networkBoy · · Score: 2

      fundamentally that's what I do.
      There is a real me on FB. Then there is me here (and this ID is shared across multiple sites) which would not be too hard to link to the real me.
      For stuff I really don't want tied to me in re. job interviews, non-gov't background checks etc. I use other identities. For something that I would be afraid of coming out in a relatively thorough discovery && || government background check I simply don't post it on line. At all.
      -nB

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    7. Re:They won't get me by SuricouRaven · · Score: 2

      There are many applicants for each job, so employers can be picky. If they have a set of candidates who are all qualified and of similar levels of experience, they'll pick the one who is most 'normal' in their personal life, and thus least likely to somehow embarass the company or to just not get on with other employees.

  3. They're coming for you, AC by blair1q · · Score: 2

    That Anonymous Coward guy is going to have a mailbox full of goatse spam.

  4. Bravo by swanzilla · · Score: 2
    Example 'scrape' FTA:

    He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

    I don't think we need to dig any deeper to come to the conclusion that this guy is an idiot.

    1. Re:Bravo by TypoNAM · · Score: 4, Funny

      He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

      I don't think we need to dig any deeper to come to the conclusion that this guy is an idiot.

      Indeed, Joseph Swanson.

      --
      This space is not for rent.
  5. The link in the summary is a dupe by Nero+Nimbus · · Score: 5, Informative

    This was talked about back in October:

    http://yro.slashdot.org/story/10/10/15/1340244/Data-Miners-Scraping-Away-Our-Privacy?from=rss

    I thought the guy in the picture looked familiar...

  6. "We (/.) ban scrapers..." LOL by billrp · · Score: 2, Insightful

    "We ban scrapers like this regularly here simply for not adhering to the rules spelled out in robots.txt." Hah! robots.txt doesn't stop any decent crawler

  7. Re:"We (/.) ban scrapers..." LOL by betterunixthanunix · · Score: 2

    However, there are patterns of browsing that are clearly not human. Humans do not make 100 requests in a 10 second timespan, nor do humans traverse every post made by every user.

    Yes, it is imperfect and you might ban an occasional human, but this is essentially the situation we have with spam filtering. It is a bit sad that the Internet is becoming so adversarial, but that is what we face.

    --
    Palm trees and 8
  8. Scraping public data to save money for them and us by garcia · · Score: 2

    Because the public sector has very little time to handle FOIA requests and they sometimes cost more money to complete than I'm willing to pay (usually because they don't do much of their own data work in-house and have to call on a contractor to do it for me), I use their websites to glean the data I want.

    Last week I gave a talk about using SAS to do screen scraping and then perform analysis on the data of jail inmate registries and level 3 sex offenders in MN. I have dashboards of the data available on my website and as I mentioned in my presentation it has even been used to help one county avoid what could have been a serious privacy issue.

    So while there are any number of pitfalls to screen scraping (not understanding the meaning of the data and trends, being fed incomplete or purposefully incorrect data, or even being banned outright) screen scraping can be great for learning about and reporting on the public sector when they are physically or financially incapable or simply unwilling to do it themselves.

  9. Re:the darker side of grey by Loether · · Score: 2

    I think they are 2 distinct issues that do not combine the way you suggest.

    1. If you violate a websites TOS the website can come after you.

    2. The info they gain spidering a website is pretty much free for them to use to discriminate against you.

    Anything I post on slashdot/FB/any online forum I treat like it is viewable by every future and past employer, insurer, lender, ex girlfriend etc. Anything online will exist forever and if it's not already permanently linked to you, it will be before you die. If that's right or wrong, legal or illegal is really besides the point IMHO.

    --
    TODO create witty sig.
  10. He's an Idiot with Plenty of Company by RobotRunAmok · · Score: 2

    Slashdot is filled to the brim with people who take the time to create an alias and then list their homepage on their profile, which of course, is displayed in a link on the same line as their alias in the post they just made.

    I click on those homepages whenever I read something really stupid or ridiculous or inflammatory or completely polar opposite my perspective. Which is to say, I click on them A LOT. I am amazed at how many of these "homepages" are links to commerce sites, or sites advertising some kind of service.

    "Why," I inevitably ask myself, "would I ever buy anything from you, you knucklehead, you?"

    It's like the guy who walks into a business meeting with a potential new client, someone he's never met before, wearing a big "I Love Obama!" button on his jacket. Or an equally large "Palin/Romney '12" button. Sure, you appreciate their passion -- maybe... if you agree with their POV -- but you immediately question their common sense, maturity, and business acumen.

    1. Re:He's an Idiot with Plenty of Company by plover · · Score: 2

      "Why," I inevitably ask myself, "would I ever buy anything from you, you knucklehead, you?"

      You aren't supposed to buy from them. The link isn't there for your benefit. It's an SEO trick, part of the strategy in trying to raise the page rank for that site.

      If you run a blog, you'll find you'll get a commenters that say stuff like "hi, your site is a good understand! one for my book marks." It's flatteringly nice, and obviously English isn't their native tongue, so you thank them for their kind words. And with luck, you may not follow the link in their user name, which you might then discover links to some Russian site, which if you bother to visit with a translator looks like some kind of news aggregator page. "Even weirder", you think.

      Eventually, you realize that the comment they posted is utterly generic, and could have applied equally to a cooking site or a fishing tutorial site. But why link to a news aggregator? You can peel the onion further, dig around the news site, and never find anything that appears to be of value. If you look at the collection of them, however, you discover it's but one plot in a link farm that ultimately links to a lot of sister sites, and all of them have links to the companies that paid them for the optimization. You'll finally realize there's a whole fake web of links out there that exist strictly to boost Google's page rank of their customer's sites.

      The best way to fight them is to make sure your blog software adds rel="nofollow" to any href tags providing links to user-supplied URLs. Most SEO spammers know that Google won't use those links when computing pagerank, and will hopefully leave your blog alone.

      --
      John
  11. I worked for a social scraping company... by sdguero · · Score: 2

    The company was SEM/SEO then they moved to social optimization and scraping. It was a black art, like the SEO stuff, and totally dependent on the provider (in this case facebook and twitter) to not change anything. It's the same basic the problem with SEO and Google; if facebook's (or Google's) API coughs the social media scrapers (or SEM/SEO people) get pneumonia. If Facebook wants to stop it, they can do so fairly easily.

    Unfortunately for privacy, a huge part of FB's business model (like Google) is selling that data to the scrapers and the scrapers' clients.

  12. Re:"We (/.) ban scrapers..." LOL by Anonymous Coward · · Score: 2, Interesting

    Humans do not make 100 requests in a 10 second timespan, nor do humans traverse every post made by every user..

    That's what I use a Greasemonkey script for, you insensitive clod!

  13. Re:"We (/.) ban scrapers..." LOL by Culture20 · · Score: 2

    mod_security is pretty handy at spotting crawler patterns (you have to be a really weird human or a well designed crawler to look like something you're not).

  14. Re:"We (/.) ban scrapers..." LOL by hoggoth · · Score: 2

    A smart discrete scraper will scrape breadth-first, ie: scrape 100 websites alternating the next page from each site in turn, instead of the next page on a single site until that site is finished. Some scraping on active sites like Slashdot or just Google's spidering is never done; It just continues on as new content is created. It would be easy for a scraper to act just like a human on Slashdot, just keep clicking 'refresh' every once in a while. An astro-turf post from GNA would really throw the admins off the trail.

    --
    - For the complete works of Shakespeare: cat /dev/random (may take some time)
  15. Reporting Back... by istartedi · · Score: 2

    The report is back sir, and the results are disturbing. Almost everybody likes sex, and a lot of them are weird. The ones that don't like sex have very strange hobbies. The ones that don't abuse illegal drugs are abusing legal drugs, and almost nobody weighs what they say or looks like their online picture. What should we do?

    (boss pauses for a moment) "Don't hire anybody ever again".

    --
    For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
  16. Re:the darker side of grey by Americium · · Score: 2

    I don't know how good of a comparison this is.

    So if I write a book, can I include TOS that makes it illegal for anyone to use the information within the book? If I write a book about how much my boss sucks, and how I slack off at work, can I include TOS so that nobody is allowed to relay that information to him? Even if I only sell my book to members of a book club, I wouldn't think this changes anything.

    If you intentionally post information about yourself on a widely viewable forum, I would expect other people might read it.

  17. Re:the darker side of grey by jd · · Score: 2

    Well, the problem with (1) is that a TOS is an agreement with no signature, no confirmation of acceptance (implicit is unlikely to hold up in court) and no proof that the TOS was even visible by the user (since what is visible to the user is a function of the browser and cannot be established at the server-side).

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  18. Some bad practices in HR that needs to end by yuhong · · Score: 2

    On this topic, here is some bad practices in HR that needs to end:
    1. Hiring based on stereotypes is NOT a good idea.
    2. The purpose of HR should not be to minimize legal liability.
    3. The illusion that celebrities are perfect needs to end.
    4. Filtering people based on health problems to minimize health insurance costs is not a good idea.
    5. Not hiring people based on debt creates a paradox for those who have to pay it off.
    And as a side note, companies with seriously broken HR often have other problems too.

  19. Re:"We (/.) ban scrapers..." LOL by sharkey · · Score: 2

    Actually, it stops ALL "decent" crawlers. It's the ones that behave indecently that ignore robots.txt.

    --

    --
    "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.