Slashdot Mirror


Google, Bing, Yahoo Data Retention Doesn't Improve Search Quality, Study Claims (theregister.co.uk)

A new paper released on Monday via the National Bureau of Economic Research claims that retaining search log data doesn't do much for search quality. "Data retention has implications in the debate over Europe's right to be forgotten, the authors suggest, because retained data undermines that right," reports The Register. "It's also relevant to U.S. policy discussions about privacy regulations." From the report: To determine whether retention policies affected the accuracy of search results, Chiou and Tucker used data from metrics biz Hitwise to assess web traffic being driven by search sites. They looked at Microsoft Bing and Yahoo! Search during a period when Bing changed its search data retention period from 18 months to 6 months and when Yahoo! changed its retention period from 13 months to 3 months, as well as when Yahoo! had second thoughts and shifted to an 18-month retention period. According to Chiou and Tucker, data retention periods didn't affect the flow of traffic from search engines to downstream websites. "Our findings suggest that long periods of data storage do not confer advantages in search quality, which is an often-cited benefit of data retention by companies," their paper states. Chiou and Tucker observe that the supposed cost of privacy laws to consumers and to companies may be lower than perceived. They also contend that their findings weaken the claim that data retention affects search market dominance, which could make data retention less relevant in antitrust discussions of Google.

38 comments

  1. Data retention at all, or more than 3 months? by AvitarX · · Score: 3, Informative

    Because I bet the 3 month retention is a huge boost, if only in giving me history of older searches in auto complete.

    Much more than that doesn't seem too helpful though, three months is a whole lot of searches, and should give plenty of information about what I'm searching for right now.

    --
    Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    1. Re:Data retention at all, or more than 3 months? by dgatwood · · Score: 1

      Much more than that doesn't seem too helpful though, three months is a whole lot of searches, and should give plenty of information about what I'm searching for right now.

      Maybe, but maybe not. There are very good reasons for keeping data longer than three months. Not all data will be valuable after three months, but it doesn't take much effort at all to come up with counterexamples in which longer retention could make a significant difference in search quality.

      For example, consider people who are in college (and younger). For about three months out of the year, they're doing something entirely different from what they do during the other nine months. If you retained data for only three months, you'd lose their entire history of school-related searches by the time they started back at the end of the summer.

      Or consider people who are planning their summer vacations. The search history from when they planned their previous summer vacation is suddenly very relevant, even though it is about a year old.

      That historical data starts to become important when words have multiple meanings. For example, when I'm searching for "Apple yield", I'm probably looking for information about the DOA rate for a model of iPhone, whereas a farmer is probably looking for ways to keep the caterpillars off of fruit. Mind you, you're unlikely to need very much history to distinguish between those two particular examples, but there's no reason that semantic information in general can't be intuited based on things that you have searched for less frequently.

      I also question the fundamental assumptions upon which these conclusions were based. They appear to be assuming that you can derive search quality based on the number of people coming to a site from a particular search engine over a couple of years' time scale. That's not a given, for several reasons:

      • It can take some time for changes in search quality to produce meaningful changes in the total number of users using a given service, thanks to a combination of momentum, pre-installation of software on devices, and other factors entirely unrelated to the service itself.
      • This ignores the critical question of whether Yahoo and Bing had comparable search quality before making those potentially quality-reducing changes. If not, then the factors keeping those users on those search sites were something other than search quality, and thus the numbers would be largely unaffected by additional reductions in search quality.
      • Changes to downstream sites' rankings for common queries could trivially produce swings in referrer count from a given search engine that would dwarf the swings caused by users switching search engines.
      • Search quality does not merely affect the number of people sent to a site. It also affects how many of the right people go to a given site versus people who were actually searching for something unrelated that is spelled similarly. So even if everything else is constant, search quality could be changing significantly.

      That said, I think perhaps the best refutation involves pointing out that these companies are highly data-driven, constantly using statistics to justify everything they do. They are spending billions of dollars on data centers to store all of that data, and they are constantly running new data processing jobs that expend tremendous computing resources to reprocess that historical data. Don't you think if the benefits of storing and regularly reprocessing all of that extra data were truly minimal, they would have run an experiment on whether processing only the most recent 'n' months gives comparable results, come to that conclusion, and started retaining less data by now?

      I mean, I'm not saying that they're definitely wrong, because anything is possible, but I wouldn't take that bet in a million years.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    2. Re:Data retention at all, or more than 3 months? by AvitarX · · Score: 1

      Valid points.

      The way you describe things it seems how google now handles my weekly routine.

      It did a very good job of finding out where I go on Friday and Sunday with no input from me, I just started getting travel times about 4 hours before I would go. I can definitely picture a world where they can find these long cycles (or with your examples of vacation and school figure it out the first time from words used and time of year, and be better year two).

      I don't know if they're that smart yet, but they very likely are (or will be soon), either way I see diminishing returns after 3 months. I'm not arguing against retention though, I happily will give away my (logged in) search data to them in exchange for better results. Similarly, if I had super private business needs, I'd buy a second burner phone, because what google now does with my search and location history is well worth the risk of my minor nefariousness being discovered.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    3. Re:Data retention at all, or more than 3 months? by Anonymous Coward · · Score: 0

      For example, consider people who are in college (and younger). For about three months out of the year, they're doing something entirely different from what they do during the other nine months. If you retained data for only three months, you'd lose their entire history of school-related searches by the time they started back at the end of the summer.

      Perhaps that's a good age to learn to be specific enough in the search terms you choose to get the results you want.

      That historical data starts to become important when words have multiple meanings. For example, when I'm searching for "Apple yield", I'm probably looking for information about the DOA rate for a model of iPhone, whereas a farmer is probably looking for ways to keep the caterpillars off of fruit.

      Perhaps you should learn to include search terms like "DOA" or "iPhone" or "AAPL" and perhaps the farmer should learn to include the word "fruit" in his searches.

      I have enough curiosity to often search for information on subjects I encounter that I know nothing or little about. When Google started to take my "preferences" into account the quality of its results deteriorated for me, because it assumes that my preference is to be shown something I already know while in reality it quite often is to be shown something new. I felt the effects of a filter bubble being forced upon me before I first heard the term. There are alternative search enginges, fortunately.

      Something to think about is this. Western cultures, especially US culture, put a lot of emphasis on individual choice and personal responsibility. Then why are so many people happy when large internet companies like Google and Facebook decide for them what they get to see?

    4. Re:Data retention at all, or more than 3 months? by olau · · Score: 2

      Won't argue about the study which may very well be flawed, but I don't think your last assertion is correct.

      The data centers certainly aren't full of search histories. Let's say each person generates 1 KB of data per day in search history (with compression) - that's 1 TB/day to store data from 1 billion. What's the marginal cost of storing that data per year? 100,000 dollars?

      One thing you need to keep in mind is that a company like Google ultimately isn't storing data because of the value it provides to their users. They are storing the data because of the value Google themselves derive from it.

      This old data may be a treasure trove for Google, but only of marginal value to each user, and they would still fight very hard to keep it.

      The best way to keep the data is of course to tell people how important it is to help us, and not tell us about any analysis that people may find disgusting.

      The other day I read an example about data mining transactions in a bank. One of the goals was to identify alcoholics by looking at how much you're spending on booze or in bars. Telling, isn't it, how much you can infer about people just by looking at where they've been and what they've bought.

      Examples like that makes you wonder what kind of labels we all might have inside the googleplex and similar.

    5. Re:Data retention at all, or more than 3 months? by AmiMoJo · · Score: 1

      1 TB/day to store data from 1 billion. What's the marginal cost of storing that data per year? 100,000 dollars?

      Google charges me much less than $100k to store 1TB of data for a year, so I can assure you it costs them much less. Let's see, 1TB of HDD space is a few tens of dollars, multiplied by 3 for redundancy, electricity cost for a year, maintenance costs are going to be pretty low along side the million other HDDs they have spinning... Maybe $100, max? I bet it's actually closer to $20 for someone like Google.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    6. Re:Data retention at all, or more than 3 months? by swillden · · Score: 1

      Don't forget to multiply by 365.2422. Still a lot less than $100K, of course.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    7. Re:Data retention at all, or more than 3 months? by vtcodger · · Score: 1

      "Much more than that doesn't seem too helpful though, three months is a whole lot of searches, and should give plenty of information about what I'm searching for right now."

      I should have thought that three DAYS would be sufficient. But what do I know?

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    8. Re:Data retention at all, or more than 3 months? by AvitarX · · Score: 1

      So your argument is the search engine that forces people to do more effort is the one that gives vest results?

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    9. Re:Data retention at all, or more than 3 months? by AvitarX · · Score: 1

      I'd even bet they spend more on the systems that analyze the data (especially people that figure out how to) than the storage, likely by close to an order of magnitude.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    10. Re:Data retention at all, or more than 3 months? by AvitarX · · Score: 1

      I would think at least a month.

      Sometimes I try to find an article to share with someone or some such.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    11. Re:Data retention at all, or more than 3 months? by Anonymous Coward · · Score: 0

      I think it amounts to less effort. The effort to add the word "fruit" to your search for "apple", if you find you get a lot of results about Apple computers, is small compared to, say, walking to the kitchen to get an apple and eat it. To figure out why a search for concrete terms gives biased results and to add terms that undo the bias takes understanding of what might be going on under the hood and results in a bigger mental effort. A search engine that searches for what you type and not for what it "thinks" you mean by it is more predictable, understandable and straightforward. I don't see what is wrong with that.

  2. No shit by Anonymous Coward · · Score: 0

    Constantly looking a old, trivial info leads to jack shit.

    News at 11.

  3. Well of course by Anonymous Coward · · Score: 0

    SEO, sponsored results and the well established fact that almost nobody goes past a page or two of results... this seems like an obvious result.

  4. I show over 100K post, each a dead end search by Trax3001BBS · · Score: 1

    As they are being taken out of context. Many websites imported UseNet newsgroups, a popular one was one I frequented.
    Those would be best removed, yet none I regret; other than some of the websites they ended up at.

  5. It helps identify nazi cowards by Anonymous Coward · · Score: 0

    for liquidation.

    1. Re: It helps identify nazi cowards by Anonymous Coward · · Score: 0

      Most of them would be over 100 years old by now. Doubt they use the internet.

  6. Fuck em by Snotnose · · Score: 1

    The only reason for data retention is tracking. 3 months ago I was searching for info on what DVDs came out recently. Yesterday I searched for what DVDs came out recently.

    My searches tend to be pretty random. Someone started showing The Incredible Hulk a few months ago. I searched for the show, Bill Bixby, the guy who played the reporter, and Lou Ferrigno. Why? Not cuz I want to buy them, but because I don't have it in me to just sit back and watch a TV show nowdays.

    Hey, Hill Street Blues! Never caught it 30 years ago, let's google the hell out of it.

    Hey, NYPD Blue! Never caught it 30 years ago, let's good the hell out of it.

  7. Suckers buy "predict the past" by davecb · · Score: 2

    And then the suckers "have" their advertizer send me ads for something they know I like... becauseI just bought it, and the advertizers know they can prove my interest to their customers/suckers.

    Net result? You get ads for stuff you bought.

    --
    davecb@spamcop.net
    1. Re:Suckers buy "predict the past" by rtb61 · · Score: 1

      There is areas of worthwhile data retention. One requires log in and that is of course blocking specific sites from turning up in your searches, the more that happens for specific sites, the further they drop down search rankings (it would require thousands of down votes). Next up of course is how to better aligning searches ie locality based, and how local, country, state, city and making than easy to use. Next is type of service you are searching, info, sales, repair, showroom, online only etc and how to smoothly incorporate that into a search, via pull downs menus ie trying to carefully squeeze in some of the more advanced search features into the basic search.

      The big thing for search, is auto translate and to make sure that works properly, it can only be done as a free open shared resource, not only translating across written language but also and especially the translation of verbal communications into written communications. This is an extremely important resource to be free open and shared, otherwise a monopoly will establish itself and censor the hell out of everything via that utility function.

      --
      Chaos - everything, everywhere, everywhen
  8. How does this affect I.T.? by Monster_user · · Score: 1

    What impact does this have on I.T.?

    I regularly search for things three to five years old, sometimes I even find my own solutions on a website somewhere.

    If the data retention has no effect on searches three to five years apart, on well aged data, then I've no problem with lower data retention.

  9. Works good for me. by PhrostyMcByte · · Score: 1

    Over time I've noticed various programming-related phrases that come up as the first result if I'm on my account, but are burried if I'm not.

    So, I'd say it works good for me. Now, if things need to be stored long-term to get the same benefit versus, say, only a couple months, I have no idea.

    1. Re:Works good for me. by coofercat · · Score: 1

      ...although they could just be fiddling with the 'settings' for your search experience - they don't need data retention to do that - they can do it incrementally as you search/browse or whatever.

      Retention is useful (for them) because they can look for new patterns they don't have 'settings' for yet. They can also pigeon-hole you into new categories that they can sell to advertisers.

  10. How's that again? by 93+Escort+Wagon · · Score: 3, Insightful

    What has data retention got to do with search results? Advertising is why they want to hold onto all your data.

    --
    #DeleteChrome
    1. Re:How's that again? by GuB-42 · · Score: 2

      Both search results and advertising work the same : try to find the most relevant site for you. The fundamental difference is than one is paid and the other is not.
      And in both cases short term data retention definitely helps. Long term may give a marginal improvement. One area where long term may help is with periodic tasks. For example if you are doing your taxes, remember what you did the year before may be helpful for both you (ex: you found a great site listing deductibles) and advertisers (ex: you considered hiring an accountant).

    2. Re:How's that again? by JohnFen · · Score: 1

      Both search results and advertising work the same : try to find the most relevant site for you.

      And I really, really wish they'd stop doing that for both search results and advertising. It works poorly for both and entails a loss of privacy.

    3. Re:How's that again? by 93+Escort+Wagon · · Score: 1

      For example if you are doing your taxes, remember what you did the year before may be helpful for both you (ex: you found a great site listing deductibles) and advertisers (ex: you considered hiring an accountant).

      If I've found a useful site I may want to use in the future - I bookmark it.

      And, unless that "great site" was #1 on my initial search for information... I probably clicked on the links which were presented above it in the Google search results. So it seems unlikely Google is going to know that result #3 was actually the one I preferred rather than result #1 or #2.

      --
      #DeleteChrome
    4. Re:How's that again? by vtcodger · · Score: 1

      Google's business is selling exposure to advertisers. Advertisers are, basically, nuts. Nonetheless, advertisers are nutcases with money burning holes in their pockets. (Why would anyone give real money to marketing people?) It doesn't matter if Google's vast trove of data has any real value in matching advertisers to potential buyers. (I'm guessing that it mostly doesn't).

      As long as the advertisers believe it is effective, it works.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
  11. Edge cases by aaarrrgggh · · Score: 1

    For me, there were two articles from the 80's that I remember in either Popular Science or Popular Mechanics that were relevant to /. stories in the past couple months. Unfortunately, not even the respective websites could be of any help. I would have really gotten a kick out of reading both articles, but it wasn't to be.

    It comes down to information quality. Most forum answers to a question have a five-year or less value, and while an archive of my travel stories from a couple decades ago might be fun for my personal nostalgia... I doubt people are really going to be searching Google for Scuba J 2000 and stumble across something with all the noise from Facebook and Instagram today.

    1. Re:Edge cases by coofercat · · Score: 1

      "the noise" is a failing of the search engine.

      Back in the day, Altavista or Yahoo or whomever used to show you a glorified 'grep' of the Internet. That ended up being a pretty poor experience because fledgling SEO hacks were promoting irrelevant content over more useful stuff. Then Google showed up, and did a far better job of it with their PageRank algorithm. Nowadays we're back to the 'noise' era of old, with a much bigger Internet and far more well funded, well motivated 'SEO hackers'. We need a new algorithm, but the barriers to entry for any new company are huge, and the algorithm hasn't yet been discovered/proven (and if it was, Google could probably implement it quite quickly before any competitor got any market share).

      So yeah... ultimately, blame Google ;-)

  12. Retention matches the warranty by wolfheart111 · · Score: 1

    A 12 month warranty, with the 13 month retention you can bet they'll be looking for the same product once it goes kaput shortly after the warranty expires... another brick in the wall...

    --
    [($)]
  13. I could have told them that by Anonymous Coward · · Score: 0

    I could have told them that. The quality of Google searches has been going down for at least a decade, probably for as long as they have been tracking users.

    As other have mentioned, the problem is that all that retention can be used for is predicting the past. Using my previous searches to predict what I'm going to search for? More like predicting what I'm not going to search for again. Using what I previously clicked on to predict which search results I'm likely to click on again? The reason I'm still searching is that none of those results solved my problem.

    Now that I think about it, I wonder if that's how StackOverflow and competitors keep getting on top. By having hundreds of posts without solutions, they get a ton of clicks from people hoping to solve a problem, and every single one of those clicks pushes those posts up even further. A site with a solution to every problem would get one click from a person with a problem. One with hundreds of questions and no solutions get hundreds of clicks.

    1. Re:I could have told them that by JohnFen · · Score: 1

      I could have told them that. The quality of Google searches has been going down for at least a decade, probably for as long as they have been tracking users.

      This. Google's search results were better before they started to customize them based on your history, and the quality of the results only continues to fall as time goes by.

    2. Re:I could have told them that by vtcodger · · Score: 1

      "The quality of Google searches has been going down for at least a decade"

      Perhaps. But it doesn't seem so to me. I try DuckDuckGo every now and then, and it seems to me that Google does a significantly better job of finding the (often pretty obscure) stuff that I am interested in. I haven't tried to analyze why/how.

      Since I have pretty much all advertising blocked in my hosts file, I don't really care all that much about the quite unpleasant associates Google is shopping my personal information to. Perhaps I should care. But I don't. If I cared, I'd probably try to set up a false online persona. But that's a hell of a lot of work.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
  14. What a horrible way to measure search accuracy. by Anonymous Coward · · Score: 0

    "As a measure of accuracy, we examine whether a consumer repeats a search or navigates to a new site." This is a pretty awful way to measure search accuracy. It's much more complicated than this. There's a gradation of results. I might click on a result that's "good enough" even though it's not really what I'm looking for. I might click on a result because I can't tell from the title and snippet whether or not it's what I want. There could be a degradation in quality that this simple metric completely misses. They really need to do some search quality annotation to be sure.

  15. Well of course it doesn't! by Rick+Schumann · · Score: 1

    Logging your searches has nothing whatsoever to do with improving the quality of search results because Google, Bing, and Yahoo don't give a damn about YOU, you're just a farm animal that produces data that they sell to so-called 'partner companies' that turn around and shove ads in your feed-box, expecting you to gobble them up, then defecate money that businesses scoop up to put in their pockets. I'm only half-surprised that they don't claim rights to our corpses when we die so they can sell our organs and render the rest of our bodyparts to make glue or lampshades or whatever. #CaptialismGoneBad

  16. That's not the purpose by JohnFen · · Score: 1

    Data retention does improve the only thing they care about: monetization.

  17. How do they evaluate "search quality"? by Guidii · · Score: 1

    So I'm confused... the study is about "search quality", but I don't understand how they define that term. They were looking at search engines that changed their retention policy. They evaluated search quality before and after. That part sounds good.

    It seems that they counted the number of users coming from search engine A and landing at site B before and after. Can anyone explain how that's an indicator of search quality? Perhaps they want to measure if the search engine lost or gained users?