Slashdot Mirror


Siri Keeps Your Data For Two Years

New submitter LeadSongDog writes with news that Apple has provided information on how long it holds onto voice search data used by its digital assistant software Siri. Speaking to Wired, an Apple representative said the data is kept for two years after the initial query. "Here’s what happens. Whenever you speak into Apple’s voice activated personal digital assistant, it ships it off to Apple’s data farm for analysis. Apple generates a random numbers to represent the user and it associates the voice files with that number. This number — not your Apple user ID or email address — represents you as far as Siri’s back-end voice analysis system is concerned. Once the voice recording is six months old, Apple “disassociates” your user number from the clip, deleting the number from the voice file. But it keeps these disassociated files for up to 18 more months for testing and product improvement purposes." This information came in response to requests for clarification of Siri's privacy policy, which was not very clear as written. The director of privacy group Big Brother Watch said, "There needs to be a very high justification for retaining such intrusive data for longer than is absolutely necessary to provide the service."

11 of 124 comments (clear)

  1. Comparison with Google search? by Anubis+IV · · Score: 4, Interesting

    Anyone have the timeline for Google's disassociation and destruction of search queries? I'm curious how Apple's policies compare against those.

    1. Re:Comparison with Google search? by fazey · · Score: 5, Insightful

      You mean google has an option to hide your search history from you?

    2. Re:Comparison with Google search? by Anubis+IV · · Score: 4, Interesting

      From what I can tell, disabling Google History doesn't seem to come with a promise that Google doesn't keep that data somewhere else. What they say they'll do is stop using your History to present targeted advertising for you across their services, or you can choose to delete individual items from your search history, that way they aren't considered when it comes to determining your interests and the like. What they very carefully seem to avoid saying is that they completely delete your queries from all of their systems, so I wouldn't be surprised if they're still using them in some sort of anonymized form for product improvement purposes, tracking trends, or other things of that sort.

    3. Re:Comparison with Google search? by Anubis+IV · · Score: 4, Informative

      Well, I've been searching since I made the comment, and the best I've found so far is this thread where a Google rep confirms that for every image search they keep a thumbnail of the item that was clicked on, as well as the IP address for 9 months (after which it gets anonymized), and identifying information for the cookie associated with you for 18 months (after which it gets anonymized and the IP address gets partially destroyed). What that means is that they never fully destroy the data, and that if the query was self-identifying in some way, someone could still tie all of the queries you made together since they would still be associated with the cookie data, even if that cookie data is no longer associated with you.

      Take it with a grain of salt, however, since that's from back in 2011. As we all know, these tech companies have made big strides to protect our privacy better since then. Wait, no, I have that backwards.

    4. Re:Comparison with Google search? by sqrt(2) · · Score: 3, Insightful

      Perfectly reasonable. Myself, I've never seen an advertisement that was legitimately helpful. I'm dubious that there ever could be such a thing because advertising is fundamentally an adversarial relationship between the advertiser and the target of the ad (you): you have money that you want to keep, or get the most value for when you do spend it; they want to give you as little as possible while taking as much of your money as they can. You are fighting each other, you have competing interests. You can see why there's a huge incentive for them to lie, or get as close to lying as they legally can, and emotionally manipulate you in their pursuit of your dollars. I find attempts at such manipulation repugnant, which is probably why I walk around most of the day with a mild nauseated sensation. Still, I'd choose that over the syrupy haze of blissful ignorance.

      Google's official ads might be the least intrusive, but their disguised ads are rather pernicious, IMO. For example, every product you are shown when using Google Shopping is a paid product advertisement, every single product. They are ALL ads, and nowhere is this disclosed clearly. They are trying to pass it off as a store like Amazon (which has plenty of hidden ads too, but they at least make a passing nod towards identifying them) but it's more like the yellow pages. You have to pay Google for your product to appear there.

      --
      If you build it, nerds will come. Soylentnews.org
  2. Siri sucks! Stop making it better! by Maxwell · · Score: 3, Insightful

    My guess is the overlap between "people who complained Siri wasn't accurate" and "people who dont want apple keeping any Siri data so they can make it better" is pretty close to perfect.

    Google reads your mail. Apple listens to your ravings. Don't like it, don't use it. And they only keep 'your' (ie identifable) data 6 months.

    1. Re:Siri sucks! Stop making it better! by nine-times · · Score: 3, Interesting
      Yeah, I find myself not minding this so much. I do think electronic records should somehow "sunset" at some point, even if it's after a few years, for various reasons. However, I don't see what the big deal is whether Apple retains the data for 1 month vs. 6 months vs. 2 years.

      When I used Siri for the first time and realized it was sending my questions to a datacenter somewhere, I had an immediate reaction of "that's a bit creepy and disconcerting." But once the data is sent out to the datacenter for processing, you've already opened the door for the data to be misused. Once you assume that the data will be stored for some amount of time, you increase the chances for the data to be misused. But if you extend the time that the data is stored for a for months or a year, I don't feel like you're greatly increasing your exposure.

      What holding on the data actually does is it gives Apple some time to process and analyze the data, improving the speech recognition and heuristic models. I'd expect them to want to keep it for a couple years, especially since Siri is new and they're probably still developing their methods for analyzing the data. In this sort of situation, having more data means being able to create a more accurate analysis.

  3. Re:Rotten to the core. by Megahard · · Score: 4, Informative

    I just tried it with Siri and it also punts to Wolfram Alpha so the answers are identical. There's no lakefront properties.

    --
    I eat only the real part of complex carbohydrates.
  4. Sample data... by sl3xd · · Score: 4, Interesting

    Everyone I've ever spoken to or read about in the field of voice recognition tells me that having samples of people's voices is critical to improving it... and getting those samples (mainly the raw quantity of samples) is the biggest problem they face.

    So it doesn’t surprise me at all that anyone keeps a massive archive of samples... the sample data can be critical in improving voice recognition.

    As an aside: Google Voice's voice mail feature does more or less the same thing... and the reasoning is the same also: More sample data means better voice recognition.

    I can't help but shake my head at the comparison:

    Google samples user voices, reads (and transcribes) voice mail, reads your email, your stock information and then feeds it into their advertising engine, and does this for four years and counting; reaction: Meh...

    Apple samples voices, anonymizes it, uses it it improve voice recognition over a period of two years; reaction: EVIL! APPLE MUST DIE!

    --
    -- Sometimes you have to turn the lights off in order to see.
    1. Re:Sample data... by sl3xd · · Score: 3, Interesting

      Voice prints are a real thing, of course; my point isn't that it's not possible to identify people from a voice sample.

      My point is that Apple doesn't make its money by selling you, me, and everyone else to the highest bidder, nor does its business have any real advantage in profiling us. Apple's business isn't advertising, it's selling hardware. (The flop that is iAd notwithstanding)

      Google, on the other hand, is entirely different: Their entire revenue stream is from collecting our personal information, categorizing and analyzing it, and then selling or otherwise making that data useful to its actual customers, ie. its advertisers.

      Hell, I create my own image and speach recognition software from scratch, and I don't need all those fucking samples. I just need to run the samples through my algorithms at most twice -- Once, then again to test if the changes were beneficial or not

      If you honestly believe that, then you've never spent even a minute actually learning the basics of speech recognition, let alone the level of complexity involved in modern algorithms. Signal processing isn't like database programming, where you get a nice result that fits into a box, and can easily reduce unwanted side effects.

      Also keep in mind, there's a difference between "automatic speech recognition" - where whole sentences are parsed and understood (such as used with Siri or Google , versus "discrete speech recognition" where very limited actions are understood (like older cell phones when you spoke "dial ").

      The problem is that while you might have improved the recognition for one specific sample, you've now made it considerably worse for another... so you have to build up a massive library of samples to do regression testing. One of the biggest challenges in speech recognition over the years is the utter lack of sample data for a wide populace, coupled with computers that are unable to hold enough samples in memory to do any meaningful comparisons.

      We've only recently started to see speech recognition of that calibre, and even then, it's accomplished by sending a recording off to a datacenter so fraking huge that it'd easily sit at the top of the TOP500 supercomputer list if their owners bothered to run linpack on it. It's no coincidence that it's also only been in the past couple of years speech recognition has become anything more than a lame joke.

      --
      -- Sometimes you have to turn the lights off in order to see.
  5. They make data anonymous after 18 months by tuppe666 · · Score: 3, Insightful

    ...and have since 2007 These two great blog posts cover the details "Taking steps to further improve our privacy practices" http://googleblog.blogspot.co.uk/2007/03/taking-steps-to-further-improve-our.html and "
    How long should Google remember searches? " http://googleblog.blogspot.co.uk/2007/06/how-long-should-google-remember.html an example from it "By anonymizing our server logs after 18-24 months, we think we’re striking the right balance between two goals: continuing to improve Google’s services for you, while providing more transparency and certainty about our retention practices." Google are suprisingly forthcoming about how and what they do with your data, which clashes sharply with Apple(pretend the don't) or Microsoft(who run hate campaigns)