Slashdot Mirror


Siri Keeps Your Data For Two Years

New submitter LeadSongDog writes with news that Apple has provided information on how long it holds onto voice search data used by its digital assistant software Siri. Speaking to Wired, an Apple representative said the data is kept for two years after the initial query. "Here’s what happens. Whenever you speak into Apple’s voice activated personal digital assistant, it ships it off to Apple’s data farm for analysis. Apple generates a random numbers to represent the user and it associates the voice files with that number. This number — not your Apple user ID or email address — represents you as far as Siri’s back-end voice analysis system is concerned. Once the voice recording is six months old, Apple “disassociates” your user number from the clip, deleting the number from the voice file. But it keeps these disassociated files for up to 18 more months for testing and product improvement purposes." This information came in response to requests for clarification of Siri's privacy policy, which was not very clear as written. The director of privacy group Big Brother Watch said, "There needs to be a very high justification for retaining such intrusive data for longer than is absolutely necessary to provide the service."

6 of 124 comments (clear)

  1. Comparison with Google search? by Anubis+IV · · Score: 4, Interesting

    Anyone have the timeline for Google's disassociation and destruction of search queries? I'm curious how Apple's policies compare against those.

    1. Re:Comparison with Google search? by Anubis+IV · · Score: 4, Interesting

      From what I can tell, disabling Google History doesn't seem to come with a promise that Google doesn't keep that data somewhere else. What they say they'll do is stop using your History to present targeted advertising for you across their services, or you can choose to delete individual items from your search history, that way they aren't considered when it comes to determining your interests and the like. What they very carefully seem to avoid saying is that they completely delete your queries from all of their systems, so I wouldn't be surprised if they're still using them in some sort of anonymized form for product improvement purposes, tracking trends, or other things of that sort.

  2. Sample data... by sl3xd · · Score: 4, Interesting

    Everyone I've ever spoken to or read about in the field of voice recognition tells me that having samples of people's voices is critical to improving it... and getting those samples (mainly the raw quantity of samples) is the biggest problem they face.

    So it doesn’t surprise me at all that anyone keeps a massive archive of samples... the sample data can be critical in improving voice recognition.

    As an aside: Google Voice's voice mail feature does more or less the same thing... and the reasoning is the same also: More sample data means better voice recognition.

    I can't help but shake my head at the comparison:

    Google samples user voices, reads (and transcribes) voice mail, reads your email, your stock information and then feeds it into their advertising engine, and does this for four years and counting; reaction: Meh...

    Apple samples voices, anonymizes it, uses it it improve voice recognition over a period of two years; reaction: EVIL! APPLE MUST DIE!

    --
    -- Sometimes you have to turn the lights off in order to see.
    1. Re:Sample data... by VortexCortex · · Score: 1, Interesting

      Anonymized voice sample you say? "Voice Print Identified" I say. Hell, I create my own image and speach recognition software from scratch, and I don't need all those fucking samples. I just need to run the samples through my algorithms at most twice -- Once, then again to test if the changes were beneficial or not. If I have a constant stream of users (new samples), and I'm smart -- read: Not fucking daft -- then I can just run the samples through once, and let the users of the system rate the samples in order to rate the sub-systems' efficiency and promote or demote the changes, meanwhile saving a fortune on voice data storage costs. (I use genetic algorithms, so the +1 ratings lead to more "breeding" advantage when spawning the next generation -- no need for data samples, just continued use.)

      Now, I suppose the longer I keep that data the more tests I can run, but think about it really: Which human is going to verify if the algorithm is producing a better match for tons of fucking voice data? No. That's fucking dumb -- That's not what happens to improve the system. That means paying tons of people to listen to the service and re-rate the output after changes have occurred. To improve the system you can collect a SMALL representative sample of those voice recordings to use as a test data set. You have a human transcriber convert those select recordings into actual text. Then you use them as the dataset -- AND YOU CAN KEEP ONLY THOSE on file. It could be totally opt in thing "[_] Improve Siri by Saving Your Search". There's no reason to keep the entire fucking database of voice recordings. That's assinine, it's not helping anyone, except maybe the feds, and the data storage requriements are stupidly taxing for no other really beneficial reason.

      If you compare two voice samples you can damn well verify they came from the same person or not. It's called Voiceprinting -- Like Fingerprinting. And as the "anonymized" AOL search data debacle proved: You can't really anonymize search data.

    2. Re:Sample data... by sl3xd · · Score: 3, Interesting

      Voice prints are a real thing, of course; my point isn't that it's not possible to identify people from a voice sample.

      My point is that Apple doesn't make its money by selling you, me, and everyone else to the highest bidder, nor does its business have any real advantage in profiling us. Apple's business isn't advertising, it's selling hardware. (The flop that is iAd notwithstanding)

      Google, on the other hand, is entirely different: Their entire revenue stream is from collecting our personal information, categorizing and analyzing it, and then selling or otherwise making that data useful to its actual customers, ie. its advertisers.

      Hell, I create my own image and speach recognition software from scratch, and I don't need all those fucking samples. I just need to run the samples through my algorithms at most twice -- Once, then again to test if the changes were beneficial or not

      If you honestly believe that, then you've never spent even a minute actually learning the basics of speech recognition, let alone the level of complexity involved in modern algorithms. Signal processing isn't like database programming, where you get a nice result that fits into a box, and can easily reduce unwanted side effects.

      Also keep in mind, there's a difference between "automatic speech recognition" - where whole sentences are parsed and understood (such as used with Siri or Google , versus "discrete speech recognition" where very limited actions are understood (like older cell phones when you spoke "dial ").

      The problem is that while you might have improved the recognition for one specific sample, you've now made it considerably worse for another... so you have to build up a massive library of samples to do regression testing. One of the biggest challenges in speech recognition over the years is the utter lack of sample data for a wide populace, coupled with computers that are unable to hold enough samples in memory to do any meaningful comparisons.

      We've only recently started to see speech recognition of that calibre, and even then, it's accomplished by sending a recording off to a datacenter so fraking huge that it'd easily sit at the top of the TOP500 supercomputer list if their owners bothered to run linpack on it. It's no coincidence that it's also only been in the past couple of years speech recognition has become anything more than a lame joke.

      --
      -- Sometimes you have to turn the lights off in order to see.
  3. Re:Siri sucks! Stop making it better! by nine-times · · Score: 3, Interesting
    Yeah, I find myself not minding this so much. I do think electronic records should somehow "sunset" at some point, even if it's after a few years, for various reasons. However, I don't see what the big deal is whether Apple retains the data for 1 month vs. 6 months vs. 2 years.

    When I used Siri for the first time and realized it was sending my questions to a datacenter somewhere, I had an immediate reaction of "that's a bit creepy and disconcerting." But once the data is sent out to the datacenter for processing, you've already opened the door for the data to be misused. Once you assume that the data will be stored for some amount of time, you increase the chances for the data to be misused. But if you extend the time that the data is stored for a for months or a year, I don't feel like you're greatly increasing your exposure.

    What holding on the data actually does is it gives Apple some time to process and analyze the data, improving the speech recognition and heuristic models. I'd expect them to want to keep it for a couple years, especially since Siri is new and they're probably still developing their methods for analyzing the data. In this sort of situation, having more data means being able to create a more accurate analysis.