Siri Keeps Your Data For Two Years
New submitter LeadSongDog writes with news that Apple has provided information on how long it holds onto voice search data used by its digital assistant software Siri. Speaking to Wired, an Apple representative said the data is kept for two years after the initial query.
"Here’s what happens. Whenever you speak into Apple’s voice activated personal digital assistant, it ships it off to Apple’s data farm for analysis. Apple generates a random numbers to represent the user and it associates the voice files with that number. This number — not your Apple user ID or email address — represents you as far as Siri’s back-end voice analysis system is concerned. Once the voice recording is six months old, Apple “disassociates” your user number from the clip, deleting the number from the voice file. But it keeps these disassociated files for up to 18 more months for testing and product improvement purposes."
This information came in response to requests for clarification of Siri's privacy policy, which was not very clear as written. The director of privacy group Big Brother Watch said, "There needs to be a very high justification for retaining such intrusive data for longer than is absolutely necessary to provide the service."
Anyone have the timeline for Google's disassociation and destruction of search queries? I'm curious how Apple's policies compare against those.
Everyone I've ever spoken to or read about in the field of voice recognition tells me that having samples of people's voices is critical to improving it... and getting those samples (mainly the raw quantity of samples) is the biggest problem they face.
So it doesn’t surprise me at all that anyone keeps a massive archive of samples... the sample data can be critical in improving voice recognition.
As an aside: Google Voice's voice mail feature does more or less the same thing... and the reasoning is the same also: More sample data means better voice recognition.
I can't help but shake my head at the comparison:
Google samples user voices, reads (and transcribes) voice mail, reads your email, your stock information and then feeds it into their advertising engine, and does this for four years and counting; reaction: Meh...
Apple samples voices, anonymizes it, uses it it improve voice recognition over a period of two years; reaction: EVIL! APPLE MUST DIE!
-- Sometimes you have to turn the lights off in order to see.
When I used Siri for the first time and realized it was sending my questions to a datacenter somewhere, I had an immediate reaction of "that's a bit creepy and disconcerting." But once the data is sent out to the datacenter for processing, you've already opened the door for the data to be misused. Once you assume that the data will be stored for some amount of time, you increase the chances for the data to be misused. But if you extend the time that the data is stored for a for months or a year, I don't feel like you're greatly increasing your exposure.
What holding on the data actually does is it gives Apple some time to process and analyze the data, improving the speech recognition and heuristic models. I'd expect them to want to keep it for a couple years, especially since Siri is new and they're probably still developing their methods for analyzing the data. In this sort of situation, having more data means being able to create a more accurate analysis.