I'm not quite sure why it is surprising to people that Apple keep Siri data - how else are they supposed to use the massive amounts of data to improve?
It sounds like it's not even voice data, but the questions and the answers given. This would probably be useful for optimising for things people are asking that siri is getting wrong?
It's not surprising to me that they keep it, but they do not need to keep personally identifying information for six months in order to do that. Seems a little improper, and also unnecessary, for employees to have access to that kind of thing if they're working on improving Siri.
Hazarding a guess, history associated with a user could help accuracy or determine usage that causes problems? I mean, sure, don't keep the user data forever (and I imagine you can't, depending on local laws), but as long as it's useful? Six months seems a reasonable timescale if you have to identify, track, test and implement problems, and six months per user probably doesn't give you a whole load of statistics on it's own.
It does not need to be personalized, it only needs to be associated with a given data set. How it's applied to the personal device in question is an implementation problem, not a matter of whether anonymization should have been implemented.
Not so long ago, the primary goal of the 1-800-GOOG-411 service was to record people's voice queries for the purpose of bettering their voice recognition technology. After they got sufficient data samples, they discontinued the service.
> "Google had stated that the company originally implemented GOOG-411 to build a large phoneme database from users' voice queries. This phoneme database, in turn, allowed Google engineers to refine and improve the speech recognition engine that Google uses to index audio content for searching."
Fascinating. I had no idea that's what that service was being used for. Interesting, it was more or less retired as soon as Android blew up and voice search was everywhere.
The general population doesn't even understand that Facebook can read their messages. I wouldn't be surprised if some people didn't even know that Siri needs an internet connection and a data center to process to voice commands, never mind the storage of that data.
I think we seriously have an issue with educating the public about how modern technologies work. I don't expect everyone to know how to code, but they should have a basic understanding of what is happening with their data while they use all these 'magical' devices.
Actually, I'm surprised they get away with keeping that data for so short a period of time. For some reason, I thought I read somewhere that US law mandated service providers to keep that sort of data significantly longer. I can't find the reference now... so I must have been mistaken.
I guess I'm more surprised that it's only kept for 2 years, than I am that the data is kept.
Apple is one of the few companies left where I don't mind them collecting my data, because it's only going to be used for their own products and services.
So, you're right that no one knows something like that. However, I think that Apple and Google (to use two specific companies) have very different approaches to gaining their profits. Apple tries to roll out higher-margin devices and get you to part with your dollars to buy it. Google, on the other hand, tries to roll out free things and get advertisers to part with their advertising dollars, serving you up as the product. That sounds a little more anti-ad than I mean it to, but I think it illustrates the "if you aren't paying, you're the product" point. There's nothing wrong with being the product if you get something awesome in return. Gmail is awesome and free.
However, once you're scanning emails for targeting an ad specific to the email, why not try to build a profile of a person? Why not try to figure out as much as you can to target them? With a mobile device, why not track their location (linked to an account) so you can tell where they are usually going? Build their voice queries into that profile? Now when they walk to a certain area and it's tuesday near lunch, maybe you know what restaurant they like and serve an ad for a competitor.
Again, some of this is hyperbole to illustrate a point and I'm not accusing Google of anything. But Google makes its money off of being able to sell ads that are relevant to you. They try to keep it to a non-creepy level, but it's their profit engine. Apple, on the other hand, makes their money off you from getting you to part with your dollars in exchange for one of their shiny toys. In fact, when you look at Apple's revenues, only 3% come from sources other than the iPhone, iPad, iTunes, iPod, and computers. Unless Apple wants to dramatically change course, the data is much more likely to be used for improvement of the system rather than targeting of individual users.
That doesn't ensure that your data will be used in any fashion by any company. But I think Apple's revenue streams show that, at least right now, they aren't heavily into any area where targeting you via these recordings would be useful for them. In fact, if anything it would jeopardize their already highly-profitable operations and that would make it a hard decision to exploit it.
I can't speak for the parent, but this is my reasoning: They make close to a hundred billion dollars a year by selling hardware, so they don't need to sell my private information for some quick cash.
It's stupid, I know. But the same couldn't be said about advertising-driven companies (like Google and Facebook).
In other words, being mindful of user privacy is a luxury they can have right now, and it's good for their business (i.e., they're not doing it because they're "good", or others that do are "bad" or "evil" - it's just that they're now in a position that respecting user privacy is better for their bottom line than not being respectful)
How else would Apple make Siri better? If they used no research on their products, the end product would be a big failure. Apple maps on release, was far from accurate. Apple maps now, a lot better.
It sounds like it's not even voice data, but the questions and the answers given. This would probably be useful for optimising for things people are asking that siri is getting wrong?