How differential privacy can crowdsource meaningful info without exposing your secrets
Apple threw out a loaded term and little information in its keynote about a new approach it's rolling out to learn user behavior while preserving privacy. It's good.
Security and privacy expert Matthew Green reassures us, “Your iPhone is not going to kill you.” That’s good. But in his recent explanation of how Apple’s differential privacy approach will send an obscured subset of our private activities to Apple, he explains that some studies demonstrate serious consequences to restricting privacy too much when collecting data related to medical research.
Apple proposes initially to gather data in iOS 10 from typing to improve emoji substitution and predictive word suggestions for previously unrecognized words, and from deep links within apps (non-private internal destinations) to improve Spotlight search results. In macOS Sierra, it will use data to improve autocorrect. And in both, it will watch which Lookup Hints are selected in Notes to provide better help.
For instance, when the word “twerk” was first used, it appeared in no dictionaries. With differential privacy, the term could have become recognized rapidly and added to iOS’s dictionary as it spread in common usage. But nobody would be able to determine whether you personally used the word “twerk.”
Differential privacy discovers the usage patterns of a large number of users without compromising individual privacy—if it’s set up and run correctly. To obscure an individual’s identity, differential privacy adds mathematical noise to an individual’s usage pattern, and only a small amount of data is ever uploaded from a given person.
As more people share the same pattern—like visiting the same popular song in a music-streaming app—general patterns begin to emerge, which can make the operating systems produce more desirable results. (This is all distinct from Apple’s anonymization techniques with Siri and other aggregated results, where data is tied to a randomly generated ID number that can be reset by the user.)
For the most part, this technique as applied by Apple should allow meaningful information to be gathered in a form that can’t be reconstructed to obtain accurate answers, whether captured on a device, in transit, or at the destination by Apple, criminals, or other parties. Ostensibly, a malicious or unwanted agent could gain access to all of Apple’s details collected from users and still remain unable to reconstruct and reconnect a single piece of data, much less a profile, of any individual.
Balancing the privacy budget
Differential privacy is a relatively recent technique in data gathering that relies on a survey tactic that’s decades old. It tries to split the baby of privacy: providing enough information about people’s decisions that the responses can be used to summon the wisdom of a crowd and train deep-learning systems, but not so much that it associates those actions or answers with an individual.
In the classic randomized response approach developed in the 1960s, a coin flip adds randomness. For instance, a researcher might ask the then-fraught question, “Are you a member of the Communist Party?” The subject would flip a coin out of view of the researcher. If the coin came up heads, they always answer “yes.” If tails, they answer truthfully. This gives them plausible deniability, as neither the researcher nor any other party knows if the actual answer is truthful, leading to better survey results. With enough answers, the noise of that randomness can be calculated and subtracted to produce a relatively accurate distribution.
Differential privacy is effectively a modern, more complicated iteration of the same idea. Instead of flipping a coin, a system adds more sophisticated random values that produce a result that can’t be reverse engineered. Instead of one coin flip and one answer, there can be the equivalent of many—sometimes many dozens.
But there are four related problems with differential privacy that could allow some recover of the original data:
- How much data from any individual party is collected.
- How the information is obscured before transmission and on reception.
- How many questions are asked that are similar enough.
- How many times an individual is asked the same or similar questions across a period of time.
On the first point, Apple plans to send just a subset of all data collected. With a massive number of users, only a relatively small number of data points provides nearly as much certainty as a huge number.
On the second, Apple apparently plans to add noise to data it collects before it stores it on an iPhone or in macOS. In an interview, Green, a professor at Johns Hopkins University who has uncovered issues with Apple’s cryptography and security in the past, says Apple told him in a briefing that some of that data is discarded and some uploaded on a daily basis. “Even if somebody got into your phone and got through your passcode lock, they’d get a big database with a lot of noise in it,” he says.
Users can opt to not send any data at all, he says, and Apple will additionally discard IP addresses before storing information on its end to avoid connecting even noisy data with its origin. (Apple didn’t respond to a request for confirmation on the points Green was briefed about.)
Green notes that on the third point, asking too many similar questions, matters become more subtle. If you ask someone if they’re a member of the Communist party, and then ask if they admire Joseph Stalin, and then ask the ideal economic and political system, and so on, it’s possible an outside observer would eventually penetrate through the noise and determine an attitude on a given topic. This requires a system design that correlates related questions.
On the last point, asking questions over time, Apple will establish a “privacy budget,” which will limit how much data about the same or related things is transmitted in a period of time or ever from a device. The answer to a question will often remain the same over time, and providing the same answer multiple time again can make it possible to determine the truthful answer.
“If I’m using the poop emoji today, I’m probably going to use it tomorrow,” Green notes, or if he goes into Starbucks today, he might enter it again tomorrow. “None of the stuff they’re doing today seems to matter,” he says, but that will change over time as Apple becomes more confident of the technique.
Why differential privacy matters
The academic work behind differential privacy largely dates back a decade or so; Apple’s deployment may be the largest-scale publicly identified test to date. Google discussed its research in 2014, but it’s not clear how broadly the company uses it, outside a few cited examples.
In the example Green cited in his blog, a study involving dosing of an anticoagulant drug, having a very restrictive privacy budget delivered information that provided the wrong guidance for personalized medicine, in which a person’s characteristics have a large bearing on the correct medical approach. Suggesting the wrong emoji won’t injure someone, but there’s a continuum from Apple’s initial limited approach to life-changing (or life-ending) outcomes.
Apple is attempting to prove in practice that it doesn’t have to track every action or collect raw data at the scale at which it’s operating in order to gain the advantages of artificial-intelligence processing. Apple has to make the right choices and perform its own internal analysis on how much truth can be tracked back. But it’s a smart approach that balances improvement with obscurity.