Differential privacy is a system that enables the analysis of databases containing personal information without divulging the identity of the individuals. Differential privacy provides a mathematically provable guarantee of privacy, protecting against a wide range of privacy attacks (including differencing attacks, linkage attacks, and reconstruction attacks).
This is achieved by adding randomized “noise” to an aggregate query result in order to protect individual entries without significantly changing the result. Differentially private algorithms prevent attackers from learning anything about specific individuals while also allowing researchers to obtain valuable information on the database as a whole. One of the simplest algorithms is the Laplace mechanism, which post-processes results of aggregate queries. Differentially private algorithms are an active field of research.
The amount of sensitive data recorded digitally is increasing with people relying on digital services for new applications from payments, shopping, and health to transportation and navigation. While this data has many advantageous use cases, it also presents significant privacy challenges. Differential privacy aims to protect the privacy of an individual's data while enabling data scientists and researchers to continue the aggregate analysis of the data collected.
Companies have typically relied on data masking (also called de-identification) to protect privacy in datasets. Data masking removes personally identifiable information (PII) from each record within the dataset. However, research and real-life incidents have shown that simply removing PII from datasets doesn't guarantee the privacy of individuals. Combining anonymous datasets with auxiliary information allows for people's identities to be discovered. Examples include the following:
- In 1996, an MIT researcher matched health records with public voter registration data and identified the governor of Massachusetts' personal information from a masked dataset.
- In 2006 researchers from the University of Texas at Austin analyzed a public, anonymous dataset from Netflix and successfully re-identified movies watched by thousands of people by combining the dataset with information from IMDB.
- In 2022, researchers used AI to re-identify over half of the mobile phone records within an anonymized dataset.
Differential privacy aims to prevent these types of attacks by sharing data with random noise introduced. It is possible to add a level of noise such that the output prevents an attacker from discovering anything statistically significant about individuals in the dataset while also ensuring the dataset remains useful to analysts. By introducing an appropriate level of random noise, the same output could come from a database with or without the target's information.
It is possible to apply differential privacy to a wide range of systems, including recommendation systems, social networks, and location-based services. Variations of differentially private algorithms are also utilized in machine learning, game theory and economic mechanism design, statistical estimation, and many more. The following are examples of differential privacy in use:
- Apple accumulating anonymous user data from their devices
- Amazon accessing users' personalized shopping preferences from sensitive information about past purchases
- Facebook gathering behavioral data for targeted advertising campaigns without defying various nation’s privacy policies
A variety of challenges are associated with implementing differential privacy:
- Restricting analysts to aggregate questions
- Difficulties determining the sensitivity of a question
- Potentially capping the number of questions due to data becoming noisier and less reliable
- Highly sensitive queries returning noisier data
Differential Privacy is one of many privacy-enhancing technologies (PETs) available. Others include the following:
- Homomorphic encryption
- Secure multiparty computation (SMPC)
- Zero-knowledge proof (ZKP)
- Federated learning
- Generative adversarial networks (GANs)
- Pseudonymization/obfuscation/data masking
- On-device learning
- Synthetic data generation (SDG)
Differentially private algorithms are the result of decades of research on technologies for privacy-preserving data analysis. Two earlier concepts that directly influenced differential privacy are:
- Dalenius’s statistical disclosure definition and
- minimum query set size.
In 1977, statistician Tore Dalenius proposed a strict definition of data privacy, stating it should be impossible to learn anything about an individual from a database that cannot be learned without access to the database. While later work would go on to disprove Dalenius's definition, it became a key building block for differential privacy.
Minimum query set size is a constraint aiming to ensure the privacy of individuals during aggregate queries (when the returned value is calculated across a subset of records in a dataset). It blocks queries that do not include data from a minimum number of records, i.e., if the query calculates data from less than a defined threshold, the query is blocked.
In 1979, Dorothy Denning, Peter J. Denning, and Mayer D. Schwartz published a paper titled "The tracker: a threat to statistical database security." The paper describes a type of attack proving it is possible to learn confidential information from a series of targeted queries. Therefore, minimum query set sizes do not ensure privacy.
Differential privacy was defined from years of research applying algorithmic ideas to the study of privacy. Many cite Cynthia Dwork's 2006 paper as the first definition of differential privacy. In the paper, Dwork proved Dalenius's definition failed, and that auxiliary information could always lead to re-identifying individuals when querying a dataset. Due to this fact, Dwork proposed a new definition known as differential privacy, stating the techniques
can achieve any desired level of privacy under this measure. In many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy.
Differential privacy guarantees that an attacker can learn nothing more about an individual than they could if the target's information were removed from the dataset. While weaker than Dalenius’s definition of privacy, the guarantee means individual records are almost irrelevant to the output of the system and therefore the organization handling a participant's data will not violate their privacy.
During the 2010s, large tech companies, including Apple, Facebook, and Amazon, began implementing differential privacy to protect the users of their services. Google has released multiple open-source differential privacy libraries to aid developers. In 2020, the US census adopted differential privacy techniques to protect respondents' personal information.
Early differential privacy patents include patent #7698250B2, first filed on December 16th, 2005 by Microsoft with inventors Cynthia Dwork and Frank D McSherry. The patent described differentially private systems and methods for controlling privacy loss during database participation by introducing an appropriate noise distribution based on the sensitivity of the query. The patent was granted and published on April 13th, 2010.
A number of companies and institutions have been granted patents related to differential privacy, including Apple, Microsoft, and NortonLifeLock. The table below shows a list of patents related to differential privacy.
Patents - Differential Privacy
Differentially private algorithms incorporate random noise to query results. This decreases the importance of individual records, preventing attackers from breaching the privacy of people within the data set.
Imagine a database of credit ratings such that: 3 people have a bad rating, 1510 have a normal rating, and 200 have a good rating. An attacker wants to know the number of people with a bad credit rating. Instead of returning the real answer (N = 3), a query of the database returns the truth (N) combined with some random noise (N+L). The random noise (L) is determined randomly from a zero-centered Laplace distribution with a standard deviation of 2.
The attacker begins querying the database receiving a different result each time:
By averaging multiple attempts, the attacker gets close to the real answer, and with more queries they can uncover sensitive data and breach the privacy of the data set. From the "90% confidence interval," you can see it will take significantly more queries to be statistically confident in the real number of people with a bad credit rating.
While adding noise has concealed the real answer somewhat, this can be circumvented by repeatedly querying the database. One could increase the number of queries it takes by increasing the level of noise introduced to the results (higher standard deviation). To better defend sensitive data, one cannot simply add a random level of noise, i.e. standard deviation of 2 from our example above. The level of noise needed to obscure the real answer is different for each function and depends on the function's sensitivity.
Take the function:
The sensitivity of the function is:
for data sets D1 and D2 differing by at most one element. The equation above states the sensitivity of a function is the largest possible difference one row can have on the result of the whole function, for any dataset. For example, a counting function has a sensitivity of 1 as adding or removing a single row from any dataset changes the count by at most 1. If the dataset were grouped using multiples of 5 (i.e., 0, 5, 10, 15, etc.), then the sensitivity would increase to 5. Determining the sensitivity of an arbitrary function is more difficult and became an area of significant research.
For an attacker to not learn anything about an individual, they must be restricted to insignificantly small changes in their belief about an individual, i.e. there is no difference between using a dataset and an identical dataset minus a single person's records.
The level of noise introduced to query results is determined by the privacy loss parameter ε (Epsilon). It is derived from the Laplace distribution and determines how much deviation there is in results if a single piece of data is excluded from the dataset. The extent that an attacker can change their belief about an individual is controlled by ε, it determines the boundary on the change in probability of any outcome.
ε determines the maximum difference between a query of the original data and the same query of a parallel database missing a single record.
A small value for ε means a small deviation in the computations where any users’ data was to be removed from the dataset, i.e. more random results where an attacker can only learn very little. If ε = 0 there is no difference in the query result if a record is removed. Higher values for ε result in more accurate but less private results. Determining the optimal value of ε depends on the trade-off between privacy and accuracy for a given scenario, and has not yet been determined.
A randomized algorithm K is ε-differentially private if for all data sets D1 and D2 (differing on at most one element), for all the possible values of K that could be predicted (S) if:
The algorithm, or mechanism K, satisfying this expression addresses concerns that any participant has about their personal information being leaked. Even if a participant's information is removed from the data, set no outputs would become significantly more or less likely.
Beyond guaranteeing privacy, differential privacy also has the following characteristics:
- Composability, if two queries are answered with different values of ε, they are guaranteed to a level of ε1 + ε2
- Strength against arbitrary background information, the system does not depend on any auxiliary information the attacker knows
- Security against post-processing, there are no restrictions on what an analyst can do with the results, they remain differentially private regardless of any post-processing
There are two common approaches to differential privacy global (sometimes called central) and local. The main difference between them is who is granted access to the raw data.
In global differential privacy a trusted central aggregator, or curator, has access to the raw data. Generally, this aggregator is a service or research organization collecting data about individuals. They receive user data without noise and are responsible for transforming it using a differentially private algorithm. The algorithm is only applied once at the end of the process before any analysis is published or shared with other parties.
When an individual's data is being queried, global differential privacy ensures they are able to deny their participation in the dataset used to produce the result. Therefore, reducing the likelihood of re-identification. Global differential privacy improves accuracy, reducing the level of noise needed to produce valuable results with a low ε. It also protects against post-processing (including from attackers with access to auxiliary information).
Global differential privacy does require individuals to trust their information with the central aggregator. Plus, with all the information held by a single organization, it increases the risk of cyberattacks and data leaks. Other downsides include limiting questions to ones that generate aggregates.
The alternative approach is local differential privacy in which the aggregator does not have access to the raw data. Instead, differentially private algorithms are applied locally to each user's data before transfer to the aggregator. The aggregator can compute statistics and publish results from this noisy data without further acting on the dataset. In theory, the aggregator could publish all the data they receive as it has already been anonymized locally.
With noise added to each individual's data, the total noise is higher, reducing accuracy and often leading to analysts needing a larger data set. However, the main advantage of local differential privacy is the removal of the trusted aggregator. Local differential privacy is a good alternative if the aggregates are too broad for the level of analysis required. With local differential privacy, individuals cannot deny participation in the data set but they can deny the contents of their records. A local approach to differential privacy also has great potential for supervised machine learning.