Enhanced Membership Inference Attacks against Machine Learning Models
Abstract
How much does a given trained model leak about each individual data record in its training set? Membership inference attacks are used as an auditing tool to quantify the private information that a model leaks about the individual data points in its training set. The attacks are influenced by different uncertainties that an attacker has to resolve about training data, the training algorithm, and the underlying data distribution. Thus attack success rates, of many attacks in the literature, do not precisely capture the information leakage of models about their data, as they also reflect other uncertainties that the attack algorithm has. In this paper, we explain the implicit assumptions and also the simplifications made in prior work using the framework of hypothesis testing. We also derive new attack algorithms from the framework that can achieve a high AUC score while also highlighting the different factors that affect their performance. Our algorithms capture a very precise approximation of privacy loss in models, and can be used as a tool to perform an accurate and informed estimation of privacy risk in machine learning models. We provide a thorough empirical evaluation of our attack strategies on various machine learning tasks and benchmark datasets.
1 Introduction
Machine learning systems have come under intense scrutiny of the regulatory authorities in the past few years. Especially, the article 35 of GDPR ^{1}^{1}1https://gdprinfo.eu/art35gdpr/ emphasizes on the importance of performing a data protection impact assessment (DPIA) when building machine learning systems on sensitive data. Veale et al. (2018) argue that machine learning models could be considered personal data due to their susceptibility to inference attacks that can recover sensitive information about training data just from the models. Membership inference attacks (Homer et al., 2008; Dwork et al., 2015; Shokri et al., 2017) and reconstruction attacks (Dinur and Nissim, 2003; Song et al., 2017; Carlini et al., 2020) are the main inference attacks that highlight, and can quantify, the privacy risk of releasing aggregate information computed on sensitive data (Dwork et al., 2017). The focus of this paper will be on membership inference attacks for measuring privacy risk. Organizations such as the ICO (UK) and NIST (US) have highlighted membership inference as a potential confidentiality violation and privacy threat to the training data (Murakonda and Shokri, 2020). This has lead to the development of opensource tools ^{2}^{2}2https://github.com/privacytrustlab/ml_privacy_meter and capabilities in widelyused ML libraries ^{3}^{3}3https://blog.tensorflow.org/2020/06/introducingnewprivacytestinglibrary.html for measuring privacy risk from machine learning models using membership inference attacks.
Although the approach of quantifying privacy risk through membership inference attacks is gaining traction, the attack success, as measured by a lot of works, cannot be completely attributed to information leakage from the models and hence their privacy risk. Various factors such as the distribution of training data, difference in distributions of the train and test data may provide an overestimate or underestimate of the actual privacy risk from the model (Erlingsson et al., 2019; Humphries et al., 2020). Theoretical analyses that connect the success of membership inference to privacy risk through the framework of differential privacy avoid this issue by slightly modifying how the attack performance is measured (Yeom et al., 2018; Jagielski et al., 2020; Nasr et al., 2021; Malek et al., 2021). Instead of measuring the leakage from a particular model, these works aim at measuring the worstcase leakage of the training algorithm (for a given model architecture), and construct multiple models with and without one training point and keep the rest of training set fixed. The performance of the attack (false positive and false negative errors) is then computed over these sampled models. However, in practice and during auditing, the performance needs to be computed based on the points in the training set versus the population data (test set) for a given fixed model. Various works rely on measuring privacy risk through the performance of membership inference attacks, but the subtle differences in how the attack is formulated mean that they might associate different causes of attack success to privacy loss. When auditing machine learning models, we need to pay attention to what exactly we are measuring and how it relates to the information leakage of the machine learning algorithm and not other factors such as the prior knowledge of the attacker.
With few exceptions, previous membership inference attacks are designed with an objective to achieve high overall performance, i.e. to succeed for most member (or nonmember) data points of most target models. Therefore, previous attacks train shadow models Shokri et al. (2017) (or reference models Long et al. (2020)) on datasets randomly sampled from an overall population, to mimic the general behavior of target models on member (or nonmember) data points. This general behavior, however, does not necessarily capture the behavior of specific target models on specific target samples. As a result, the performance of shadow modelbased attack is limited, and does not precisely measure the modelspecific and (or) samplespecific information leakage.
In this work, we focus on designing membership inference attacks that more precisely measure what a specific target model leaks about each individual training data, in a binary hypothesis testing framework. We start from the (constant) loss threshold attack derived from LRT (likelihood ratio test), and study the central problem: how to further design different attack strategy (loss threshold) for different target models and (or) target samples? We derive multiple attack strategies (existing and new) from this hypothesis test formulation through approximations for the null hypothesis at different granularity. We also formulate the performance of an attack strategy for specific target models and (or) specific models inside the binary hypothesis testing framework, via the tradeoff between the attack’s false positive rate and true positive rate. Following the methodology, we not only derive existing shadow (reference) modelbased attacks (Attack S and R), and explain why they measure the overall leakage, but also design new attacks (attack D) that offer a more accurate privacy auditing for machine learning models, by constructing modeldependent and sampledependent attack strategy (loss threshold) via distillation. We empirically evaluate the attack performance (TPRFPR curve and AUC score) and computation cost of our new attacks (notably Attack D[istillation]), and the shadow modelbased prior attacks (Attack S[hadow] and R[eference]) on multiple datasets to compare the effectiveness of these strategies.
2 Related Work
The current work in this domain can be broadly grouped into three categories: 1) empirical works for improving the existing attack strategies or adapting them to different settings and models, 2) theoretically/empirically analyzing the privacy risk in various systems using the existing attack strategies, 3) exploring the connections with differential privacy and using them to establish lower bounds on leakage and/or select privacy parameters. Below, we provide a brief summary of works in all the three categories.
Empirical Attack Strategies for Membership Inference:
Shokri et al. (2017) demonstrated the vulnerability of machine learning models to membership inference attacks in the blackbox setting, where the adversary has only a query access to the target model. The attack algorithm is based on the concept of shadow models, which are models trained on some attacker dataset that is similar to that of the training data. Membership inference is modeled as a binary classification task for an attack model that is trained on the predictions of shadow models on the attacker dataset. A substantial body of literature followed this work extending the attacks to different setting such as white box analysis (Nasr et al., 2019; Sablayrolles et al., 2019; Leino and Fredrikson, 2020), labelonly access (Li and Zhang, 2020; ChoquetteChoo et al., 2021), federated learning (Nasr et al., 2019; Melis et al., 2019), transfer learning (Zou et al., 2020) and different types of data, models such as aggregate location data (Pyrgelis et al., 2017), generative models (Hayes et al., 2019), language models (Carlini et al., 2019; Song and Shmatikov, 2019; Carlini et al., 2020), sentence embeddings (Song and Raghunathan, 2020), and speech recognition models (Shah et al., 2021). Multiple works have looked at improving the attack methodology through a more fine grained analysis or by reducing the background knowledge and the compute power required to execute the attack (Long et al., 2018; Song and Mittal, 2021; Salem et al., 2018). All these works follow the same attack framework for membership inference, but they either exploit a slightly different signal that is correlated with membership of a point in the training set or find an efficient way to exploit the already known signals.
Privacy Risk Analysis with Membership Inference Attacks:
Homer et al. (2008) performed the first membership inference attack on genome data to identify the presence of an individual’s genome in a mixture of genomes. Sankararaman et al. (2009) provided a formal analysis of this risk of detecting the presence of an individual from aggregate statistics computed on independent and binary attributes. Murakonda et al. (2021) extended this analysis to the case of releasing discrete Bayesian networks learned from data with dependent attributes. Backes et al. (2016) perform an analysis similar to that of (Sankararaman et al., 2009) but for MicroRNA data (aggregate statistics computed on independent and continuous attributes). Dwork et al. (2015) provide a more extensive analysis when the released statistics are noisy and the attacker has only one reference sample to perform the attack. The key results of all these works establish the privacy risk of releasing aggregate statistics by quantifying the success of membership inference attacks as a function of the number of statistics released and the number of individuals in the dataset. Similar attempts were made to analyze the privacy risk of machine learning models through membership inference via the lens of mutual information (Farokhi and Kaafar, 2020) and generalization error (Yeom et al., 2018; Del Grosso et al., 2021). Beyond these theoretical analyses, membership inference attacks are also used to empirically study the tradeoffs between privacy and other desirable characteristics for machine learning models such as fairness (Chang and Shokri, 2020), robustness to adversarial examples (Song et al., 2019), and model explanations (Shokri et al., 2021).
Differential Privacy and Membership Inference:
The definitions of differential privacy (Dwork et al., 2006) and membership inference (Homer et al., 2008; Dwork et al., 2015; Shokri et al., 2017) are very closely connected and the hypothesis testing interpretation of differential privacy provides a clear view of the relationship between them. Satisfying differential privacy is equivalent to imposing a bound on the ability to distinguish any two neighboring datasets that differ by the presence of one individual i.e., inferring about the presence/absence of the individual. The bound is stated as a tradeoff between the typeI and typeII errors in distinguishing the two neighboring datasets. Hypothesis testing interpretation of differential privacy is very useful in deriving tight compositions (Kairouz et al., 2015) and has even motivated a new relaxed notion of differential privacy called fDP (Dong et al., 2019). By definition, differentially private algorithms bound the success of membership inference attacks for distinguishing between two neighboring datasets. Multiple works (Yeom et al., 2018; Erlingsson et al., 2019; Humphries et al., 2020), each improving on the previous work, have provided upper bounds on the success of membership inference attacks as a function of the parameters in differential privacy. Jayaraman and Evans (2019) evaluated the performance of membership inference attacks on machine learning models trained with different values of epsilon under different relaxed notions of differential privacy. Rahman et al. (2018) also use membership inference attacks to measure the privacy loss on models trained with differentially private algorithms. The empirical performance of membership inference attacks has also been used to provide lower bounds on the privacy guarantees achieved by various differentially private algorithms (Jagielski et al., 2020; Nasr et al., 2021; Malek et al., 2021). The key difference between the empirical analysis of membership inference in the previous three works and other works is that they simulate the exact adversary in differential privacy i.e., they train multiple models with and without one particular training point and keep the rest of training set fixed. The performance of the attack (false positive and false negative errors) is averaged over these models, whereas in the previous works the model is fixed and the performance is computed as an average over points in the training and test sets. This simulation of the DP adversary helps in removing the effects of other points in the dataset when measuring the leakage through the model about a particular point of interest.
3 Attack Framework
Our objective is to design a framework that enables auditing the privacy loss of machine learning models in the blackbox setting (where only model outputs —and not their parameters or internal computations— are observable). This framework needs to have three elements: (i) the inference game as the evaluation setup; (ii) the indistinguishably metric to measure the privacy risk, and (iii) the construction of membership inference attack as hypothesis testing. The notion of privacy underlying our framework is primarily based on differential privacy, and multiple pieces of this framework are generalizations of existing inference attacks against machine learning algorithms. We present the important design choices while constructing and evaluating membership inference attacks, for the purpose of having a precise privacy auditing. We identify different sources of uncertainty that influence the error of inference attacks, and can lead to miscalculation of the privacy loss of a model.
We quantify privacy loss of a model in a hypothetical inference game between a challenger and an adversary. We are given a private training set , and a model which is trained on using a training algorithm . The challenger samples a random data point from the training set (a member), and a random data point from the data population outside the training set (a nonmember). He then randomly selects one of the two (member or nonmember) with probability , and shares the selected data point and the model’s output with the adversary. The adversary’s task is to determine if the data point is a member or not (i.e., guess ).^{4}^{4}4We need to emphasize that there is a difference between the privacy loss of a machine learning algorithm and that of the specific models trained with the algorithm. A model is a given instance of a training algorithm, thus its leakage needs to be computed with respect to the individual data records in its specific training set. This subsequently means that the privacy loss of an algorithm varies depending on the randomness in sampling of its training data, and the randomness of the training algorithm. One can analyze the privacy loss of an algorithm as, for example, its worst case privacy loss (as in differential privacy). Differentially private algorithms enforce an upper bound on the privacy loss of an algorithm over all models with respect to all possible training data. Thus, in the inference game for differential privacy, the privacy loss of the training algorithms, would be the worst case privacy loss over all choices of , , and in our inference game for model privacy.
We use an indistinguishability measure (which is the basis of differential privacy) to define privacy of individual training data of a model. According to this measure, the privacy loss of the model with respect to its training data is the adversary’s success in distinguishing between the two possibilities over multiple repetitions of the inference game. Naturally, the inference attack is a hypothesis test, and adversary’s error is composed of the false positive (i.e., inferring a nonmember as member) and false negative of the test. In practice, the error of adversary in each round of the inference game depends of multiple factors:
• The true leakage of the model about the target data when .
• The uncertainty (belief or background knowledge) of attack algorithm about the population data.
• The adversary’s uncertainty about the training algorithm .
• The uncertainty about all training data except the target data .
• The attack dependency on the target data , and the model .
In the ideal setting, we only want the attack error to be dependent on the true leakage of the model about the target data (i.e., whether the same model trained with and without are distinguishable from each other). To this end, and to cancel out the effect of other uncertainties, an attack algorithm and the evaluation setup for the inference game need to be designed based on the following principle: The population data used for constructing the attack algorithm, and evaluating the inference game, need to be similar, in distribution, to the training data. This is to minimize the impact of prior belief (what could have been sampled for the training set) in the performance of the inference attack. This is not hard to achieve as all the process (of constructing the hypothesis testing attack, and evaluating it) is controlled by the auditor. By violating this principle, we might overestimate the privacy loss (by making the test dependent on a distinct prior knowledge) or underestimate the privacy loss (by evaluating the inference attack on a population data distribution for which it was not constructed).
Another crucial requirement is that the privacy audit needs to output a detailed report, which captures the uncertainty of the attack. Reporting just the accuracy of the attack, as it is mostly reported in the literature, is not an informative report. Given the attack being a hypothesis test, the audit report needs to include the analysis of the error versus power of the test: if we can tolerate a certain level of false positive rate in the inference attack, how much would be the true positive rate of the attack, over the random samples from member and nonmember data? The area under the curve for such an analysis reflects the chance that the membership of a random data point from the population or the training set can be inferred correctly.
4 Constructing Membership Inference Attacks
The adversary in the membership inference game can observe the output of a target machine learning model , trained on unknown dataset . He also gets a precise target data point as input, and is expected to output or to guess whether the sample is in the dataset or not. We use likelihood ratio test (LRT) as the most powerful criterion for choosing among membership hypotheses, under the following assumptions.

The adversary knows, and can sample from the underlying data distribution over population .

A randomized training algorithm takes in a training dataset , which consists of i.i.d. samples from the data population, and produces a model that incurs a low loss on dataset . We denote to be the posterior distribution of trained model given training dataset .
Definition 4.1 (Approximated LRT for Membership Inference)
Let be random samples from the joint distribution of target model and target data point, specified by one of the following membership hypotheses.
(1)  
(2) 
The likelihoods function of hypothesis and , given observed target model and target data point , is as follows (detailed derivations are in the Appendix).
(3)  
(4) 
Now we follow the previous construction of Bayesoptimal membership inference attack Sablayrolles et al. (2019), and model the posterior distribution of trained model as follows.
(5) 
where is a temperature constant that allows different randomness in different training algorithm . This posterior holds for Bayesian learning algorithms, such as stochastic gradient descent Polyak and Juditsky (1992). Intuitively, for , equation 5 becomes concentrated around the optimum of the loss function, and recovers deterministic MAP (Maximum A Posteriori) inference. For , equation 5 captures Bayesian posterior sampling under uniform prior, and loglikelihood based on the loss function Welling and Teh (2011). Therefore, the LRT statistics can be computed as follows (detailed derivations are in the Appendix).
(6)  
(7) 
The LRT hypothesis test rejects when the LRT statistic is small. By equation 7, the rejection region can be approximated as follows.
(8) 
The above approximated LRT strategy compares a constant threshold with the loss of the target model on the target data point . This recovers the commonly used (constant) loss threshold attacks in literature Nasr et al. (2019); Sablayrolles et al. (2019). However, by derivations in the appendix (equation 52), a more accurate LRT attack strategy would compare the loss with a threshold function that depends on the target model . Therefore, the attack in equation 8 with constant threshold can overly simplify the LRT, thus limiting its performance. This motivates our design of attacks with modeldependent and sampledependent thresholds, with the objective of improving the attack performance.
Our General Template for Attack Strategy.
Building on the approximate LRT equation 8, we derive the following new variant of sampledependent and (or) model dependent attack strategy.
(9) 
where is a threshold function chosen by the attacker under tolerance of false positive rate (FPR), i.e., satisfies the following equation which controls false positive rate.
(10) 
Our General Method for Determining Attack Threshold
To build new sampledependent and (or) modeldependent attacks, we need to compute new attack thresholds that changes with target sample and (or) target model. To be valid, this threshold must ensure a false positive rate of the corresponding attack, via equation 10. By approximating the joint distribution with empirical distribution over samples under null hypothesis, we solve equation 10 and obtain valid attack threshold for different constructed attack strategy.
Attack  Threshold  Dependencies 

S  Label of Sample  
P  Model  
R  Sample  
D  Model, Sample 
4.1 Attack S: MIA via Shadow models
Starting from Shokri et al. (2017), a substantial body of literature Nasr et al. (2019); Sablayrolles et al. (2019); Leino and Fredrikson (2020); Long et al. (2018); Song and Mittal (2021); Salem et al. (2018) studied and improved the shadow model attack methodology. All these works follow the same attack framework for membership inference, but they either use a different attack statistics (such as loss or confidence score), or find a more efficient way to perform the attacks.
In this paper, we focus on membership inference attacks solely based on loss threshold. Therefore, we first formalize a strong baseline shadow model membership inference attack S based on Shokri et al. (2017), that effectively uses labeldependent attack threshold , as follows.
(11) 
where the threshold function satisfies equation 10 and ensures false positive rate .
To estimate a valid attack threshold under FPR , the attacker approximates the joint distribution in equation 3 with a set of shadow models and shadow data points.
(12)  
(13)  
(14) 
The shadow models trained on population datasets are samples from the posterior distribution in equation 3, and the shadow points are samples from the population data distribution in equation 3. Therefore, by replacing the joint distribution in equation 10 with empirical distribution over samples in , we prove that the attack threshold under FPR satisfies the following empirical constraint.
(15) 
One sufficient condition for equation 15 to hold is that for any fixed data label , we have
(16) 
By solving equation 16 for every , we compute that equals the the percentile for the histogram of loss values over samples . This recovers the classdependent attack threshold function that we use in Attack S.
4.2 Attack P: modeldependent MIA via Population data
Can we design an attack with better performance by exploiting the dependence of loss threshold on different models? In this section, we design a new modeldependent membership inference Attack P that applies different attack threshold for different target model . The rationale for this design is to construct an inference attack which exploits the similar statistics as in Attack S, in a more accurate way by computing it only on the target model (instead of on all shadow models), yet with less computations (without the need to train shadow models). Similar techniques utilizing population data to infer membership are used in Genomics Homer et al. (2008), however, to the best of our knowledge, this techniques is not previously used in MIA for machine learning. The hypothesis test with modeldependent attack threshold is as follows.
(17) 
where the threshold function satisfies equation 10 and ensures false positive rate . Using the chain rule of joint distribution, and the independence of and under null hypothesis in equation 1, we prove that
(18) 
where is the target model distribution specified by the null hypothesis . Using equation 18, we rewrite the false positive rate in equation 10 as follows.
(19) 
One sufficient condition for equation 19 to hold is that, for any fixed target model , we have
(20) 
To approximate the data distribution in equation 20, the attacker samples the following set of population data points from distribution .
(21) 
Therefore, we approximate in equation 20 with the empirical distribution over samples .
(22) 
By solving equation 22 for every target model , we compute that equals the the for the histogram of loss values over population data samples . This recovers the modeldependent attack threshold for Attack P.
4.3 Attack R: sampledependent MIA via Reference models
The privacy loss of the model with respect to the target data could be directly related to how susceptible the target data is to be memorized (e.g., being an outlier) Feldman (2020). Based on this finding, we design the membership inference Attack R that applies different attack threshold for different target data sample (both its input features and the label ). This attack relies on training many reference models on population datasets (excluding the target data sample) and evaluating their loss on one specific target sample, as described in Long et al. (2020). This attack is also very similar to the membership inference attacks designed for summary statistics and graphical models, which use reference models to compute the probability of the null hypothesis Sankararaman et al. (2009); Murakonda et al. (2021). We define the test with sampledependent attack threshold in our case as follows.
(23) 
where the threshold function satisfies equation 10 and ensures false positive rate . Using equation 18, we rewrite the false positive rate constraint in equation 10 as follows.
(24) 
One sufficient condition for equation 24 to hold is that, for any fixed data point ,
(25) 
To approximate the target model distribution in equation 25, the attacker then samples the following set of reference models.
(26) 
Because the reference models are trained on population datasets, they serve as samples from the distribution . Therefore, we replace in equation 25 with the empirical distribution over samples , as follows.
(27) 
By solving equation 27 for fixed target sample , we compute that equals the the percentile for the histogram of loss values over reference models samples . This computes the sampledependent attack threshold that in Attack R.
4.4 Attack D: modeldependent and sample dependent MIA via Distillation
Can we design an attack that takes advantage of all the information available in the target model and the target data that can increase the chance of identifying the right hypothesis? We design a membership inference Attack D whose threshold function depends on both the target sample and the target model , as follows.
(28) 
where the threshold function satisfies equation 10 and ensures false positive rate . However, the degree of freedom in the threshold function is still too large for us to directly solve equation 10. Therefore, we restrict to take the following form.
(29) 
where is the training dataset for target model . For the simplicity of derivation, let us first assume that the randomized training algorithm has a deterministic inverse mapping , i.e. the training dataset for a given model is uniquely specified. (Later we also show how to approximate the training datasets for model when the training algorithm is not invertible.) Plugging equation 29 into equation 10, we rewrite the FPR constraint as follows.
(30) 
By deterministic mapping from to , the chain rule of joint distribution, and the independence between random variables and under null hypothesis in equation 1, we prove that
(31)  
(32)  
(33) 
where is distribution of trained model under a fixed training dataset , as specified in equation 5. Plugging equation 31 into equation 30, we rewrite the false positive rate constraint as
(34) 
One sufficient condition for equation 34 to hold is that, for any fixed sample and target model ,
(35) 
where is an implicitly fixed training dataset for the target model , and the distribution captures retrained models on the training dataset for given target model , as follows.
(36) 
To approximate this distribution of retrained model , the attacker generates the following set of selfdistilled models using the target model .
(37) 
These distilled models approximate samples from the retrained model distribution . This is because is trained on distillation dataset consisting of population data points which are softlabeled with the target model . This roughly recovers the target model trained on , however without its potential dependence on . Therefore, the attacker approximate in the false positive rate constraint equation 35 with the empirical distribution over distilled models samples in , as follows.
(38) 
By solving equation 38 for fixed target model and target point , we compute that equals the the percentile for the histogram of loss values on distilled models , and the target sample . This recovers the modeldependent and sampledependent attack threshold in Attack D.
4.5 Attack L: Leaveoneout attack
An ideal attack, that removes the randomness over the training data (except the target data that could potentially be part of the training set) would be the leaveoneout attack. In this attack, the adversary trains reference models on , where the randomness only comes from the training algorithm . The attack would be in the same class of attacks as in Attack D, as it will be a modeldependent and datadependent attack. It also runs a similar hypothesis test, however the attack requires assuming the adversary already knows the data records in . This is a totally acceptable assumption in the setting of privacy auditing.
Note that Attack D aims at reproducing the results of the leaveoneout attack without assuming the knowledge of data records in .
4.6 Summary and Comparison of Different Attacks
For identifying whether a data point has been part of the training set of , here are the main underlying questions for the attacks we present in this section:
• How likely is the loss to be a sample from the distribution of loss of random samples from the population on (Attack P: the same model) (Attack S: models trained on the population data)? Depending on the tolerable false positive rate and the estimated distribution of loss, we reject the null hypothesis.
• How likely is the loss to be a sample from the distribution of loss of on (Attack R: models trained on population data) (Attack D: models trained to be as close as possible to the target model, using distillation) (Attack L: models trained on records from excluding )? Depending on the tolerable false positive rate and the estimated distribution of loss, we reject the null hypothesis.
Effectively, these questions cover different types of hypothesis tests that could be designed for performing membership inference attacks. We expect these attacks to have a different error due to the uncertainties that can influence their performance.
Attacks S and P are of the same nature. However, attack S could potentially have a higher error due to its imprecision in using other models to approximate the loss distribution of the target model on population data. Attacks R, D, and L are also of the same nature. However, we expect attacks D to have more confidence in the tests due to reducing the uncertainty of other training data that can influence the model’s loss. Thus, we expect attack D to be the closest to the strongest attack which is the leaveoneout attack.
5 Empirical Evaluation
In this section, we empirically measure and study the performance of different membership inference attacks, in terms of how the internal mechanics of the attacks would play a role in their performance, and how the attacks compare to each other. Here we report the results of attacks on models trained using the Purchase100 dataset. Results on more datasets can be found in Appendix A.3. We provide details of the experimental setup in Appendix A.2.
Illustration of Attack Threshold
How differently does the attack threshold in Attack S, P, R, D depend on the target model and the target data point ? How do the different dependencies of the attack threshold on and affect an attack’s success? To investigate these important questions, we plot the loss histograms that different attacks use to compute the thresholds on different target models and member data points in Figure 1. The loss histogram approximates the LRT statistics distribution under null hypothesis in equation 1, with different uncertainty for different attacks, as discussed in 4. We see that modeldependence and sampledependence of attack threshold reduces the uncertainty in the loss histogram. The attacks with more concentrated loss histograms (i.e. Attack D) are more likely to succeed.
Evaluation of Attack Performance
Which one of the attacks S, P, R, D has the best performance? How do we evaluate the strength of an attack besides using its accuracy? How can we design finegrained attack evaluation metric to identify vulnerable samples and vulnerable models?

Firstly, we quantify the attacker’s average performance using two metrics: its true positive rate (TPR), and its false positive rate (FPR), over the random samples from member and nonmember data and (or) random target models. The ROC curve captures the tradeoff between the TPR and FPR of an attack, as its threshold is varied across different FPR tolerance . The AUC (area under the ROC curve) score measures the strength of an attack. We plot the ROC curves of all attacks on the Purchase100 dataset, and compute their AUC (area under the ROC curve) score in Figure 2. The attack with the highest AUC score on Purchase100 is Attack D, which has the least level of uncertainty, as discussed Section 4. Table 2 compares the average performance (AUC) for different attacks under more settings.

Secondly, we quantify the vulnerability one specific target model , by evaluating the TPR and FPR of attacks on random samples from member and nonmember data. If the AUC of attack is significantly higher for one target model than other models, then it implies that the model is more vulnerable. This facilitates comparing different target models to release the one with the least information leakage, i.e. the lowest AUC.

Thirdly, we quantify the vulnerability of one specific sample , by evaluating the TPR and FPR of attacks on random target models. If the AUC of attack is significantly higher for one target sample than other samples, then it implies that is a vulnerable record. This facilitates identifying (and removing) vulnerable record from the training dataset.
Effect of Attack Hyperparameters
How many models do Attacks S, R, and D need to achieve a good attack performance? How many data points does Attack P need for each class in the dataset to perform well? We plot the effect on AUC scores on varying these attack hyperparameters in Figure 3. We observe that the attack performance improves significantly with an increase in reference or distilled models for Attacks R and D respectively. For Attack S we observe a slight increase in its AUC score with an increase in the number of shadow models. For Attack P, we similarly observe a slight increase in its AUC score with an increase in number of data points per class.
Train Acc.  Test Acc.  Attack S  Attack P  Attack R  Attack D  
I  96.2 0.031  52.5 0.026  0.809 0.017  0.822 0.014  0.84 0.023  0.876 0.009 
II  99.5 0.004  65.4 0.009  0.752 0.008  0.755 0.006  0.799 0.009  0.821 0.004 
III  100.0 0.0  75.5 0.004  0.687 0.003  0.687 0.003  0.755 0.004  0.768 0.002 
IV  95.74 0.01  71.71 0.009  0.647 0.004  0.55 0.005  0.682 0.009  0.70 0.005 
Comparing the Attacks
Besides attack strength, how differently are the attacks performing on random input target models and target points? Do the attacks have different level of confidence on the same input? How do the different internal mechanics of attacks changing the attacker’s guess qualitatively? Do the attacks succeed on similar or different samples of member data and target model? How far away are the attack performance from the most ideal leaveoneout attacks described in Section 4? Answers to these questions require understanding how and why the attacks perform differently, for which we do detailed comparisons between attacks as follows.

Similarity of Attacks with Each Other. The scatter plot comparing Attack S and P in Figure 4 is roughly linearly centered around the diagonal, with slightly more points in NorthWest then in SouthEast. This shows that Attack S and Attack P almost always makes similar guess of membership for train points, with similar (correlated) confidence level (shown in the xaxis and yaxis of Figure 4), while Attack P performs slightly better than Attack S. Meanwhile, Attack R and D tend to output different membership guesses with different confidence levels, and Attack D dominates Attack R for correctly guessing membership of train points, because their comparison scatter plot is biased towards the NorthWest direction.

Gap Between Attacks to the Ground Truth. From Table 3, among all attacks, Attack D agrees with the ground truth the most on train points, while the least on test points. This matches our observation that Attack D has a larger threshold under given from Figure 1, which causes it to guess both more points as members. Besides that, we see that the agreement rate between Attack S and Attack P is as high as both on train points and test points, this matches their linear comparison scatter plot in Figure 4. This may be because Attack P and Attack R both reduces one degree of uncertainty for the joint distribution under null hypothesis , as discussed in Section 4.

Closeness of Attacks to Ideal Leaveoneout Attack. One interesting observation from Table 3 is that, among all the attacks, Attack D agrees with the Attack L the most, with agreement rate on train points and on test points. We believe this is because Attack D is highly similar in nature with Attack L, by approximating the training dataset of a target model and performing retraining, as discussed in Section 4.
(a)  (b) 
GT  L  S  P  R  D  
GT  0.968  0.696  0.662  0.772  0.992  
L  0.41  0.692  0.654  0.796  0.968  
S  0.656  0.73  0.874  0.592  0.696  
P  0.664  0.706  0.948  0.586  0.662  
R  0.656  0.722  0.756  0.732  0.78  
D  0.372  0.822  0.716  0.708  0.692 
6 Summary
We provide a framework for auditing the privacy risk from machine learning models through membership inference attacks. The framework is used to derive attack strategies and also highlight the factors beyond leakage from the models that affect the attack performance. We also empirically analyze the performance of these attack strategies against models trained on benchmark datasets.
References
 Membership privacy in micrornabased studies. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 319–330. Cited by: §2.
 The secret sharer: evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284. Cited by: §2.
 Extracting training data from large language models. arXiv preprint arXiv:2012.07805. Cited by: §1, §2.
 On the privacy risks of algorithmic fairness. arXiv preprint arXiv:2011.03731. Cited by: §2.
 Labelonly membership inference attacks. In International Conference on Machine Learning, pp. 1964–1974. Cited by: §2.
 Bounding information leakage in machine learning. arXiv preprint arXiv:2105.03875. Cited by: §2.
 Revealing information while preserving privacy. In Proceedings of the twentysecond ACM SIGMODSIGACTSIGART symposium on Principles of database systems, pp. 202–210. Cited by: §1.
 Gaussian differential privacy. arXiv preprint arXiv:1905.02383. Cited by: §2.
 Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §2.
 Robust traceability from trace amounts. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pp. 650–669. Cited by: §1, §2, §2.
 Exposed! a survey of attacks on private data. Annual Review of Statistics and Its Application, pp. 61–84. Cited by: §1.
 That which we call private. arXiv preprint arXiv:1908.03566. Cited by: §1, §2.
 Modelling and quantifying membership information leakage in machine learning. arXiv preprint arXiv:2001.10648. Cited by: §2.
 Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 954–959. Cited by: §4.3.
 LOGAN: membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies, pp. 133–152. Cited by: §2.
 Resolving individuals contributing trace amounts of dna to highly complex mixtures using highdensity snp genotyping microarrays. PLoS genetics, pp. e1000167. Cited by: §1, §2, §2, §4.2.
 Differentially private learning does not bound membership inference. arXiv preprint arXiv:2010.12112. Cited by: §1, §2.
 Auditing differentially private machine learning: how private is private sgd?. arXiv preprint arXiv:2006.07709. Cited by: §1, §2.
 Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1895–1912. Cited by: §2.
 The composition theorem for differential privacy. In International conference on machine learning, pp. 1376–1385. Cited by: §2.
 Stolen memories: leveraging model memorization for calibrated whitebox membership inference. In 29th USENIX Security Symposium (USENIX Security 20), pp. 1605–1622. Cited by: §2, §4.1.
 Membership leakage in labelonly exposures. arXiv preprint arXiv:2007.15528. Cited by: §2.
 Understanding membership inferences on wellgeneralized learning models. arXiv preprint arXiv:1802.04889. Cited by: §2, §4.1.
 A pragmatic approach to membership inferences on machine learning models. In 2020 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 521–534. Cited by: §1, §4.3.
 Antipodes of label differential privacy: pate and alibi. arXiv preprint arXiv:2106.03408. Cited by: §1, §2.
 Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 691–706. Cited by: §2.
 Quantifying the privacy risks of learning highdimensional graphical models. In International Conference on Artificial Intelligence and Statistics, pp. 2287–2295. Cited by: §2, §4.3.
 ML privacy meter: aiding regulatory compliance by quantifying the privacy risks of machine learning. arXiv preprint arXiv:2007.09339. Cited by: §1.
 Comprehensive privacy analysis of deep learning: passive and active whitebox inference attacks against centralized and federated learning. In IEEE Symposium on Security and Privacy (SP), pp. 1022–1036. Cited by: §2, §4.1, §4.
 Adversary instantiation: lower bounds for differentially private machine learning. arXiv preprint arXiv:2101.04535. Cited by: §1, §2.
 Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30 (4), pp. 838–855. Cited by: Definition 4.1.
 Knock knock, who’s there? membership inference on aggregate location data. arXiv preprint arXiv:1708.06145. Cited by: §2.
 Membership inference attack against differentially private deep learning model.. Transactions on Data Privacy, pp. 61–79. Cited by: §2.
 Whitebox vs blackbox: bayes optimal strategies for membership inference. In International Conference on Machine Learning, pp. 5558–5567. Cited by: §2, §4.1, Definition 4.1, §4.
 Mlleaks: model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246. Cited by: §2, §4.1.
 Genomic privacy and limits of individual detection in a pool. pp. 965. Cited by: §2, §4.3.
 Evaluating the vulnerability of endtoend automatic speech recognition models to membership inference attacks. Proc. Interspeech 2021, pp. 891–895. Cited by: §2.
 On the privacy risks of model explanations. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 231–241. Cited by: §2.
 Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 3–18. Cited by: 1st item, §1, §1, §2, §2, §4.1, §4.1.
 Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pp. 377–390. Cited by: §2.
 Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security, pp. 587–601. Cited by: §1.
 Auditing data provenance in textgeneration models. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 196–206. Cited by: §2.
 Systematic evaluation of privacy risks of machine learning models. In 30th USENIX Security Symposium (USENIX Security 21), Cited by: §2, §4.1.
 Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 241–257. Cited by: §2.
 Algorithms that remember: model inversion attacks and data protection law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, pp. 20180083. Cited by: §1.
 Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 681–688. Cited by: Definition 4.1.
 Privacy risk in machine learning: analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. Cited by: §1, §2, §2.
 Privacy analysis of deep learning in the wild: membership inference attacks against transfer learning. arXiv preprint arXiv:2009.04872. Cited by: §2.
Appendix A Appendix
a.1 Detailed derivation of approximated LRT for membership inference
We first prove two useful approximation inequalities about the posterior distribution as follows.

For arbitrary data point , and arbitrary dataset , we have
(39) Proof: By equation 5, we have
(40) (41) (42) (43) 
Let be i.i.d. samples from the data distribution . Then when is large enough, for any model parameter , we have
(44) Proof: This is ensured by the convergence in distribution of the posterior random variable of model on large number of training data samples , as .
We now offer details for deriving the approximated likelihood ratio test (LRT) for membership inference.
Definition A.1 (Approximated LRT for membership inference)
Let be random samples from the joint distribution of target model and target data point, specified by one of the following membership hypotheses.
(45)  
(46) 
The likelihoods function of hypothesis and given observed target model and target data point is as follows.
Therefore the LRT statistics is
(52) 
The LRT hypothesis test rejects when the LRT statistic is small. By equation 6, the rejection region can be approximated as follows.
(53) 
a.2 Experimental Setup
For each configuration, we train up to models on random IID splits of the data. Information regarding the datasets can be found below. For the Purchase100 configurations, we use a 4 layer MLP with layer units = [512, 256, 128, 64]. For CIFAR10, we use AlexNet and a 3block VGGNet; for CIFAR100 and MNIST configurations, we use a 2 layer CNN with filters = [32, 64] and max pooling. For all target models on Purchase100, CIFAR100, and MNIST, we use SGD as the optimization algorithm; for CIFAR10 we use Adam as the optimization function. All target models and datasets use categorical crossentropy as the loss function.
• Target models: For each dataset, we randomly select 10 target models out of trained models. The performance of Attacks S, P, and R are evaluated on all 10 models, whereas Attack D is evaluated on the 3 models (due to its computational cost).
• Given a target model, we use the remaining models as shadow models or reference models for Attack S and Attack R respectively. For the Attack D, we use the training data for the models to distill the target model.
• Population data for Attack P: We sample 1000 data points from each class for target models in Purchase100 and MNIST configurations, and 400 data points from each class for CIFAR10 and CIFAR100 configurations.
• Using models: For Configuration IV of Purchase100 and Configuration III of CIFAR10, we report the attack performance results using shadow, reference, and distilled models. For Configurations I and II of CIFAR10 we report the attack performance results using shadow, reference, and distilled models.
Datasets

The Purchase100 dataset is based on Kaggle’s "Acquire Valued Shoppers Challenge" that contains shopping histories for thousands of individuals^{5}^{5}5https://www.kaggle.com/c/acquirevaluedshopperschallenge. We use a simplified preprocessed purchase dataset provided in Shokri et al. [2017]. There are data points in this dataset, where each data point has binary features. The data points are clustered into classes to represent different shopping style.

The CIFAR10 and CIFAR100 datasets are widely used benchmark datasets for image classification. The CIFAR10 dataset consists of data points in classes, whereas the CIFAR100 dataset consists of data points in classes. Here each data point is a color image, and there are training images and test images in total.

The MNIST dataset is a widely used benchmark dataset for image classification. The MNIST dataset consists of data points in classes, where each data point is a handwritten digit image. In total, there are training images and test images.
a.3 Additional Empirical Results
CIFAR10 (III)  Purchase100 (IV) 
Train Acc.  Test Acc.  Attack S  Attack P  Attack R  Attack D  
I  96.2 0.046  40.9 0.029  0.870 0.018  0.857 0.023  0.874 0.018  0.871 0.007 
II  97.8 0.012  45.9 0.010  0.860 0.014  0.868 0.008  0.858 0.019  0.889 0.011 
III  97.4 0.004  68.2 0.011  0.706 0.011  0.708 0.009  0.737 0.014  0.742 0.003 
Train Acc.  Test Acc.  Attack S  Attack P  Attack R  Attack D  

I  97.4 0.013  14.9 0.009  0.959 0.005  0.960 0.004  0.964 0.006  0.957 0.0003 
II  97.9 0.006  20.4 0.006  0.944 0.004  0.945 0.003  0.945 0.006  0.936 0.0 
Train Acc.  Test Acc.  Attack S  Attack P  Attack R  Attack D  

I  97.9 0.004  95.8 0.007  0.50 0.004  0.50 0.005  0.557 0.009  0.549 0.006 
II  98.6 0.001  97.1 0.002  0.496 0.005  0.496 0.006  0.551 0.011  0.544 0.004 