University of Twente Student Theses

As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Membership inference attacks and synthetic data generation with differential privacy

Micali, G. (2022) Membership inference attacks and synthetic data generation with differential privacy.

PDF
3MB

Abstract:	Machine learning (ML) is one of the most popular methods for data analysis. A carefully chosen design supplied with a set of training data yields a predictive model for the problem. However, it has been documented that parts of the training data or even complete records can be extracted from the model, which exposes the model to privacy attacks. One such example is a membership inference attack (MIA), where, based on the trained model, the attacker tries to determine whether a single record was in the training data or not. In order to satisfy strong and quantifiable privacy guarantees, differential privacy (DP) is the preferred tool because it introduces randomization in the algorithm, which obfuscates the private data in the model. However, noise and randomness reduce the utility of the data analysis. This means that a balance must be found between privacy and utility. How to resolve this trade-off is a highly non-trivial question, which is the main objective of this project. To show that such combination is possible, we worked on two aspects simultaneously: attacking and defending the model. An attack takes advantage of the fact that ML models are trained multiple times over the same train data set. As a result, the outcome of a point in the data set can be predicted more easily. A defence is therefore needed as a response for the increased concerns about the privacy of individuals whose data is used during the training. In more details, we show that the effectiveness of a MIA is reduced when the attacked ML model is being trained using a DP-learning algorithm. Particularly, using stochastic gradient descent (DP-SGD) and the DP-Frank Wolfe (DP-FW) which are suitable for privatization. Concretely, we will run two metric based attacks on three simple ML models that were trained using both DP-FW and DP-SGD, and later compare the results of Linear Regression, Logistic Regression and Multiclass Logistic Regression. The results of implementation of DP show a decrease of the effectiveness of these attacks, without compromising too much the target accuracy of the ML models. In particular, both DP-FW and DP-SGD show satisfactory privacy protection, but DP-SGD has better computational performances, and it also converges faster. Furthermore, the effectiveness of our MIAs is model dependent. The most meaningful results are obtained on Multiclass Logistic Regression, where the defence through DP is sharper. For other models, the effects of DP are only clearly visible when we impose additional assumptions on the data sets used for both training and testing. Without these assumptions, MIA is barely distinguishable from a random guess attack. The experiments for Logistic and Multiclass Logistic regression also show that the attacks are highly sensitive to changes of a threshold, which measures how much the dataset over-fitted the model. Therefore, such threshold depends on the data and it is needed for building an adequate metric based attack. In principle, this information must be kept private as well. Therefore, we will also explain how to set such threshold adequately. In the second part of the thesis, we switch points of view. Instead of creating a mechanism that guarantees privacy computation over a data set, artificial data is generated with the purpose of preserving privacy. Synthetic data is created by using different types of algorithms, such as Multiplicative Weights. The output is a dataset whose statistical properties are similar to the original data, but does not reveal any information regarding real data. More specifically, we create synthetic data by minimizing the error we commit when querying the dataset over a fixed set of statistical queries. Such formulations yields a saddle point optimization problem, for which different types of regularization are used. Since real datasets usually do not contain many repetitions of the same individual's data, it was given more focus to regularizations that promote high entropy.
Item Type:	Essay (Master)
Faculty:	EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:	31 mathematics
Programme:	Applied Mathematics MSc (60348)
Link to this item:	https://purl.utwente.nl/essays/92447
Export this item as:	BibTeX EndNote HTML Citation Reference Manager

Show download statistics for this publication

Repository Staff Only: item control page