Synthetic data is a hammer. Not everything is a nail.
Synthetic data was created to create data, not privacy.
February 2, 2022
Saibal Banerjee, Ph.D.
CTO & Co-founder
Synthetic data is popular in the MLOps space, but it’s often used for the wrong applications. Used for amplifying data where you don’t have enough original data, it’s the wrong method when applied to sensitive data that must be kept private, especially where GDPR requirements are present. More problematic than its privacy pitfalls, it destroys data utility and breaks all relationships to the original sensitive data. Its precise applications become challenging in the artificial intelligence world. Think of life sciences where such a relationship between genomics and health data is important in determining health conditions. Every data scientist or MLOps professional aspires to achieve the highest correlation between the real data and the safe data.
Data Creation. Not Privacy
Synthetic Data originated from engineers who had insufficient data for training their ML models and simply wanted to amplify their dataset. Privacy was not a requirement as amplification was the only goal–enough data did not exist to train their ML models. Their objective was to ensure that the amplified data retain the statistical properties of the original dataset.
PET companies using synthetic data claim that they are private, based on the belief that the generated fake data is in no way connected to the original data. According to multiple studies (see references below) by credible global centers of excellence, this belief continues to be proven false. Statistics modeled on sensitive data by synthetic data solutions give away sensitive information. If new data is generated based on this sensitive information, one of the new data points may hit upon a sensitive field which can break privacy.
In order to address this privacy deficiency, synthetic data solutions attempt to patch up their technique with methods of differential privacy. Unfortunately, this patch adds so much noise that it significantly lowers utility and materially impacts the effectiveness of the ML models, rendering them unreliable.
Privacy and Scale
The best practice for MLOps and data scientists to achieve quality output is to anonymize a large data set; provided that the solution used is GDPR compliant (the highest privacy standard). But of course, this presents a problem with sensitive and regulated data sets. Unlike synthetic data which was created to deal with non-existing data, anonymization was created to deal with a situation where data to train ML models does exist but cannot be accessed due to privacy concerns. It guarantees privacy and has better data utility.
Instead of applying differential privacy as an afterthought for generating new data, an anonymization process must add differentially private noise to each sensitive data point from the outset to achieve privacy and utility. This eliminates the inaccuracy stemming from the sampling process that generates new data. As a result, data utility soars while data privacy is guaranteed to be high.
Also, since this process adds noise to the original sensitive data instead of generating new noisy data, there is a relationship between an original data point and the anonymized data points arising from it. This relationship which we call a “true relationship” is an advantage that synthetic data can never hope to achieve.
It’s all about the Use Case
The only dependable use case for synthetic data is data amplification when privacy isn’t a concern. True anonymization can achieve that as well because multiple anonymized data points can arise from a single sensitive data point through a process called supersampling–still while retaining privacy.
Because true anonymization enjoys the “true relationship” property, anonymized data can be used in place of the sensitive equivalent for any purpose in which the real data may be used. In addition, one can join two anonymized datasets such as for genomics and health data. The genomics data is the “Nature” side of a person–their predisposition for getting a disease. Their health data reveals their “Nurture” side, e.g. comorbidities. When joined, the whole provides the likelihood of a person having the disease, a sum greater than its parts.
If both high data utility and privacy are needed, then true anonymization is the go-to method.
M. Elliott, “Final Report on the Disclosure Risk Associated with the Synthetic Data produced by the SYLLS Team.” Manchester University. October 2014.
L. Rocher, J.M. Hendrickx, YA. de Montjoye, “Estimating the success of re-identifications in incomplete datasets using generative models,” Nature Communications, vol. 10, Art. 3069, July 2019.
M. Hittmeir, R. Mayer, A. Ekelhart, “A Baseline for Attribute Disclosure Risk in Synthetic Data,” CODASPY ’20: Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy, March 2020, pp. 133-143.
K. El Emam, L. Mosquera, B. Jason, “Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation,” Journal of Medical Internet Research, vol. 22, no. 11, 16 November 2020.