Data privacy is important to consumers and a material compliance obligation of enterprises who serve them, but difficult to achieve – and prove – due to the absence of clear standards. Pew Research reports that over 80% of Americans highly value their data privacy and don’t trust enterprises to properly protect it. “Privacy” is defined by data privacy laws applicable to the jurisdiction where enterprises use sensitive data. If we’re speaking about data privacy under GDPR – the most stringent global data privacy law – its applicability and enforcement continue to be developed. If the enterprise is a long-game player, it has already built a framework of compliance for which it continues to improve. Otherwise, it will face significant financial penalties, with the maximum fine of up to €20 million or 4% of the global annual turnover of the preceding financial year, whichever is higher. Data privacy is especially poignant to enterprises that leverage big data to create AI impactful to their business, as it must create a new safe data pipeline for data scientists, ML engineers, generative AI engineers, data controllers – among many others – that minimizes the downside risk of data accessibility. Although there are many aspects of data privacy that apply to an enterprises’ risk mitigation approach, we’re talking about leveraging production data and Real World Data to drive growth through data centric product innovation. Sophisticated enterprises strive to reach the holy grail of data portability: anonymization. If data is anonymized properly under GDPR, that data can be used for any purpose and isn’t relegated to only using data for the purposes for which it was collected. If you want to dig deeper into anonymization under GDPR, please request our free tomtA.ai True Atomic Privacy Buyer’s Guide.
Anonymization is a classification of individual personally identifiable data that has been made no longer identifiable – i.e., a state in which re-identification is unlikely. If anonymized, the data owner is relieved of certain future data use requirements, such as:
- Limit the collection of personal information to what is directly relevant and necessary to accomplish a specified purpose;
- Process any such personal information in compliance with GDPR; and
- Retain the data only for as long as is necessary to fulfill that purpose.
An anonymized data set may be used for any other purpose, such as combining it with other patient data to develop AI/ML to predict further illness – i.e., without obtaining consent of data subjects for additional processing activity or processing it under the category for “scientific or historical research purposes or statistical purposes” under GDPR Article 89(1).
One method to anonymize data is to use privacy enhanced technology tools (“PETs”) recommended – but not endorsed – by the UK’s ICO. PETs include various technologies that deliver a protected or modified data set that may or may not be suitably free from re-identification under GDPR. The burdon of whether sensitive data is anonymized remains with the data controller. The devil is in the details, of course, as to what constitutes proper anonymization and in what manner is acceptable by the EU data authority. GDPR Recital 26 requires an enterprise:
“To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments”.
This guidance can be quantified into a risk analysis framework that includes a tool to objectively determine the risk of re-identification based upon re-identification workflows. But an industry standard privacy measure is needed and no such standard exists that is easily portable. Airbus and Boeing customers expect adequate instrumentation to ensure safe and efficient operation of their aircraft. Why do data product/PET purveyors like synthetic data companies not follow suit with transparent and verifiable privacy metrics? When we entered the market, the lack of transparency and objectivity was apparent, and we invested heavily into metrics that would help our customers validate and trust the results of our True Atomic Privacy (TAP) data preparation tool. Because our customers use data at scale, it is imperative that all of their stakeholders can trust its privacy and precision, or bad things may happen.
tomtA’s True Atomic Privacy solution provides the user privacy, accuracy and utility metrics after every sensitive data transformation. All of our metrics are based upon scientific and objective measures attained by TAP’s computational differential privacy + generative AI technology. Currently tomtA has developed three privacy metrics, each corresponding to a different facet of the transformed data to provide a comprehensive risk analysis for the likelihood of data reidentification by an adversary. They are:
- Our Epsilon Loss Bound metric provides an information-theoretic guarantee of how well tomtA TAP can anonymize a sensitive dataset through the process of noise addition and suppression. Loosely speaking, epsilon is a measure of how well an adversary can find out whether a single individual does or does not belong to a dataset by observing anonymized outputs from it. tomtA’s computational differential privacy automatically determines the best privacy value for the particular sensitive data set, unlike less sophisticated products which afford the user to set a “privacy budget” value and require them to guess the appropriate setting.
- Our Attribute Disclosure Risk metric provides the amount of information leaked from the attribute value in each cell of the sensitive table by comparing it to its value in the corresponding cell of the anonymized table using Fano’s information leakage measure. Prof. Robert M. Fano’s approach is the most accepted scientific data privacy metric used today.
- Our Membership Disclosure Risk metric provides a measure of how well the existence of any member in the sensitive dataset is protected. This is of particular importance for extremely sensitive datasets under GDPR – e.g. in a dataset of HIV patients, the revelation of a single individual’s inclusion in that dataset rather than their individual data alone will negatively affect their privacy. Although their specific data may not be identified, their condition is revealed which would be an unacceptable re-identification under GDPR.
The stakes are too high for sensitive data pipelines to rely upon sales promises and belief systems for privacy. If you can’t measure it, you can’t manage it, and managing data precision and privacy are critical deliveries for any enterprise in the AI era. Deliver trust and confidence to your stakeholders and customers by measuring what matters with trusted and verified metrics that support the enterprise and its customers alike.