Based on a variety of discussions with customers, regulators, analyst firms and academics, we have seen a need for synthetic data privacy and accuracy standards. However, due to the novelty of this technology and a not yet existent synthetic data community with members from synthetic data producers, synthetic data users, academia as well as from the regulatory side, we will first start with this Synthetic Data Industry Connections (IC) activity to build the community and discuss how standardization of this new technology could best be approached. Besides laying the groundwork for the submission of a proposal for synthetic data standards, this IC activity will also seek to advance the concept of fair synthetic data, as well as to support regulators in their understanding of this new technology and how it can be evaluated.
Data is now at the core of every technological, societal, and economic advance and organizations are under increasing pressure to become data-driven and offer personalized services to meet their customers’ expectations. Thus, there is a rising need to utilize customer data. However, by 2023, 65% of the world’s population will have its personal information covered under modern privacy regulations, up from 10% in 2020 (Gartner) and already now the European Union’s General Data Protection Regulation (GDPR) and the United States’ California Consumer Privacy Act (CCPA) present organizations with the challenge to find privacy-preserving ways of utilizing customer data. While anonymization of customer data would be a solution, as fully anonymous data is exempt from privacy legislation, researchers have shown repeatedly that legacy anonymization techniques (e.g., masking, obfuscating) fail in the era of big data and are not able to protect individuals from re-identification in supposedly anonymous datasets. Moreover, due to the destructive nature of these approaches (i.e., all parts of a dataset that could be re-identifying need to be deleted), it is not possible to preserve the utility of traditionally anonymized data, which significantly limits its usability for analytical purposes and AI training. This has resulted in customer data being largely locked up, creating a barrier to data-driven innovation.
Recent advancements in deep learning and an increase in computational power facilitated the development and early adoption of an emerging anonymization and privacy protection technique: AI-generated synthetic data. Synthetic data is artificial data that is generated based on original customer data. It is highly realistic and statistically representative to the original data and thus suitable to serve as a drop-in replacement for it (e.g., for AI training). Yet – when generated with appropriate privacy mechanisms – synthetic data is fully anonymous and impossible to re-identify.
Besides creating replica datasets, synthetic data is also capable of augmenting data to reduce bias and to correct imbalances. Research—also by Gartner—estimates that by 2022, 85% of algorithms will be erroneous due to bias. Bias and discrimination of AI systems are problems that are already being taken seriously, and synthetic data can contribute to mitigate bias with fair synthetic datasets representing the world, not as it is, but as we would like to see it. For instance, without gender-based or racial discrimination.
Due to synthetic data’s immense potential to reconcile data utilization with privacy protection, there are already enterprise organizations in the financial services, insurance, healthcare, and telco industry, as well as public sector organizations using synthetic data for AI training, analytics, digital product development, cross-border data sharing and testing. However, there are no commonly agreed criteria for measuring the accuracy and privacy of synthetic data-generating platforms.
We welcome new participants from large and small corporations, academia, industry, and government agencies that are interested in this Synthetic Data activity. Members will be composed of, but not limited to:
- Providers of AI-generated synthetic data/vendors
- Users of AI-generated synthetic data
- Data protection authorities and other regulators
- Privacy researchers
- Deep Learning/AI-generated synthetic data researchers
- Digital human rights activist groups
- Privacy lawyers
- Consulting firms with focus on (Ethical) AI and Privacy
Deliverables and outcomes from Industry Connections activities may include documents (e.g., white papers or reports), proposals for standards, conferences, workshops, etc. The deliverables of this Synthetic Data IC activity will include:
- Building an international community for structured, AI-generated synthetic data
- Recommendation for an AI-generated synthetic data privacy and accuracy standard proposal
Once the above big milestones are reached, the group will dedicate its time to work on other deliverables, including:
- Harmonizing the definitions of different synthetic data types and categories
- Definition of typical uses, best practices, and application orientation for AI-generated synthetic data
- An educational AI-generated synthetic data white paper for data protection authorities and other regulatory bodies
- A series of workshops and a final report to advance the concept of fair synthetic data
- A report on AI-generated synthetic data for AI auditing and explainable AI
- Defining criteria catalogs for standardized open AI-generated synthetic datasets to enable lesser resourced countries and small and medium-sized enterprises (SMEs) to innovate at a more competitive level
How to Participate
To join this Synthetic Data IC activity, please express your interest by sending an inquiry to:
- ICAID (PDF)