Responsible Synthetic Data: Unlocking Insights While Safeguarding Privacy
December 3, 2025
From electronic health records (EHRs) to federal statistics, synthetic data are rapidly transforming how organizations share and analyze information, offering new ways to unlock insights while protecting individual privacy. Although this article focuses on synthetic data generation for privacy and confidentiality purposes, such as safeguarding sensitive information and enabling secure data access, synthetic data have many additional applications. They are increasingly employed to improve machine learning (ML) model performance, facilitate data sharing for scientific collaboration, and simulate rare or hard-to-observe scenarios.
The responsible generation and use of synthetic data require more than advanced algorithms; these processes demand a deep understanding of the statistical, ethical, and operational challenges involved. Here, Minsun Riddles, PhD, a Principal Statistical Associate for Statistics and Data Science, examines the technical dimensions and practical aspects of synthetic data. She emphasizes that valid inference from synthetic datasets depends on rigorous planning, validation, and governance at every stage, from data generation to analysis to policy application. By pairing technical innovation with transparency and sound governance, the use of synthetic data can enable broader data access and faster evidence generation without compromising public trust.
Q. For clients considering synthetic data, what are the most common pitfalls that can lead to invalid inference, and how do we diagnose and prevent them up front (e.g., during data generation vs. during analysis)?
A. Drawing valid statistical inferences from synthetic datasets requires careful planning and validation. Otherwise, common pitfalls can lead to a failure to preserve key statistical properties (e.g., means, variances, correlations). Here are the most common pitfalls and how to identify and mitigate them.
- Bias in Source Data. Synthetic data can carry over biases from the original dataset. Before creating synthetic data, the source should be checked for known or suspected biases. If found, the data can be rebalanced or adjusted after generation (e.g., through reweighting or resampling) to reduce these biases.
- Loss of Causal Structure. Synthetic data often capture correlations but not cause-and-effect relationships, which can result in misleading information. This issue can be detected by testing whether known causal patterns appear in the synthetic data. To prevent the problem, domain experts should be involved early, and models should be chosen or designed to preserve causal relationships, especially for data used in policy development or decision-making.
- Overfitting. When a synthesis model overfits, it may copy real records too closely, risking privacy and limiting usefulness. This problem can be detected by testing the model on separate data and checking for unrealistically close matches. Controlling model complexity can help prevent overfitting.
- Loss of Rare but Critical Events. Uncommon but important cases may be missing or underrepresented in synthetic data. These can be found by comparing the frequency of rare events in real and synthetic datasets. When rare cases are involved, models should be trained to preserve these distributions, and oversampling may be necessary.
- Inference–Prediction Mismatch. A model trained on synthetic data might predict well but produce incorrect inferences about population relationships. During validation, parameter estimates and confidence intervals from the real and synthetic datasets should be compared. Reliable validation systems and proper uncertainty estimates are key when synthetic data are used for inference.
Q. Variance estimation presents a challenge in properly capturing uncertainty. Which approaches do you recommend in practice, and how do you explain the uncertainty to nontechnical stakeholders making policy or budget decisions?
A. Uncertainty is inherent in any model or estimate due to varying conditions in the underlying population. A practical approach is multiple imputation-style synthesis, generating several synthetic datasets to estimate variance due to the synthesis process. For more complex scenarios or small samples, bootstrapping over the synthesis process can yield a robust measure of uncertainty, though it requires access to the generation process, which is often unavailable to end users. In surveys, replicate weights can account for sampling and synthesis-related uncertainty. Bayesian methods provide another way to capture uncertainty by estimating a range of possible outcomes and their likelihoods. However, these approaches can be harder to explain to people without a technical background.
When communicating uncertainty to decision-makers, visualizing confidence intervals can make variability more intuitive. It is often more effective to frame uncertainty in terms of the likelihood that an estimate falls within a policy-relevant range, which is more intuitive than variance alone.
Q. How do you balance disclosure risk and analytic utility for different use cases, and what governance tests or metrics do you use to quantify that trade-off for clients?
A. Balancing privacy protection with data usefulness is at the heart of creating and using synthetic data. The right balance depends on how the data will be used. For example, when exploring patterns or testing ideas, it may be fine if the synthetic data are less precise, as long as individual privacy is well protected. However, when developing statistical models or testing specific hypotheses, it is important that the synthetic data closely reflect the real relationships in the original dataset. For research intended for publication, both privacy risk and data quality must be carefully evaluated, often resulting in more cautious approaches or methods that use real data for final checks and validations.
To guide this balance, risk-utility maps are often used to visualize the trade-off between disclosure risk and analytic utility. Governance policies such as tiered access controls, synthetic data validation protocols, and expert review panels enable organizations to customize synthesis approaches aligned with data sensitivity and analysis goals.
Westat offers a comprehensive and multidisciplinary in-house team that possesses deep technical, analytical, and subject matter expertise. Our dedicated privacy and confidentiality group has extensive experience in designing and implementing cutting-edge synthetic data and privacy-preserving solutions.
Minsun Riddles, PhD, Principal Statistical Associate, Statistics and Data Science
Q. From a capabilities standpoint, what in-house expertise and toolchains does Westat bring to clients, and where do we see gaps that require new methods or partnerships as the field evolves?
A. Westat offers a comprehensive and multidisciplinary in-house team that possesses deep technical, analytical, and subject matter expertise. Our dedicated privacy and confidentiality group has extensive experience in designing and implementing cutting-edge synthetic data and privacy-preserving solutions. These experts focus on applying privacy-enhancing technologies that enable secure and compliant data sharing and analysis in line with evolving regulations.
Complementing this group is a robust statistics and data science team skilled in advanced statistical methods, ML, and artificial intelligence (AI), ensuring that models and tools are fit for purpose across a range of analytical challenges. In addition, our broad network of subject matter experts spans key policy areas, such as public health, behavioral health, clinical research, education, and transportation. These domain specialists work closely with our technical teams to ensure that solutions are relevant, actionable, and grounded in real-world contexts.
At the same time, we recognize that this is a rapidly evolving field, especially in terms of the ethical use of AI, and we remain committed to continuous innovation and strategic collaboration to stay ahead of emerging challenges.
Q. Beyond individual projects, what societal benefits can synthetic data realistically unlock, and what safeguards must be in place so these benefits do not come at the expense of confidentiality or public trust?
A. By generating realistic yet nonidentifiable datasets, synthetic data can broaden data access, particularly for individuals and institutions with limited resources. Synthetic data also support faster, more iterative evidence generation, which is essential for timely decision-making. In contexts like data linkage, synthetic records can serve as a safe scaffold for designing and testing linkage strategies without exposing sensitive identifiers or violating privacy regulations.
Transparency is essential, not only in documenting synthesis methods but also in clearly disclosing the known limitations and intended use of synthetic datasets. Robust governance practices—such as clear terms of use, role-based access, ethical oversight, and auditing—are key in ensuring data use aligns with the intended purpose. Public trust depends on accountability (e.g., independent validation) and clear communication of both the benefits and limitations of synthetic data.
Capabilities
Advanced Technologies Analysis and Modeling Data Harmonization and Data Linkage Data Integration, Harmonization, and Complex Analytics Data Privacy Data Science Data Science and Analytics Machine Learning and Artificial Intelligence Statistical Methods Survey StatisticsFeatured Expert
Minsun Riddles
Principal Statistical Associate
-
Expert Interview
Responsible Synthetic Data: Unlocking Insights While Safeguarding PrivacyDecember 2025
From electronic health records (EHRs) to federal statistics, synthetic data are rapidly transforming how organizations share and analyze information, offering new ways to unlock insights…
-
Expert Interview
Moving AHEAD by Overcoming Potential Challenges to Healthcare TransformationNovember 2025
Healthcare access and costs in the U.S. continue to move in opposite directions—access is declining while costs keep rising. Americans need innovative solutions that tackle…
-
Expert Interview
Improving Blood Transfusion Safety and Clinical PracticesNovember 2025
Every 2 seconds, someone in the U.S. needs blood, platelets, or both. This translates to 16 million blood components transfused in patients annually. But how…