Machine Learning for MDs Weekly Digest


What’s New in ML for MDs
Welcome to the ML for MDs Newsletter. The mission of ML for MDs is to connect physicians interested in machine learning. This newsletter provides the most relevant news, journal articles, and jobs at the intersection of medicine and machine learning.
Fun Facts
- Finland provides AI courses to ensure the country’s competitiveness. Their goal is to teach at least one percent of the residents the fundamentals of artificial intelligence
- Sophia, a lifelike humanoid, was granted citizenship in Saudi Arabia in 2017.
This Week’s Top Stories
- Kaiser bought Geisinger! Covered nicely in the Health Tech Nerds newsletter
- Did anyone else miss that BestBuy reached a deal with Atrium Health to install hospital at home technology? I can’t be the only one who thinks this is another doomed business for BestBuy since most people already have tech in their homes and it should be able to do most hospital at home functions in a few years without a special installation. But I guess it’ll give the Geek Squad a few more years of life.
- In reassuring news about humans, most 4-11 year olds believe neither Alexa nor Roomba should be yelled at
- Robots with tactile sense are continuing to improve and can now sense proximity as well
Weekly summary
A review of synthetic healthcare data generation
Creating synthetic health datasets helps solve the obvious issues related to acquiring datasets that don’t have PHI and are large enough to use machine learning. Without synthetic datasets, doctors and innovators sometimes wait months for institutional approval to test an idea that could be proven conceptually in a matter of hours.
This is a brief review of synthetic health data sets with links to good papers about the topic, with a focus on the Synthea data set.
Synthetic health datasets are used for:
- Simulation and prediction research
- Hypothesis, methods, and algorithm testing
- Public health research
- Health IT development/testing
- Education and training
- Public release of datasets
- Linking data
Synthetic data can be on a spectrum of low disclosure risk/less detailed to high disclosure risk/more detailed. The basic idea is that when data are more realistic and useful, they’re also at higher risk of disclosure because they likely used some real data that could be re-identified:

This figure is from the UK’s Office for National Statistics.
There are many synthetic data generation (SDG) approaches, but often:
- Their methods are not well-described which makes them difficult to repeat and validate
- They’re not validated, or only superficially validated.
- Some of them have elderly men who get pregnant, etc
- There’s currently no benchmark or standard for validating synthetic data
- This is a good paper from Nature summarizing benchmarking of synthetic data
PADARSER/Synthea – Developed in 2018, uses models for top 10 reasons patients visit their PCP and top 10 chronic conditions responsible for loss of life from CDC, US Census, and NIH but no EHR data. Treatments for the synthetic patients were based on public clinical guidelines or expert opinion.
The validation is particularly interesting: almost all synthetic patients with Type 2 diabetes developed neuropathy, as well as some kids under 5. The authors were able to modify the transition probabilities and the data improved. However, this experience underscores the importance of validation in these data sets.
A further validation study in 2019 showed the Synthea data:
- Had no patients develop complications after hip/knee replacement
- Had no patients control their blood pressure after developing hypertension
The authors of the validation study mention that “synthetic patient generators do not currently model for deviations in care and the potential outcomes that may result from care deviations”.
In 2022, the Coherent Data Set was developed from the Synthea data, which includes imaging DICOM files, narrative notes, and physiologic data.
Approach | Uses real EHR data? | Free? | Notes |
medGAN | yes | yes | |
MDClone | yes | no | “converts real EHR records into a synthetic version that is statistically comparable and maintains correlations among its variables” |
PatientGen | yes | no | generates statistically based synthetic patients to test health IT systems for privacy leaks |
PADARSER/Synthea | no | yes | Started with a MassChallenge Health Tech Sprint |
SynSys | yes | no | Semi-supervised ML, uses hidden Markov models and regression models that are initially trained on real data sets |
iPatient Data Generator (iPDG) | no | yes | Uses ML to generate records from publicly available data |
Examples of Synthetic Datasets from Gonzales et al:
Synthetic Dataset | Data Owner/ Distributor | Type of Synthetic Dataset | Data Characteristics (type and quantity) | Use | Use Case Example |
CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) | Centers for Medicare and Medicaid Services (Public domain) | Fully synthetic | 6.8 Million beneficiary records; 112 million claims records; and 111 million prescription drug events records | Data entrepreneur analysis, software and application development, research training | Used a sub-set of the DE-SynPUF dataset to test different classification algorithms to accurately predict inpatient health care expenditure [60]. |
Synthea-Generated Datasets | MITRE Corporation | Fully Synthetic | One million longitudinal clinical synthetic patient records (SyntheticMass) | Innovation, development, education, and other nonclinical secondary uses | A pilot project used SyntheticMass data to assess whether data could be extracted from EHR through FHIR resources to support clinical trials [61]. |
US Synthetic Household Population | RTI International | Fully synthetic | Location and descriptive sociodemographic attributes of households (116 million records) and person living in those households (300 million records) | Agent-based modeling, disease outbreak simulation, distribution of resources analysis, sociodemographic pattern recognition, and disaster planning and response. | Used the dataset to simulate the impact of different influenza epidemics and the impact of utilizing pharmacies in addition to traditional (hospitals, clinic/physician offices, and urgent care centers) locations for vaccination [62]. |
CMS Synthetic data in Blue Button Sandbox | Centers for Medicare and Medicaid Services (inside a sandbox with access requirement) | No information | 30,000 synthetic beneficiaries with claims data (Blue Button 2.0 Sandbox) | Development and testing of applications and information systems that will need to interact with CMS data systems | Blue Button 2.0 sandbox has more than 2,000 developers using the sandbox to test data exchange [63]. |
Community News
- If you haven’t introduced yourself, please do so under the #intros channel.
Thanks for being a part of this community! As always, please let me know if you have questions/ideas/feedback.
Sarah
Sarah Gebauer, MD