Machine Learning for MDs Weekly Digest

What’s New in ML for MDs

Welcome to the ML for MDs Newsletter. The mission of ML for MDs is to connect physicians interested in machine learning. This newsletter provides the most relevant news, journal articles, and jobs at the intersection of medicine and machine learning.

Fun Facts

Finland provides AI courses to ensure the country’s competitiveness. Their goal is to teach at least one percent of the residents the fundamentals of artificial intelligence
Sophia, a lifelike humanoid, was granted citizenship in Saudi Arabia in 2017.

This Week’s Top Stories

Kaiser bought Geisinger! Covered nicely in the Health Tech Nerds newsletter
Did anyone else miss that BestBuy reached a deal with Atrium Health to install hospital at home technology? I can’t be the only one who thinks this is another doomed business for BestBuy since most people already have tech in their homes and it should be able to do most hospital at home functions in a few years without a special installation. But I guess it’ll give the Geek Squad a few more years of life.
In reassuring news about humans, most 4-11 year olds believe neither Alexa nor Roomba should be yelled at
Robots with tactile sense are continuing to improve and can now sense proximity as well

Weekly summary

A review of synthetic healthcare data generation

Creating synthetic health datasets helps solve the obvious issues related to acquiring datasets that don’t have PHI and are large enough to use machine learning. Without synthetic datasets, doctors and innovators sometimes wait months for institutional approval to test an idea that could be proven conceptually in a matter of hours.

This is a brief review of synthetic health data sets with links to good papers about the topic, with a focus on the Synthea data set.

Synthetic health datasets are used for:

Simulation and prediction research
Hypothesis, methods, and algorithm testing
Public health research
Health IT development/testing
Education and training
Public release of datasets
Linking data

Synthetic data can be on a spectrum of low disclosure risk/less detailed to high disclosure risk/more detailed. The basic idea is that when data are more realistic and useful, they’re also at higher risk of disclosure because they likely used some real data that could be re-identified:

https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot

This figure is from the UK’s Office for National Statistics.

There are many synthetic data generation (SDG) approaches, but often:

Their methods are not well-described which makes them difficult to repeat and validate
They’re not validated, or only superficially validated.
1. Some of them have elderly men who get pregnant, etc
2. There’s currently no benchmark or standard for validating synthetic data
3. This is a good paper from Nature summarizing benchmarking of synthetic data

PADARSER/Synthea – Developed in 2018, uses models for top 10 reasons patients visit their PCP and top 10 chronic conditions responsible for loss of life from CDC, US Census, and NIH but no EHR data. Treatments for the synthetic patients were based on public clinical guidelines or expert opinion.

The validation is particularly interesting: almost all synthetic patients with Type 2 diabetes developed neuropathy, as well as some kids under 5. The authors were able to modify the transition probabilities and the data improved. However, this experience underscores the importance of validation in these data sets.

A further validation study in 2019 showed the Synthea data:

Had no patients develop complications after hip/knee replacement
Had no patients control their blood pressure after developing hypertension

The authors of the validation study mention that “synthetic patient generators do not currently model for deviations in care and the potential outcomes that may result from care deviations”.

In 2022, the Coherent Data Set was developed from the Synthea data, which includes imaging DICOM files, narrative notes, and physiologic data.

Approach	Uses real EHR data?	Free?	Notes
medGAN	yes	yes
MDClone	yes	no	“converts real EHR records into a synthetic version that is statistically comparable and maintains correlations among its variables”
PatientGen	yes	no	generates statistically based synthetic patients to test health IT systems for privacy leaks
PADARSER/Synthea	no	yes	Started with a MassChallenge Health Tech Sprint
SynSys	yes	no	Semi-supervised ML, uses hidden Markov models and regression models that are initially trained on real data sets
iPatient Data Generator (iPDG)	no	yes	Uses ML to generate records from publicly available data

Examples of Synthetic Datasets from Gonzales et al:

Synthetic Dataset	Data Owner/ Distributor	Type of Synthetic Dataset	Data Characteristics (type and quantity)	Use	Use Case Example
CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF)	Centers for Medicare and Medicaid Services (Public domain)	Fully synthetic	6.8 Million beneficiary records; 112 million claims records; and 111 million prescription drug events records	Data entrepreneur analysis, software and application development, research training	Used a sub-set of the DE-SynPUF dataset to test different classification algorithms to accurately predict inpatient health care expenditure [60].
Synthea-Generated Datasets	MITRE Corporation	Fully Synthetic	One million longitudinal clinical synthetic patient records (SyntheticMass)	Innovation, development, education, and other nonclinical secondary uses	A pilot project used SyntheticMass data to assess whether data could be extracted from EHR through FHIR resources to support clinical trials [61].
US Synthetic Household Population	RTI International	Fully synthetic	Location and descriptive sociodemographic attributes of households (116 million records) and person living in those households (300 million records)	Agent-based modeling, disease outbreak simulation, distribution of resources analysis, sociodemographic pattern recognition, and disaster planning and response.	Used the dataset to simulate the impact of different influenza epidemics and the impact of utilizing pharmacies in addition to traditional (hospitals, clinic/physician offices, and urgent care centers) locations for vaccination [62].
CMS Synthetic data in Blue Button Sandbox	Centers for Medicare and Medicaid Services (inside a sandbox with access requirement)	No information	30,000 synthetic beneficiaries with claims data (Blue Button 2.0 Sandbox)	Development and testing of applications and information systems that will need to interact with CMS data systems	Blue Button 2.0 sandbox has more than 2,000 developers using the sandbox to test data exchange [63].

Community News

If you haven’t introduced yourself, please do so under the #intros channel.

Thanks for being a part of this community! As always, please let me know if you have questions/ideas/feedback.

Sarah

Sarah Gebauer, MD

MlforMDs.com

May 1, 2023