May 1, 2023

Machine Learning for MDs Weekly Digest

What’s New in ML for MDs

Welcome to the ML for MDs Newsletter. The mission of ML for MDs is to connect physicians interested in machine learning. This newsletter provides the most relevant news, journal articles, and jobs at the intersection of medicine and machine learning. 

Fun Facts

  • Finland provides AI courses to ensure the country’s competitiveness. Their goal is to teach at least one percent of the residents the fundamentals of artificial intelligence
  • Sophia, a lifelike humanoid, was granted citizenship in Saudi Arabia in 2017.

This Week’s Top Stories

  • Kaiser bought Geisinger! Covered nicely in the Health Tech Nerds newsletter
  • Did anyone else miss that BestBuy reached a deal with Atrium Health to install hospital at home technology? I can’t be the only one who thinks this is another doomed business for BestBuy since most people already have tech in their homes and it should be able to do most hospital at home functions in a few years without a special installation. But I guess it’ll give the Geek Squad a few more years of life. 
  • In reassuring news about humans, most 4-11 year olds believe neither Alexa nor Roomba should be yelled at
  • Robots with tactile sense are continuing to improve and can now sense proximity as well

Weekly summary

A review of synthetic healthcare data generation

Creating synthetic health datasets helps solve the obvious issues related to acquiring datasets that don’t have PHI and are large enough to use machine learning. Without synthetic datasets, doctors and innovators sometimes wait months for institutional approval to test an idea that could be proven conceptually in a matter of hours. 

This is a brief review of synthetic health data sets with links to good papers about the topic, with a focus on the Synthea data set.

Synthetic health datasets are used for:

  • Simulation and prediction research
  • Hypothesis, methods, and algorithm testing
  • Public health research
  • Health IT development/testing
  • Education and training
  • Public release of datasets
  • Linking data

Synthetic data can be on a spectrum of low disclosure risk/less detailed to high disclosure risk/more detailed. The basic idea is that when data are more realistic and useful, they’re also at higher risk of disclosure because they likely used some real data that could be re-identified:

https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot

This figure is from the UK’s Office for National Statistics. 

There are many synthetic data generation (SDG) approaches, but often:

  1. Their methods are not well-described which makes them difficult to repeat and validate
  2. They’re not validated, or only superficially validated.
    1. Some of them have elderly men who get pregnant, etc
    2. There’s currently no benchmark or standard for validating synthetic data
    3. This is a good paper from Nature summarizing benchmarking of synthetic data

PADARSER/Synthea – Developed in 2018, uses models for top 10 reasons patients visit their PCP and top 10 chronic conditions responsible for loss of life from CDC, US Census, and NIH but no EHR data. Treatments for the synthetic patients were based on public clinical guidelines or expert opinion. 

The validation is particularly interesting: almost all synthetic patients with Type 2 diabetes developed neuropathy, as well as some kids under 5. The authors were able to modify the transition probabilities and the data improved. However, this experience underscores the importance of validation in these data sets. 

A further validation study  in 2019 showed the Synthea data:

  • Had no patients develop complications after hip/knee replacement
  • Had no patients control their blood pressure after developing hypertension

The authors of the validation study mention that “synthetic patient generators do not currently model for deviations in care and the potential outcomes that may result from care deviations”.

In 2022, the Coherent Data Set was developed from the Synthea data, which includes imaging DICOM files, narrative notes, and physiologic data.

ApproachUses real EHR data?Free?Notes
medGANyesyes
MDCloneyesno“converts real EHR records into a synthetic version that is statistically comparable and maintains correlations among its variables”
PatientGenyesnogenerates statistically based synthetic patients to test health IT systems for privacy leaks 
PADARSER/SyntheanoyesStarted with a MassChallenge Health Tech Sprint
SynSysyesnoSemi-supervised ML, uses hidden Markov models and regression models that are initially trained on real data sets
iPatient Data Generator (iPDG)noyesUses ML to generate records from publicly available data

Examples of Synthetic Datasets from Gonzales et al:

Synthetic DatasetData Owner/ DistributorType of Synthetic DatasetData Characteristics (type and quantity)UseUse Case Example
CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF)Centers for Medicare and Medicaid Services (Public domain)Fully synthetic6.8 Million beneficiary records; 112 million claims records; and 111 million prescription drug events recordsData entrepreneur analysis, software and application development, research trainingUsed a sub-set of the DE-SynPUF dataset to test different classification algorithms to accurately predict inpatient health care expenditure [60].
Synthea-Generated DatasetsMITRE CorporationFully SyntheticOne million longitudinal clinical synthetic patient records (SyntheticMass)Innovation, development, education, and other nonclinical secondary usesA pilot project used SyntheticMass data to assess whether data could be extracted from EHR through FHIR resources to support clinical trials [61].
US Synthetic Household PopulationRTI InternationalFully syntheticLocation and descriptive sociodemographic attributes of households (116 million records) and person living in those households (300 million records)Agent-based modeling, disease outbreak simulation, distribution of resources analysis, sociodemographic pattern recognition, and disaster planning and response.Used the dataset to simulate the impact of different influenza epidemics and the impact of utilizing pharmacies in addition to traditional (hospitals, clinic/physician offices, and urgent care centers) locations for vaccination [62].
CMS Synthetic data in Blue Button SandboxCenters for Medicare and Medicaid Services (inside a sandbox with access requirement)No information30,000 synthetic beneficiaries with claims data (Blue Button 2.0 Sandbox)Development and testing of applications and information systems that will need to interact with CMS data systemsBlue Button 2.0 sandbox has more than 2,000 developers using the sandbox to test data exchange [63].

Community News

  • If you haven’t introduced yourself, please do so under the #intros channel. 

Thanks for being a part of this community! As always, please let me know if you have questions/ideas/feedback.

Sarah

Sarah Gebauer, MD

MlforMDs.com

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top