March 13, 2023 - ML for MDs

Weekly feature: AlphaFold

As an undergraduate, I worked in a lab that used X-ray crystallography to determine protein structure. I chose that lab because there were beautiful portraits of twisting protein structures in the lab that I appreciated more as works of art than as science. Determining protein structure back then sometimes took years. With the advent of Deep Mind’s AlphaFold, that time decreased to minutes.

AlphaFold’s ability to predict protein folding and structure was first described in a Nature paper in June 2021 and was immediately heralded as an innovation that “will change everything.”

Prior Approaches

Finding a less time-intensive approach to determine protein structure had been an open problem for decades. Physical Interaction programs focused on how the amino acids should interact with each other given thermodynamic or kinetic forces. However this was difficult to model from a practical standpoint. Evolutionary programs use bioinformatics to look at how proteins evolved and what proteins they’re similar to. These programs fail when confronted with a protein for which there’s no existing homologue in the databank.

How AlphaFold Works

*Note: this explanation is, by necessity, a simplification*

From the paper: “The AlphaFold network directly predicts the 3D coordinates of all heavy atoms for a given protein using the primary amino acid sequence and aligned sequences of homologues as inputs.”

The Evoformer: The trunk of the network

Inputs are processed through repeated layers of the Evoformer to produce an array of sequences x residues.
- This array represents a multisequence alignment (MSA) and an array that represents residue pairs

The Structure Module

“Introduces an explicit 3D structure in the form of a rotation and translation for each residue of the protein”
Iterative refinement that’s related to computer vision approaches

Training

Trained network to predict structures of sequences -> creation of new dataset
Train the same architecture from scratch with different (established) data and the new dataset
Mask out or mutate individual residues and use a BERT-style transformer to predict masked elements

Limitations

Accuracy goes down when the median alignment depth is less than 30 sequences (ie, it helps to have at least that much information)
Worse for proteins with bridging domains in which the shape is mostly created by the interactions of the chains

Progress since AlphaFold’s introduction

Prior to its introduction, the 3D structure of 17% of the 20,000 proteins in the human body had been established. By 2022, 98% had been determined. Jobs like the one I had in my college lab became obsolete nearly overnight.
Recent MIT paper by the Collins lab showed limitations in antibiotic discovery

Fun Fact

CAPTCHA stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart

This Week’s Top Stories

A lab is trying to use skin-derived stem cells to develop brain organoidsto power supercomputers.
- Each organoid has the same number of cells as the nervous system of a fruit fly
- The same lab developed a brain organoid-computer interface similar to an EEG cap
Listen to a podcast about why pharma needs to incorporate AI/ML, especially with profits expected to decrease
Machine learning improved predictions for ER patient volume compared to simple univariate time sample analysis, but not by much.
- Trained two years of data from a single institution with a wide variety of inputs including patient data, weather, and air quality, tested on a third year
- The random forest model performed best, followed by the gradient boosted model (GBM)
For those interested in AI security, the National Institute on Standards and Terminology has an open comment period for terminology related to AI
More news on academic medical centers partnering with AI companies to improve clinical documentation

Machine Learning Paper of the Week

This paper from May 2022 is a nice review of the “plagues” of AI in healthcare:

Problem relevance: We built a beautiful model but no one WILL use it
Practicality: We built a beautiful model but no one CAN use it
Generalization to new populations: Getting enough good data is hard, especially in healthcare
Discriminatory bias: Medical information lacks representation
Emergence of new trends: Medical practice and equipment changes but the training data doesn’t
Replicability: Minor changes in the algorithm or training data can create big differences
Explainability: Better models are harder for doctors to understand
Accidental fitting of confounders: Common to many studies – this is something doctors understand
Hacking: one pixel attack: Changing just a small amount of data, even one pixel, can ruin the model
Model drift: Model performance often worsens over time

The study also briefly discusses two other major issues:

How to communicate the limitations of a model with physicians
- False positive/negative rates
Who is liable when the computer is wrong?
- I’m guessing liability will fall on the physician, since that’s what happens currently when technology is wrong (monitors, etc)