A transformer model for de novo sequencing of data-independent acquisition mass spectrometry data
Justin Sanders, Bo Wen, Paul Rudnick, Rich Johnson, Christine C. Wu, Sewoong Oh, Michael J. MacCoss, William Stafford Noble A transformer model for de novo sequencing of data-independent acquisition mass spectrometry data bioRxiv 2024.06.03.597251; doi: https://doi.org/10.1101/2024.06.03.597251
- Organism: Homo sapiens, Mus musculus
- Instrument: Orbitrap Astral,Q Exactive HF-X
- SpikeIn:
No
- Keywords:
de novo, data independent acquisition, sequence variant, extracellular vesicles, Mag-Net
-
Lab head: Michael MacCoss
Submitter: Michael MacCoss
A core computational challenge in the analysis of mass spectrometry data is the de novo sequencing problem, in which the generating amino acid sequence is inferred directly from an observed fragmentation spectrum without the use of a sequence database. Recently, deep learning models have made significant advances in de novo sequencing by learning from massive datasets of high confidence labeled mass spectra. However, these methods are primarily designed for data-dependent acquisition (DDA) experiments. Over the past decade, the field of mass spectrometry has been moving toward using data-independent acquisition (DIA) protocols for the analysis of complex proteomic samples due to their superior specificity and reproducibility.
Hence, we present a new de novo sequencing model called Cascadia, which uses a transformer architecture to handle the more complex data generated by DIA protocols. In comparisons with existing approaches for de novo sequencing of DIA data, Cascadia achieves improved performance across a range of instruments and experimental protocols. Additionally, we demonstrate Cascadia’s ability to accurately discover de novo coding variants and peptides from the variable region of antibodies.
For our experiments with narrow-window DIA data, we use a training dataset of 878,217 labeled augmented spectra derived from 77 mouse plasma DIA mass spectrometry runs with 4~Th isolation window collected on the Orbitrap Astral.
Peptide detections for training were generated using the DIA_Speclib_Quant workflow in MSFragger-DIA, and precursor features for DeepNovo-DIA and Casanovo were selected using DIA-Umpire.
To demonstrate Cascadia’s ability to discover de novo peptides, we test it in a setting were ground truth labels are available through an orthogonal sequencing modality. We generated 40 DIA runs from the sampling of the surface of human skin using D100 squame sampling disks. We used three of these DIA runs derived from three different individuals. Targeted exome sequencing was then performed on 549 genes in these same individuals, yielding lists of 357, 368, and 595 ground truth single-nucleotide variants
(SNVs) in each sample.
Mouse plasma samples were prepared with the Mag-Net protocol for enrichment of extracellular vesicles as described using 50 µl of EDTA plasma (https://doi.org/10.1101/2023.06.10.544439).
The surface of the skin was sampled using a D100 Squame disk, placed in an Eppendorf tube, and vortexed in the presence of 2% SDS buffer. The samples were digested to proteins using S-traps via the manufacturer's instructions (https://files.protifi.com/protocols/s-trap-micro-long-4-7.pdf)
Created on 6/21/24, 4:53 PM