An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
Ulrich Omasits, Adithi R. Varadarajan, Michael Schmid, Sandra Goetze,
Damianos Melidis, Marc Bourqui, Olga Nikolayeva, Maxime Québatte,
Andrea Patrignani, Christoph Dehio, Juerg E. Frey, Mark D. Robinson,
Bernd Wollscheid, and Christian H. Ahrens
- Organism: Bartonella henselae strain Houston-1 (Bhen)
- Instrument: Orbitrap Fusion Mass Spectrometer (Thermo Fisher Scientific)
Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations.
Our strategy towards accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics dataset against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote.
Peptides for novel ORFs, start sites, expressed pseudogenes, or assembly-specific changes were selected based on spectral count, number of tryptic sites, number of missed cleavage sites, and PeptideRank prediction (Qeli et al. 2014). Heavy-labeled reference peptides were purchased from JPT Peptide Technologies GmbH (Berlin, Germany) and used to set up PRM assays (Bartonella_referencePeptides). Specific transitions were measured in Cyt and TM extracts of biological replicates of both conditions (i.e. induced vs uninduced fractions, see below).
Protein extracts of Bartonella cytoplasmic (Cyt) and total membrane (TM) fractions were prepared from MQB307 grown under uninduced and induced conditions as described (Omasits et al. 2013).
For tryptic digestion Bhen protein extracts from cyt and TM fractions were precipitated with 80% acetone before reduction. Alkylation was carried out with 10 mM iodoacetamide to modify cysteine residues. After trypsin digestion samples were purified by reverse phase C-18 chromatography (Sep-PacK, Waters). Of each condition and fraction, 2 experimental replicates were processed.
(Omasits, U., Quebatte, M., Stekhoven, D.J., Fortes, C., Roschitzki, B., Robinson, M.D., Dehio, C., and Ahrens, C.H. 2013. Directed shotgun proteomics guided by saturated RNA-Seq identifies a complete expressed prokaryotic proteome. Genome Research 23: 1916-1927).
Created on 6/14/17, 9:38 AM