A quantitative tagless co-purification method to validate and identify protein-protein interactions
Maxim Shatsky, Ming Dong, Haichuan Liu, Lee Lisheng Yang, Megan Choi, Mary E. Singer, Jil T. Geller, Susan J. Fisher, Steven C. Hall, Terry C. Hazen, Steven E. Brenner, Gareth Butland, Jian Jin, H. Ewa Witkowska, John-Marc Chandonia, and Mark D. Biggin
- Organism: Desulfovibrio vulgaris Hildenborough wild-type ATCC29579
- Instrument: AB Sciex 4800 TOF/TOF, AB Sciex 5800 TOF/TOF
A quantitative tagless method that employs iTRAQ™ mass spectrometry (Ross et al., 2004. Mol. Cell. Proteomics 3:1154-69) to measure the co-purification of endogenous proteins through orthogonal chromatographic fractionation steps (Dong et al.,2008. J. Proteome Res. 7:1836-49) was employed to characterize protein-protein interactions (PPIs) in D.vulgaris (DvH). 5,273 fractions from a four step fractionation of a D. vulgaris protein extract were assayed, leading to the detection of 1,242 proteins. Shotgun LC MALDI utilized AB Sciex 4800 and 5800 TOF/TOF mass spectrometers. ProteinPilot™ software was used for protein identification and relative quantitation. Pearson cross-correlation values (CC values) were computed for each iTRAQ multiplex for both the SEC and separately the HIC dimensions. Interologs of protein pairs in the established benchmark protein-protein interaction sets are much more likely to have high maximum CC values in both the HIC and SEC dimensions than seen for all protein pairs or for negative protein pairs. We identified 200 high confidence D. vulgaris PPIs based on tagless co-purification and co-localization in the genome, 140 of which are not part of our D. vulgaris affinity purification-MS interactome (Shatsky et al., submitted).
The library was generated using all data acquired in the course of the project. All the highest ranking DvH proteins identified on the basis of at least one peptide that was matched with a confidence above 0.95 are shown, including proteins that were further filtered out in the process of protein-protein interactions analysis, as described in the paper. Competitor proteins based on the evidence used for the higher ranking counterparts are not shown.
Identifying protein-protein interactions (PPIs) at an acceptable false discovery rate (FDR) using large scale screens is challenging. Previously we identified several hundred PPIs from affinity purification - mass spectrometry (AP-MS) data for the bacteria Escherichia coli and Desulfovibrio vulgaris. These two interactomes have lower FDRs and are much more enriched in protein pairs annotated with similar functions or validated by other interaction data than the pairs from nine other bacterial AP-MS or yeast two hybrid (Y2H) interactomes. To more thoroughly determine the accuracy of protein interactomes and identify PPIs de novo, here we present a quantitative tagless method that employs iTRAQ™ mass spectrometry to measure the co-purification of endogenous proteins through orthogonal chromatographic fractionation steps. 5,273 fractions from a four step fractionation of a D. vulgaris protein extract were assayed, leading to the detection of 1,242 proteins. Protein partners in our D. vulgaris and E. coli AP-MS interactomes show highly correlated co-purification as frequently as pairs belonging to three benchmark datasets of well characterized PPIs. In contrast, the protein pairs from the nine other Y2H or AP-MS screens co-purify 2 – 20 fold less often. We also identify 200 high confidence D. vulgaris PPIs based on tagless co-purification and co-localization in the genome, 140 of which are not part of our D. vulgaris AP-MS interactome. These novel PPIs include examples validated by other experiments and also identify additional members of complexes first detected by AP-MS. Our results establish that a quantitative tagless method can be used to validate and identify PPIs, but that such data must be analyzed carefully to minimize the FDR.
10 g of soluble protein was extracted from a crude cell lysate of 400 L of wild type D. vulgaris cell culture (D. vulgaris Hildenborough wild-type ATCC29579). This crude extract was separated by ammonium sulfate precipitation, followed by three successive highly parallel chromatographic steps: MonoQ anion exchange Chromatography (Q-IEC); Hydrophobic Interaction Chromatography (HIC); and Size Exclusion Chromatography (SEC). To avoid redundantly analyzing similar fractions, every second or third fraction from each proceeding separation step was used as input to the subsequent step. Each fraction from the SEC dimension was digested with trypsin and the resulting peptides labeled with iTRAQ™ reagents to quantitate relative abundances of each protein between fractions. Samples were combined to form iTRAQ multiplexes that contain between 3 - 8 SEC fractions for simultaneous mass spectrometry. Two patterns of iTRAQ labeling were used. In one, successive fractions from the same SEC column were labeled to determine the elution profiles of each protein across that column. In the second, the equivalent fractions from multiple SEC columns (i.e. fractions with the same retention time, same sized proteins) were labeled to allow the elution of proteins across the HIC column to be inferred. A total of 1,472 distinct iTRAQ-labeled multiplexes were obtained and assayed by shotgun MALDI LC MS.
Pearson cross-correlation values (CC values) were computed for each iTRAQ multiplex for both the SEC and separately the HIC dimensions. Each co-occurring protein pair was assigned the maximum CC value for that pair for the SEC and, separately, for the HIC dimension. We established logistic regression, machine learning to combine up to eight features and rank co-occurring pairs by the confidence that they are bona fide PPIs. Five features derive only from the tagless mass spectrometry data and include the CC values in the HIC and SEC dimensions as well as the frequency with which protein pairs co-occur in the same fractions. The remaining three features are based on genome location and capture the tendency for two genes to be present in the same operon across a range of species, using information from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING). The logistic regression was trained on a gold standard positive set of likely PPIs and a gold standard negative set of non-interacting protein pairs. All eight features show strong enrichment of pairs from the gold positive set over pairs from the gold negative set, indicating that each feature can partially distinguish true positives from false positive PPIs. Cross-validation ensured that gold standard complexes used for training were excluded from the validation step.
Tryptic digestion of SEC fractionated proteins was performed on 96-well PVDF membrane plates (Multiscreen-IP 0.45 µm, Millipore, MAIPN4510) using protocol based on a method originally introduced for protein N-deglycosylation (Papac et al.,1998. Glycobiology 8:445-54). The tryptic peptides were eluted from the membranes into a 96-well collection plate using the vacuum. Two formats of iTRAQ derivatization procedures were utilized: manufacturer’s protocol and own miniaturized protocol that employed 1/8th of the reagent. Thus generated samples were combined to form multiplexes comprising up to 4 or 8 SEC fractions for simultaneous MS analysis when using iTRAQ 4- and 8-plex reagents, respectively. To allow protein elution profiles to be quantitated across all selected fractions from a single column, one “joint” fraction was labeled twice with a common iTRAQ label (e.g. 113) that was used in two otherwise non-overlapping multiplexes.
iTRAQ-labeled peptide mixtures were separated by reversed phase HPLC using an Ultimate 3000 dual column HPLC system (Dionex, Sunnyvale, CA) that was set up in a parallel configuration and equipped with a pair of reversed phase LC Packings/Dionex Monolithic PepSwift-DVB trap and analytical columns (200 µm x 1 cm and 200 µm x 5 cm, respectively). The LC system was operated in a swinging fashion to allow for a simultaneous peptide fractionation and column equilibration using an active and a resting column, respectively. Each sample was fractionated into 129 fractions over an 8-min collection time, with a frequency of 3.66 seconds per spot, mixed with MALDI matrix [5 mg/mL α-cyano-4-hydroxycinnamic acid (CHCA) in 80% acetonitrile / 0.05% TFA], containing 10 mM ammonium phosphate and 20 fmol/μl of [Glu1]-fibrinopeptide B (Glu-Fib) as internal calibration standard and spotted onto a blank MALDI plate (AB Sciex) using a.SunChrom Fraction Collector/Spotter (Sunchrom, Friedrichsdorf, Germany). The majority of analyses were performed using a 4800 MALDI TOF/TOF mass spectrometer (AB Sciex) operated using 4000 Series Explorer software (version 3.5.28193; build 1011, AB Sciex). External calibration based on Plate Model software (AB Sciex) was applied. Internal one-point calibration using the monoisotopic mass of the spiked Glu-Fib (m/z 1570.677) as a reference was performed for all spectra that met the preset internal standard data quality criteria (minimum accuracy of 0.2 Da and signal-to-noise (S/N) of 50). The vendor’s supplied “stop conditions” software was employed to automatically stop MS/MS data acquisition once all the specified spectrum quality criteria were reached. Small portion of the data was acquired using AB Sciex 5800 mass spectrometer while employing an iterative MS/MS acquisition routine (Liu et al., 2011. Anal. Chem. 83:6286-93).
The AB Sciex search engine ProteinPilot™ v. 3.0 and 4.0 with the Paragon™ Method algorithm (Shilov et all, 2007. Mol. Cell. Proteomics 6:1638-55) was employed for protein identification and calculation of relative protein abundances. The ProteinPilot “Add TOF/TOF Data” module was used to extract raw MS data stored in an Oracle database for direct submission to a search engine. The majority of analyses (~88%) utilized a custom database (a total of 51283 entries) that included 6-frame translated products of the D. vulgaris genome and common contaminants. The presence of at least one peptide matched with a confidence of 95% was used as a threshold for considering a protein for further analysis. Competitor protein identifications based on same evidence (spectra) explained by alternate hypotheses of the same confidence were included. After subsequent filtering described below, however, all proteins present in pairs that co-occur with CC values >0.85 or are part the 200 high confidence PPIs were detected by at least one peptide with a confidence of 99% and were ranked as primary identifications. Default settings of the search engine algorithm were used to calculate average relative abundances of each polypeptide. Neither bias correction nor background subtraction options were employed.
Created on 12/18/15 10:51 AM