Here we present a protocol for the rapid identification of proteins produced by genomically sequenced pathogenic bacteria using MALDI-TOF-TOF tandem mass spectrometry and top-down proteomic analysis with software developed in-house. Metastable protein ions fragment because of the aspartic acid effect and this specificity is exploited for protein identification.
This protocol identifies the immunity proteins of the bactericidal enzymes: colicin E3 and bacteriocin, produced by a pathogenic Escherichia coli strain using antibiotic induction, and identified by MALDI-TOF-TOF tandem mass spectrometry and top-down proteomic analysis with software developed in-house. The immunity protein of colicin E3 (Im3) and the immunity protein of bacteriocin (Im-Bac) were identified from prominent b- and/or y-type fragment ions generated by the polypeptide backbone cleavage (PBC) on the C-terminal side of aspartic acid, glutamic acid, and asparagine residues by the aspartic acid effect fragmentation mechanism. The software rapidly scans in silico protein sequences derived from the whole genome sequencing of the bacterial strain. The software also iteratively removes amino acid residues of a protein sequence in the event that the mature protein sequence is truncated. A single protein sequence possessed mass and fragment ions consistent with those detected for each immunity protein. The candidate sequence was then manually inspected to confirm that all detected fragment ions could be assigned. The N-terminal methionine of Im3 was post-translationally removed, whereas Im-Bac had the complete sequence. In addition, we found that only two or three non-complementary fragment ions formed by PBC are necessary to identify the correct protein sequence. Finally, a promoter (SOS box) was identified upstream of the antibacterial and immunity genes in a plasmid genome of the bacterial strain.
Analysis and identification of undigested proteins by mass spectrometry is referred to as the top-down proteomic analysis1,2,3,4. It is now an established technique that utilizes electrospray ionization (ESI)5 and high-resolution mass analyzers6, and sophisticated dissociation techniques, e.g., electron transfer dissociation (ETD), electron capture dissociation (ECD)7, ultraviolet photo-dissociation (UV-PD)8, etc.
The other soft ionization technique is matrix-assisted laser desorption/ionization (MALDI)9,10,11 that has been less extensively utilized for the top-down analysis, in part because it is primarily coupled to time-of-flight (TOF) mass analyzers, which have limited resolution compared to other mass analyzers. Despite these limitations, MALDI-TOF and MALDI-TOF-TOF instruments have been exploited for the rapid top-down analysis of pure proteins and fractionated and unfractionated mixtures of proteins. For the identification of pure proteins, in-source decay (ISD) is a particularly useful technique because it allows mass spectrometry (MS) analysis of ISD fragment ions, as well as tandem mass spectrometry (MS/MS) of protein ion fragments providing sequence-specific fragment often from the N- and C-termini of the target protein, analogous to Edman sequencing12,13. A drawback to the ISD approach is that, as in Edman sequencing, the sample must contain only one protein. The one protein requirement is due to the need for unambiguous attribution of fragment ions to a precursor ion. If two or more proteins are present in a sample, it may be difficult to assign which fragment ions belong to which precursor ions.
Fragment ion/precursor ion attribution can be addressed using MALDI-TOF-TOF-MS/MS. As with any classical MS/MS experiment, precursor ions are mass-selected/isolated prior to fragmentation, and the fragment ions detected can be attributed to a specific precursor ion. However, the dissociation techniques available for this approach are restricted to primarily high energy collision-induced dissociation (HE-CID)14 or post-source decay (PSD)15,16. HE-CID and PSD are most effective at fragmenting peptides and small proteins, and the sequence coverage can, in some cases, be limited. In addition, PSD results in polypeptide backbone cleavage (PBC) primarily on the C-terminal side of aspartic and glutamic acid residues by a phenomenon called the aspartic acid effect17,18,19,20.
MALDI-TOF-MS has also found a niche application in the taxonomic identification of microorganisms: bacteria21, fungi22, and viruses23. For example, MS spectra are used to identify unknown bacteria by comparison to a reference library of MS spectra of known bacteria using pattern recognition algorithms for comparison. This approach has proved highly successful because of its speed and simplicity, although requiring an overnight culturing of the isolate. The protein ions detected by this approach (usually under 20 kDa) comprise a MS fingerprint allowing taxonomic resolution at the genus and species level and in some cases at the sub-species24 and strain level25,26. However, there remains a need to not only taxonomically classify potentially pathogenic microorganisms but also identify specific virulence factors, toxins, and antimicrobial resistance (AMR) factors. To accomplish this, the mass of peptides, proteins, or small molecules are measured by MS and subsequently isolated and fragmented by MS/MS.
Pathogenic bacteria often carry circular pieces of DNA called plasmids. Plasmids, along with prophages, are a major vector of horizontal gene transfer between bacteria and are responsible for the rapid spread of antimicrobial resistance and other virulence factors across bacteria. Plasmids may also carry antibacterial (AB) genes, e.g., colicin and bacteriocin. When these genes are expressed and the proteins secreted, they act to disable the protein translation machinery of neighboring bacteria occupying the same environmental niche27. However, these bactericidal enzymes can also pose a risk to the host that produced them. In consequence, a gene is co-expressed by the host that specifically inhibits the function of an AB enzyme and is referred to as its immunity protein (Im).
DNA-damaging antibiotics such as mitomycin-C and ciprofloxacin are often used to induce the SOS response in Shiga toxin-producing E. coli (STEC) whose Shiga toxin gene (stx) is found within a prophage genome present in the bacterial genome28. We have used antibiotic induction, MALDI-TOF-TOF-MS/MS, and top-down proteomic analysis previously to detect and identify Stx types and subtypes produced by STEC strains29,30,31,32. In the previous work, STEC O113:H21 strain RM7788 was cultured overnight on agar media supplemented with mitomycin-C. However, instead of detecting the anticipated B-subunit of Stx2a at m/z ~7816, a different protein ion was detected at m/z ~7839 and identified as a plasmid-encoded hypothetical protein of unknown function33. In the current work, we identified two plasmid-encoded AB-Im proteins produced by this strain using antibiotic induction, MALDI-TOF-TOF-MS/MS, and top-down proteomic analysis using standalone software developed to process and scan in silico protein sequences derived from whole-genome sequencing (WGS). In addition, the possibility of post-translation modifications (PTM) involving sequence truncation were incorporated into the software. The immunity proteins were identified using this software from the measured mass of the mature protein ion and sequence-specific fragment ions from PBC caused by the aspartic acid effect and detected by MS/MS-PSD. Finally, a promoter was identified upstream of the AB/Im genes in a plasmid genome that may explain the expression of these genes when this strain is exposed to a DNA-damaging antibiotic. Portions of this work were presented at the National American Chemical Society Fall 2020 Virtual Meeting & Expo (August 17-20, 2020)34.
1. Microbiological sample preparation
2. Mass spectrometry
3. In silico protein database construction
4. Operating Protein Biomarker Seeker software
5. Post-search confirmation of protein sequence
Figure 3 (top panel) shows the MS of STEC O113:H21 strain RM7788 cultured overnight on LBA supplemented with 400 ng/mL mitomycin-C. Peaks at m/z 7276, 7337, and 7841 had been identified previously as cold-shock protein C (CspC), cold-shock protein E (CspE), and a plasmid-borne protein of unknown function, respectively33. The protein ion at m/z 9780 [M+H]+ was analyzed by MS/MS-PSD as shown in Figure 3 (bottom panel). The precursor ion was isolated with a timed-ion selector (TIS) window ±100 Da. Fragment ions are identified by their m/z and type/number. The fragment ion at m/z 2675.9 (highlighted with a star) is spillover from the dissociation of the metastable protein ion at m/z 9655 shown in Figure 3 (top panel). The theoretical average m/z of each fragment ion is shown in parentheses based on PBC of the sequence of colicin E3 immunity protein (Im3) shown above. Sites of PBC are highlighted with a red asterisk with the corresponding fragment ion(s) produced. The N-terminal methionine is underlined signifying that it is post-translationally removed in the mature protein. The sequence has a single cysteine residue (boxed) and is therefore considered in its reduced state.
Using the mass of the protein biomarker and a few prominent non-complementary fragment ions: m/z 1813.8, 2128.9, and 4293.7 (±1.5 tolerance) (Figure 1, bottom panel) and restricting PBC to the C-terminal side of D- and E-residues, only one candidate sequence was reported by the software: Im3 protein sequence (without its N-terminal methionine) (Figure 2, bottom panel). When selecting fragment ions for a search, it should be emphasized that any group of non-complementary fragment ions assumes that summing the m/z of any two fragment ions in the group (and subtracting two protons) results in a mass sum that do not fall within the biomarker mass and associated mass tolerance (±10 Da). Draft WGS of RM7788 revealed 5008 protein sequences (open reading frames)37. Of these ~5,000 full protein sequences, 189,490 full and partial sequences (unrestricted truncation) met the biomarker mass criteria (Figure 2, top panel). Those sequences passing the mass criteria then undergo in silico PBC on the C-terminal side of D- and/or E-residues. The resulting fragment ions generated are then compared to the observed fragment ions entered. The candidate sequence reported by the software was based solely on its mass and three D- and/or E-specific PBC sites. The specificity achieved by such a small amount of information will be discussed in the next section.
As shown in Figure 3 (bottom panel), the most abundant fragment ions are the result of PBC on the C-terminal side of D- and E-residues via the aspartic acid effect fragmentation mechanism19,20. Two CFIP are observed: b67/y17 (m/z 7645.1 / m/z 2128.9) and b70/y14 (m/z 7959.4 / m/z 1813.8). These CFIP can be used to more accurately calculate the mass of the protein precursor ion using the simple formula: b (m/z) + y (m/z) – 2H+ = protein mass (Da)33. Using the two CFIP, we obtain an average mass of the protein: 9771.6 Da, which is closer to its theoretical value of 9772.5 Da than the measured mass of the protein ion in MS-linear mode: 9779 Da (Figure 3, top panel). Only a few CFIP were detected because most of the precursor ions having the ionizing proton sequestered at the only arginine residue: R80. The higher gas phase basicity of arginine (237.0 kcal/mol38) compared to a lysine residue (K) (221.8 kcal/mol38) is likely responsible for preferential sequestration of the ionizing proton at the only R-residue.
Figure 4 (top panel) shows the MS of STEC O113:H21 strain RM7788 cultured overnight on LBA supplemented with 800 ng/mL mitomycin-C. Figure 4 (top panel) is quite similar to Figure 3 (top panel), although there are differences in the relative abundance of some protein ions due to the differences in antibiotic concentrations utilized. There are also slight shifts in protein biomarker m/z that reflect differences in external calibration of the instrument on different days. Once again, the protein ions at m/z 7272, 7335, and 7838 are CspC, CspE, and a plasmid-borne protein, respectively. In addition, we detect the Im3 protein ion at m/z 9778 (albeit with less abundance than in Figure 3) as well as a protein ion at m/z 9651 [M+H]+. Figure 4 (bottom panel) shows MS/MS-PSD of the protein precursor ion at m/z 9651. The precursor ion was isolated using a narrower and asymmetric TIS window of -75/+60 Da to eliminate contributions of adjacent protein ions at m/z 9539 and 9778. Fragment ions are identified by their m/z and type/number. The sequence of the immunity protein of bacteriocin (Im-Bac) is shown above. Sites of PBC are highlighted with a red asterisk with their corresponding fragment ion(s). The theoretical average m/z of each fragment ion is also shown in parentheses in the spectrum. The Im-Bac sequence also has a single cysteine residue (boxed) and is therefore considered in its reduced state.
Using the protein biomarker mass, three prominent non-complementary fragment ions: m/z 2675.4, 3853.5, and 5772.8 (±1.50 tolerance) from Figure 4 and restricting PBC to only the C-terminal side of D- and/or E- and/or asparagine (N)-residues, only one candidate sequence was reported by the software: Im-Bac protein. The candidate sequence was retrieved after scanning 191,375 full or partial sequences that met the biomarker mass and tolerance (±10 Da) criteria. The candidate sequence was identified by the software-based solely on its mass and three D- and/or E- and/or N-specific PBC sites.
The most prominent fragment ions in Figure 4 (bottom panel) were, once again, the result of PBC on the C-terminal side of D and/or E-residues and also on the N-terminal side of one of the P-residues20. We also observe PBC on the C-terminal side of an N-residues that is also likely to occur by an aspartic acid effect-like fragmentation mechanism39,40. The weakness of the protein precursor ion signal results in a limited number of interpretable fragment ions. The accuracy of the fragment ion m/z declines with fragment ion abundance. No CFIP were detected due presumably to the ionizing proton being sequestered at the only arginine residue (R74) of the protein ion sequence. All fragment ions contain the R74 residue, consistent with this hypothesis.
The promoter of antibacterial immunity genes
Figure 5 shows a portion of the 6482 bp contig00100 of E. coli strain RM7788 (GenBank: NWVS01000096.1) from whole-genome shotgun sequencing37. The coding regions for colicin E3, its immunity protein (Im3), the immunity protein of bacteriocin (Im-Bac), and a lysis protein are highlighted in yellow. Upstream of the coding region for the colicin E3 gene are the -35 region, the Pribnow box (PB), inverted repeat of the SOS box, the Shine-Dalgarno/ribosomal binding site (SD/RBS)27. There is a nine base-pair intergenic region between colicin E3 and Im3. LexA (a repressor protein and an autopeptidase) binds to the SOS box blocking the expression of genes downstream. Upon DNA damage (e.g., UV radiation or DNA-damaging antibiotics), LexA undergoes self-cleavage allowing expression of genes downstream27,28. Thus, the expression of these two immunity proteins is consistent with exposure of this strain to a DNA-damaging antibiotic.
Figure 1: Screen shots of Protein Biomarker Seeker software. Top panel: Graphical user interface (GUI) of the Protein Biomarker Seeker software. Bottom panels: Pop-up windows of Protein Mass Calculator Tool, Fragment Page, Confirm Search Parameters, and Search progress bar. Please click here to view a larger version of this figure.
Figure 2: Search results of a protein identification using Protein Biomarker Seeker software. Top panel: Summary of search results displayed in the Log Field of the software GUI. Bottom panel: A pop-up window displaying a protein identification using the software. Please click here to view a larger version of this figure.
Figure 3: Mass spectrometry analysis of STEC O113:H21 strain RM7788. Top panel: MS of STEC O113:H21 strain RM7788 cultured overnight on LBA supplemented with 400 ng/mL mitomycin-C. Bottom panel: MS/MS-PSD of the protein precursor ion at m/z 9780 (top panel). The precursor ion was isolated with a TIS window ±100 Da. Fragment ions are identified by their m/z and ion type. The sequence of the immunity protein for colicin E3 (Im3) is shown. Basic residues (sites of possible charge sequestration) are highlighted in blue. PBC are highlighted with a red asterisk with the corresponding fragment ion(s) generated. The theoretical average m/z of each fragment ions is shown in parentheses. Please click here to view a larger version of this figure.
Figure 4: Mass spectrometry analysis of STEC O113:H21 strain RM7788. Top panel: MS of STEC O113:H21 strain RM7788 cultured overnight on LBA supplemented with 800 ng/mL mitomycin-C. Bottom panel: MS/MS-PSD of the protein precursor ion at m/z 9651 (top panel). The precursor ion was isolated with an asymmetric TIS window of -75 on the low m/z side of the precursor ion and +60 on the high m/z side of the precursor ion. Fragment ions are identified by their m/z and ion type. The sequence of the immunity protein of bacteriocin (Im-Bac) is shown. Basic residues (sites of possible charge sequestration) are highlighted in blue. PBC are highlighted with a red asterisk with the corresponding fragment ion(s) generated. The theoretical average m/z of each fragment ion is shown in parentheses. Please click here to view a larger version of this figure.
Figure 5: Analysis of a section of the plasmid genome carried by E. coli O113:H21 strain RM7788. A portion of the 6482 bp contig00100 of E. coli O113:H21 strain RM7788 (GenBank: NWVS01000096.1) from whole genome shotgun sequencing37. Please click here to view a larger version of this figure.
Supplementary File 1 (S1 Im3): Results of benchmarking analysis of software using select fragment ions of Im3 (from Figure 3, bottom panel). Please click here to download this File.
Supplementary File 2 (S2 ImBac): Results of benchmarking analysis of software using select fragment ions of Im-Bac (from Figure 4, bottom panel). Please click here to download this File.
Protocol considerations
The primary strengths of the current protocol are its speed, simplicity of sample preparation, and use of an instrument that is relatively easy to operate, be trained on, and maintain. Although bottom-up and top-down proteomic analysis by liquid chromatography-ESI-HR-MS are ubiquitous and far superior in many respects to top-down by MALDI-TOF-TOF, they require more time, labor, and expertise. Instrument complexity can often affect whether certain instrument platforms are likely to be adopted by scientists not formally trained in mass spectrometry. The top-down approach with MALDI-TOF-TOF is meant to extend the analysis of MALDI-TOF-MS beyond its current use for taxonomic identification of bacteria in clinical microbiology labs while not dramatically increasing the labor, complexity, or expertise required for analysis.
The protocol does not employ any mechanical (or electrical) cell lysis step. Although secreted or extracellular proteins may be detected using the protocol, an earlier version of this method was first developed for detection of Shiga toxin (Stx) from STEC strains wherein antibiotic induction triggers the bacterial SOS response resulting in expression of phage genes, including stx as well as late phage genes responsible for bacterial cell lysis41. We found that antibiotic-induced cell lysis has certain advantages for the detection of Stx as well as plasmid proteins that have SOS promoters (current work). Certainly, mechanical cell lysis (e.g., bead-beating) can also be used (although not used in the current work). However, mechanical lysis results in all bacterial cells being lysed (not simply induced cells) resulting in the sample being enriched with abundant, highly conserved host proteins that can make detection of phage and plasmid proteins from an unfractionated sample more challenging.
The antibiotic concentrations for a bacterial strain were found to be generally reproducible with respect to the antibiotic-induced proteins detected. We noted variations in the relative protein abundance with respect to the antibiotic-induced proteins detected. Since our analysis is qualitative (not quantitative), protein biomarker abundance need only be sufficient for adequate MS/MS analysis. A putative STEC strain is first cultured with a range of antibiotic concentrations (e.g., 300 ng/mL to 2,000 ng/mL of mitomycin-C) to determine the optimum concentration such that it triggers the bacterial SOS response while still providing enough bacterial cells for harvesting. For the STEC strain RM7788, we found that the optimum antibiotic concentration for detection of the biomarkers identified was 400 to 800 ng/mL of mitomycin-C.
In addition to protein sequence truncation, E. coli proteins can have PTMs that involve addition of mass, e.g., phosphorylation, glycosylation, etc. As MS/MS utilizes PSD for dissociation of singly charged metastable protein ions (under 20 kDa in mass) generated by MALDI, such PTMs attached to residue side chains would likely undergo facile dissociative loss because PSD is an ergodic dissociation technique. The presence of such PTMs could be inferred from the appearance of a fragment ion close in mass to the original precursor ion (minus the mass of the PTM) in the MS/MS data. However, neither PSD nor the software would be able to identify where such PTMs are attached. In addition, the software can only identify proteins from fragment ions of PBC and not dissociative loss of small molecules (e.g., water or ammonia) or PTMs attached to the side-chains of residues. However, if fragment ions from PBC are detected, the protein could still be identified using the software by either widening the protein mass tolerance window to include the mass of the PTM or simply entering the mass of the protein fragment ion corresponding to dissociative loss of the suspected PTM. Any identification by the software would be of the protein sequence without the PTM. Interestingly, we have not detected proteins having phosphorylation, glycosylation, etc. in our bacterial work thus far. However, that may be due to: their relative abundance by MALDI, the mass range being used: 2-20 kDa, that such PTMs may be unusually labile and may not survive application of the MALDI matrix, or that such PTMs may undergo very rapid dissociative loss in the source before ions are accelerated from the source.
Currently, the software does not include cysteine alkylation, and our sample protocol does not include a disulfide reduction step for cysteine residues. The protocol has been clarified to indicate that the search is to be operated with cysteine residues in their Oxidized state, and if no identification is obtained, then to execute the search again with cysteine residues in their Reduced state. If no identifications are found again, widening the fragment ion tolerance to ±2 or ±3 lowers the threshold for fragment ion matching allowing sequences with cysteines to be matched whether they are present in their oxidized and/or reduced states.
Top-down proteomic analysis by MALDI-TOF-TOF mass spectrometry
Most top-down proteomic analysis has been achieved using ESI and high-resolution mass spectrometry platforms. By contrast, fewer top-down proteomic analysis has been conducted using MALDI-TOF-TOF platforms. In consequence, there is very little top-down proteomic software for analysis of singly charged metastable protein ions generated and analyzed by MALDI-TOF-TOF-MS/MS-PSD that exploit the aspartic acid effect for fragmentation15,42. There are a number of reasons for this. First, the ionization efficiency of MALDI is biased toward lower molecular weight peptides and proteins, and this bias is particularly apparent with a mixture of proteins as would be found in an unfractionated bacterial cell lysate. Second, MALDI generates low charge states, and there is little or no Coulomb repulsion to facilitate protein ion dissociation. Third, PSD sequence coverage is quite limited unlike other techniques ECD7, ETD7, UV-PD8, etc. Fourth, the fragmentation efficiency of PSD declines with increasing mass of the protein ion. Fifth, ergodic dissociation techniques, such as PSD, tend to result in facile dissociative loss of PTMs attached to residues, e.g., phosphorylation, glycosylation, etc., making it challenging to determine the site of PTM attachment. In spite of these severe limitations, top-down analysis using MALDI-TOF-TOF-MS/MS-PSD has clear advantages, e.g., simplicity of sample preparation, absence of LC separation, isolation of metastable protein ions by MS/MS allowing attribution of fragment ions to precursor ions, identification of PTMs involving sequence truncation and intramolecular disulfide bonds and most importantly the speed of analysis. When combined with in silico protein sequences derived from WGS data, this technique can provide rapid information before other more time-consuming and labor-intensive analyses are completed.
The Protein Biomarker Seeker software was developed using IntelliJ and written in Java to efficiently process and search protein amino acid sequences derived from WGS of a bacterial strain. The software was modified from an earlier algorithm that operated as a macro within Excel33. We decided to develop a standalone version of the software with a GUI interface to make it more user-friendly as well as provide further improvements.
In the event of PTMs involving protein sequence truncation, the software sequentially removes an amino acid residue from the N-terminus while iteratively adding residues of the sequence until the mass sum meets or exceeds the measured mass of the detected protein biomarker. Although this process can result in a very large number of protein mass fragments (~200,000 from ~5000 full protein sequences), it has the advantage of not excluding any potential protein fragments from the truncation at the N-terminus or C-terminus (or both) however improbable such truncation may be from a biological perspective. This approach is referred to as unrestricted truncation. However, the most common bacterial PTMs involving truncation are removal of the N-terminal methionine or N-terminal signal peptide. In consequence, the software also allows the operator to select an upper limit (50 residues) for residue truncation from the N-terminus, which results in much fewer protein fragments that meet the protein biomarker mass criteria.
PBC on the C-terminal side of D- and E-residues as well as on the N-terminal side of P-residues are consistent with the aspartic acid effect mechanism, which has been studied extensively both experimentally and theoretically17,18,19,20. Inclusion of PBC on the C-terminal side of N-residues was included in the software because of an aspartic acid effect-like mechanism that has been observed for a number of metastable protein ions in our laboratory39,40. The most abundant fragment ions from the dissociation of singly charged metastable protein ions analyzed by MS/MS-PSD are due to the aspartic acid effect fragmentation mechanism. The operator selects the most prominent fragment ions from the MS/MS-PSD data and enters their m/z into the software as well as an associated fragment ion tolerance (±m/z). The fragment ion tolerance can be adjusted for each fragment ion to reflect its relative abundance. An appropriate fragment ion tolerance may vary between ±1.0 to ±2.5 m/z depending on the absolute abundance of the fragment ion as well as its relative abundance compared to background chemical noise. Typically, the more abundant a fragment ion, the better its mass accuracy, which allows a narrower fragment ion tolerance to be used.
MS/MS-PSD data of metastable protein ions can vary dramatically in terms of their complexity. Some MS/MS-PSD spectra are more easily interpretable than others. There are several reasons for this phenomenon. First, the protein ion may not fragment efficiently on the timescale of the analysis (~10-30 µs) perhaps because it remains folded or partially folded even after solubilization in the MALDI matrix solution. Second, in addition to PBC, metastable protein ions can undergo dissociative loss of small molecules, i.e., ammonia (-17 Da) or water (-18 Da)15. A significant contributor to spectral complexity appears to be dissociative loss of ammonia from the side-chain of R-residues33. We have observed an increase in spectral complexity of MS/MS-PSD data with the number of R-residues in the protein sequence. Proteins with no R-residues (YahO protein36 and cold-shock protein CspC33,43), with one R-residue (cold-shock protein CspE33 and B-subunit of Stx241), with two R-residues (hypothetical protein33), produce MS/MS-PSD spectra that are relatively uncomplicated and easy to interpret. However, when the number of R-residues increase to three (HU protein44), or four (ubiquitin35andcold-shock protein CsbD33,43), spectral complexity increases significantly. The software compares fragment ions from PBC at residues specific to the aspartic acid effect mechanism only as this is the most accessible dissociation channel of singly charged metastable protein ions analyzed by MS/MS-PSD. The software does not include fragment ions resulting from dissociative loss (or losses) of small neutral molecule(s). In consequence, it is important that the operator does not select fragment ions that include small neutral dissociative losses. Fragment ions from PBC are typically the most prominent fragment ions; however, when the number of R-residues in a protein increases to three or four, the most abundant fragment ion at a PBC site may be one that includes a small dissociative loss (or losses). If such a cluster of fragment ions (separated by multiples of 17 or 18 m/z) is detected, the fragment ion with the highest m/z within a cluster should be the one entered into the fragment ion search parameters.
It should be emphasized that the software was not designed for operator-free proteomic identification. The operator must select which fragment ions from MS/MS-PSD data are to be included in the search. However, based on numerous experiments that have confirmed the aspartic acid effect by MS/MS-PSD, the most prominent fragment ions are always the result of PBC on the C-terminal side of D- or E- or N-residues. The utility of the software is that it eliminates many obviously incorrect sequences and retrieves only a few likely candidates. Some candidate sequences may be eliminated based on the absence of a fragment ion where a D-residue in a sequence would be expected to generate a prominent fragment ion. Invariably, D-residues result in prominent fragment ions throughout the polypeptide backbone except when they are located within a few residues of the N- or C-termini where the efficiency of the aspartic acid effect declines36.
Minimum number of PBC sites needed for tentative protein identification
A CFIP is formed from two identical protein precursor ions that dissociate at the same PBC site but have their ionizing proton on opposite sides of the cleavage site. Although a CFIP can be used to calculate the mass of the protein biomarker more accurately (allowing a narrowing of the protein mass tolerance during a search), its utility for sequence-specific identification is less useful than that of two non-complementary fragment ions formed from two different cleavage sites, which provide greater identification specificity. The ease with which the two AB-Im proteins were identified led us to speculate as to the minimum number of fragment ions necessary to tentatively identify the correct protein sequence from thousands of proteins or protein fragment sequences. We quickly determined that it was not the number of fragment ions per se but the number of non-complementary fragment ions that is important because each non-complementary fragment ion represents one PBC site whereas a CFIP represents the same cleavage site. Thus, identification specificity is derived from the number of PBC sites detected not the number of fragment ions.
It is possible that the success in identification with only three fragment ions may have been simply fortuitous. To test this hypothesis and to eliminate bias in the selection of fragment ions, we created a benchmarking module within the software that randomly selects fragment ions from a larger pool of complementary and/or non-complementary fragment ions. The larger fragment ion pool was selected from the 14 prominent fragment ions identified in Figure 3 (bottom panel) based upon their relative abundance.
The testing protocol was as follows. Using a binary search, three fragment ions were randomly selected from the pool of 14 prominent fragment ions in Figure 3 (bottom panel) (m/z 1813.8, 2128.9, 3881.3, 4293.7, 5158.0, 6505.0, 6619.9, 6939.4, 7645.1, 7959.4, 8022.7, 8136.2, 8583.3, and 8961.5). A three-fragment ion cohort was compared against in silico fragment ions from PBC on the C-terminal side of D- or E- or N-residues as well as a combination of D & E and D & E & N. This comparison was performed for each individual fragment ion of a cohort, for the three fragment ion pairs of a cohort and for the three-fragment ion combination of a cohort. For a comparison to be counted as a match, both fragment ions of a pair and all three fragment ions of a combination must match to in silico fragment ions. After completion of the analysis, another three-fragment ion cohort is randomly selected, and the analysis is repeated. Repetition in fragment ion selection was allowed. As there are 364 possible combinations [(n!/r!(n-r)!] of a three-fragment ion cohort (r) from a pool of 14 fragment ions (n), only 10 analyses were performed as shown in the S1Im3 (Supplementary Information).
The three-fragment ion identification requirement appears to be a general phenomenon as shown in column 3_ABC of Tables 2-7, 9-10 (S1Im3). All counts of 1 in the 3_ABC column correspond to the Im3 sequence (without N-terminal methionine). The only failure in identification occurred because the fragment ion at m/z 8136.2 (shown in Figure 3, bottom panel and highlighted in gray in Tables 1 and 8) exceeded the fragment ion tolerance (±1.5 m/z) entered for the analysis. Since the testing algorithm requires that all fragment ions of a three-fragment ion cohort be matched, any group that included the m/z 8136.2 fragment ion would fail to identify/count the correct protein sequence.
Table 6 in S1Im3 shows that when two of three fragment ions are complementary (highlighted in yellow), more incorrect sequences matched the criteria than that observed when all three fragment ions were non-complementary. As noted previously, this is because a CFIP corresponds to a single PBC site, a threshold that is attainable by many more incorrect in silico sequences compared to using two non-complementary fragment ions that correspond to two PBC sites, a more stringent criterion.
A similar analysis was performed on six prominent fragment ions (m/z 2675.4, 2904.5, 3076.2, 3853.5, 5657.5, and 5772.8) of Im-Bac shown in Figure 4 (bottom panel). Unlike Im3, Im-Bac has no discernable CFIP, therefore the six fragment ions correspond presumably to six PBC sites. As there are 20 possible combinations of a three-fragment ion cohort selected from a pool of six fragment ions, only 10 analyses were performed as shown in the tables of S2 Im-Bac (Supplementary Information). The Im-Bac sequence was correctly identified/counted for all three-fragment ion groups in column 3_ABC in all analyses. In four analyses, one or two incorrect sequences were also matched. However, this small number of incorrect sequences is a manageable number for manual confirmation.
Overall, complementary and/or non-complementary fragment ions that correspond to two or three PBC sites appear to provide enough specificity to retrieve one or two candidate sequences. Of course, the fragment ions selected by the operator should be relatively abundant and have good S/N. One or two fragment ions from a single PBC site does not provide enough specificity to avoid retrieving an unworkable number of incorrect sequences that must be confirmed by the operator. It is not clear why two or three PBC sites are adequate, but a single PBC site is apparently not specific enough. Although unrestricted truncation results in ~200,000 proteins and protein fragment sequences that meet the protein mass criteria, it is probable that the site/residue-specific nature of the cleavage sites, i.e., C-terminal side of D-, E-, and N-residues, contributes to the sharp narrowing of possible sequences during fragment ion comparison. This may be due, in part, to the frequency of D-, E-, and N-residues in bacterial protein sequences as well as their unique locations in protein sequences across the proteome of bacteria. Acidic residues play critical roles in protein structure and solvent interactions. In consequence, their frequency and locations in the primary sequence are critical if not unique for protein function and may explain why only a few PBC sites are necessary to tentatively identify the correct protein sequence among hundreds of thousands incorrect sequences.
From a gas phase chemistry perspective, the importance of D-, E-, and N-residues stems from their participation in a dissociation channel that is accessible at low internal energies of singly charged metastable protein ions generated by MALDI and decay by PSD20. The relatively long timescale (~10-30 µs) of molecular ion fragmentation by PSD means that the internal energy of the protein ion is randomized among all vibrational and rotational degrees-of-freedom of the molecular ion such that dissociation is ergodic and statistical. It should also be pointed out that the mechanism of aspartic acid effect involves a molecular ion rearrangement that occurs by a sequence of steps or a single concerted step involving multiple atoms until a favorable geometry is achieved that lowers the activation barrier of PBC17,18,19.
Two plasmid-encoded antibacterial immunity proteins produced by a STEC strain were identified using a protocol involving antibiotic induction, MALDI-TOF-TOF-MS/MS-PSD, and top-down proteomic analysis. These proteins were identified using software developed in-house that incorporates the measured mass of the protein and a relatively small number of sequence-specific fragment ions formed as a result of the aspartic acid effect. The software compares the MS and MS/MS data to in silico protein and protein fragment sequences derived from WGS data. Although the software does not provide identification metrics or scoring, it eliminates a very high percentage of incorrect sequences resulting in a very small number of candidate sequences (one or two) that can be easily confirmed by manual inspection. Finally, manual inspection of the WGS data of this bacterial strain revealed a promoter (SOS box) upstream of the AB and Im genes in a plasmid genome, which rationalizes expression of these genes due to exposure of DNA-damaging antibiotics.
The authors have nothing to disclose.
Protein Biomarker Seeker software is freely available (at no cost) by contacting Clifton K. Fagerquist at clifton.fagerquist@usda.gov. We wish to acknowledge support of this research by ARS, USDA, CRIS grant: 2030-42000-051-00-D.
4000 Series Explorer software | AB Sciex | Version 3.5.3 | |
4800 Plus MALDI TOF/TOF Analyzer | AB Sciex | ||
Acetonitrile Optima LC/MS grade | Fisher Chemical | A996-1 | |
BSL-2 biohazard cabinet | The Baker Company | SG403A-HE | |
Cytochrome-C | Sigma | C2867-10MG | |
Data Explorer software | AB Sciex | Version 4.9 | |
Focus Protein Reduction-Alkylation kit | G-Biosciences | 786-231 | |
GPMAW software | Lighthouse Data | Version 10.0 | |
Incubator | VWR | 9120973 | |
LB Agar | Invitrogen | 22700-025 | |
Luria Broth | Invitrogen | 12795-027 | |
Lysozyme | Sigma | L4919-1G | |
Microcentrifuge Tubes, 2 mL, screw-cap, O-ring | Fisher Scientific | 02-681-343 | |
MiniSpin Plus Centrifuge | Eppendorf | 22620207 | |
Mitomycin-C (from streptomyces) | Sigma-Aldrich | M0440-5MG | |
Myoglobin | Sigma | M5696-100MG | |
Shaker MaxQ 420HP Model 420 | Thermo Scientific | Model 420 | |
Sinapinic acid | Thermo Scientific | 1861580 | |
Sterile 1 uL loops | Fisher Scientific | 22-363-595 | |
Thioredoxin (E. coli, recombinant) | Sigma | T0910-1MG | |
Trifluoroacetic acid | Sigma-Aldrich | 299537-100G | |
Water Optima LC/MS grade | Fisher Chemical | W6-4 |