The goal of this protocol is to develop a reference for divergent proteins in a group that lacks coherent criteria for nomenclature and classification. This reference will facilitate analyses and discussion of the group as a whole and can be used in addition to established names.
Related proteins that have been studied in different labs using varying organisms may lack a uniform system of nomenclature and classification, making it difficult to discuss the group as a whole and to place new sequences into the appropriate context. Developing a reference that prioritizes important sequence features related to structure and/or activity can be used in addition to established names to add some coherency to a diverse group of proteins. This paper uses the cysteine-stabilized alpha-helix (CS-αβ) superfamily as an example to show how a reference generated in spreadsheet software can clarify relationships between existing proteins in the superfamily, as well as facilitate the addition of new sequences. It also shows how the reference can help to refine sequence alignments generated in commonly used software, which impacts the validity of phylogenetic analyses. The use of a reference will likely be most helpful for protein groups that include highly divergent sequences from a broad spectrum of taxa, with features that are not adequately captured by molecular analyses.
A protein's name should reflect is characteristics and relationship to other proteins. Unfortunately, names are generally assigned at the time of discovery and, as research continues, the understanding of the larger context may change. This can lead to multiple names if a protein was independently identified by more than one lab, to changes in nomenclature or in the characteristics thought to be definitive when assigning the name, and to the name no longer sufficiently differentiating the protein from others.
Invertebrate defensins provide a good example of degeneration in nomenclature and classification. The first invertebrate defensins were reported from insects, and the name "insect defensin" was proposed based on the perceived homology to mammalian defensins1,2. The term defensin is still used, even though it is now clear that invertebrate and mammalian defensins do not share a common ancestor3,4. Depending on the species, an invertebrate "defensin" may have six or eight cysteines (that form three or four disulfide bonds) and a variety of antimicrobial activities. To complicate the situation, proteins with the same characteristics as defensins are not always called "defensins," such as the recently identified cremycins from Caenorhabditis remanei5. In addition, invertebrate big defensins are more likely to be evolutionarily related to vertebrate β-defensins than to other invertebrate defensins6. Despite this, researchers sometimes rely on the name "defensin" when determining which sequences should be included in analyses.
Structural studies revealed the similarity between insect defensins and scorpion toxins7, and the CS-αβ fold was subsequently established as the defining structural characteristic of insect defensins8. This fold defines the scorpion toxin-like (CS-αβ) superfamily in the Structural Classification of Proteins (SCOP) database9, which currently includes five families: insect defensins, short-chain scorpion toxins, long-chain scorpion toxins, MGD-1 (from a mollusk), and plant defensins. This superfamily is synonymous with the recently described cis-defensins4 and Superfamily 3.30.30.10 in the CATH/Gene 3D database10,11. Studies from a variety of invertebrate taxa, plants, and fungi show that the names of proteins that contain this fold are not clearly related to cysteine number or bonding pattern, antimicrobial activity, or evolutionary history12.
The lack of consistency and clear criteria make it challenging to name and classify newly-identified sequences in this superfamily. A major obstacle to comparing proteins in this superfamily is that cysteines are numbered with respect to each individual sequence (the first cysteine in each sequence is C1), with no way to account for the structural role. This means that only sequences with the same number of cysteines can be compared. There is little sequence conservation other than the cysteines forming the CS-αβ fold, which makes alignments and phylogenetic analyses difficult. By developing a numbering system that prioritizes structural features, superfamily sequences can be more easily compared and aligned. Conserved features, as well as those defining subgroups, can be visualized quickly, and new sequences can be more easily placed into the appropriate context.
This paper uses a spreadsheet software (e.g., Excel) to generate a reference numbering system for the CS-αβ superfamily. It shows how this clarifies comparisons between sequences and applies it to new CS-αβ sequences identified from tardigrades. Using the CS-αβ superfamily as an example, the protocol was written to provide guidance when using sequences of interest; however, it is not intended to be specific to this superfamily or to cysteine-rich sequences. This method will likely be most useful for groups of proteins that have been researched independently in divergent taxa and/or have little overall sequence homology, with discrete characteristics that may not be easily recognized by molecular analysis software. This method requires some a priori decisions regarding important features, so it will be of limited utility if no important features have been identified. The primary goal is to show how a simple visualization of the sequence relationships can be achieved. This can then be used to inform sequence alignment and analysis, but if alignment and analysis are the primary goals, a barcode method would be a suitable alternative that has more capacity for automation13. The current method displays the features of each peptide in a linear form, so it will not be helpful for the direct visualization of 3D structure.
1. Determine the Defining Features of the Protein Group of Interest
2. Collect Relevant Sequences
3. Generate a Reference in a Spreadsheet Based on the Important Features That Were Identified
4. Use the Reference to Refine Amino Acid Alignments
NOTE: There are many programs that can be used for multiple sequence alignments, but this demonstration will use Molecular Evolutionary Genetics Analysis (MEGA6)19 because it is available to download for free.
5. Compare the Groups Identified Using the Reference with Results from Phylogenetic Analyses
Groups of sequences in the CS-αβ superfamily reported in the literature are shown in Figure 4. The cysteine pairings based on the numbering for each sequence suggest five basic groups (Table 1, middle column). Group 1 has six cysteines that from three disulfide bonds and includes sequences from insects, arachnids, mollusks, nematodes, and fungi. Groups 2, 3, and 4 have 8 cysteines that form four disulfide bonds. Group 2 includes insect, arachnid, and plant sequences; group 3 includes arachnid, mollusk, and nematode sequences; and group 4 includes sequences from cnidarians, annelids, mollusks, and fungi. Group 5 includes the 10 cysteine macins. Some sequences did not quite fit these patterns but were generally closer to one group than the others.
Groups 1 and 2 seem to share two bonds: C2-C5 and C3-C6; however, beginning the numbering of each sequence with its first cysteine does not acknowledge the structural context of the bonds. C2-C5 in Group 1 sequences forms one of the two bonds in the CSH motif, while C2-C5 in Group 2 sequences forms the final bond needed to stabilize the CS-αβ fold. The homologous bond to the Group 1 C2-C5 is Group2 C3-C6, which is not obvious from the numbering. It is also not obvious that in Group 3, the C2-C6 bond plays the same structural role.
Using sequences from the literature generated a reference with a total of ten cysteines. The CSH motif is formed from bonds C3-C8 and C4-C9, with C2-C6 completing the CS-αβ fold. Renumbering the cysteine pairs based on the reference numbers clarifies the bonds present in each sequence (Table 1, right column). It is now obvious that all the sequences have C2-C6, C3-C8, and C4-C9, reflecting the structural fold that defines the superfamily. The use of a reference allows for easy comparison between sequences that have inconsistent nomenclature and ambiguous classification criteria. It can also help to identify features that define a subgroup of sequences. For example, the C1-C7 bond may differentiate macins from other superfamily members, making it appropriate to classify sequences with this bond as "macins" rather than "defensins" (Table 1 and Figure 4).
Searches of public online databases revealed sixteen sequences from tardigrades that clearly have the CS-αβ fold, eight each from Hypsibius dujardini and Milnesium tardigradum. Four of the new sequences have six cysteines, nine have eight, one has nine, and two have ten. This gives very little information, but by aligning the sequences to the reference, it becomes clear that tardigrade sequences with the same number of cysteines do not always have the structurally-important cysteines at the same place within the sequence (Figure 5 and Figure 6). The alignment with the reference also allows for the inference of bonding patterns (Table 2, inferred bonding patterns shown in parentheses). Some of the tardigrade sequences clearly fit patterns 1 – 4. Others are most similar to the proposed bacterial ancestor, scorpion Cl- toxin, or a family of fungal defensin-like peptides. Pattern 2 may have two subgroups, one represented by scorpion Na+ toxins, drosomycin, and plant defensins, and the other by scorpion Cl- toxins. Further work investigating the function of the tardigrade proteins is needed to determine if some should be considered toxins rather than defensins.
Phylogenetic analyses are often used to study how a group of proteins may have evolved. The sequences in the CS-αβ superfamily are generally short and highly divergent; resulting trees are often poorly resolved and offer little insight. Both the ML and Bayesian trees for the subset of sequences analyzed here were poorly resolved, with low support for many clades (Figure 10, Supplementary Files 1 – 4). It is common practice to only show bootstrap levels over 70 (or posterior probabilities over 0.7), but Figure 10 retains all numbers to demonstrate the overall low levels of support. Five groups were supported above 70/0.7 in at least one of the two trees: (a) a 6C and an 8C scorpion toxin; (b) macins; (c) tick and scorpion defensins; (d) plant defensins; and (e) 6C defensins from insects, arachnids, and mollusks. In the ML tree, clade e also includes an 8C toxin and an 8C tardigrade defensin, but support was very low (Figure 10A). In general, these reflect the categories identified using the reference cysteine numbering but also show that sequences with different cysteine numbers within a large taxonomical group may be more closely related than sequences with the same pattern from different groups. While only a small number of sequences were used in this study, a larger analysis of 250 sequences did not eliminate the lack of resolution (Supplementary Files 5 – 8)12. The spreadsheet reference alignment may offer easier visualization of similarities with structural or functional relevance compared to phylogenetic trees.
Figure 1: Defining Sequence and Structural Features of the CS-αβ Superfamily. Amino acids and 3D structure are color coded: loop (blue), alpha-helix (green), beta-sheets (gold), and disulfide bonds (pink). Please click here to view a larger version of this figure.
Figure 2: Preliminary Six-cysteine Reference Based on the Sequence of Insect Defensin. Columns indicate the conserved cysteines (C1-C6) and, for the CSH motif, the number of conserved amino acids between the cysteines. The filled boxes indicate that the sequence has the given cysteine and the numbers indicate amino acids between the cysteines. Please click here to view a larger version of this figure.
Figure 3: Refined Ten-cysteine Reference Based on Representative Sequences from Groups of the CS-αβ Superfamily. The columns indicate conserved cysteines and the amino acids between them. Cysteines contributing to the CSH motif (C3, C4, C8, and C9) and to the CS-αβ fold (C2 and C6) are labeled. Sequences are color-coded by taxonomic group: Arachnida (light orange), Bacteria (black), Cnidaria (grey), Hexapoda (orange), Mollusca (blue), Nematoda (purple), and Plantae (green). Please click here to view a larger version of this figure.
Figure 4: Summary of CS-αβ Superfamily Sequences Aligned with Reference by Group Characteristics. The columns indicate conserved cysteines and the amino acids between them. Cysteines contributing to the CSH motif (C3, C4, C8, and C9) and to the CS-αβ fold (C2 and C6) are labeled. Sequences are color-coded by taxonomic group: Annelida (dark red), Arachnida (light orange), Bacteria (black), Cnidaria (grey), Fungi (light green), Hexapoda (orange), Mollusca (blue), Nematoda (purple), and Plantae (green). Numbers separated by a dash indicate a range of intervening amino acids; numbers separated by a slash represent either/or. A "C" indicates an additional cysteine that does not occur with enough frequency to warrant addition to the reference. Please click here to view a larger version of this figure.
Figure 5: Addition of Tardigrade CS-αβ Sequences to Superfamily Alignment with Reference by Group Characteristics. The columns indicate conserved cysteines and the amino acids between them. Cysteines contributing to the CSH motif (C3, C4, C8, and C9) and to the CS-αβ fold (C2 and C6) are labeled. Sequences are color-coded by taxonomic group: Annelida (dark red), Arachnida (light orange), Bacteria (black), Cnidaria (grey), Fungi (light green), Hexapoda (orange), Mollusca (blue), Nematoda (purple), Plantae (green), and Tardigrada (yellow). Numbers separated by a dash indicate a range of intervening amino acids; numbers separated by a slash represent either/or. A "C" indicates an additional cysteine that does not occur with enough frequency to warrant addition to the reference. Please click here to view a larger version of this figure.
Figure 6: Addition of Tardigrade CS-αβ Sequences to Superfamily Alignment with Reference by Taxonomic Group. The columns indicate conserved cysteines and the amino acids between them. Cysteines contributing to the CSH motif (C3, C4, C8, and C9) and to the CS-αβ fold (C2 and C6) are labeled. Sequences are color-coded by taxonomic group: Annelida (dark red), Arachnida (light orange), Bacteria (black), Cnidaria (grey), Fungi (light green), Hexapoda (orange), Mollusca (blue), Nematoda (purple), Plantae (green), and Tardigrada (yellow). Numbers separated by a dash indicate a range of intervening amino acids; numbers separated by a slash represent either/or. A "C" indicates an additional cysteine that does not occur with enough frequency to warrant addition to the reference. Please click here to view a larger version of this figure.
Figure 7: Misaligned Sequence Using Automated Alignment. Amino acids conserved in all sequences are indicated by * in the row above the first sequence (outlined in pink boxes). AlCRP is misaligned. The gap needs to be removed to correctly align the C (pink arrow). Please click here to view a larger version of this figure.
Figure 8: Manual Refinement of the Alignment Preserves the Structurally Important Features of the Sequences. AlCRP is now aligned correctly (pink arrow), and the CXXXC motif is fully conserved for the sequences (pink boxes). Please click here to view a larger version of this figure.
Figure 9: Redundant Sequences in an Alignment. If there are pairs of nearly identical sequences (pink boxes), one can be removed, since these will likely always cluster together in and contribute little to the overall topology of the tree. Please click here to view a larger version of this figure.
Figure 10: Comparison of Trees Generated from Phylogenetic Analyses. (A) Maximum likelihood analysis in MEGA, with 1,000 bootstrap replicates using the WAG+G+I model. (B) Bayesian analysis with 1,000,000 generations using the mixed-model setting. Clades supported at 70/0.7 are shown in solid pink lines; dashed pink lines show clades supported at 70/0.7 in the other tree. (a) A 6C and an 8C scorpion toxin; (b) macins; (c) tick and scorpion defensins; (d) plant defensins; and (e) 6C defensins from insects, arachnids, and mollusks. Please click here to view a larger version of this figure.
Table 1: Groups within the CS-αβ Superfamily Based on Cysteine-pairing Patterns. Five basic patterns of bond formation are shown using internal numbers (middle column) or reference numbers (right column). Scorpion Cl- toxins, ASABF 6Cys-alpha, and a group of fungal peptides are placed with the pattern that most closely matches. A cysteine not included in the reference is indicated by a superscript of the cysteines before/after (e.g., C3/4 is between C3 and C4).
Table 2: Addition of Tardigrade CS-αβ Sequences to Cysteine-pairing Pattern Groups. Tardigrade defensins and macins (bold) are put into the previously established groups where possible. Some tardigrade sequences may show a group-specific pattern. A cysteine not included in the reference is indicated by a superscript of the cysteines before/after (e.g., C3/4 is between C3 and C4). The notation "2C/1" indicates there are two cysteines upstream of reference C1.
Supplementary File 1 (S1): Alignment of this Dataset in MEGA. Please click here to download this file.
Supplementary File 2 (S2): Maximum-likelihood Tree MEGA File for This Dataset. Please click here to download this file.
Supplementary File 3 (S3): Alignment of This Dataset in Nexus Format for MrBayes. Please click here to download this file.
Supplementary File 4 (S4): Consensus File from the MrBayes Analysis of This Dataset. Please click here to download this file.
Supplementary File 5 (S5): Alignment of 250 CS-αβ Sequences in MEGA. Please click here to download this file.
Supplementary File 6 (S6): Maximum Likelihood tree of 250 CS-αβ Sequences. Please click here to download this file.
Supplementary File 7 (S7): Alignment of 250 CS-αβ Sequences in Nexus Format for MrBayes. Please click here to download this file.
Supplementary File 8 (S8): Consensus File from the MrBayes Analysis of 250 CS-αβ Sequences. Please click here to download this file.
The criteria for naming a protein within a group should be clear, but this is not always the case. Sequences that have the CS-αβ fold have been studied in many labs using a variety of organisms, resulting in different systems of nomenclature, as well as varying levels of characterization. Attempting to impose a completely new nomenclature is not reasonable and would result in a great deal of confusion when consulting previous literature. A reference numbering system can be used in addition to the name of a protein to clarify its characteristics relative to the superfamily.
Groups of proteins with clear criteria for naming and classification will not likely benefit from generating a reference in a spreadsheet, although it may be useful for summarizing large numbers of sequences and visualizing important characteristics. Sequence alignments and logos are useful for investigating the level of conservation at each site, but do not actively prioritize sequence features important for structure or function. The CS-αβ example focused on the structure, but specific amino acids that form a binding site could also be incorporated as a defining feature. As sequence features that confer specific antimicrobial/toxic activities of CS-αβ peptides are identified, these could be added to the reference to clarify groups based on activity. Although only the predicted mature peptides were used in this example, if the presence of a signal peptide or pro-peptide is important, that information can be added for each sequence. Specific insertion or deletion events, as well as intron locations, can also be included if they are thought to be informative. An advantage of using MrBayes for the phylogenetic analysis is that it is not limited to molecular data-it can analyze data coding for other characteristics that may have evolutionary importance. These can be coded as present or absent, providing more information than the sequence alone.
Collecting the relevant sequences is a critical step of the protocol. Depending on the scope of the study and the distribution of the group members, this may span broad taxonomical groups. If the goal is to understand an entire group of proteins, consider that some sequences may be found outside the species that they are usually reported from. If a taxon is already well represented and additional sequences are unlikely or redundant, excluding them from the search may be appropriate. A basic rule-of-thumb for retrieving hits in a BLAST search is to use a cutoff of -05 for the e-value. The e-value is the number of hits expected by chance. While this is suitable for some situations, if there is a group of sequences that is highly divergent but shares specific characteristics, it can be less reliable-it may retrieve sequences that are similar but do not have the specific features wanted, and it may not return sequences that have the key characteristics but that are highly divergent. There are some potential ways of addressing this issue. The first is to look at the sequences identified in the search that are below the -05 cut-off to see if they meet the inclusion criteria. Second, if there is enough information, use Position-Specific Iterated BLAST (PSI-BLAST)22 or Pattern-Hit Initiated BLAST (PHI-BLAST)23. PSI-BLAST uses the results from an initial search to generate a new model for the next round and can sometimes find divergent sequences that the initial search did not identify. PHI-BLAST requires a pattern to be submitted along with the query sequence. This restricts the retrieved sequences to those containing the pattern of interest. This tool is especially useful if a motif unique to the group can be clearly identified.
An accurate alignment is critical for phylogenetic analysis; interpretations of trees are only valid if they are generated using good alignment. Using the reference to inform the alignment can help to avoid errors that are only obvious when the structure or activity are considered. Sequence redundancy will need to be defined for the project. Two sequences that seem redundant may not be for phylogenetic purposes if they are from widely divergent taxa or are nearly identical in sequence but have different structural or functional properties. If there is ambiguity regarding which sequences should be included, multiple alignments can be generated and analyzed separately to see how alignment changes impact phylogenetic inferences. The method presented here does not eliminate the need for the manual adjustment of alignments, but it can help to clarify how the sequences should align and could possibly be used in conjunction with a more sophisticated barcoding technique than has been described previously13.
For the reference to be useful, it is important to identify defining characteristics that are not currently obvious from the sequence alone. For example, consider the inability to compare cysteine bonding patterns between sequences with different numbers of cysteines when each sequence is numbered with respect to itself. The goal is to facilitate comparison and discussion, not to add another layer of confusion. This may involve several iterations of the reference and judgment calls in deciding which features to include. It is hoped that adopting a common method of discussing divergent sequences in a group will increase the understanding of the group as a whole.
The authors have nothing to disclose.
Ongoing tardigrade antimicrobial peptide research is supported by intramural funding from the Midwestern University Office of Research and Sponsored Programs (ORSP). The ORSP had no role in study design, data collection, analysis, interpretation, or manuscript preparation.
BLAST webpage | https://blast.ncbi.nlm.nih.gov/Blast.cgi | ||
EditSeq (Lasergene suite) | DNASTAR | https://www.dnastar.com/t-allproducts.aspx | |
Excel 2013 | Microsoft | ||
FigTree | http://tree.bio.ed.ac.uk/software/figtree/ | ||
MEGA | www.megasoftware.net | ||
MrBayes | http://mrbayes.sourceforge.net/ | ||
SCOP database | http://scop.mrc-lmb.cam.ac.uk/scop/ |