A method of constructing a phylogenetic tree based on sequence homology of SWEETs from eukaryotes and SemiSWEETs from prokaryotes is described. Phylogenetic analysis is a useful tool for explaining the evolutionary relatedness between homologous proteins or genes from different organism groups.
Phylogenetic analysis uses nucleotide or amino acid sequences or other parameters, such as domain sequences and three-dimensional structure, to construct a tree to show the evolutionary relationship among different taxa (classification units) at the molecular level. Phylogenetic analysis can also be used to investigate domain relationships within an individual taxon, particularly for organisms that have undergone substantial change in morphology and physiology, but for which researchers lack fossil evidence due to the organisms' long evolutionary history or scarcity of fossilization.
In this text, a detailed protocol is described for using the phylogenetic method, including amino acid sequence alignment using Clustal Omega, and subsequent phylogenetic tree construction using both Maximum Likelihood (ML) of Molecular Evolutionary Genetics Analysis (MEGA) and Bayesian Inference via MrBayes. To investigate the origin of eukaryotic Sugars Will Eventually be Exported Transporters (SWEET) genes, 228 SWEETs including 35 SWEET proteins from unicellular eukaryotes and 57 SemiSWEET proteins from prokaryotes were analyzed. Interestingly, SemiSWEETs were found in prokaryotes, but SWEETs were found in eukaryotes. Two phylogenetic trees constructed using theoretically distinct methods have consistently suggested that the first eukaryotic SWEET gene might stem from the fusion of a bacterial SemiSWEET gene and an archaeal SemiSWEET gene. It is worth noting that one should be cautious to draw a conclusion based only on phylogenetic analysis, although it is useful to explain the underlying relationship between different taxa, which is difficult or even impossible to discern through experimental means.
DNA or RNA sequences carry genetic information for underlying phenotypes that can be analyzed through physiological and biochemical methods or observed through morphological and fossil evidence. In a sense, genetic information is more reliable than evaluating external phenotypes because the former is the basis for the latter. In evolutionary study, fossil evidence is very direct and convincing. However, many organisms, such as microorganisms, have little chance to form a fossil during long geologic ages. Therefore, molecular information such as nucleotide sequences and amino acid sequences from related extant organisms are of value for exploring evolutionary relationships1. In the present study, a simple introduction of basic phylogenetic knowledge and an easy-to-learn protocol was provided for newcomers who need to construct a phylogenetic tree on their own.
Both DNA (nucleotide) and protein (amino acid) sequences can be used to infer phylogenetic relationships between homologous genes, organelles, or even organisms2. DNA sequences are more likely to be affected by changes during evolution. In contrast, amino acid sequences are much more stable given that synonymous mutations in nucleotide sequences do not cause mutations in amino acid sequences. As a result, DNA sequences are useful for comparison of homologous genes from closely related organisms, whereas amino acid sequences are appropriate for homologous genes from distantly related organisms3.
A phylogenetic analysis begins with the alignment of amino acid or nucleotide sequences4 retrieved from an annotated genome sequencing database5 listed in FASTA format, i.e., putative or expressed protein sequences, RNA sequences, or DNA sequences. It is worth noting that it is critical to collect high-quality sequences for the analysis, and only homologous sequences can be used to analyze phylogenetic relationships. Many different platforms such as Clustal W, Clustal X, Muscle, T-coffee, MAFFT, can be used for sequence alignment. The most widely used is Clustal Omega6,7 (http://www.ebi.ac.uk/Tools/msa/clustalo/), which can be used online or can be downloaded free of charge. The alignment tool has many parameters that the user can adjust before starting the alignment, but the default parameters work well in most cases. After the process is complete, the aligned sequences should be saved in the correct format for the next step. They should then be edited or trimmed using an editing software, such as BioEdit, because phylogenetic tree construction by MEGA requires the sequences to be of equal length (including both amino acid abbreviations and hyphens. In the aligned sequence, any position without an amino acid or nucleotide is represented by a hyphen "-"). Generally, all of the protruding amino acids or nucleotides at either end of the alignment should be removed. In addition, columns containing poorly aligned sequences in the alignment can be deleted because they convey little valuable information, and can sometimes give confusing or false information3. The columns containing one or more hyphens can be deleted at this time or in the later tree construction stage. Alternatively, they can be used for phylogenetic computation. When the sequence alignment and trimming is finished, the aligned sequences should be saved in FASTA format, or the desired format, for later use.
Many software platforms provide tree construction functions using different methods or algorithms. In general, the methods can be classified as either distance matrix methods or discrete data methods. Distance matrix methods are simple and fast to compute, while discrete data methods are complicated and time-consuming. For very closely related taxa with a high degree of sharing of amino acid or nucleotide sequence identity, a distance matrix method (Neighbor Joining: NJ; Unweighted Pair Group Method with Arithmetic mean: UPGMA) is appropriate; for distantly related taxa, a discrete data method (Maximum Likelihood: ML; Maximum Parsimony: MP; Bayesian Inference) is optimal3,8. In this study, the ML methods in MEGA (6.0.6) and Bayesian Inference (MrBayes 3.2) were applied to construct phylogenetic trees9. Ideally, when the proper model and parameters are used, the results derived from different methods may be consistent, and they are thus more reliable and convincing.
For a ML phylogenetic tree constructed using MEGA10, the aligned sequence file in FASTA format must be uploaded into the program. The first step then is to select the optimal substitution model for the uploaded data. All available substitution models are compared based on the uploaded sequences, and their final scores will be shown in a results table. Select the model with the smallest Bayesian Information Criterion (BIC) score (listed first in the table), set ML parameters according to the recommended model, and start the computation. The computation time varies from several minutes to several days, depending on the complexity of the loaded data (length of the sequences and number of taxa) and the performance of the computer on which the programs are run. When the computation is finished, a phylogenetic tree will be shown in a new window. Save the file as "FileName.mat". After setting parameters to specify the appearance of the tree, save once more. Using this method, MEGA can generate publication grade phylogenetic tree figures.
For tree construction with MrBayes11, the first step is to transform the aligned sequence, which is usually listed in FASTA format, into nexus format (.nex as the file type). Transforming FASTA files into nexus format can be processed in MEGA. Next, the aligned sequence in nexus format can be uploaded into MrBayes. When the file is successfully uploaded, specify detailed parameters for the tree computation. These parameters include details such as amino acid substitution model, variation rates, chain number for Markov chain Monte Carlo (MCMC) coupling, ngen number, average standard deviation of split frequencies, and so on. After these parameters have been specified, start the computation. In the end, two tree figures in ASC II code, one showing clade credibility and the other showing branch lengths, will be displayed on the screen.
The tree result will be saved automatically as "FileName.nex.con". This tree file can be opened and edited by FigTree, and the figure displayed in FigTree can be modified further to make it more suitable for publication.
In this study, 228 SWEET proteins, including 35 SWEETs from unicellular eukaryotes and 57 SemiSWEETs from prokaryotes, were analyzed as an example. Both the SWEETs and SemiSWEETs were characterized as glucose, fructose, or sucrose transporters across membranes12,13. Phylogenetic analysis suggests that the two MtN3/saliva domains containing SWEETs might be derived from an evolutionary fusion of a bacterial SemiSWEET and of an archaeon14.
1. Sequence Alignment
2. Computation of the Phylogenetic Tree
3. Presentation of the Phylogenetic Tree
NOTE: A phylogenetic ML tree will be presented when the computation using MEGA is finished (Figure 10).
4. Analysis of the Relationship of SWEETs and SemiSWEETs Using Sequence Alignment
NOTE: This step may not be needed in ordinary sequence analysis.
5. Phylogenetic Tree Construction with MrBayes
Phylogenetic trees show that all of the first MtN3/saliva domains of the 35 SWEET sequences clustered as one clade and the second MtN3/saliva domains of the SWEET sequences clustered as another clade. In addition, alignment results of the SWEETs and SemiSWEETs show that some SemiSWEETs from α-Proteobacteria aligned with the first MtN3/saliva domain of the SWEET sequences, whereas SemiSWEETs from Methanobacteria (archaea) aligned with the second MtN3/saliva domain of the SWEET sequences. These results together suggest that the two MtN3/saliva domains containing SWEETs might be derived from an evolutionary fusion of a bacterial SemiSWEET and of an archaeon14.
Figure 1: Save the aligned sequences of the 35 putative eukaryotic SWEETs as "35.clustal" via Clustal Omega. Please click here to view a larger version of this figure.
Figure 2: Select path in BioEdit to trim the aligned sequences of "35.clustal," which was prepared in Clustal Omega. Please click here to view a larger version of this figure.
Figure 3: Select and delete the uneven sequences at the left side of the first MtN3/saliva domain sequences of the 35 putative eukaryotic SWEETs in BioEdit. Please click here to view a larger version of this figure.
Figure 4: The trimmed sequences of the first MtN3/saliva domain of the 35 putative eukaryotic SWEETs in BioEdit. Please click here to view a larger version of this figure.
Figure 5: Select and copy the first MtN3/saliva domain sequences of the 35 putative eukaryotic SWEETs in MEGA. The copied sequences will be pasted in a doc file for the editing. Please click here to view a larger version of this figure.
Figure 6: Convert "35 realigned.fas" into "35.nex" (PAUP format) for Bayesian Inference at a later stage. Please click here to view a larger version of this figure.
Figure 7: Search for the best-fit substitution model by MEGA for Maximum Likelihood (ML) phylogenetic tree construction based on the "35 realigned.fas" file. Please click here to view a larger version of this figure.
Figure 8: A table of the best-fit substitution model computed for ML tree based on the "35 realigned.fas" file. Please click here to view a larger version of this figure.
Figure 9: Specify the parameters for ML tree computation based on the best-fit substitution model for "35 realigned.fas" in MEGA. Please click here to view a larger version of this figure.
Figure 10: An original ML tree constructed by MEGA based on "35 realigned.fas". At this stage, many options for figure style, size, color, etc., are available. Please click here to view a larger version of this figure.
Figure 11: Alignment of 228 eukaryotic SWEETs and 57 prokaryotic SemiSWEETs by Clustal Omega. The results were shown in Jalview, integrated into Clustal Omega. In the alignment, some SemiSWEETs from α-Proteobacteria were aligned with the first MtN3/saliva domain of the SWEET sequences, whereas SemiSWEETs from Methanobacteria (archaea) were aligned with the second MtN3/saliva domain of the SWEET sequences. Please click here to view a larger version of this figure.
Figure 12: Load the "35.nex" file into MrBayes in the DOS window. In order to show the overall results, content that was similar was deleted to reduce the figure length. Please click here to view a larger version of this figure.
Figure 13: Information displayed on the screen after computation of the "35.nex" file using MrBayes. To show the overall results, content that was similar was deleted to reduce the figure length. Please click here to view a larger version of this figure.
Figure 14: Summarized samples of model parameters for the "35.nex" file. Please click here to view a larger version of this figure. Please click here to view a larger version of this figure.
Figure 15: Summarized tree samples of the "35.nex" file. To show the overall results, content that was similar was deleted to reduce the figure length. Please click here to view a larger version of this figure.
Figure 16: The phylogenetic tree of "35.nex.con" displayed by FigTree. Please click here to view a larger version of this figure.
It is becoming increasingly popular in biological research to make a phylogenetic tree based on nucleotide or amino acid sequences8. Generally, there are three critical stages of the practice including sequence alignment, evaluation of the aligned sequences with the proper method or algorithm, and visualization of the computational result as a phylogenetic tree. In the presented study, three rounds of sequence alignment were conducted: first, the SWEET protein sequences, including the first and second MtN3/saliva domain, were aligned; second, each of the individual MtN3/saliva domain sequences of the SWEETs as an independent taxon were gathered and aligned together; and finally, SemiSWEET sequences and SWEET sequences were jointly aligned. Only one round of sequence alignment is usually needed for phylogenic tree construction.
In the preliminary stage, homologous sequences can be downloaded from NCBI or other databases. These downloaded sequences may need to be screened if they are not well annotated. In the first and second stage, alignment and computation cannot be started if the sequence format is incorrect. For example, Clustal Omega will reject any departure from the FASTA format in the sequence file. In the computational stage, note that the sequence lengths including both amino acids or nucleotides and hyphens are required to be equal before being evaluated by MEGA.
Despite the wealth of methods and models for tree construction that are available, none of them is foolproof. Robust and convincing results are those that are consistent with each other when different algorithms or models are used to evaluate the same data15. In the ML method, the reliability of tree topology largely depends on the bootstrap value of each clade; a bootstrap value of 70 or greater is generally regarded as reliable. In the present study, all of the first MtN3/saliva domain sequences clustered as a large clade with a bootstrap value of 83. The value of the other clade containing all the second MtN3/saliva domain sequences, however, was only 6 (Figure 10). To verify the tree architecture, MrBayes, which employs a completely different method16 than ML, was used to analyze the relationship of the taxa. The posterior probabilities16 of the first and second domain clades obtained from MrBayes were 100 and 68, respectively (Figure 16).
Another limitation of the ML and the MrBayes computation is that both are time-consuming to run. Using a computer with multi-core processors and graphical processing units (GPU) is helpful to improve computational performance and speed17,18. For the operation of MrBayes, a computer with a discrete graphics card and the appropriate CUDA drivers can significantly speed up the likelihood calculations11.
Selecting the proper model for phylogenetic tree computation is difficult for those with little experience. In this respect, MEGA provides an easy way to find the best model by comparing the BIC scores of candidate models. In addition, the recently upgraded MEGA 6.0 integrates several sequence alignment tools such as MUSCLE and Clustal W10, which are very convenient to use. It also provides both a sequence editing and phylogenetic tree construction function. These features partly explain why this software is so popular in the computational molecular evolution field. As for MrBayes, a significant advantage of this tool is that it can process mixed datatypes together (e.g., morphological and molecular data)11, and thus the results are more comprehensive.
In conclusion, the present study provides a method to analyze the molecular origin of protein-encoding genes that have undergone complex variation such as fusion after duplication or horizontal gene transfer (HGT) during evolution. Hopefully, more findings will be revealed with broad application of phylogenetic analysis in the evolutionary research field.
The authors have nothing to disclose.
This work was supported by the National Natural Science Foundation of China (31371596), the Bio-technology Research Center, China Three Gorges University (2016KBC04), and the Natural Science Foundation of Jiangsu Province, China (BK20151424).
Adobe Illustration | a graphical tool developed by Adobe Systems Software Ireland Ltd. Copyright © 2017 | ||
BioEdit | a biological sequence alignment editor written for Windows 95/98/NT/2000/XP/7. Copyright © Tom Hall | ||
Clustal Omega | a package for making multiple sequence alignments of amino acid or nucleotide sequences. http://www.clustal.org/ | ||
CorelDRAW | a graphic design software. Copyright © 2017 Corel Corporation | ||
FigTree | a graphical viewer of phylogenetic trees designed by the University of Edinburgh | ||
MEGA | MolecularEvolutionary Genetics Analysis version6.0 http://www.megasoftware.net/home | ||
MrBayes | an Bayesian phylogenetic inference tool | ||
NVIDIA | a company designs graphics processing units (GPUs) for the gaming and professional markets. Corporation Copyright © 2017 | ||
PAUP | Phylogenetic Analysis Using Parsimony. David Swofford's program implements the maximum likelihood method under a number of nucleotide models. | ||
Photoshop | a raster graphics editor developed and published by Adobe Systems Software Ireland Ltd. Copyright © 2017 | ||
RHYTHM | a knowledge based prediction of hekix contacts. Charité Berlin – Protein Formatics Group – Copyright 2007-2009 | ||
TMHMM | a tool for prediction of transmembrane helices in proteins. http://www.cbs.dtu.dk/services/TMHMM/ | ||
Compter | 4GB memory, Core 2 or above CPU. Windows 7, Windows 10 |