The purpose of this protocol is to use a combination of computational and bench research to find novel sequences that cannot be easily separated from a co-purifying sequence, which may be only partially known.
Subtractive genomics can be used in any research where the goal is to identify the sequence of a gene, protein, or general region that is embedded in a larger genomic context. Subtractive genomics enables a researcher to isolate a target sequence of interest (T) by comprehensive sequencing and subtracting out known genetic elements (reference, R). The method can be used to identify novel sequences such as mitochondria, chloroplasts, viruses, or germline restricted chromosomes, and is particularly useful when T cannot be easily isolated from R. Beginning with the comprehensive genomic data (R + T), the method uses Basic Local Alignment Search Tool (BLAST) against a reference sequence, or sequences, to remove the matching known sequences (R), leaving behind the target (T). For subtraction to work best, R should be a relatively complete draft that is missing T. Since sequences remaining after subtraction are tested through quantitative Polymerase Chain Reaction (qPCR), R does not need to be complete for the method to work. Here we link computational steps with experimental steps into a cycle that can be iterated as needed, sequentially removing multiple reference sequences and refining the search for T. The advantage of subtractive genomics is that a completely novel target sequence can be identified even in cases in which physical purification is difficult, impossible, or expensive. A drawback of the method is finding a suitable reference for subtraction and obtaining T-positive and negative samples for qPCR testing. We describe our implementation of the method in the identification of the first gene from the germline-restricted chromosome of zebra finch. In that case computational filtering involved three references (R), sequentially removed over three cycles: an incomplete genomic assembly, raw genomic data, and transcriptomic data.
The purpose of this method is to identify a novel target (T) genomic sequence, either DNA or RNA, from a genomic context, or reference (R) (Figure 1). The method is most useful if the target cannot be physically separated, or it would be expensive to do so. Only a few organisms have perfectly finished genomes for subtraction, so a key innovation of our method is the combination of computational and bench methods into a cycle enabling researchers to isolate target sequences when the reference is imperfect, or a draft genome from a non-model organism. At the end of a cycle, qPCR testing is used to determine whether more subtraction is needed. A validated candidate T sequence will show statistically greater detection in known T-positive samples by qPCR.
Incarnations of the method have been implemented in discovery of new bacterial drug targets that do not have host homologs1,2,3,4 and identification of novel viruses from infected hosts5,6. In addition to identification of T, the method can improve R: we recently used the method to identify 936 missing genes from the zebra finch reference genome and a new gene from a germline-only chromosome (T)7. Subtractive genomics is particularly valuable when T is likely to be extremely divergent from known sequences, or when the identity of T is broadly undefined, as in the zebra finch germline-restricted chromosome7.
By not requiring positive identification of T beforehand, a key advantage of subtractive genomics is that it is unbiased. In a recent study, Readhead et al. examined the relationship between Alzheimer's disease and viral abundance in four brain regions. For viral identification, Readhead et al. created a database of 515 viruses8, severely limiting the viral agents that their study could identify. Subtractive genomics could have been used to compare the healthy and Alzheimer's genomes in order to isolate possible novel viruses associated with the disease, regardless of their similarity to known infectious agents. While there are 263 known human-targeting viruses, it has been estimated that approximately 1.67 million undiscovered viral species exist, with 631,000-827,000 of them having a potential to infect humans9.
Isolation of novel viruses is an area in which subtractive genomics is particularly effective, but some studies may not need such a stringent method. For example, studies identifying novel viruses have used unbiased high-throughput sequencing followed by reverse transcription and BLASTx for viral sequences5 or enriching of viral nucleic acids to extract and reverse transcribe viral sequences6. While these studies employed de novo sequencing and assembly, subtraction was not used because the target sequences were positively identified through BLAST. If the viruses were completely novel and not related (or distantly related) to other viruses, subtractive genomics would have been a useful technique. The benefit of subtractive genomics is that sequences that are completely new can be obtained. If the organism's genome is known, it can be subtracted out to leave any viral sequences. For example, in our published study we isolated a novel viral sequence from zebra finch through subtractive genomics, though it was not our original intent7.
Subtractive genomics has also proved useful in the identification of bacterial vaccine targets, motivated by the dramatic rise in antibiotic resistance1,2,3,4. To minimize the risk of autoimmune reaction, researchers narrowed down the potential vaccine targets by subtracting any proteins that have homologs in the human host. One particular study, looking at Corynebacterium pseudotuberculosis, performed subtraction of vertebrate host genomes from several bacterial genomes to ensure that possible drug targets would not affect proteins in the hosts leading to side effects1. The basic work flow of these studies is to download the bacterial proteome, determine vital proteins, remove redundant proteins, use BLASTp to isolate the essential proteins, and BLASTp against host proteome to remove any proteins with host homologs1,2,3,4. In this case, subtractive genomics ensure that the vaccines developed will not have any off-target effects in the host1,2,3,4.
We used subtractive genomics to identify the first protein-coding gene on a germline-restricted chromosome (GRC) (in this case, T), which is found in germlines but not somatic tissue of both sexes10. Before this study, the only genomic information that was known about the GRC was a repetitive region11. De novo assembly was performed on RNA sequenced from ovary and teste tissues (R+T) from adult zebra finches. The computational elimination of sequences was performed using published somatic (muscle) genome sequence (R1)12, its raw (Sanger) read data (R2), and a somatic (brain) transcriptome (R3)13. The sequential use of three references was driven by the qPCR testing at step 5 of each cycle (Figure 2A), showing that additional filtering was required. The discovered α-SNAP gene was confirmed through qPCR from DNA and RNA, and cloning and sequencing. We show in our example that this method is flexible: it is not dependent on matching nucleic acids (DNA vs RNA) and that subtraction can be performed with references (R) that are comprised of assemblies or raw reads.
1. De novo Assemble Starting Sequence
NOTE: Any Next-Generation Sequence (NGS) data can be used, as long as an assembly can be produced from those data. Suitable input data includes Illumina, PacBio, or Oxford Nanopore reads assembled into a fasta file. For concreteness, this section describes an Illumina-based transcriptomic assembly specific to the zebra finch study we performed7; however be aware that the specifics will vary by project. For our example project, raw data were derived from a MiSeq and approximately 10 million paired reads were obtained from each sample.
2. BLAST the Assembly against the Reference Sequence
NOTE: Use this step when the reference is an assembly or long reads like Sanger; if it is composed of raw Illumina reads, see step 3 below for mapping reads to the query. All BLAST steps were completed with version 2.2.29+ though the commands should work on any recent BLAST version.
3. Map Reads onto the Assembly
NOTE: This method can be used if the reference dataset consists of raw genomic reads, rather than assembled sequences or Sanger sequences, in which case use BLAST (step 2.1).
4. Use Python Script to Remove any Matching Sequences
NOTE: Provided scripts work with Python 2.7.
5. Design Primers for the Sequence that Remains
NOTE: At this point there is a fasta file containing candidate T sequences. This section describes qPCR to experimentally test whether they come from T or from previously unknown regions of R. If the subtraction in step 4 removed all sequences, then either the initial assembly failed to include T, or the subtraction may have been too stringent.
6. qPCR Validation of the Remaining Sequence
NOTE: This step requires primers validated and PCR conditions established in step 5.
7. Repeat with a New Reference to Pare Down the Data.
NOTE: If step 6 validated the identified sequences from T, end the cycle here (Figure 2A). However, a variety of considerations may motivate a continuation of the cycle, for example if many R sequences remain in the file or if none of the candidate T sequences were validated by qPCR in step 6.
After running BLAST, the output file will have a list of sequences from the query that match the database. After Python subtraction, a number of nonmatching sequences will be obtained, and tested by qPCR. The results of this, and next steps, are discussed below.
Negative result. There are two possible negative results that can be seen after BLAST to the reference sequence. There may be no BLAST results, meaning that the total sequence does not have any similar sequences to the reference. This may be an error in selecting the right reference sequence for the sample sequenced. Another possibility is that there are no unique sequences in the starting assembly (everything is subtracted away), therefore no genes are found for the sequence of interest. Check where the reference came from and ensure that it is not the same tissue as the query assembly.
After computational filtering, qPCR may yield a negative result, for examples see Figure 3A, 3B, C in which there was no difference in detection across bird tissues. Panels A through C are representative genes from different subtraction cycles, which motivated additional subtractive cycle iterations and the development of the method (Figure 2A, 2B).
Positive result. A positive result–the identification of a true target sequence–is confirmed when genomic DNA qPCR shows statistically greater detection in the tissue / sample of interest relative to the reference (Figure 3D). The subtractive project in this case started with sequencing the RNA from germline tissue of male and female adult zebra finch, obtaining 10 million read pairs from each sex. For brevity, we will describe the processing of the ovary sequence only, in which 167,929 transcripts were obtained by de novo assembly. The subtractive genomics method (BLASTn) was used to eliminate any sequences that matched the published somatic genome12, which left 5,060 transcripts corresponding to 598 unique proteins, indicating that many of the transcripts were noncoding. The Sanger raw reads used to generate the assembly were then used for the next level of subtraction by tBLASTn, yielding 78 proteins. One final subtraction was performed using RNA-seq raw reads from the auditory lobule13, which left eight proteins. When these proteins were run through NCBI nr BLAST, six of the proteins were viral, one was a repetitive region in birds, and the last was an α-SNAP that is germline restricted7 (Figure 2B). During this process, 935 somatic genes that were not previously included in the whole genome annotation were identified; several showed uniform qPCR amplification across tissues (Figure 3A, 3B, 3C). The α-SNAP gene was validated to be germline restricted using qPCR, because it was depleted in somatic tissue relative to testis DNA where it was present at levels equivalent to actin (Figure 3D).
What could go wrong. The main problem that must be overcome when using this method is ensuring that the proper reference sequence is used. The best reference sequence encapsulates, in the broadest sense, the genomic complexity in which the sequence of interest (T) is embedded. This may mean that sequences in different forms; transcriptome, assembly, raw data, or data from multiple studies need to be used as references (Figure 1). In the zebra finch study, we developed primers from RNA sequencing data; however, the primers did not always work due to the presence of introns between or within primer binding sites in DNA. We tested each primer set by PCR off genomic DNA from testis DNA, which encodes both the target (T) and the reference (R), making it a suitable positive control. Primer failure at this stage necessitates the design and testing of new primers until a suitable set is identified. Standard pitfalls of PCR-based methods apply: amplification conditions must be optimized, amplification specificity confirmed by testing and/or cloning, and no-template controls must be included in all experiments. For more information on qPCR assays, see22.
Figure 1. The subtractive approach can iteratively remove multiple references (R) to recover only the target sequence of interest (T) from total genomic data. The reference sequences of individual projects may not overlap in precisely this way and may include datasets not indicated on the figure. Please click here to view a larger version of this figure.
Figure 2. Visual methods. (A) Subtractive cycle schematic. The cycle can be iterated as many times as needed, each time utilizing distinct reference sequences, to obtain the best results.(B) Specific example of the subtractive cycle of steps carried out in Biederman et al.7, with steps numbered as in A, and with the number of sequences remaining at each stage shown. Please click here to view a larger version of this figure.
Figure 3. Example data of qPCR results including negative and positive outcomes. (A) Genomic DNA qPCR of CHD8, a negative outcome. (B) Genomic DNA qPCR of DNMT1, a negative outcome. (C) Genomic DNA qPCR of CHD7, a negative outcome. (D) Genomic DNA qPCR of NAPAG, confirming presence specifically in testis samples and depletion from liver and ovary relative to actin, a positive outcome. All panels indicate average +/- standard deviation of three measurements. Please click here to view a larger version of this figure.
While subtractive genomics is powerful, it is not a cookie-cutter approach, requiring customization at several key steps, and careful selection of reference sequences and test samples. If the query assembly is of poor quality, filtering steps might only isolate assembly artifacts. Therefore, it is important to thoroughly validate the de novo assembly using an appropriate validation protocol to the specific project. For RNA-seq, guidelines are provided on the Trinity website18 and for DNA, a tool like REAPR23 can be used. Another critical step when using BLAST is selection of appropriate e-value, which will determine whether the subtraction will be relaxed or stringent. However, an inversion occurs in the method: a more stringent match to reference is actually a less-stringent subtraction, as non-matching sequences are not subtracted. Therefore, a larger (less stringent) e-value should be used in BLAST for a more stringent subtraction. The final essential step of the protocol is reference selection. For greatest efficiency the reference should be as complete as possible; however, it does not need to be perfect because qPCR testing confirms whether remaining sequences are from T or R, and whether more filtering is necessary. During the implementation of the protocol, new references may be used to further narrow down the genes to be validated. We note that sometimes the matching method may change: for the last subtractive step we used the algorithm BWA to map raw reads onto the query sequences, and used custom python scripts to identify query sequences with no matching reads (Figure 2B).
Limitations of this method include availability of a reference sequence. For example, Meyer et al. evaluated the mitochondrial genome of a new hominin; they used human and Denisovan probes to capture mitochondrial DNA, which was sequenced and mapped to a human reference24. In this case, there were no existing nuclear genome reference data that the researchers could have subtracted against to obtain the mitochondrial genome, necessitating the read-mapping alternative strategy24. Any extensively diverged regions of the novel mitochondrion relative to the human mitochondrial reference would be lost by read-mapping. Subtractive genomics offers a less-biased approach than read-mapping but is not always applicable depending on the research question, and in this case the low levels of ancient DNA precluded the kind of sequence coverage required for de novo assembly (step 1 of subtractive genomics).
Physical purification provides another alternative method to subtractive genomics. Purification of DNA or RNA is often used in sequencing whole chloroplast and mitochondrial genomes because these organellar genomes are much smaller than nuclear genomes25,26,27,28. Human and other smaller mitochondrial genomes can be isolated for sequencing through amplification using two primer sets followed by purification25. However, subtractive genomics may be helpful for cases in which mitochondrial genomes are unusually large, the primer binding sites are divergent or will not result in the full genome. An example of this is in ciliates, which have large, divergent, linear mitochondrial genomes29. Mapping to a reference genome is not a viable option for ciliates due to high divergence across species and lack of homologs even across genuses30. By using subtractive genomics, the ciliate mitochondrial genome can be isolated and analyzed while minimizing the potential of missing segments of the genome. Similarly, while a de novo assembly approach was used in the Sitka spruce chloroplast genome assembly, gap-closing involved comparative read mapping against the white spruce, potentially introducing bias at these sites31.
Depending on the project, subtractive genomics may offer time and cost advantages relative to purification or mapping approaches, while offering less bias in the discovery process. In some situations, the target sequence cannot be easily isolated because it is completely unknown, is vital to cell survival (mitochondria), or too large to separate by standard gel electrophoresis. Size-based electrophoretic purification is slow and requires significant starting material (which may be expensive) while optimizing conditions over multiple attempts. Pulse-field gel electrophoresis (PFGE) enables separation of DNA fragments up to 107 bp (10 Mb) but takes 2-3 days, large amounts of material, and sometimes specialized equipment that is not commercially available32. In Biederman et al., the only sequence that was known from the germline-restricted chromosome was a noncoding repeat7. As this chromosome is the largest in the bird, over 100 Mb in length10, purification would have been impossible; therefore, subtractive genomics was able to do what other methods could not. In the genomic era it is often cheaper and faster to sequence now, and filter by computer later. Enabling the discovery of completely novel sequences, subtractive genomics utilizes a combination of approaches to isolate novel sequences even without a perfect reference sequence.
The authors have nothing to disclose.
The authors acknowledge Michelle Biederman, Alyssa Pedersen, and Colin J. Saldanha for their assistance with the zebra finch genomics project at various stages. We also acknowledge Evgeny Bisk for computing cluster system administration and NIH grant 1K22CA184297 (to J.R.B.) and NIH NS 042767 (to C.J.S).
Accustart II Taq DNA Polymerase | Quanta Bio | 95141 | |
Blasic Local Alignment Search Tool (BLAST) | https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment | ||
Bowtie 2 | https://www.python.org/download/releases/2.7/ | ||
BWA-MEM v. 0.7.12 | https://github.com/BenLangmead/bowtie2 | ||
Geneious | https://blast.ncbi.nlm.nih.gov/Blast.cgi | ||
PEAR v. 0.9.6 | http://www.mybiosoftware.com/reptile-1-1-short-read-error-correction.html | ||
Personal Computer | Biomatters | http://www.geneious.com/ | |
PowerSYBR qPCR mix | ThermoFisher | 4367659 | |
Python v. 2.7 | https://sco.h-its.org/exelixis/web/software/pear/ | ||
Reptile v.1.1 | https://alurulab.cc.gatech.edu/reptile | ||
Stratagene Mx3005P | Agilent Technologies | 401456 | |
TransDecoder v. 3.0.1 | https://sourceforge.net/projects/bio-bwa/files/ | ||
Trinity v. 2.4.0 | https://github.com/TransDecoder/TransDecoder/wiki |