Here, we present a bioinformatic approach and analyses to identify LINE-1 expression at the locus specific level.
Long INterspersed Elements-1 (LINEs/L1s) are repetitive elements that can copy and randomly insert in the genome resulting in genomic instability and mutagenesis. Understanding the expression patterns of L1 loci at the individual level will lend to the understanding of the biology of this mutagenic element. This autonomous element makes up a significant portion of the human genome with over 500,000 copies, though 99% are truncated and defective. However, their abundance and dominant number of defective copies make it challenging to identify authentically expressed L1s from L1-related sequences expressed as part of other genes. It is also challenging to identify which specific L1 locus is expressed due to the repetitive nature of the elements. Overcoming these challenges, we present an RNA-Seq bioinformatic approach to identify L1 expression at the locus specific level. In summary, we collect cytoplasmic RNA, select for polyadenylated transcripts, and utilize strand-specific RNA-Seq analyses to uniquely map reads to L1 loci in the human reference genome. We visually curate each L1 locus with uniquely mapped reads to confirm transcription from its own promoter and adjust mapped transcript reads to account for mappability of each individual L1 locus. This approach was applied to a prostate tumor cell line, DU145, to demonstrate the ability of this protocol to detect expression from a small number of the full-length L1 elements.
Retrotransposons are repetitive DNA elements that can “jump” in the genome in a copy-and-paste mechanism via RNA intermediates. One subset of retrotransposons is known as Long INterspersed Elements-1 (LINEs/L1s) and makes up a sixth of the human genome with over 500,0000 copies1. Despite their abundance, most of these copies are defective and truncated with only an estimated 80-120 L1 elements thought to be active2. A full-length L1 is about 6 kb in length with 5’ and 3’ untranslated regions, an internal promoter and associated anti-sense promoter, two non-overlapping open-reading frames (ORFs), and a signal and polyA tail3,4,5. In humans, L1s are made up of subfamilies distinguished by evolutionary age with the older families having accumulated more unique sequence mutations over time compared to the youngest subfamily, L1HS6,7. L1s are the only autonomous, human retrotransposons and their ORFs encode a reverse transcriptase, endonuclease, and RNPs with RNA-binding and chaperone activities required to retrotranspose and insert in the genome in a process referred to as target-primed reverse transcription8,9,10,11,12.
Retrotransposition of L1s has been reported to cause human germline diseases by a variety of mechanisms including insertional mutagenesis, target-site deletions, and rearrangements13,14,15,16. Recently it has been hypothesized that L1s may play a role in oncogenesis and/or tumor progression as increased expression and insertion events of this mutagenic element have been observed in a variety of epithelial cancers17,18. It is estimated that there is one new L1 insertion in every 200 births19. Therefore, it is imperative to better understand the biology of the actively expressing L1s. The repetitive nature and abundance of defective copies found within transcripts of other genes have made this level of analysis challenging.
Fortunately, with the advent of high throughput sequencing technologies, strides have been made to parse out and identify authentically expressing L1s at the locus-specific level. There are differing philosophies on how to best identify expressed L1s using RNA next-generation sequencing. There have been only two reasonable approaches suggested for mapping L1 transcripts at the locus-specific level. One focuses only on the potential transcription that reads through the L1 polyadenylation signal and into flanking sequences20. Our approach takes advantage of small sequence differences between L1 elements and only maps those RNA-Seq reads that uniquely map to one locus21. Both of these methods have limitations in terms of quantitation of transcript levels. Quantitation can be improved potentially by adding a correction for the ‘unique mappability’ of each L1 locus21, or using more complex algorithms that redistribute the multi-mapped reads that could not be uniquely mapped to a specific locus22. Here, we will detail in a step-by-step manner the RNA extraction and next-generation sequencing and bioinformatics protocol to identify expressed L1 elements at the locus-specific level. Our approach takes maximal advantage of our knowledge of the biology of functional L1 elements. This includes knowing that functional L1 elements must be generated from the L1 promoter, initiated at the beginning of the L1 element, must be translated in the cytoplasm and that their transcripts should be co-linear with the genome. Briefly, we collect fresh, cytoplasmic RNA, select for polyadenylated transcripts, and utilize strand-specific RNA-Seq analyses to uniquely map reads to L1 loci in the human reference genome. These aligned reads then still require extensive manual curation to determine if transcript reads originate from the L1 promoter before designating a locus as an authentically expressed L1. We apply this approach on the DU145 prostate tumor cell line sample to demonstrate how it identifies a relatively few actively transcribed L1 members from the mass of inactive copies.
1. Cytoplasmic RNA extraction
2. Next-Generation sequencing
3. Create annotations (optional if one has an existing annotation)
4. Read alignment pipeline to identify expressed L1s
Option | Description |
–p | This details the number of threads the computer should use running the alignment. Larger computer memory will allow more threads and should be empirically d. |
–m 1 | This tells the program to only accept reads that have one match in the genome that is better than any other genome match. |
– y | This is the tryhard switch which makes the mapping search for all possible matches and not allow it to quit after a fixed number of matches is reached. |
–v 3 | This only allows the program to utilize memory for mapped reads with 3 or less mismatches to the genome. |
–X 600 | This only allows paired reads that map within 600 bases of one another. This makes sure the read pairs are co-linear in the genome and selects against s involving processed RNA molecules. |
–chunkmbs 8184 | This command assigns extra memory for handling the large amount of alignments possible for each L1-related read. |
Table 1: Command line options for Bowtie.
5. Manual curation
6. Read alignment strategy to assess mappability in reference genome (optional if one has an existing aligned genomic DNA dataset)
The steps described above and described graphically in Figure 1 were applied to a human prostate tumor cell line DU145. The RNA sample was cytoplasmically prepped and was next-gen sequenced in a poly-A selected, strand-specific, paired-end protocol. Using Bowtie, the paired-end sequencing files were aligned allowing only unique matches in which the paired-end read matched better to one genomic location compared to any other genomic location. The DU145 sequence files were aligned to the human reference genome creating a bam file, which is available upon author request. Using bedtools, data was extracted from the DU145 strand-separated bam files on the number of reads that mapped to full length L1s. Those reads were sorted in a spreadsheet from largest to smallest and manually curated by examining the genomic environment around each L1 locus in IGV to confirm its authenticity (Supplemental Table 1). If a sample was curated to be authentically expressed, it was color-coded green with an explanation for its acceptance in the right most column. Examples of L1 loci accepted to be authentically expressed following guidelines described in the methods section are shown in Figure 2a-b. If a sample was rejected to be authentically expressed, it was color-coded as red with the reason for rejection on the right most column. Examples of L1 loci rejected because of expression from a promoter other than their own following guidelines described in the methods section are detailed in Figure 2c-e.
Here, only full-length L1s with an intact promoter region were studied. If this distinction is not made, a large source of transcriptional noise originating from truncated L1s is introduced. Examples of truncated L1s in DU145 are shown in Figure 3a-b where they were identified as having uniquely mapped RNA-Seq reads. In IGV, however, it is apparent that those transcripts were not initiated from the truncated L1, but from the inclusion of the L1 sequence in a gene or downstream from an expressed gene.
Overall in DU145, the percentage of full-length L1 loci and reads that are rejected as authentically expressed L1s after manual curation is approximately 50% (Supplemental Table 2) demonstrating the high level of L1 mapped transcript reads that would otherwise be recorded as false positives without manual curation. Specifically, in DU145 there were 114 total full-length L1 loci to have uniquely mapped reads in the sense direction with a total of 3,152 reads, but there were only 60 loci identified to be expressed off their own promoter after manual curation with 1,879 reads (Supplemental Table 1). This is the case even when steps were taken to reduce expression irrelevant to L1 biology by selecting for cytoplasmic mRNA. Note that the locus with the highest level of mapped transcripts in DU145 was rejected because it was not an authentically expressed L1 (Figure 4). Overall the number of mapped transcripts to specific L1 loci ranges similarly between the accepted and rejected L1 loci as authentically expressed after manual curation (Figure 4).
After manual curation, the number of reads that map uniquely to authentically expressed specific L1 loci in DU145 range from 175 reads to an arbitrarily chosen minimum cut off of 10 reads (Figure 5). This approach of identifying uniquely mapped transcript reads to L1s limits the ability to accurately quantify expression. To account for this, a correction factor for each locus based on its mappability was created. To create this correction factor, first bedtools was used to extract the number of uniquely mapped reads from the HeLa genomic bam file that aligned to all full-length L1 loci and graphed those loci from highest to lowest mapped transcript reads (Supplemental Figure 1). It was arbitrarily designated that L1s with 400 reads had full coverage mappability. The number of reads able to map to a L1 locus in HeLa genomic sequencing sample was scaled relative to 400 reads and that scaled number was then multiplied to the number of reads that mapped to each authentically expressed L1 loci in DU145 (Supplemental Table 2). As expected, the L1 elements that had larger correction scores for mappability came from younger subfamilies like L1PA2 (Supplemental Table 2). Once reads were adjusted for mappability scores in each locus, the quantitation for expression for most loci increased (Figure 6). The number of reads that mapped uniquely to authentically expressed specific L1 loci with mappability corrections in DU145 ranged from 612 to 4 reads and there was a re-ordering of highest to lowest expressing loci (Figure 6).
Figure 1: Workflow schematic.
Graphically described are the steps to identify expressed L1s in a human sample. Note that steps 1 and 2 do not need to be repeated if the appropriate files are already available. These appropriate files may be downloaded from Supplement File 1a-b and Supplement File 2. The boxes in red indicate the steps where bedtools coverage program is used to count the number of reads mapping to L1s in the same sense direction. These loci with sense oriented mapping reads are the L1s that should be manually curated. Please click here to view a larger version of this figure.
Figure 2: Examples of curated L1 loci in DU145.
Loaded into IGV are the reference genome, the full-length L1 gff annotation file matching the reference genome version (Supplement File 1), the DU145 bam file, and lastly the genomic HeLa bam file to assess mappability, which are all available upon author request. Arrows have been added to aid in the visualization of direction of the annotated L1. Arrows and reads in red are oriented in sequence from right to left. Arrows and reads in blue are oriented in sequence from left to right. a) In IGV, this L1 locus appears to be expressed off its own promoter as there are no reads upstream of the L1 in the sense orientation for over 5 kb. This L1 has low mappability, it is not in a gene, and has evidence of expected antisense promoter activity26. b) In IGV, this L1 locus appears to be expressed off its own promoter as there are no reads upstream the L1 in the sense orientation for over 5 kb. This L1 has low mappability and is within a gene of opposite direction. c) In IGV, this L1 locus was rejected as an expressed L1 as there are upstream reads in the same orientation within 5 kb. This L1 is within a gene of the same direction so the transcript reads are most likely originating from the promoter of the expressed gene. d) In IGV, this L1 locus was rejected as an expressed L1 as there are upstream reads in the same orientation within 5 kb. This L1 is downstream of a highly expressed gene in the same direction so the transcript reads are most likely originating from the promoter of that expressed gene and extending beyond the normal gene terminator. e) In IGV, this L1 locus was rejected as an expressed L1 as there are upstream reads in the same orientation within 5 kb. This L1 is not within or near an annotated gene in the reference gene so the origin of these transcripts within and upstream of the L1 element suggest an un-annotated promoter. Please click here to view a larger version of this figure.
Figure 3: Background noise originates from truncated L1s as well.
Our L1 annotation does not include truncated L1s as they are a major source of background noise. Arrows have been added to aid in the visualization of direction of the annotated L1. Arrows and reads in blue are oriented in sequence from left to right. a) Demonstrated is an example of a truncated L1 in the L1MB5 sufamily that is 2706 bps. In IGV it is apparent that the reads originate from downstream extension of an expressed gene. b) Shown is another example of a truncated L1. This L1 is an L1PA11 that is 4767 bps long. In IGV it is apparent that the reads mapping uniquely to the L1 originate from the expressed exon, which the L1 is within. Please click here to view a larger version of this figure.
Figure 4: Transcript reads that map uniquely to all full-length intact L1s in the human genome expressed in DU145 prostate tumor cell line.
In black are the specific loci to be identified as authentically expressed after manual curation and in red are the specific loci to be rejected as authentically expressed reads after manual curation. In grey are loci with less than ten reads mapping to each. As these loci represent a small fraction of transcript reads, they were not manually curate. The x-axis tick marks denote every 100 full-length, intact L1s. Approximately 4,500 loci are not graphically shown as they had zero mapped reads. Please click here to view a larger version of this figure.
Figure 5: Transcript reads that map uniquely to authentically expressed full-length intact L1s in DU145 prostate tumor cell line.
Shown are the numbers of transcript reads that map to specific loci in DU145 cells after manual curation. Please click here to view a larger version of this figure.
Figure 6: Reads mapping to authentically expressed L1 when adjusted by mappability.
Shown are the numbers of transcript reads adjusted by loci-specific mappability scores that map to manually curated L1 loci in DU145 cells. Please click here to view a larger version of this figure.
Supplemental File 1: Annotations for full-length, intact human L1s according to orientation. a) FL-L1-BLAST_RM_minus.gff. b) FL-L1-BLAST_RM_plus.gff. Please click here to download this file.
Supplemental File 2: Supercomputer scripts used to automate the bioinformatics pipeline detailed in section 4. Please click here to download this file.
Supplemental Figure 1: Genomic DNA sample used to determine L1 mappability.
Shown are the number of genomic transcript reads from HeLa cell line sample that map uniquely to all 5,000 full-length L1 loci in the genome. It was designated that an L1 has full coverage mappability when 400 reads map to the L1. Please click here to download this figure.
Supplemental Table 1: Manual Curation of L1s in DU145. Please click here to download this table.
Supplemental Table 2: Curated L1s in DU145 with mappability adjustment. Please click here to download this table.
L1 activity has been shown to cause genetic damage and instability contributing to disease27,28,29. Of the approximately 5,000 full-length L1 copies, only a few dozen evolutionarily young L1s account for the majority of retrotransposition activity2. However, there is evidence that even some older, retrotranspositionally-incompentent L1s are still able to produce DNA damaging proteins30. To fully appreciate the role of L1s in genomic instability and disease, L1 expression at the locus-specific level must be understood. However, the high background of L1-related sequences incorporated into other RNAs unrelated to L1 retrotransposition poses a significant challenge in interpreting authentic L1 expression. Another challenge in identifying and therefore understanding expression patterns of individual L1 loci occurs because of their repetitive nature that does not allow many short read sequences to map to a single unique locus. To overcome these challenges, we developed the above-described approach in identifying expression of individual L1 loci using RNA-Seq data.
Our approach filters the high level (over 99%) of transcriptional noise generated from L1 sequences that are unrelated to L1 retrotransposition by taking a number of steps. The first step involves the preparation of cytoplasmic RNA. By selecting for cytoplasmic RNA, L1-related reads found within expressed intronic mRNA in the nucleus are significantly depleted. In the sequencing library preparation, another step taken to reduce transcriptional noise unrelated to L1s include the selection of polyadenylated transcripts. This removes L1-related transcript noise found in non-mRNA species. Another step includes strand-specific sequencing in order to identify and eliminate antisense L1-related transcripts. The use of an annotation for full-length L1s with functional promoter regions when identifying the number of RNA-Seq transcripts that map to L1s also eliminates background noise that otherwise originate from truncated L1s. Finally, the last critical step in eliminating transcriptional noise of L1 sequences unrelated to L1 retrotransposition is the manual curation of full-length L1s identified to have mapped RNA-Seq transcripts. The manual curation involves the visualization of each bioinformatically identified-to-be-expressed L1 locus in the context of its surrounding genomic environment to confirm that expression originates from the L1 promoter. This approach was applied to DU145, a prostate tumor cell line. Even with all the preparation-related steps taken to reduce background noise, approximately 50% of L1 loci identified bioinformatically in DU145 were rejected as L1 background noise originating from other transcriptional sources (Figure 4), emphasizing the rigor required to produce reliable results. This approach using manual curation is labor intensive, but necessary in the development of this pipeline to evaluate and understand the genomic environment surrounding a full-length L1. The next steps include reducing the amount of necessary manual curation by automating some of the curation rules, though due to the still not completely known nature of genomic expression, un-annotated sources of expression in the reference genome, regions of low mappability, and even complicating factors involved with the construction of a reference genome it is not be possible to fully automate L1 curation at this time.
The second challenge in identifying expression of individual L1 loci with sequencing relates to the mapping of repetitive L1 transcripts. In this alignment strategy, it is required that a transcript must align uniquely and co-linearly to the reference genome in order to be mapped. By selecting for paired-end sequences that map concordantly, the amount of transcripts that uniquely align to L1 loci found in the reference genome increases. This unique-mapping strategy provides confidence in the calling of reads mapping specifically to a single L1 locus, though it potentially underestimates the expression quantity of each identified-to-be-authentically expressed, repetitive L1. To approximately correct for this underestimation, a “mappability” score for each L1 locus based on its mappability was developed and applied to the number of uniquely mapped transcript reads (Figure 6). It is of note that ideally, mappability should be scored to full coverage reads across the full-length L1 according to the matched WGS sample. Here, we use WGS of HeLa cells to determine mappability scores of each L1 loci in order to inflate or deflate reads mapping to L1 loci in DU145 prostate tumor cell lines. This mappability calculation is a crude correction score, but the chosen ‘complete coverage mappability’ of 400 reads was determined with the dynamic nature of tumor cell lines in mind. It can be observed in Supplemental Figure 1, that there are a few L1 loci with HeLa WGS with extremely high number of mapped reads. These likely come from duplicated chromosome sequences within HeLa that are not within the reference genome, which is why those loci were not chosen to be representative of complete mappability coverage. Instead it was determined that the average of 100% read coverage occurs around 400 reads according to Supplemental Figure 1 and was then assumed that this average applies to the DU145 tumor prostate cell line as well.
This alignment strategy with 100-200 bp reads from RNA-Seq technology also preferentially selects for evolutionarily older L1s within the reference genome as older L1s have accumulated over time unique mutations that make them more mappable. This approach, therefore, has limited sensitivity when it comes to identifying the youngest of L1s as well as non-reference, polymorphic L1s. To identify the youngest of L1s, we suggest using 5’ RACE selection of L1 transcripts and sequencing technology like PacBio that make use of longer reads21. This permits more unique mapping and therefore confident identification of the expressed, young L1s. Using RNA-Seq and PacBio approaches together can lead to a more comprehensive list of authentically expressed L1s. To identify authentically expressed polymorphic L1s, the first next steps include construction and insertion of polymorphic sequences into the reference genome.
The biological and technical challenges in studying repeat sequences are great, though with the above rigorous procedure to remove transcriptional noise of L1 sequences un-related to retrotransposition using RNA-sequencing technology, we begin to sift through the large levels of transcriptional background noise and being to confidently and stringently identify L1 expression patterns and quantity at the individual locus level.
The authors have nothing to disclose.
We would like to thank Dr. Yan Dong for the DU145 prostate tumor cells. We would like to thank Dr. Nathan Ungerleider for his guidance and advice in creating supercomputer scripts. Some of this work was funded by NIH grants R01 GM121812 to PD, R01 AG057597 to VPB, and 5TL1TR001418 to TK. We would also like to acknowledge support from the Cancer Crusaders and the Tulane Cancer Center Bioinformatics Core.
1 M HEPES | Affymetrix | AAJ16924AE | |
5 M NaCl | Invitrogen | AM9760G | |
Agilent bioanalyzer 2100 | Agilent technologies | ||
Agilent RNA 6000 Nano Kit | Agilent technologies | 5067-1511 | |
bedtools.26.0 | https://bedtools.readthedocs.io/en/latest/content/installation.html | ||
bowtie-0.12.8 | https://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.8/ | ||
Cell scraper | Olympus plastics | 25-270 | |
Chloroform | Fisher | C298-500 | |
Digitonin | Research Products International Corp | 50-488-644 | |
Ethanol | Fisher | A4094 | |
Gibco (Phosphate Buffered Saline) | Invitrogen | 10-010-049 | |
Homogenizer | Thomas Scientific | BBI-8541906 | |
IGV 2.4 | https://software.broadinstitute.org/software/igv/download | ||
Isopropanol | Fisher | A416-500 | |
mac2unix | https://sourceforge.net/projects/cs-cmdtools/files/mac2unix/ | ||
Q-tips | Fisher | 23-400-122 | |
RNAse later solution | Invitrogen | AM7022 | |
RNaseZap RNase Decontamination Solution | Invitrogen | AM9780 | |
samtools-1.3 | https://sourceforge.net/projects/samtools/files/ | ||
sratoolkit.2.9.2 | https://github.com/ncbi/sra-tools/wiki/Downloads | ||
SUPERase·In RNase Inhibitor | Invitrogen | AM2694 | |
Trizol | Invitrogen | 15-596-018 | |
Water (DNASE, RNASE free) | Fisher | BP2484100 |