Here, we describe a protocol for genome-wide mapping of the integration sites of Moloney murine leukemia virus-based retroviral vectors in human cells.
Moloney murine leukemia (MLV) virus-based retroviral vectors integrate predominantly in acetylated enhancers and promoters. For this reason, mLV integration sites can be used as functional markers of active regulatory elements. Here, we present a retroviral scanning tool, which allows the genome-wide identification of cell-specific enhancers and promoters. Briefly, the target cell population is transduced with an mLV-derived vector and genomic DNA is digested with a frequently cutting restriction enzyme. After ligation of genomic fragments with a compatible DNA linker, linker-mediated polymerase chain reaction (LM-PCR) allows the amplification of the virus-host genome junctions. Massive sequencing of the amplicons is used to define the mLV integration profile genome-wide. Finally, clusters of recurrent integrations are defined to identify cell-specific regulatory regions, responsible for the activation of cell-type specific transcriptional programs.
The retroviral scanning tool allows the genome-wide identification of cell-specific promoters and enhancers in prospectively isolated target cell populations. Notably, retroviral scanning represents an instrumental technique for the retrospective identification of rare populations (e.g. somatic stem cells) that lack robust markers for prospective isolation.
Cell identity is determined by the expression of specific sets of genes. The role of cis-regulatory elements, such as promoters and enhancers, is crucial for the activation of cell-type specific transcriptional programs. These regulatory regions are characterized by specific chromatin features, such as peculiar histone modifications, transcription factors and co-factors binding, and chromatin accessibility, which have been widely used for their genome-wide identification in several cell types1,2,3. In particular, the genome-wide profile of acetylation of histone H3 lysine 27 (H3K27ac) is commonly used to define active promoters, enhancers and super-enhancers4,5,6.
Moloney murine leukemia virus (MLV) is a gamma-retrovirus that is widely used for gene transfer in mammalian cells. After infecting a target cell, the retroviral RNA genome is retro-transcribed in a double-stranded DNA molecule that binds viral and cellular proteins to assemble the pre-integration complex (PIC). The PIC enters the nucleus and binds the host cell chromatin. Here, the viral integrase, a key PIC component, mediates the integration of the proviral DNA into the host cell genome. mLV integration in the genomic DNA is not random, but occurs in active cis-regulatory elements, such as promoters and enhancers, in a cell-specific fashion7,8,9,10. This peculiar integration profile is mediated by a direct interaction between the mLV integrase and the cellular bromodomain and extraterminal domain (BET) proteins11,12,13. BET proteins (BRD2, BRD3, and BRD4) act as a bridge between host chromatin and mLV PIC: through their bromodomains they recognize highly acetylated cis-regulatory regions, while the extraterminal domain interacts with the mLV integrase11,12,13.
Here, we describe the retroviral scanning, a novel tool to map active cis-regulatory regions based on the integration properties of mLV. Briefly, cells are transduced with mLV-derived retroviral vector expressing the enhanced green fluorescent protein (eGFP) reporter gene. After genomic DNA extraction, the junctions between the 3' long terminal repeat (LTR) of the mLV vector and the genomic DNA are amplified by linker-mediated PCR (LM-PCR) and massively sequenced. mLV integration sites are mapped to the human genome and genomic regions highly targeted by mLV are defined as clusters of mLV integration sites.
Retroviral scanning was used to define cell-specific active regulatory elements in several human primary cells14,15. mLV clusters co-mapped with epigenetically defined promoters and enhancers, most of which harbored active histone marks, such as H3K27ac, and were cell-specific. Retroviral scanning allows the genome-wide identification of DNA regulatory elements in prospectively purified cell populations7,14, as well as in retrospectively defined cell populations, such as keratinocyte stem cells, that lack effective markers for prospective isolation15.
1. MLV Transduction of Human Cells
2. Amplification of mLV integration sites by linker-mediated-PCR (LM-PCR)
3. Massive Sequencing of mLV Integration Sites
NOTE: LM-PCR products can be sequenced using commercial platforms (choosing the proper nested primer pair in the second PCR reaction, see subsection 2.5.1). For sequencing by Roche GS-FLX pyrosequencing platform, refer to previous papers7,14,15. In this section, a newly-optimized protocol for Illumina sequencing platform is described.
Workflow of the retroviral scanning procedure
The workflow of retroviral scanning procedure is schematized in Figure 1. The target cell population is purified and transduced with a mLV-derived retroviral vector expressing an eGFP reporter gene. The transgene is flanked by the two identical long terminal repeats (5' and 3' LTR), ensuring synthesis, reverse transcription and integration of the viral genome into host DNA. The transduction efficiency is assessed by FACS analysis of eGFP expression. The cell population containing a high proportion (>30%) of mLV-transduced cells is amplified and subsequently lysed to extract genomic DNA containing the integrated mLV viral cassettes. Genomic DNA is digested and ligated with a compatible linker and the junctions between the viral 3' LTRs and the host genome are amplified by LM-PCR. Virus-host genome junctions are then massively sequenced using Roche or Illumina platforms. Finally, mLV integration sites are mapped to the human genome to define genomic clusters of recurrent insertion sites.
Amplification of mLV integration sites by LM-PCR
The LM-PCR is schematized in Figure 2. Genomic DNA is extracted from mLV-transduced cells and digested with the Tru9I restriction enzyme, which cuts frequently the human genome, generating fragments with a median length of 70 bp. A second restriction enzyme (PstI) is used to prevent amplification of integrated and non-integrated internal 5' LTR fragments.A Tru9I double-stranded linker is then ligated to the genomic fragments and LM-PCR is performed with primers specific for the linker and the 3' LTR to amplify the virus-host genome junctions. Nested PCR can be performed using primers compatible with Roche or Illumina sequencing platforms.
Analysis of virus-host genome junction amplicons
In the experiment represented in Figure 3, we purified CD34–CD13+ myeloid progenitor/precursors (MPP) and transduced them with an mLV-derived retroviral expressing the eGFP reporter gene. More than 60% of MPP cells expressed eGFP 48 h after transduction (data not shown). 15 days after transduction, we collected the cells, extracted gDNA and amplified the virus-host genome junctions, as described above. An aliquot of the pooled LM-PCR products was loaded on an 1% agarose gel to verify the presence and the size of the amplicons. We successfully visualized a DNA smear corresponding to the LM-PCR products of different sizes, ranging from 150 to 500 bp (Figure 3A). Amplicons were then concentrated by DNA precipitation and loaded on a 1% agarose gel. LM-PCR products were gel-purified and run on a bioanalyzer system, confirming the expected amplicon sizes (between 150 and 500 bp; Figure 3B).
Mapping of mLV integrations into active and cell-specific regulatory regions
In the experiment reported in Figure 4, hematopoietic stem/progenitor cells (HSPC), erythroid progenitor/precursors (EPP) and myeloid progenitor/precursors (MPP) were transduced with an mLV-derived retroviral vector. These results were obtained processing and sequencing samples derived from different cell types separately, to avoid the potential contamination/collisions18,19. Raw sequence reads generated by massive sequencing were processed by an automated bioinformatics pipeline to eliminate viral and linker sequences. Then, unique sequences of at least 20 bp were mapped on the human genome using Blat17. Raw alignments were filtered requiring the match to start within the first 3 nucleotides, univocal matches and a minimum of 95% identity. Clusters of recurrent mLV integrations were defined by a statistical comparison with a dataset of random genomic sequences, generated randomly extracting genomic positions from the human genome with a Tru91 restriction motif at a distance compatible with the sequencing platform. Control sequences were then processed through the same mapping and filtering pipeline used for integration sequences, to generate a random set of unique sites. To define mLV clusters, we applied the DBSCAN clustering algorithm19, comparing the distribution of consecutive mLV integrations with that of an equal number of random sites to identify regions of highly clustered integrations, which define cell-specific regulatory elements7,14,15. In order to avoid the generation of false clusters, multiple extractions from the random control dataset were performed. We mapped by LM-PCR and pyrosequencing 32,574, 27,546 and 36,358 mLV integration sites in HSPC, EPP and MPP, respectively. Clusters of recurrent mLV integrations co-mapped with acetylated enhancers and promoters (Figure 4A). Most of the mLV-targeted regulatory regions were cell-specific, such as: (i) the promoter of HSPC-specific SPINK2 gene (Figure 4B); (ii) the Locus Control Region containing potent enhancers of the erythroid-specific β-like globin genes (Figure 4C); (iii) enhancers located upstream of the MPP-specific LYZ gene (Figure 4D). Finally, we used luciferase assays to validate a subset of putative cell-specific mLV-targeted enhancers in EPP and MPP (Figure 4E).
Figure 1: A general scheme of the retroviral integration site mapping procedure. Target cells are transduced with a mLV-based retroviral vector containing a eGFP cassette. Genomic DNA obtained from transduced cells is digested with Tru9I and ligated with a compatible Tru9I double-strand linker. mLV integration sites were amplified by nested LM-PCR and the library of virus-host genome junctions can be massively sequenced using Illumina or Roche platforms. The resulting reads were mapped to the human genome to define clusters of recurrent mLV integrations. Please click here to view a larger version of this figure.
Figure 2: Amplification of virus-host genome junctions by LM-PCR. Genomic DNA (gDNA) containing the integrated mLV provirus is digested with Tru9I and PstI restriction enzymes, and ligated with a compatible Tru9I linker. Nested PCR is performed using primers specific for the LTR and the linker. Tru9I and PstI restriction sites in the viral and in the human genome are indicated. Please click here to view a larger version of this figure.
Figure 3: Analysis of LM-PCR amplicons. (A) LM-PCR products (lane +) were run on a 1% agarose gel and visualized by ethidium bromide staining. A no template sample served as negative control sample, where only PCR primers were visualized (lane -). (B) After gel purification, LM-PCR product size was checked by microcapillary electrophoresis. Sample size (bp) and fluorescence intensity (FU) are shown on x and y axes of the electropherogram, respectively. Please click here to view a larger version of this figure.
Figure 4: Mapping of mLV integration into epigenetically-defined regulatory regions. (A) We defined 3,498, 2,989 and 4,103 clusters of recurrent mLV integration sites in HSPC, EPP and MPP, respectively. In each cell population, >95% of mLV clusters overlapped with epigenetically defined enhancers and promoters. (B, C, and D) Cell-specific mLV-targeted regions were highly acetylated and associated with cell-specific expression of the targeted gene (SPINK2, HBB and LYZ, almost exclusively expressed in HSPC, EPP and MPP, respectively, as determined by Cap analysis of Gene Expression). mLV single integrations are depicted with small bars. TPM indicates Tag Per Million. (E) Putative cell-specific mLV-targeted enhancers in EPP and MPP. Figure 4 is adapted from reference14. We received the permission to re-use this figure under the creative commons license. Please click here to view a larger version of this figure.
Here, we described a protocol for genome-wide mapping of the integration sites of mLV, a retrovirus that targets chromatin regions, epigenetically marked as active promoters and enhancers. Critical steps and/or limitations of the protocol include: (i) mLV transduction of the target cell population; (ii) amplification of virus-host junctions by LM-PCR; (iii) retrieval of a high fraction of integration sites. mLV-based retroviral vectors efficiently transduce dividing cells. The low efficiency of transduction of non-dividing cells (e.g. post-mitotic neuronal cells) is a potential limitation of this technique. However, it can be overcome through cell sorting of the transduced population based on the expression of the reporter gene (e.g. eGFP). The generation of relatively short amplicons by LM-PCR (150 to 500 bp) is mandatory to generate a library of amplicons compatible with the currently used massive sequencing strategies and to allow a comprehensive genome-wide analysis of mLV integration sites. As an example, amplification of >500 bp long LM-PCR products can result from either partial genomic DNA digestion or intra-ligation of Tru9I-digested genomic fragments, as evaluated by shotgun cloning of LM-PCR amplicons (data not shown). In the first case, the issue can be resolved by further optimizing genomic DNA digestion conditions, whereas, in the latter case, the successful massive sequencing of mLV integration sites can be accomplished through a gel purification of amplicons between 150 and 500 bp. Finally, the use of restriction enzymes to cut the genomic DNA can lead to the preferential amplification and detection of integration sites that lie close to a restriction site. The percentage of integration sites that the Tru9I restriction enzyme can retrieve is estimated to be ~50%. Thus, this technique can be further optimized to improve integration site retrieval by using multiple restriction enzymes or performing random DNA shearing by sonication20,21.
Recently, we have optimized the sequencing of viral integration sites using the widely-used Illumina platform, as detailed in this paper. This sequencing approach allows the generation of a higher number of reads per run compared to the Roche platform, greatly increasing the number of integration sites retrieved from a single experiment, thus globally reducing time and costs.
Genome-wide identification of cell-specific regulatory regions requires the mapping of one to three histone modifications by ChIP-seq6 (mono- and tri-methylation of histone H3 lysine 4 to identify enhancers and promoters, and H3K27ac to distinguish between active and inactive regulatory elements)4,5,6 and a systematic comparison of different cell types to define cis-acting elements active exclusively in a determined cell population. We mapped mLV integration sites in prospectively isolated target cell populations, such as multipotent hematopoietic progenitors and their committed erythroid and myeloid progeny14, embryonic stem cells, neuroepithelial-like stem cells and differentiated keratinocytes15. Retroviral scanning allowed the genome-wide definition of cell-specific regulatory elements in each cell population analyzed, making the comparative genome-wide studies not strictly necessary. Importantly, this tool can be used for the identification of active regulatory elements in rare cell populations (e.g. somatic stem cells), that lack robust markers for prospective isolation and cannot be analyzed by ChIP-seq-based analysis of histone modifications15. In these cell populations mLV integration is a permanent genetic marker of active regulatory regions, allowing their retrospective identification in the more abundant cell progeny in vitro and in vivo. As an example, we successfully used mLV integration clusters as surrogate markers of promoters and enhancers in a retrospectively identified keratinocyte stem cell (KSC) population. We transduced an early-passage, foreskin-derived keratinocyte culture containing KSCs with an mLV vector, then we passaged these cells for >35 cell doublings to enrich in the progeny of KSCs, thus defined by their ability to maintain the culture for this number of passages. In this case, mLV integrations permanently marked the regulatory regions active in the original transduced KSC population15. Future studies will aim at identifying in a genome-wide manner cell-specific regulatory elements in a larger number of rare human stem cell populations.
The authors have nothing to disclose.
This work was supported by grants from the European Research Council (ERC-2010-AdG, GT-SKIN), the Italian Ministry of Education, Universities and Research (FIRB-Futuro in Ricerca 2010-RBFR10OS4G, FIRB-Futuro in Ricerca 2012-RBFR126B8I_003, EPIGEN Epigenomics Flagship Project), the Italian Ministry of Health (Young researchers Call 2011 GR-2011-02352026) and the Imagine Institute Foundation (Paris, France).
PBS, pH 7.4 | ThermoScientific | 10010031 | or equivalent |
Fetal Bovine Serum | ThermoScientific | 16000044 | or equivalent |
0.2 ml tubes | general lab supplier | ||
1.5 ml tubes | general lab supplier | ||
QIAGEN QIAmp DNA mini Kit | QIAGEN | 51306 | or equivalent |
T4 DNA ligase | New England BioLabs | M0202T | |
T4 DNA Ligase Reaction buffer | New England BioLabs | M0202T | |
Linker Plus Strand oligonucleotide | general lab supplier | 5’-PO4-TAGTCCCTTAAGCGGAG-3’ (Purification grade: SDS-PAGE) | |
Linker Minus Strand oligonucleotide | general lab supplier | 5’-GTAATACGACTCACTATAGGGCTCCGCTTAAGGGAC-3’ (Purification grade: SDS-PAGE) | |
Tru9I | Roche-Sigma-Aldrich | 11464825001 | |
SuRE/Cut Buffer M | Roche-Sigma-Aldrich | 11417983001 | |
PstI | Roche-Sigma-Aldrich | 10798991001 | |
SuRE/Cut Buffer H | Roche-Sigma-Aldrich | 11417991001 | |
Platinum Taq DNA Polimerase High Fidelity | Invitrogen | 11304011 | |
10mM dNTP Mix | Invitrogen | 18427013 | or equivalent |
PCR grade water | general lab supplier | ||
96-well thermal cycler (with heated lid) | general lab supplier | ||
linker primer | general lab supplier | 5’-GTAATACGACTCACTATAGGGC-3’ (Purification grade: PCR grade) | |
MLV-3’ LTR primer | general lab supplier | 5’-GACTTGTGGTCTCGCTGTTCCTTGG-3’ (Purification grade: PCR grade) | |
linker nested primer 454 | general lab supplier | 5’-GCCTTGCCAGCCCGCTCAG[AGGGCTCCGCTTAAGGGAC](Purification grade: SDS-PAGE) | |
MLV-3’ LTR nested primer 454 | general lab supplier | 5’-GCCTCCCTCGCGCCATCAGTAGC[GGTCTCCTCTGAGTGATTGACTACC](Purification grade: SDS-PAGE) | |
linker nested primer Illumina | general lab supplier | 5'-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[AGGGCTCCGCTTAAGGGAC](Purification grade: SDS-PAGE) | |
MLV-3’ LTR nested primer Illumina | general lab supplier | 5'-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-[GGTCTCCTCTGAGTGATTGACTACC](Purification grade: SDS-PAGE) | |
Sodium Acetate Solution (3M) pH 5.2 | general lab supplier | ||
Ethanol (absolute) for molecular biology | Sigma-Aldrich | E7023 | or equivalent |
Topo TA Cloning kit (with pCR2.1-TOPO vector) | Invitrogen | K4500-01 | |
QIAquick Gel Extraction kit | QIAGEN | 28704 | |
Agarose | Sigma-Aldrich | A9539 | or equivalent |
Ethidium bromide | Sigma-Aldrich | E1510 | or equivalent |
100 bp DNA ladder | Invitrogen | 15628019 | or equivalent |
6X Loading Buffer | ThermoScientific | R0611 | or equivalent |
NanoDrop 2000 UV-Vis Spectrophotometer | ThermoScientific | ND-2000 | |
Nextera XT Index kit | Illumina | FC-131-1001 or FC-131-1002 | |
2x KAPA HiFi Hot Start Ready Mix | KAPA Biosystems | KK2601 | |
Dynal magnetic stand for 2 ml tubes | Invitrogen | 12321D | or equivalent |
Agencourt AMPure XP 60 ml kit | Beckman Coulter Genomics | A63881 | |
Tris-HCl 10 mM, pH 8.5 | general lab supplier | ||
Agilent 2200 TapeStation system | Agilent Technologies | G2964AA | or equivalent |
D1000 ScreenTape | Agilent Technologies | 5067-5582 | or equivalent |
D1000 Reagents | Agilent Technologies | 5067-5583 | or equivalent |
KAPA Library Quantification Kit for Illumina platforms (ABI Prism) | KAPA Biosystems | KK4835 | |
ABI Prism 7900HT Fast Real-Time PCR System | Applied Biosystems | 4329003 | |
NaOH 1.0 N, molecular biology-grade | general lab supplier | ||
HT1 (Hybridization Buffer) | Illumina | Provided in the MiSeq Reagent Kit | |
MiSeq Reagent Kit v3 (150 cycles) | Illumina | MS-102-3001 | |
MiSeq System | Illumina | SY-410-1003 | |
PhiX Control v3 | Illumina | FC-110-3001 |