Intrinsically disordered domains are important for oncogenic fusion transcription factor function. To therapeutically target these proteins, a more detailed understanding of the regulatory mechanisms employed by these domains is required. Here, we use transcriptomics to map important structural features of the intrinsically disordered EWS domain in Ewing sarcoma.
Many cancers are characterized by chromosomal translocations which result in the expression of oncogenic fusion transcription factors. Typically, these proteins contain an intrinsically disordered domain (IDD) fused with the DNA-binding domain (DBD) of another protein and orchestrate widespread transcriptional changes to promote malignancy. These fusions are often the sole recurring genomic aberration in the cancers they cause, making them attractive therapeutic targets. However, targeting oncogenic transcription factors requires a better understanding of the mechanistic role that low-complexity, IDDs play in their function. The N-terminal domain of EWSR1 is an IDD involved in a variety of oncogenic fusion transcription factors, including EWS/FLI, EWS/ATF, and EWS/WT1. Here, we use RNA-sequencing to investigate the structural features of the EWS domain important for transcriptional function of EWS/FLI in Ewing sarcoma. First shRNA-mediated depletion of the endogenous fusion from Ewing sarcoma cells paired with ectopic expression of a variety of EWS-mutant constructs is performed. Then RNA-sequencing is used to analyze the transcriptomes of cells expressing these constructs to characterize the functional deficits associated with mutations in the EWS domain. By integrating the transcriptomic analyses with previously published information about EWS/FLI DNA binding motifs, and genomic localization, as well as functional assays for transforming ability, we were able to identify structural features of EWS/FLI important for oncogenesis and define a novel set of EWS/FLI target genes critical for Ewing sarcoma. This paper demonstrates the use of RNA-sequencing as a method to map the structure-function relationship of the intrinsically disordered domain of oncogenic transcription factors.
A subset of cancers, including many malignancies of childhood and adolescence, are characterized by chromosomal translocations which generate novel fusion oncogenes1,2,3,4,5,6. The resulting fusion proteins frequently function as oncogenic transcription factors, orchestrating widespread changes in transcriptional regulation to promote tumorigenesis7,8. Cancers with these translocations commonly possess an otherwise quiet mutational landscape, with few recurring genomic aberrations aside from the pathognomonic fusion4,9. As such, directly targeting the fusion protein is an attractive therapeutic strategy in these diseases. However, these oncogenic transcription factors commonly consist of a low-complexity, intrinsically disordered, transcriptionally activating domain fused with a DNA-binding domain (DBD)10,11,12,13,14. Both the intrinsically disordered domains (IDDs) and DBDs of these proteins have proven difficult to target with conventional pharmacological approaches. Development of novel therapeutic approaches, therefore, requires a more detailed molecular understanding of the mechanisms employed by these fusions to aberrantly regulate gene expression.
The N-terminal IDD portion of EWSR1 is commonly fused to a DBD in cancer, including EWS/FLI in Ewing sarcoma, EWS/WT1 in diffuse small round cell tumor, and EWS/ATF1 in clear cell sarcoma of soft parts10. The mechanistic role of the EWS IDD in each of these fusions is incompletely understood. The EWS/ETS family of fusions, specifically EWS/FLI, is the most functionally characterized to date. EWS/FLI coordinates genome-wide epigenetic and transcriptional changes leading to the activation and repression of thousands of genes7,11,15,16. Studies have shown that the IDD is important for the recruitment of both transcriptional co-activators (such as p300, WDR5, and the BAF complex), as well as co-repressors (such as the NuRD complex)11,15,17. The fusion of the EWS IDD to the C-terminal portion of FLI1 confers novel DNA-binding specificity to the ETS DBD of FLI1, such that the fusion oncoprotein (EWS/FLI) binds to repetitive GGAA-microsatellite regions of the genome in addition to the consensus ETS motif18,19,20. Combined with the co-activator recruitment function, this emergent DNA-binding activity of EWS/FLI promotes de novo enhancer formation at GGAA-microsatellites distal to transcription start sites (TSS) (“enhancer-like” microsatellites) and recruits RNA polymerase II to promote transcription at GGAA-microsatellites proximal to TSS (“promoter-like” microsatellites)11,15,16,21.
Taken together, these data led us to hypothesize that discrete elements within the EWS domain contribute to the recruitment of distinct co-regulators to different types of EWS/FLI binding sites. However, discerning these elements within the EWS portion of EWS/FLI, and how they function, has been hindered by the highly repetitive and disordered nature of the domain. Here we utilize a previously published knockdown-rescue system in Ewing sarcoma cells to functionally map these elements in the EWS IDD. In this system EWS/FLI is depleted using an shRNA targeting the 3’UTR of the FLI1 gene, and expression is rescued with varying EWS/FLI mutant cDNA constructs lacking the 3’UTR7,17,22. These experiments focused on constructs with various deletions to map the structure-function relationship between the EWS IDD and important oncogenic phenotypes, including activation of a GGAA-microsatellite reporter construct, colony formation assays, and targeted validation of EWS/FLI-activated and -repressed genes7,17,22. However, these studies failed to find discrete sub-domains within the EWS IDD in EWS/FLI that are uniquely important for either activation or repression. All tested constructs were either able to both activate and repress specific target genes, leading to efficient colony formation, or unable to regulate any of the EWS/FLI target genes, leading to loss of colony formation7,17,22.
Transcriptomic analyses enabled by the widespread adoption of the next generation sequencing are commonly used to compare gene expression signatures in two conditions, frequently in the context of screening or descriptive studies. We instead wanted to leverage the ability to capture genome-wide expression data using RNA-sequencing (RNA-seq) to characterize the contributions of IDDs to transcription factor function. In this case RNA-seq is paired with the knockdown-rescue system to explore the structure-function relationship of the EWS domain. This approach is applicable to other fusion transcription factors, including other EWS fusions or wildtype transcription factors with poorly understood function, and has multiple advantages over the other assays used for functional mapping studies, such as reporter assays or targeted qRT-PCR. These include testing structural determinants of function in the relevant chromatin context, the ability to test multiple types of response elements in one assay (i.e., activated and repressed, GGAA-microsatellite and non-microsatellite, etc.), and the resulting ability to better detect partial function.
Successful implementation of this approach depends on a cell-based system that captures the phenotypes of interest (in this case A673 cells with shRNA-mediated EWS/FLI depletion), and a panel of mutant constructs in an expression vector appropriate for the cell-based system (in this case, pMSCV-hygro with various 3x-FLAG-tagged EWS/FLI mutants to be delivered by retroviral transduction). Viral transduction of either CRISPR-based depletion constructs, shRNA-based depletion constructs, and cDNA expression constructs with appropriate selection to generate stable cell lines is recommended over transient transfection. The downstream interpretation of results is strengthened when the transcriptomic data can be paired with other data related to localization of the transcription factor and other phenotypic readouts where available.
In this paper, we apply this approach to characterize the activity of the DAF mutant of EWS/FLI14. The DAF mutant has 17 tyrosine to alanine mutations in the repetitive regions of the EWS IDD of EWS/FLI14. This particular EWS mutant had been previously reported and is unable to activate reporter gene expression when fused to the ATF1 DBD14. However, preliminary qRT-PCR data suggested that this mutant was able to activate transcription of the EWS/FLI target NR0B123. The transcriptomic approach described here enabled successful detection of partial function of the DAF mutant. By pairing these transcriptomic data with information about EWS/FLI binding and recognition motifs we further show that the DAF mutant retains function at GGAA-microsatellite repeats. These results identify DAF as the first partially functional EWS/FLI mutant and highlight function at non-microsatellite genes as important for oncogenesis (as reported23). This demonstrates the power of this transcriptomic structure-function mapping approach to provide insight into the function of oncogenic transcription factors.
1. Set up in vitro panel of constructs
NOTE: This step will vary depending on the specific protein to be analyzed.
2. Collect cells, validate expression of constructs, and set up correlative phenotypic assays
3. Next-Generation Sequencing
4. Alignment and Transcript Counting Pipeline
NOTE: This protocol assumes that following sample submission and processing, a set of paired FASTQ files are returned for each sample. These files are frequently compressed with a suffix of “fastq.gz”. Further analysis of these FASTQ files will require access to a high-performance computing (HPC) facility running a Linux operating system.
5. Differential expression and downstream analysis
6. Comparison with Relevant Phenotypes
Preliminary qRT-PCR data suggested that an EWS/FLI mutant called DAF, with specific tyrosine to alanine mutations in the repetitive and disordered region of EWS, maintained the ability to activate EWS/FLI target genes, but failed to repress critical target genes23. In order to better understand the relationship between these residues in the EWS domain and EWS/FLI function, the protocol described above and outlined in Figure 1 was used. A673 Ewing sarcoma cells were virally transduced with an shRNA targeting the 3’UTR of FLI1, resulting in the depletion of endogenous EWS/FLI. After four days of selection, EWS/FLI function was rescued with viral transduction of different 3XFLAG-tagged EWS/FLI mutant constructs, with empty vector as a control for no rescue. A non-functional mutant lacking the EWS domain, called Δ22, was used as a negative control and wild-type EWS/FLI, called wtEF, was used as a positive control (Figure 2A). DAF was used as the test construct, though more than one test construct can be used if desired. Cells were selected for an additional 10 days to allow construct expression to stabilize and then collected for RNA (with a gDNA removal step), protein and colony forming assays. Four replicates were collected and representative qRT-PCR and western blots showing effective knockdown and rescue are shown in Figure 2B-D. It should be noted that DAF-rescued cells failed to form colonies as shown in Figure 2E, suggesting impaired oncogenic transformation.
Following completion of the replicate validation and phenotypic assays, RNA was submitted to the Institute for Genomic Medicine at Nationwide Children’s Hospital for library preparation and next generation sequencing with ~50 million 150-bp paired-end reads collected. The data was returned as fastq.gz files. Low-quality reads were trimmed from these files with TrimGalore and STAR was used to align reads to the human genome hg19 and count the reads per gene. hg19 was used for purposes of compatibility with the other curated datasets for EWS/FLI used in downstream analysis. These read counts were combined into a single count matrix for all samples, the first 6 rows of which are shown in Figure 3.
Counts were initially run through DESeq2 without batch normalization, however, visual inspection of the sample-to-sample distance showed potential confounding batch effects as shown highlighted with red arrows in Figure 4A. This likely arose due to biological variability introduced by the passage of cells in culture and differences in the processing of each batch. Normalization for batch effects was performed with ComBat and is generally recommended. The sample-to-sample distances of the batch-normalized data are shown in Figure 4B. Following batch normalization, DESeq2 was used to generate transcriptional profiles for the three constructs (wtEF, Δ22, and DAF) relative to the baseline. Note that while “parental” A673 cells (mock knockdown and mock rescue, called “iLuc” here) were included in the differential analysis, the reference for this experiment are the cells with EWS/FLI-depleted, called iEF cells. The transcriptional profile can be generated for the endogenous protein here by comparing the iLuc sample to iEF, and this may be of interest in understanding how the rescue system works, but that is not the goal of this particular analysis. The transcriptional profiles generated for the mutants include positive (wtEF) and negative (Δ22) controls, with respect to iEF, such that these should function as the benchmarks for other mutants. This is important, as the positive control in this example did not completely recapitulate the function of endogenous EWS/FLI as discussed elsewhere7,23.
The principal component analysis (PCA) in Figure 5 suggests that the transcriptional profile of DAF is intermediate between wtEF and Δ22, confirming partial function. Moreover, hierarchical clustering of the 1000 most variable genes across samples showed that DAF failed to repress EWS/FLI target genes, and only partially retained gene activation activity as shown in Figure 6A and Figure S5. ToppGene analysis suggested that the classes of genes that DAF activates are functionally distinct from those EWS/FLI-activated targets where DAF is non-functional (Figure 6B). Interestingly, the function of activated genes rescued by wtEF, but not by DAF, appear to be related to transcriptional control and chromatin regulation. Based on the results of the colony formation assays, the genes from this core gene signature should be further analyzed for their role in EWS/FLI-mediated oncogenesis. The importance of EWS/FLI-mediated gene repression has been previously described17.
It is known that EWS/FLI possesses a unique binding affinity for GGAA-microsatellite repeat elements19,22, and that binding at these elements drives downstream gene regulation11,15,18,20,22. These microsatellites have been characterized as either associated with activation or repression, and either proximal to (< 5 kb) TSS or distal to (> 5 kb) TSS25. In addition, there are EWS/FLI-regulated genes with high affinity (HA) ETS motifs proximal to TSS23. In order to further analyze the characteristics of DAF function and what types of EWS/FLI-activated genes DAF was able to rescue, differential expression of genes associated with these different classes was analyzed. Interestingly, DAF was most able to rescue GGAA-microsatellite activated genes, but unable to rescue activated genes near an HA site as seen in Figure 7. As seen with hierarchical clustering, DAF fails to rescue EWS/FLI-mediated repression across motif classes. These data suggest that DAF retains sufficient structural features of EWS to bind to and activate from GGAA-microsatellites, both proximal and distal to TSS. This likely arises from the intact SYGQ domain thought to be important for EWS/FLI activity at GGAA repeats11. These data also suggest that the specific tyrosines mutated in DAF play important, but poorly understood, roles in EWS/FLI-mediated gene regulation from HA sites, as well as in gene repression, highlighting an important area of further investigation.
Figure 1: Workflow. Depiction of the step-by-step procedure to perform structure-function mapping by transcriptomics. Cells were first prepared to express the suite of constructs required for structure-function mapping. Following expression, cells were harvested for RNA and protein and assayed for correlative phenotypes. Expression of the constructs was validated, and this process was repeated 3-4 times to gather independent biological replicates. RNA was then submitted for next-generation sequencing (NGS). When data was received, data was trimmed for quality, aligned, and counts per transcript were calculated. Batch effects were controlled for and transcriptomic signatures and differential expression were determined using DESeq2. Hierarchical clustering and downstream analysis integrating other -omics datasets and different pathway or functional analysis can be incorporated. Please click here to view a larger version of this figure.
Figure 2: Validation of construct expression and correlative assays. (A) Schematic depicting the constructs tested in this example. (B) Validation of knockdown of endogenous EWS/FLI and expression of 3X-FLAG-tagged constructs by immunoblot. (C,D) Validation of construct activity at an EWS/FLI (C) activated target gene, NR0B1, and (D) repressed target gene, TGFBR2, by qRT-PCR. Data are presented as mean +/- standard deviation. P-values were calculated with a Tukey’s honest significance test. * p < 0.05, ** p < 0.01, *** p < 0.005 (E) Colony counts from soft-agar assays performed to assess transforming activity of constructs. P-values were calculated with a Tukey’s honest significance test. * p < 0.05, ** p < 0.01, *** p < 0.005. This figure is adapted from Theisen, et al.23 Please click here to view a larger version of this figure.
Figure 3: Final collated count data for analysis. Screenshot of the first 6 rows of the count file with gene counts for all the samples to be batch normalized and analyzed. Please click here to view a larger version of this figure.
Figure 4: Sample-to-sample distance heatmaps. (A) Sample-to-sample distance plot showing the sample clustering of the raw count data. Samples which are clustering both by batch and by sample are denoted with red arrows. (B) Sample-to-sample distance plot following batch normalization with ComBat. Here, samples from all replicates cluster together, independent of batch. Please click here to view a larger version of this figure.
Figure 5: Results of differential expression analysis. (A) Principle component analysis (PCA) plot of the transcriptomic signatures generated for all the samples show strong intra-sample clustering and demonstrate that DAF is intermediated between the positive (wtEF) and negative (Δ22) controls. (B) Volcano plots showing the -log(p-value) plotted against the log2FoldChange for genes in each construct. Genes with an adjusted p-value < 0.05 and a |log2(FoldChange)| > 1 are considered significant and are shown in red. Panel 5B is adapted from Theisen, et al.23 Please click here to view a larger version of this figure.
Figure 6: Hierarchical clustering to identify gene classes. (A) Hierarchical clustering of the top 1000 most variable genes across all constructs and the baseline, iEF, shows DAF partially rescues EWS/FLI-mediated gene activation. (B) Gene ontology (molecular function) results from ToppGene showing the functional enrichment of EWS/FLI-activated genes that are either rescued or not rescued by DAF. Panel 6B is adapted from Theisen, et al.23 Please click here to view a larger version of this figure.
Figure 7: Detailed analysis of different transcription factor response elements to different constructs: (A) Schematic depicting the data processing used to generate panels (B) and (C) by incorporating other available datasets with the transcriptomic profiles here. (B,C) Compilation showing the rescue of different classes of direct EWS/FLI- (B) activated and (C) repressed targets. Genes included were only those genes with detectable differential expression by endogenous EWS/FLI. In each pie chart, gray depicts the portion of genes which are not rescued by the construct. Red depicts the portion of genes that are differentially activated, and blue depicts the portion of genes that are differentially repressed. This figure is adapted from Theisen, et al.23 Please click here to view a larger version of this figure.
Figure S1: Loading the fastq.gz files to the HPC environment, trimming and alignment. Please click here to download this figure.
Figure S2: Collating read counts across samples and running batch normalization with ComBat. Please click here to download this figure.
Figure S3: Running DESeq2 and extracting results of differential expression analysis. Please click here to download this figure.
Figure S4: Analyzing output. Please click here to download this figure.
Figure S5: Hierarchical clustering to identify gene classes: Hierarchical clustering of the top 1000 most variable genes across all constructs and the baseline, iEF, sorted into k clusters. In this instance k=7, but this parameter is set by the user as shown in Figure S4D. Please click here to download this figure.
Table S1: List of genes (Ensembl gene ID) with cluster annotation. Please click here to download this table.
Studying the biochemical mechanisms of oncogenic transcription factors is critically important to understand the diseases they cause and to design new therapeutic strategies. This is especially true in malignancies characterized by chromosomal translocations resulting in fusion transcription factors. The domains included in these chimeric proteins may lack meaningful interactions with regulatory domains present in the wild-type proteins, complicating the ability to interpret structure-function information in the context of the fusion26,27,28. Moreover, many of these oncogenic fusions are characterized by low-complexity intrinsically disordered domains10,13,29,30.
The EWS domain is an example of such an intrinsically disordered domain that is involved in a variety of oncogenic fusions10. The intrinsically disordered and repetitive nature has hindered efforts to understand the molecular mechanisms employed by the EWS domain. Prior efforts to investigate the structure-function have largely resorted to using different mutants in the context of reporter gene assays or in cell backgrounds that fail to recapitulate the relevant cellular context, or lack any structural variations which produce meaningful partial function11,17,25. The method presented here addresses these issues. Structure-function mapping is performed in a disease-relevant cell context and next generation sequencing enables transcriptomic profiling to evaluate transcription factor function in the setting of native chromatin. In the specific case of the DAF mutant of EWS/FLI, DAF was reported to show little activity in reporter assays using isolated response elements, but to show activity in the context of the full gene promoter, either in a reporter assay or in native chromatin, suggesting an interesting phenotype23. Using the method described here more directly resolves the question of which type of regulatory elements across the genome are most responsive in the disease setting. By testing all candidate target genes in their native chromatin context simultaneously, a transcriptomic approach is more likely to identify constructs with partial function.
The inherent strength of using a disease-relevant cell background is perhaps the biggest limitation of this technique. One of the most important factors is choosing the appropriate cell system for these experiments. Many cell lines derived from malignancies with pathognomonic transcription factors do not readily tolerate knockdown of that transcription factor, and in many instances, particularly for pediatric cancers, the true cell of origin remains controversial and the expression of the oncogene in other cell backgrounds is prohibitively toxic31,32. In these cases, it may be helpful to perform experiments in a different cell background, so long as the researcher exercises caution in the interpretation of results and appropriately validates any relevant findings in a more disease-relevant cell type.
It is critically important to carefully validate the stability and phenotypic consequences of expression of the oncogene and to only submit samples for sequencing that meet strict criteria. Here, this included western blot to confirm knockdown and rescue, and qRT-PCR of a small number of known target genes to validate the positive control (Figure 2). It is likewise crucial to decrease as much batch variability as possible by carefully performing the cell and RNA preparations as similarly as possible through each batch.
The method described here becomes especially powerful when paired with other types of genomic data that speak to the genome-wide function of the transcription factor under study. Future directions for this type of structure-function analysis would expand to include ChIP-seq and ATAC-seq to determine the binding of the transcription factor and any induced changes in chromatin accessibility. As a suite, this type of data can point to where different structural components of an oncogenic transcription factor contribute to different aspects of function (i.e. DNA binding vs. chromatin modification vs. co-regulator recruitment). Overall, using NGS-based approaches to map the structure-function relationships of fusion transcription factors can reveal new insights in the biochemical determinants of the oncogenic function of these proteins. This is important to further our understanding of the diseases they cause and in enabling the development of new therapeutic strategies.
The authors have nothing to disclose.
This research was supported by the High Performance Computing Facility at the Abigail Wexner Research Institute at Nationwide Children’s Hospital. This work was supported by the National Institutes of Health National Cancer Institute [U54 CA231641 to SLL, R01 CA183776 to SLL]; Alex’s Lemonade Stand Foundation [Young Investigator Award to ERT]; Pelotonia [Fellowship to ERT]; and the National Health and Medical Research Council CJ Martin Overseas Biomedical Fellowship [APP1111032 to KIP].
Wet Lab Reagents | |||
anti-FLI rabbit pAb | Abcam | ab15289 | 1:500 |
anti-lamin B1 rabbit pAb | Abcam | ab16048 | 1:2000 |
Cell-based system for introduction of mutant constructs | Determined by cell system used | ||
Cryotubes | For viral aliquots | ||
DMEM | Corning Cellgro | 10-013-CV | For viral production |
Fetal bovine serum | Gibco | 16000-044 | For viral production |
G418 | ThermoFisher | 10131027 | For viral production |
HEK293-EBNAs | ATCC | CRL-10852 | For viral production |
HEPES | Gibco | 15630106 | |
Hygromycin B | ThermoFisher | 10687010 | |
M2 anti-FLAG mouse mAb | Sigma | F3165 | 1:2000 |
Near IR-secondary antibodies | Li-Cor | ||
Optimem | Gibco | 31985062 | For viral production |
Penicillin/Streptomycin/Glutamine | Gibco | 10378-016 | For viral production |
Polybrene | Sigma | TR-1003-G | For viral transduction |
Puromycin | Sigma | P8833 | Stored at 2 mg/mL stock |
RNeasy Plus kit | Qiagen | 74136 | Has gDNA removal columns |
Selection reagents | As dictated by cell system used | ||
Sodium Pyruvate | Gibco | 11360-070 | For viral production |
Tissue culture media | Determined by cell system used | ||
TransIT-LT1 | Mirus | MIR 2304 | For viral production |
Software | |||
Access to HPC environment | |||
AnnotationDbi | 1.38.2 | ||
Cairo | 1.5-10 | ||
DESeq2 | 1.16.1 | ||
genefilter | 1.58.1 | ||
ggbiplot | 0.55 | ||
ggplot2 | 3.1.1 | ||
org.Hs.eg.db | 3.4.1 | ||
pheatmap | 1.0.12 | ||
PuTTY | |||
R | 3.4.0 | ||
RColorBrewer | 1.1-2 | ||
reshape2 | 1.4.3 | ||
rgl | 0.100.19 | ||
R-studio | |||
STAR | Version 2.6 or later | ||
sva | 3.24.4 | ||
TrimGalore! | |||
WinSCP |