We describe a methodology based on sequence diversification to estimate the amino acid preferences of multispecific binding sites in protein-protein interactions (PPIs). In this strategy, thousands of potential peptide ligands are generated and screened in silico, thus overcoming some limitations of available experimental methods.
Many protein-protein interactions involve the binding of short protein segments to peptide-binding domains. Usually, such interactions require the recognition of linear motifs with variable conservation. The combination of highly conserved and more variable regions in the same ligands often contributes to the multispecificity of binding, a common property of enzymes and cell signaling proteins. Characterization of amino acid preferences of peptide-binding domains is important for the design of mediators of protein-protein interactions (PPIs). Computational methods are an efficient alternative to the often costly and cumbersome experimental techniques, enabling the design of potential mediators that can be later validated in downstream experiments. Here, we described a methodology using the Pepspec application of the Rosetta molecular modeling package to predict the amino acid preferences of peptide-binding domains. This methodology is useful when the structure of the receptor protein and the nature of the peptide ligand are both known or can be inferred. The methodology starts with a well-characterized anchor from the ligand, which is extended by randomly adding amino acid residues. The binding affinity of peptides generated this way is then evaluated by flexible-backbone peptide docking in order to select the peptides with the best predicted binding scores. These peptides are then used to calculate amino acid preferences and to optionally compute a position-weight matrix (PWM) that can be used in further studies. To illustrate the application of this methodology, we used the interaction between subunits of human interferon regulatory factor 5 (IRF5), previously known to be multispecific but globally guided by a short conserved motif called pLxIS. The estimated amino acid preferences were consistent with previous knowledge about the IRF5 binding surface. Positions occupied by phosphorylatable serine residues exhibited a high frequency of aspartate and glutamate, likely because their negatively charged side chains are similar to phosphoserine.
Interaction between two proteins often involves the binding of short segments of amino acids to peptide-binding domains, resembling protein-peptide interfaces. Receptor proteins involved in such protein-protein interactions (PPI) often have the ability to recognize a certain set of overlapping but divergent ligand sequences, a property known as multispecificity1,2. Multispecific recognition is a feature of many cellular proteins, but it is particularly remarkable in enzymes and cell signaling proteins3. Proteins interacting with multispecific binding sites often have a combination of more and less conserved regions in their sequence4,5,6. In this scenario, the more conserved sequence motifs are involved in stringent molecular interactions. Conversely, the more variable sequences interact with somehow permissive surfaces in the receptor binding site. Usually, these less conserved but still functionally relevant segments are loops lacking defined secondary structure patterns or have even more dynamic conformations, such as those typical of intrinsically disordered proteins7.
Identification of potential peptide ligands of binding sites is usually the first step in the design of mediators able to interfere with the corresponding PPIs8. However, it is often unlikely to find a single most frequent amino acid residue at most sequence positions in ligands of multispecific binding sites. Instead, these sites may have particular preferences for a specific class of amino acids according to their chemical properties, e.g., acidic and negatively charged amino acids such as aspartate or glutamate, bulky aromatic amino acids such as phenylalanine or more hydrophobic residues such as aliphatic amino acids alanine, valine, leucine or isoleucine3. Several experimental methods can provide insights about amino acid preferences of protein binding sites, including directed evolution9, multi-codon scanning mutagenesis10, and deep mutational scanning11. All of these methods follow the approach of sequence diversification, which is based on introducing mutations to original ligands and further analyzing their effect on the function of the receptor protein (see Bratulic and Badran12 for a comprehensive review). However, these methods often require the survey of large sequence libraries, which makes them more cumbersome, costly, and time-consuming.
Computational methods to infer the amino acid preferences of multispecific binding sites have the potential to circumvent the limitations of wet lab methods. Among these, the in silico sequence diversification approach evaluates the energetic impact of a wide range of amino acid replacements in the ligand sequence as a way to characterize the structural plasticity of the PPI13. This method begins with the structure or model of the peptide ligand bound to the receptor binding site and subsequently introduces mutations to the ligand sequence. Statistical and energy-scoring functions are then used to evaluate the impact of these mutations on stability and binding affinity. The set of best-scoring ligand sequences resulting from the evaluation phase can then be used to compute the amino acid preferences. This strategy has the potential to process a very high number of ligand sequences in an efficient manner. Therefore, it can provide a more complete and consistent inference of amino acid preferences compared to those computed from the more limited number of sequences that can usually be processed in wet lab approaches.
The Pepspec application of the Rosetta molecular modeling suite14 is a tool that performs sequence diversification as a key step of its peptide design mode. This application requires a structure or model of the receptor protein with a bound peptide down to a single amino acid residue in length, which is used as an anchor for the next steps. The sequence of the bound peptide is then extended (if necessary) and diversified to generate a large number of putative peptide ligands. The binding affinity of these peptides is then evaluated by flexible-backbone peptide docking in order to select those with the best predicted binding scores. Although the main output of this application is the best peptide candidates selected at the end of the design phase, the much larger set of peptides accepted during this phase can also be used to compute the amino acid preferences of the target binding site. Amino acid preferences are computed as the frequency of each amino acid residue per position of the ligand sequence represented either as a position weight matrix (PWM) or as a more visual sequence logo.
In this article, we describe a protocol to estimate the amino acid preferences of the binding surface of a receptor protein involved in a PPI. The protocol is focused on PPIs in which a linear segment of the protein-ligand is known to bind to the receptor protein, so the scenario can be modeled as a protein-peptide interface. In this scenario, conserved motifs from the ligand typically interact with defined pockets in the receptor binding site, although the entire ligand segment involved in the PPI may contain less conserved regions. A flowchart summarizing the major steps of the protocol is shown in Figure 1. The protocol starts with the 3D structure of the protein-protein complex and further reduces the ligand protein to the potential best-interacting segment, leaving the receptor protein intact. The best-interacting segment is inferred by using the BUDE Alanine Scan server15, which conducts computational alanine scanning mutagenesis to identify hot-spot residues between the two interacting proteins. In this approach, residues from the ligand are individually replaced by alanine, and the estimated change in free energy or stability of the complex (ΔΔG) is then used to infer the relevance of the corresponding residue for the target PPI. Once the best-interacting segment is inferred, its complex with the receptor protein is used as the base structure submitted to Pepspec to perform sequence diversification.
Figure 1: Overview of the main steps of the protocol proposed in this work. Numbers match step numbers in the protocol section. Figures were made with the protein-protein complex used as the example described in the text. In this complex, the protein chain considered as the receptor is shown in pink, while the chain considered as the ligand is shown in light blue with its predicted best-interacting segment highlighted in red. Please click here to view a larger version of this figure.
One of the limitations of the suggested protocol is the requirement for a resolved structure of the protein-peptide interface. The protocol may alternatively begin with a model of the target protein-peptide interface, although the specific modeling steps are not described herein. Moreover, although the protocol can be conducted on a personal computer running any operating system, a Linux environment is required for the steps involving the Rosetta applications. A computer cluster is also highly recommended for the sequence diversification step due to the large number of iterations typically performed by Pepspec.
Application of the suggested protocol is illustrated with the estimation of amino acid preferences of the biding surface of IRF5, a member of the human interferon regulatory factor (IRF) family. We chose this protein as an example because, during its activation, two subunits bind to form a dimer whose structure is well characterized16. In IRF dimers, binding can be modeled as a protein-peptide interface in which one subunit provides the binding surface and the other one interacts through a region containing a short conserved motif called pLxIS17,18. In addition, binding to IRF subunits is multispecific; therefore, they can form homodimers, heterodimers, and complexes with other cellular proteins known as coactivators18.
1. Initial preparation of the protein-peptide interface
Figure 2: The Protein Data Bank (PDB) page for the structure used as a representative example in this work. (A) Search box to introduce the PDB accession code of the target structure. (B) Menu to download the structure in several formats. (C) Options to download biological assemblies when the structure has been saved as an asymmetric unit (see step 1.1.2 for more details). Please click here to view a larger version of this figure.
Figure 3: Selection of receptor and ligand in the BUDE Alanine Scan server. (A) Graphic representation of the protein-protein complex. (B) Text box to enter the name of the job in the server. (C) Panel to interactively select the chains that will be considered as receptor and ligand (see step 1.2 for more details). Please click here to view a larger version of this figure.
Figure 4: Results page of the BUDE Alanine Scan server. The potential best-interacting segment in the ligand sequence is indicated with a red box. In the left panel, the residue with the higher predicted energy contribution (Leu433) is highlighted in green. Please click here to view a larger version of this figure.
2. Sequence diversification
NOTE: In the following steps, rosetta_main refers to the main Rosetta installation directory, which is typically located at /opt/rosetta_src_<version>_bundle/main/, where <version> indicates the installed Rosetta version. Also, it is assumed that Rosetta applications are accessible system-wide; if this is not the case, the full path to the executables has to be provided. When compiled from the source, these executables are located in the /rosetta_main/source/bin/ directory.
3. Estimation of amino acid preferences
In this article, we described a protocol to predict the amino acid preferences of the binding surface of IRF5, a member of a family of transcription factors known as human interferon regulatory factors. These proteins are regulators of innate and adaptive immune responses and participate in the differentiation and activation of several immune cells. IRF subunits have highly plastic and multispecific binding surfaces, being capable of forming homodimers, heterodimers, and complexes with other cellular proteins17,18. Dimerization is thought to be the first step in the activation of these factors, and in most family members, it is triggered by the phosphorylation of multiple serine/threonine residues18. During dimerization, each monomer interacts with the biding surface of the other monomer via a highly conserved motif called pLxIS, located towards the C-terminal region of their sequence. The pLxIS abbreviation partially represents the amino acid preferences of the binding surface, which sequentially recognizes a polar amino acid ('p'), followed by two positions with a high frequency of leucine ('L') and isoleucine ('I'), separated by a position occupied by any amino acid ('x') and followed by a phosphorylatable serine residue (Ser436 in this example). Phosphorylation of several serine residues, including that of the pLxIS motif, promotes the bending of the C-terminal segment of one monomer and its interaction with the binding surface of the other monomer19,22.
The protocol described here started with a 3D structure of the IRF5 dimer19, in which one of the monomers was arbitrarily considered as the receptor in the PPI, while the other one was considered as the ligand containing the pLxIS motif. To better define the segment of the ligand interacting with the receptor binding site, we conducted computational alanine scanning mutagenesis (step 1.2). The predicted segment was comprised of 13 amino acid residues from positions 424 to 436, with the pLxIS motif starting at Arg432. The structure of the original dimer was then reduced to a peptide-protein complex in which the sequence of the monomer considered as ligand was cropped to the predicted best-interacting segment, whereas the other monomer was left intact (step 1.3). This structure was then used as input for the sequence diversification strategy (section 2), designating the leucine residue of the pLxIS motif (Leu433) as the anchor required by Pepspec. This process resulted in over 26,000 potential peptide ligands. The top 20% potential ligands with the best energy scores (5,280) were used to estimate the amino acid preferences of the binding surface in the form of a PWM (Figure 5A) and a sequence logo (Figure 5B) (section 3).
Figure 5: Amino acid preferences of the binding surface of IRF3. (A) PWM indicating the frequency of each amino acid residue (rows) per position in the peptide ligand sequence (columns). (B) Sequence logo visually representing the corresponding amino acid frequencies. Positions of the original IRF5 sequence are shown in parenthesis below each column of the sequence logo. Please click here to view a larger version of this figure.
In the PWM, each row corresponds to a specific amino acid residue, while each column represents a position in the sequence. Each cell of the matrix contains the relative frequency of each amino acid at that position, weighted by the overall background frequencies. Sequence logos are constructed by stacking the letters of amino acids so that the total height of the stack at each position indicates the conservation of the overall sequence at that position. Conversely, the height of the individual letters within the stack indicates the frequency of the corresponding amino acid. In this example, both the PWM and the sequence logo are consistent with the previous knowledge regarding the binding surface of IRF5, with a higher preference for a polar amino acid (glutamate) at position 432 ('p') and a very high preference for leucine and isoleucine at positions 433 and 435, respectively. Remarkably, positions 427, 429, and 436 were all predicted to have higher conservation for aspartate despite being occupied by serine in the original IRF5 sequence. This finding evidences the importance of phosphorylation of these positions for the formation of the IRF5 dimer since the negative charge in the side chains of aspartate and glutamate resembles that of phosphoserine. In fact, a previous study reported that a decoy peptide called IRF5D, in which these serine residues were replaced by aspartate, was able to inhibit IRF5 activity23. Conversely, position 425 was predicted to have a very high preference for serine, suggesting that the serine residue in this position may participate in the PPI in its unphosphorylated form. Indeed, it has been previously reported for other IRFs that phosphorylation of the equivalent serine residue negatively affects dimerization and binding to other coactivators16,24.
The present article describes a protocol to estimate the amino acid preferences of potentially multispecific binding sites based on in silico sequence diversification. Few computational tools have been developed to estimate amino acid preferences of protein-peptide interfaces14,25,26. These tools have a predictive nature, but they differ in the computational algorithms used to perform their predictions and the corrections they implement to improve accuracy. In this work, we used the Pepspec application of the Rosetta molecular modeling suite14. While primarily oriented to peptide design, this application implements a sequence diversification algorithm that can be used to predict amino acid preferences. To the best of our knowledge, this tool is the only one currently available that provides a built-in script to compute a PWM directly from the sequence diversification results. It is important to remark that the protocol is focused on PPIs, therefore, the initial structure is expected to be a complex of two protein subunits. Before the actual sequence diversification step, the protein considered as the ligand is cropped to the segment expected to interact with the receptor protein, and it is further treated as a peptide. However, the protocol can also be applied to protein-peptide complexes, a scenario in which steps 1.1-1.3 may not be required. During the preparation step (section 1) it is also essential to correct wrongly formatted residues and heteroatoms, as well as to model segments of the complex structure relevant for the target binding site that could not be properly resolved. These corrections depend on the specific structure under study and were not required for the structure used as an example herein.
The most critical steps of this protocol are those performed with the Rosetta applications, which include an initial repack of side chains with FixBB (step 2.1) and the actual sequence diversification with Pepspec (step 2.2). This initial repack step, called pre-packing, is explicitly mentioned to be required by Pepspec authors14. Although it can be performed by Pepspec, the authors of this application highly recommend using the FixBB application, which was particularly designed to optimize side chain rotamers in fixed protein backbones. In the sequence diversification step, it is important to consider that the Pepspec application is oriented to peptide design. Consequently, it reports a few best-scoring peptide candidates by default. Since the goal of the protocol presented here is to generate a large number of putative peptide ligands rather than a few best-scoring candidates, we changed the "-pepspec:n_peptides" option from 8 (its default) to 200 (step 2.2.1). Using this setting, Pepspec predicted more than 20,000 peptides as potential ligands. This set of putative peptides provided a very broad view of the binding landscape of the receptor, which was then sampled for the top 20% of best-scoring peptides for actual estimation of amino acid preferences. If a lower number of peptides is passed to "-pepspec:n_peptides", significantly fewer candidates will be accepted by Pepspec. Under this scenario, the sampling proposed in the protocol may capture many putative peptide ligands with suboptimal energy scores, potentially resulting in less robust estimations.
One of the main limitations of the protocol presented in this work is that it relies on the previous knowledge of the structure of the protein containing the binding surface. However, this structure does not necessarily have to be determined experimentally but can be modeled ab initio or by homology modeling14. In addition, it is also necessary to know the binding mode of at least one amino acid residue (anchor) of the peptide ligand. This anchor will be extended to a particular number of residues via specific anchor extension options from Pepspec to perform the sequence diversification. If the orientation of the entire ligand in the binding site is known, as is the case of the representative example of this study, options related to anchor extension should be left as default (no extension), although a residue from the peptide must still be specified as an anchor to guide the sequence diversification algorithm. The Pepspec application does not support de novo docking of a possible anchor residue, but it can use as input the output of other docking applications or a model of a homologous protein-peptide complex to perform anchor docking14; though these scenarios are beyond the scope of this article.
An important disadvantage of the suggested protocol is its inherent predictive nature, which is directly affected by the resolution and accuracy of the initial structure or model of the protein-protein complex. However, Pepspec authors have stated that the accuracy of this application was significantly improved by treating the input backbone coordinates as an ensemble of structures rather than using a single protein structure and applying background normalization when computing the PWM14. Furthermore, the protocol is an alternative to the cumbersome and costly experimental methods for the estimation of amino acid preferences. All of these experimental methods rely on evaluating large sequence libraries obtained by introducing mutations to the sequence of protein ligands, followed by experimental evaluation of the impact of such mutations (see Bratulic and Badran12 for a review). Computational protocols such as the one proposed in this work allow the screening of thousands of putative peptide ligands in a very efficient manner, potentially providing a more robust set for the estimation of amino acid preferences13,14,25. Our proposed protocol can be applied to any PPI that could be reduced to a protein-peptide interface. Additionally, this protocol may serve as an initial strategy to identify mediators of PPIs, such as potential activators or inhibitors. The identified mediators can be further used to study these PPIs in the laboratory, or they can be evaluated as potential therapeutic agents.
The authors have nothing to disclose.
Financial support by Sistema Nacional de Investigación (SNI) (grant numbers SNI-043-2023 and SNI-170-2021), Secretaría Nacional de Ciencia, Tecnología e Innovación (SENACYT) of Panama and Instituto para la Formación y Aprovechamiento de Recursos Humanos (IFARHU) are gratefully acknowledged. Authors would like to thank Dr. Miguel Rodríguez for carefully reviewing the manuscript.
BUDE Alanine Scan Server | University of Edinburgh | https://pragmaticproteindesign.bio.ed.ac.uk/balas/ | doi: 10.1021/acschembio.9b00560 |
Rosetta Modeling Software | Rosetta Commons | https://www.rosettacommons.org/software | doi: 10.1002/prot.22851 |
UCSF Chimera | University of California San Francisco | https://www.cgl.ucsf.edu/chimera/ | doi: 10.1002/jcc.20084 |