Overview

Overview

Much insight can be obtained from analyzing a query protein within its evolutionary context. Evolutionary data, inferred from the protein’s family, i.e., its homologues, readily highlight conserved regions that are often important for catalysis, binding, and other biological functions. ConSurf-DB includes pre-calculated estimates of the evolutionary profiles of most proteins of known structure. To see these, the user needs only to provide the PDB identifier (or amino acid sequence) of the query protein. The results will instantaneously be displayed with convenient user interface.

The evolutionary conservation estimates are highly robust because particularly stringent thresholds were used in constructing the database. ConSurf-DB will be periodically updated to keep up with the rapid increase in sequence and structure data.

A detailed description of the construction of ConSurf-DB, as well as a few examples, are available in a recent publication [1]. Briefly, ConSurf-DB was constructed based on the fully automated four steps process described in this flowchart: (a) downloading and parsing non-redundant PDB entries, (b) collecting sequence homologues and building the multiple sequence alignment (MSA), (c) calculating evolutionary rates, and (d) formatting the results for presentation in the ConSurf-DB website.

  1. The first step in building ConSurf-DB is retrieving the PDB entries. Each PDB entry can contain one or more protein chains, which are handled separately in ConSurf-DB. The chains are extracted from a PISCES file [2], which contains all non-redundant (unique) chains in the PDB. After extracting the unique chains, their sequences, and their identical chains from the file, the unique chains are filtered according to several criteria. Finally, each unique chain is associated with its sequence and identical chains. Thus, the unique chain’s calculations could be easily mapped to the structures of all its identical chains.

  2. The second step is detecting homologues. The sequence homologues are searched in UniRef90 [3], a clustered version of the UniProt database [4]. This is done using one iteration of HMMER [5] with an E-value threshold of 0.0001. The candidate homologues retrieved by HMMER for a certain chain are further filtered according to sequence identity (maximum 95%), sequence coverage (minimum 60%) and overlap among homologues (maximum 10%). Following the homologues filtration process, chains with less than 50 homologues are eliminated. Next, CD-HIT [6] removes any redundant homologues with a threshold of 95%. If there are more than 50 homologues after the CD-HIT filtration process, the remaining homologues are sorted by their E-value in an ascending order, based on the principle that the lower the E-value the more significant the homologue. A maximum of 300 homologues are sampled evenly from the sorted list to create the final list of homologues of the query protein. Finally, a multiple sequence alignment (MSA) of the homologues is constructed using the MAFFT-LINSi procedure [7].

  3. The third step is estimating the evolutionary rates. It begins by inferring the best amino acids substitution model based on the MSA [8]. Then, a phylogenetic tree is built from the MSA with the Neighbor-Joining method [9], implemented in the Rate4Site program [10]. Next, Rate4Site assigns an evolutionary rate to each position in the query sequence, based on the phylogenetic tree and the substitution model, and using an empirical Bayesian methodology [11]. Finally, the evolutionary rates are categorized to discrete conservation grades, ranging from 1 to 9, where 1 are the most highly variable, 5 are of intermediate conservation, and 9 are the most highly conserved positions. Positions that are assigned grades with low confidence are treated as a separate, tenth, category. The nine grades are then mapped to colors that reflect the level of conservation, which allows a clear and intuitive detection of the conserved regions in the protein.

  4. The fourth and final step is formatting and visualizing the results. The conservation grades are mapped on the three-dimensional structure of the query protein, which can be viewed using the NGL viewer [12] or FirstGlance in Jmol [13]. The colors are also projected on the query sequence and on the MSA. Moreover, session files, presenting the protein structure colored by the conservation grades, are created using the PyMOL [14] and Chimera [15] programs. All visual results are available in two color scales: the default color scale, which is turquoise-through-maroon, and the color-blind friendly color scale, which is green-through-purple. These color scales correspond to variable (grade 1)-through-conserved (grade 9). Positions with low reliability according to the confidence interval are colored in light yellow in both color scales.

ConSurf-DB Flowchart

Figure 1. A flowchart of the pipeline used to construct ConSurf-DB. The pipeline consists of four steps: retrieving PDB entries, homologues detection and building MSA, estimating evolutionary conservation, and formatting the results.


Table 1 details the ConSurf-DB statistics, including the number of starting proteins, the numbers of proteins filtered using various criteria, and the total number of proteins with available evolutionary profiles.

PDB chains

 

MSA sizes

 

Total chains found

473,197

Chains with less than 50 homologues

7363

Total non-redundant chain found

108,958

MSA’s created

 

Filtered

 

 Chains with 50-100 homologues

3238

 Chains shorter than 30 amino acids

7054

 Chains with 101-200 homologues

4978

 Chains with large structures

4629

 Chains with 201-300 homologues

81,486

 Chains with more than 15% modified residues

210

Total chains processed

89,702

Total chains post initial filtration

389,863

 

 

Total non-redundant chains post initial filtration

97,065

 

 

 

Table 1. Statistics of ConSurf-DB.



References:

  1. Ben Chorin A., Masrati G., Kessel A., Narunsky A., Sprinzak J., Lahav S., Ashkenazy H. and Ben-Tal N. (2020).
    ConSurf-DB: accurate estimate of the evolutionary conservation pattern for 83% of PDB proteins.
    Protein Science (in review).
    PDF

  2. Wang G. and Dunbrack R.L. (2005).
    PISCES: recent improvements to a PDB sequence culling server.
    Nucleic Acids Research, 33:W94-W98; PMID: 15980589.
    PDF

  3. Suzek B.E., Wang Y., Huang H., McGarvey P.B., Wu C.H. and UniProt Consortium. (2015).
    UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.
    Bioinformatics, 31:926-932; PMID: 25398609.
    PDF

  4. UniProt Consortium. (2019).
    UniProt: a worldwide hub of protein knowledge.
    Nucleic Acids Research, 47:D506-D515; PMID: 30395287.
    PDF

  5. Mistry J., Finn R.D., Eddy S.R., Bateman A. and Punta M. (2013).
    Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions.
    Nucleic Acids Research, 41:e121; PMID: 23598997.
    PDF

  6. Fu L., Niu B., Zhu Z., Wu S. and Li W. (2012).
    CD-HIT: accelerated for clustering the next-generation sequencing data.
    Bioinformatics, 28:3150-3152; PMID: 23060610.
    PDF

  7. Katoh K. and Standley D.M. (2013),
    MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
    Molecular Biology and Evolution, 30:772–780; PMID: 23329690.
    PDF

  8. Darriba D., Taboada G.L., Doallo R. and Posada D. (2011).
    ProtTest 3: fast selection of best-fit models of protein evolution.
    Bioinformatics, 27:1164-1165; PMID: 21335321.
    PDF

  9. Saitou N. and Nei M. (1987).
    The neighbor-joining method: a new method for reconstructing phylogenetic trees.
    Molecular Biology and Evolution, 4:406-425; PMID: 3447015.
    PDF

  10. Pupko T., Bell R.E., Mayrose I., Glaser F. and Ben-Tal N. (2002).
    Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues.
    Bioinformatics, 18:S71-S77; PMID: 12169533.
    PDF

  11. Mayrose I., Graur D., Ben-Tal N. and Pupko T. (2004).
    Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior.
    Molecular Biology and Evolution, 21:1781-1791; PMID: 15201400.
    PDF

  12. Rose A.S., Bradley A.R., Valasatava Y., Duarte J.M., Prlic A. and Rose P.W. (2018).
    NGL viewer: web-based molecular graphics for large complexes.
    Bioinformatics, 34:3755-3758; PMID: 29850778.
    PDF

  13. Martz E. (2005).
    FirstGlance in Jmol.
    firstglance.jmol.org

  14. Schrödinger L.L.C. (2015).
    The PyMOL Molecular Graphics System, Version 2.3.3.

  15. Pettersen E.F., Goddard T.D., Huang C.C., Couch G.S., Greenblatt D.M., Meng E.C. and Ferrin T.E. (2004).
    UCSF Chimera--a visualization system for exploratory research and analysis.
    Journal of Computational Chemistry, 25:1605-1612; PMID: 15264254.
    PDF