THE CONSURF DATABASE
SERVER FOR THE IDENTIFICATION OF FUNCTIONAL REGIONS IN PROTEINS
Overview
Much insight can be obtained from analyzing a query protein within its evolutionary context. Evolutionary data, inferred from the protein’s family, i.e., its homologues, readily highlight conserved regions that are often important for catalysis, binding, and other biological functions. ConSurf-DB includes pre-calculated estimates of the evolutionary profiles of most proteins of known structure. To see these, the user needs only to provide the PDB identifier (or amino acid sequence) of the query protein. The results will instantaneously be displayed with convenient user interface.
The evolutionary conservation estimates are highly robust because particularly stringent thresholds were used in constructing the database. ConSurf-DB will be periodically updated to keep up with the rapid increase in sequence and structure data.
A detailed description of the construction of ConSurf-DB, as well as a few examples, are available in a recent publication [1]. Briefly, ConSurf-DB was constructed based on the fully automated four steps process described in this flowchart: (a) downloading and parsing non-redundant PDB entries, (b) collecting sequence homologues and building the multiple sequence alignment (MSA), (c) calculating evolutionary rates, and (d) formatting the results for presentation in the ConSurf-DB website.
b. The second step is detecting homologues. The sequence homologues are searched in UniRef90 [3], a clustered version of the UniProt database [4]. This is done using one iteration of HMMER [5] with an E-value threshold of 0.0001. The candidate homologues retrieved by HMMER for a certain chain are further filtered according to sequence identity (maximum 95%), sequence coverage (minimum 60%) and overlap among homologues (maximum 10%). Following the homologues filtration process, chains with less than 50 homologues are eliminated. Next, CD-HIT [6] removes any redundant homologues with a threshold of 95%. If there are more than 50 homologues after the CD-HIT filtration process, the remaining homologues are sorted by their E-value in an ascending order, based on the principle that the lower the E-value the more significant the homologue. A maximum of 300 homologues are sampled evenly from the sorted list to create the final list of homologues of the query protein. Finally, a multiple sequence alignment (MSA) of the homologues is constructed using the MAFFT-LINSi procedure [7].
c. The third step is estimating the evolutionary rates. It begins by inferring the best amino acids substitution model based on the MSA [8]. Then, a phylogenetic tree is built from the MSA with the Neighbor-Joining method [9], implemented in the Rate4Site program [10]. Next, Rate4Site assigns an evolutionary rate to each position in the query sequence, based on the phylogenetic tree and the substitution model, and using an empirical Bayesian methodology [11]. Finally, the evolutionary rates are categorized to discrete conservation grades, ranging from 1 to 9, where 1 are the most highly variable, 5 are of intermediate conservation, and 9 are the most highly conserved positions. Positions that are assigned grades with low confidence are treated as a separate, tenth, category. The nine grades are then mapped to colors that reflect the level of conservation, which allows a clear and intuitive detection of the conserved regions in the protein.
d. The fourth and final step is formatting and visualizing the results. The conservation grades are mapped on the three-dimensional structure of the query protein, which can be viewed using the NGL viewer [12] or FirstGlance in Jmol [13]. The colors are also projected on the query sequence and on the MSA. Moreover, session files, presenting the protein structure colored by the conservation grades, are created using the PyMOL [14] and Chimera [15] programs. All visual results are available in two color scales: the default color scale, which is turquoise-through-maroon, and the color-blind friendly color scale, which is green-through-purple. These color scales correspond to variable (grade 1)-through-conserved (grade 9). Positions with low reliability according to the confidence interval are colored in light yellow in both color scales.
Figure 1: A flowchart of the pipeline used to construct ConSurf-DB. The pipeline consists of four steps: retrieving PDB entries, homologues detection and building MSA, estimating evolutionary conservation, and formatting the results.
Table 1 details the ConSurf-DB statistics, including the number of starting proteins, the numbers of proteins filtered using various criteria, and the total number of proteins with available evolutionary profiles.
PDB chains |
|
MSA sizes |
|
Total chains found |
473,197 |
Chains with less than 50 homologues |
7363 |
Total non-redundant chain found |
108,958 |
MSA’s created |
|
Filtered |
|
Chains with 50-100 homologues |
3238 |
Chains shorter than 30 amino acids |
7054 |
Chains with 101-200 homologues |
4978 |
Chains with large structures |
4629 |
Chains with 201-300 homologues |
81,486 |
Chains with more than 15% modified residues |
210 |
Total chains processed |
89,702 |
Total chains post initial filtration |
389,863 |
|
|
Total non-redundant chains post initial filtration |
97,065 |
|
|
Table 1. Statistics of ConSurf-DB.
References
-
1. Ben Chorin A., Masrati G., Kessel A., Narunsky A., Sprinzak J., Lahav S., Ashkenazy H. and Ben-Tal N. (2020).
ConSurf-DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins.
Protein Science 29:258–267. PDF -
2. Wang G. and Dunbrack R.L. (2005).
PISCES: recent improvements to a PDB sequence culling server.
Nucleic Acids Research, 33:W94-W98; PMID: 15980589. PDF -
3. Suzek B.E., Wang Y., Huang H., McGarvey P.B., Wu C.H. and UniProt Consortium. (2015).
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.
Bioinformatics, 31:926-932; PMID: 25398609. PDF -
4. UniProt Consortium. (2019).
UniProt: a worldwide hub of protein knowledge.
Nucleic Acids Research, 47:D506-D515; PMID: 30395287. PDF -
5. Mistry J., Finn R.D., Eddy S.R., Bateman A. and Punta M. (2013).
Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions.
Nucleic Acids Research, 41:e121; PMID: 23598997. PDF -
6. Fu L., Niu B., Zhu Z., Wu S. and Li W. (2012).
CD-HIT: accelerated for clustering the next-generation sequencing data.
Bioinformatics, 28:3150-3152; PMID: 23060610. PDF -
7. Katoh K. and Standley D.M. (2013),
MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
Molecular Biology and Evolution, 30:772–780; PMID: 23329690. PDF -
8. Darriba D., Taboada G.L., Doallo R. and Posada D. (2011).
ProtTest 3: fast selection of best-fit models of protein evolution.
Bioinformatics, 27:1164-1165; PMID: 21335321. PDF -
9. Saitou N. and Nei M. (1987).
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
Molecular Biology and Evolution, 4:406-425; PMID: 3447015. PDF -
10. Pupko T., Bell R.E., Mayrose I., Glaser F. and Ben-Tal N. (2002).
Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues.
Bioinformatics, 18:S71-S77; PMID: 12169533. PDF -
11. Mayrose I., Graur D., Ben-Tal N. and Pupko T. (2004).
Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior.
Molecular Biology and Evolution, 21:1781-1791; PMID: 15201400. PDF -
12. Rose A.S., Bradley A.R., Valasatava Y., Duarte J.M., Prlic A. and Rose P.W. (2018).
NGL viewer: web-based molecular graphics for large complexes.
Bioinformatics, 34:3755-3758; PMID: 29850778. PDF -
13. Martz E. (2005).
FirstGlance in Jmol. firstglance.jmol.org -
14. Schrödinger L.L.C. (2015).
The PyMOL Molecular Graphics System, Version 2.3.3.
-
15. Pettersen E.F., Goddard T.D., Huang C.C., Couch G.S., Greenblatt D.M., Meng E.C. and Ferrin T.E. (2004).
UCSF Chimera--a visualization system for exploratory research and analysis.
Journal of Computational Chemistry, 25:1605-1612; PMID: 15264254. PDF