THE CONSURF DATABASE

SERVER FOR THE IDENTIFICATION OF FUNCTIONAL REGIONS IN PROTEINS

Overview

Much insight can be obtained from analyzing a query protein within its evolutionary context. Evolutionary data, inferred from the protein’s family, i.e., its homologues, readily highlight conserved regions that are often important for catalysis, binding, and other biological functions. ConSurf-DB includes pre-calculated estimates of the evolutionary profiles of most proteins of known structure. To see these, the user needs only to provide the PDB identifier (or amino acid sequence) of the query protein. The results will instantaneously be displayed with convenient user interface.

The evolutionary conservation estimates are highly robust because particularly stringent thresholds were used in constructing the database. ConSurf-DB will be periodically updated to keep up with the rapid increase in sequence and structure data.

A detailed description of the construction of ConSurf-DB, as well as a few examples, are available in a recent publication [1]. Briefly, ConSurf-DB was constructed based on the fully automated four steps process described in this flowchart: (a) downloading and parsing non-redundant PDB entries, (b) collecting sequence homologues and building the multiple sequence alignment (MSA), (c) calculating evolutionary rates, and (d) formatting the results for presentation in the ConSurf-DB website.

a. The first step in building ConSurf-DB is retrieving the PDB entries. Each PDB entry can contain one or more protein chains, which are handled separately in ConSurf-DB. The chains are extracted from a PISCES file [2], which contains all non-redundant (unique) chains in the PDB. After extracting the unique chains, their sequences, and their identical chains from the file, the unique chains are filtered according to several criteria. Finally, each unique chain is associated with its sequence and identical chains. Thus, the unique chain’s calculations could be easily mapped to the structures of all its identical chains.

b. The second step is detecting homologues. The sequence homologues are searched in UniRef90 [3], a clustered version of the UniProt database [4]. This is done using one iteration of HMMER [5] with an E-value threshold of 0.0001. The candidate homologues retrieved by HMMER for a certain chain are further filtered according to sequence identity (maximum 95%), sequence coverage (minimum 60%) and overlap among homologues (maximum 10%). Following the homologues filtration process, chains with less than 50 homologues are eliminated. Next, CD-HIT [6] removes any redundant homologues with a threshold of 95%. If there are more than 50 homologues after the CD-HIT filtration process, the remaining homologues are sorted by their E-value in an ascending order, based on the principle that the lower the E-value the more significant the homologue. A maximum of 300 homologues are sampled evenly from the sorted list to create the final list of homologues of the query protein. Finally, a multiple sequence alignment (MSA) of the homologues is constructed using the MAFFT-LINSi procedure [7].

c. The third step is estimating the evolutionary rates. It begins by inferring the best amino acids substitution model based on the MSA [8]. Then, a phylogenetic tree is built from the MSA with the Neighbor-Joining method [9], implemented in the Rate4Site program [10]. Next, Rate4Site assigns an evolutionary rate to each position in the query sequence, based on the phylogenetic tree and the substitution model, and using an empirical Bayesian methodology [11]. Finally, the evolutionary rates are categorized to discrete conservation grades, ranging from 1 to 9, where 1 are the most highly variable, 5 are of intermediate conservation, and 9 are the most highly conserved positions. Positions that are assigned grades with low confidence are treated as a separate, tenth, category. The nine grades are then mapped to colors that reflect the level of conservation, which allows a clear and intuitive detection of the conserved regions in the protein.

d. The fourth and final step is formatting and visualizing the results. The conservation grades are mapped on the three-dimensional structure of the query protein, which can be viewed using the NGL viewer [12] or FirstGlance in Jmol [13]. The colors are also projected on the query sequence and on the MSA. Moreover, session files, presenting the protein structure colored by the conservation grades, are created using the PyMOL [14] and Chimera [15] programs. All visual results are available in two color scales: the default color scale, which is turquoise-through-maroon, and the color-blind friendly color scale, which is green-through-purple. These color scales correspond to variable (grade 1)-through-conserved (grade 9). Positions with low reliability according to the confidence interval are colored in light yellow in both color scales.

Figure 1: A flowchart of the pipeline used to construct ConSurf-DB. The pipeline consists of four steps: retrieving PDB entries, homologues detection and building MSA, estimating evolutionary conservation, and formatting the results.

Table 1 details the ConSurf-DB statistics, including the number of starting proteins, the numbers of proteins filtered using various criteria, and the total number of proteins with available evolutionary profiles.

PDB chains		MSA sizes
Total chains found	473,197	Chains with less than 50 homologues	7363
Total non-redundant chain found	108,958	MSA’s created
Filtered		Chains with 50-100 homologues	3238
Chains shorter than 30 amino acids	7054	Chains with 101-200 homologues	4978
Chains with large structures	4629	Chains with 201-300 homologues	81,486
Chains with more than 15% modified residues	210	Total chains processed	89,702
Total chains post initial filtration	389,863
Total non-redundant chains post initial filtration	97,065

Table 1. Statistics of ConSurf-DB.

References

1. Ben Chorin A., Masrati G., Kessel A., Narunsky A., Sprinzak J., Lahav S., Ashkenazy H. and Ben-Tal N. (2020).
ConSurf-DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins.
Protein Science 29:258–267. PDF
2. Wang G. and Dunbrack R.L. (2005).
PISCES: recent improvements to a PDB sequence culling server.
Nucleic Acids Research, 33:W94-W98; PMID: 15980589. PDF
3. Suzek B.E., Wang Y., Huang H., McGarvey P.B., Wu C.H. and UniProt Consortium. (2015).
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.
Bioinformatics, 31:926-932; PMID: 25398609. PDF
4. UniProt Consortium. (2019).
UniProt: a worldwide hub of protein knowledge.
Nucleic Acids Research, 47:D506-D515; PMID: 30395287. PDF
5. Mistry J., Finn R.D., Eddy S.R., Bateman A. and Punta M. (2013).
Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions.
Nucleic Acids Research, 41:e121; PMID: 23598997. PDF
6. Fu L., Niu B., Zhu Z., Wu S. and Li W. (2012).
CD-HIT: accelerated for clustering the next-generation sequencing data.
Bioinformatics, 28:3150-3152; PMID: 23060610. PDF
7. Katoh K. and Standley D.M. (2013),
MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
Molecular Biology and Evolution, 30:772–780; PMID: 23329690. PDF
8. Darriba D., Taboada G.L., Doallo R. and Posada D. (2011).
ProtTest 3: fast selection of best-fit models of protein evolution.
Bioinformatics, 27:1164-1165; PMID: 21335321. PDF
9. Saitou N. and Nei M. (1987).
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
Molecular Biology and Evolution, 4:406-425; PMID: 3447015. PDF
10. Pupko T., Bell R.E., Mayrose I., Glaser F. and Ben-Tal N. (2002).
Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues.
Bioinformatics, 18:S71-S77; PMID: 12169533. PDF
11. Mayrose I., Graur D., Ben-Tal N. and Pupko T. (2004).
Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior.
Molecular Biology and Evolution, 21:1781-1791; PMID: 15201400. PDF
12. Rose A.S., Bradley A.R., Valasatava Y., Duarte J.M., Prlic A. and Rose P.W. (2018).
NGL viewer: web-based molecular graphics for large complexes.
Bioinformatics, 34:3755-3758; PMID: 29850778. PDF
13. Martz E. (2005).
FirstGlance in Jmol. firstglance.jmol.org
14. Schrödinger L.L.C. (2015).
The PyMOL Molecular Graphics System, Version 2.3.3.
15. Pettersen E.F., Goddard T.D., Huang C.C., Couch G.S., Greenblatt D.M., Meng E.C. and Ferrin T.E. (2004).
UCSF Chimera--a visualization system for exploratory research and analysis.
Journal of Computational Chemistry, 25:1605-1612; PMID: 15264254. PDF