ConSurf Logo
HOME    OVERVIEW    COMPARISON    OUTPUTS    FAQ    CITING & CREDITS

Overview

ConSurf-DB is a repository for evolutionary conservation analysis of the proteins of known structures in the protein data bank (PDB). Sequence homologues of each of the PDB entries were collected and aligned using standard methods. The evolutionary conservation of each amino acid position in the alignment was calculated using the Rate4Site algorithm, implemented in the ConSurf web-server. The algorithm takes into account the phylogenetic relations between the aligned proteins and the stochastic nature of the evolutionary process explicitly. Rate4Site assigns a conservation level for each position in the multiple sequence alignment using an empirical Bayesian inference. Visual inspection of the conservation patterns on the 3-dimensional structure often enables the identification of key residues that comprise the functionally-important regions of the protein.

The repository is updated with the latest PDB entries on a monthly basis and will be rebuilt annually.

Introduction

The study of a protein raises many questions: What the protein function is? Does it have more than one function? How does the protein perform its functions? Is it acting alone? Where/when is the protein active? Where are the functional regions of the protein and what their nature is? Each of these questions can be further refined into additional, more specific, questions.

Advances in sequencing technologies produce ever larger databases containing protein sequences from a large collection of species. Within these databases one can find many protein families that can be analyzed in search for functional regions. Generally speaking, protein function is mediated through clusters of evolutionarily conserved amino acids that are located in close vicinity to each other. These clusters may be involved in enzymatic activity, ligand binding, protein-protein interactions, or in the folding and stabilization of the protein's architecture(1). Typically, the detection of these clusters is useful for initial investigation of a protein by characterizing their properties. In addition, correlating the conservation pattern with other data is often insightful. The ConSurf-DB leverages the protein databases in order to aid in the detection of such clusters.

We introduced the original ConSurf, available as an online server(2) at http://consurf.tau.ac.il/, back in 2001(3). ConSurf was developed for the identification of functional regions in proteins based on the conservation of amino acids, taking into account the phylogenetic relations between the proteins. In 2005 we introduced the ConSurf-HSSP(4) database which was a pre-calculated repository of ConSurf results based on multiple sequence alignments (MSAs) extracted from the HSSP database(5). The MSAs in HSSP do not include the gaps in the query sequence, i.e., positions in the aligned sequences which do not have corresponding positions in the query sequence are removed from the alignment. Consequently, the phylogenetic reconstruction of the protein family is prone to errors. The ConSurf-DB, presented here, replaces ConSurf-HSSP as our repository of pre-calculated ConSurf results. The MSAs in the ConSurf-DB include all sequence data needed for the phylogenetic reconstruction, it also uses a more advanced Rate4Site(6) algorithm utilizing Bayesian inference rather than the Maximum Likelihood estimate that was used in ConSurf-HSSP. The conservation results of ConSurf-DB are presented in much more standard and cross platform formats.

The sequence homologues of each protein in ConSurf-DB are collected using PSI-BLAST(10) and then filtered in order to represent reliably and comprehensively the protein family. This process requires a delicate balance between two opposing effects. A conservative search would yield only very close homologues and would make it virtually impossible to discriminate between amino acid positions that are truly important and those that did not change because of insufficient evolutionary time. On the other hand, an overly permissive search may falsely detect non-homologues that do not share the same structure and/or function. We conducted preliminary investigations and came up with a scheme, presented below, which balances between these two extremes. The selected homologues are aligned using MUSCLE(11) and are available for use as part of the ConSurf-DB repository.

The Rate4Site program is subsequently used to construct a phylogenetic tree and calculate conservation scores. Rate4Site builds a phylogenetic tree of the homologues using the neighbor joining algorithm(12). Using an empirical Bayesian approach it calculates the evolutionary rate of each amino acid position of the MSA, taking into account the stochastic nature of the evolutionary process. The amino acid evolution is traced using the JTT(13) substitution model. High evolutionary rate represents a variable position while low rate represents an evolutionarily conserved position.

The conservation scores are normalized so that the average over all residues is zero, and the standard deviation is one. Low (negative) scores indicate the conserved positions while the high scores indicate the variable ones. The normalized scores are then binned into the 1-9 color codes presented in Fig. 1, representing the conservation grades and projected on the 3D model of the query protein, where 1 corresponds to maximal variability and 9 to maximal conservation. It is important to note that even though the same scale is used in all the protein families, the conservation scores are not absolute and hence, comparing the conservation scores between different protein families might be misleading.

The web interface can be used for visual inspection of one or few proteins. It supports 3D visualization (using FirstGlance in Jmol) and access to all supplementary data. The entire repository can be downloaded via ftp or rsync and used for large-scale automated studies. For advanced uses, involving re-building of variants of the repository, the build scripts can be obtained. Please contact us at bioSequence@tauex.tau.ac.il for that.

Methodology

Building the ConSurf-DB repository consists of four stages: scanning the PDB(14), building MSA files, calculating the conservation scores and formatting the results. This design was chosen to allow reusability of the scripts by controlling the data at each step. For instance, an MSA file can be created by simply inputting a FASTA format sequence file to the MSA building script or if a Rate4Site output was obtained using a unique set of parameters, it can be used to create 3D visualization.

ConSurf build
The ConSurf-DB build process starts by scanning the PDB entries, a PDB entry can contain one or more chains that are handled separately. When a new PDB entry is found, the SEQRES section of each chain passes through three filters: "type", "length" and "modifications". Nucleic acid chains are discarded by the "type" filter, short amino acid chains of less than 30 residues are discarded by the "length" filter as they do not contain enough data for reliable phylogenetic tree reconstruction. Finally, the "modification" filter converts a list of non-standard residues into their closest standard amino acid form, and discards highly modified chains with over 15% non-standard residues. The modifications are noted and saved as part of the chain's supplementary data.

The next two steps rely solely on the sequence of amino acids in the chain. Identical sequences are grouped and processed once in order to avoid repetitive calculations. The second step in the process is the creation of the MSAs. Using PSI-BLAST we find potential homologues in the SwissProt(15) database using an e-value cutoff of 10-3 and three iterations. The results are forwarded to a filtering script that removes redundant sequences according to three criteria: (i) Sequence identity - sequences with more than 95% identity to the query sequence are removed; (ii) Sequence length - sequences shorter than 60% of the query sequence are removed; (iii) Maximum overlap - since BLAST is a local alignment algorithm, fragment sequences that overlap by over 10% are also removed. Redundant sequences are removed using CD-HIT(16); a maximum of the 300 sequences with the lowest e-value are selected as homologues, and MUSCLE is used to multiply align them. If a total of less than 50 homologues are found, the entire process is repeated using the Clean_UniProt database. Clean_UniProt is a modified version of the UniProt database(15) aimed to screen the more reliable sequences based on two criteria: (i) if the "Decription" (DE) field contain "Disease", "RIKEN", "variant", "mutation", "mutant" or "whole genome shotgun sequence" the sequence is removed; (ii) if the database is "TrEMBL" and the "Comments" (CC) lines contain the word "CAUTION" the sequence is removed. The Clean_UniProt includes unreviewed entries and is about 10 times larger than SwissProt.

The third and most CPU-bound step is the execution of Rate4Site to produce the evolutionary scores for each amino acid position in the protein. A Condor(17) job system that is part of the European grid network was used to this end, which allowed us to complete this part of the process for all the polypeptide chains in the PDB within less than 5 days. Rate4Site output includes a Newick formatted phylogenetic tree of the homologues and a list of conservation scores for each of the amino acids positions in the original sequence.

The last step includes parsing of the Rate4Site scores and formatting them to create a range of output data. The scores are normalized and classified into nine conservation levels, as explained in the Introduction above. These levels are subsequently used for visualization, using RasMol(18) coloring script and FirstGlance in Jmol. The confidence interval, which is assigned to each amino acid position, represents the reliability of the conservation score of that position. For example, a conservation score for a position that consists mostly of gaps will have a large confidence interval, i.e., low reliability. Low reliability positions are marked yellow in the 3D visualization(2).

All data including intermediate calculations are saved in each chain's directory and a user friendly HTML page is created to allow viewing the results using a web browser.

For a comparison of ConSurf-DB with other servers, please read further...

References

  1. Madabushi, S., Yao, H., Marsh, M., Kristensen, D.M., Philippi, A., Sowa, M.E. and Lichtarge, O. (2002) Structural clusters of evolutionary trace residues are statistically significant and common in proteins. Journal of molecular biology, 316, 139-154.
  2. Landau, M., Mayrose, I., Rosenberg, Y., Glaser, F., Martz, E., Pupko, T. and Ben-Tal, N. (2005) ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic acids research, 33, W299-302.
  3. Armon, A., Graur, D. and Ben-Tal, N. (2001) ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. Journal of molecular biology, 307, 447-463.
  4. Glaser, F., Rosenberg, Y., Kessel, A., Pupko, T. and Ben-Tal, N. (2005) The ConSurf-HSSP database: the mapping of evolutionary conservation among homologs onto PDB structures. Proteins, 58, 610-617.
  5. Dodge, C., Schneider, R. and Sander, C. (1998) The HSSP database of protein structure-sequence alignments and family profiles. Nucleic acids research, 26, 313-315.
  6. Mayrose, I., Graur, D., Ben-Tal, N. and Pupko, T. (2004) Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Molecular biology and evolution, 21, 1781-1791.
  7. Morgan, D.H., Kristensen, D.M., Mittelman, D. and Lichtarge, O. (2006) ET viewer: an application for predicting and visualizing functional sites in protein structures. Bioinformatics (Oxford, England), 22, 2049-2050.
  8. Innis, C.A. (2007) siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins. Nucleic acids research, 35, W489-494.
  9. Pettit, F.K., Bare, E., Tsai, A. and Bowie, J.U. (2007) HotPatch: a statistical approach to finding biologically relevant features on protein surfaces. Journal of molecular biology, 369, 863-879.
  10. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25, 3389-3402.
  11. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32, 1792-1797.
  12. Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4, 406-425.
  13. Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci, 8, 275-282.
  14. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic acids research, 28, 235-242.
  15. (2008) The universal protein resource (UniProt). Nucleic acids research, 36, D190-195.
  16. Li, W. and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England), 22, 1658-1659.
  17. Thain, D., Tannenbaum, T. and Livny, M. (2005) Distributed computing in practice: the Condor experience. Concurrency - Practice and Experience, 17, 323-356.
  18. Bernstein, H.J. (2000) Recent changes to RasMol, recombining the variants. Trends in biochemical sciences, 25, 453-455.