ConSurf Logo

Source Code

Download the source code.

Main Scripts - creates directory structure and FASTA sequence files - creates MSA file and runs Rate4Site - creates various output files according to Rate4Site output

This script scans the entire PDB and for each entry checks whether a ConSurf-DB directory with the same name already exists. If a ConSurf-DB directory exists, the PDB directory is skipped. When a directory in PDB is not found in ConSurf-DB then: (i) it is created, (ii) the PDB file is copied to it, (iii) new sub-directories are created for each chain, and (iv) a FASTA format file that contains the chain's sequence is saved in each sub-directory.
All newly created directory names are stored as a set of perl hash tables to the root of ConSurf-DB data directory. The hash data is stored in two files named proteinid.storedhash, proteindirs.storedhash and is used in the next steps of the creation process. Since will overwrite these files it is recommended to create a backup of the files before starting the process.
To build a full database, start with an empty staging directory.
*** All parameters for can be adjusted in the process_pdb_data.ini file.

This script creates the MSA file and runs Rate4Site. This is the most CPU intensive part of the process.
The script accepts a single command line parameter - a full path to a FASTA file containing the sequence to be processed.
There are two assumptions:
  1. The FASTA file should be located in a directory path ending with /{PDB_ID}/{CHAIN_ID}/
  2. A corresponding log directory {LOG_BASE}/{PDB_ID}/{CHAIN_ID}/ exist (log base directory is specified at single_seq_rate4site.ini)
*** All parameters for can be adjusted in the single_seq_rate4site.ini file.

This script creates files which will help interpret rate4site results and view it using a molecular viewer (either FirstGlance in Jmol or RasMol).
Rate4site outputs the conservation scores according to the MSA, which takes the protein sequence from the PDB's SEQRES fields. The script match those scores to fit the ATOM field in the PDB, in order to display the conservation scores on the molecule. The continuous conservation scores, according to rate4site, are partitioned into a discrete scale of 9 bins for visualization, such that bin 9 contains the most conserved positions and bin 1 contains the most variable positions.
More information on the output files can be read in the "OUTPUTS" page.

Quick Step By Step Update Process

  1. Sync PDB directory
  2. Check parameters in process_pdb_data.ini
  3. Backup proteinid.storedhash, proteindirs.storedhash in base data directory
  4. Run
  5. Check parameters in single_seq_rate4site.ini
  6. Run, on all new PDB entries (listed in proteinid.storedhash, proteindirs.storedhash)
  7. Check for failed jobs