Desktop Molecular Graphics Background Essentials

Biochem 660 - 2008

Desktop Molecular Graphics Background Essentials

1 Introduction

The visualization techniques of the structure of macromolecules are companion tools to the sequence analysis algorithms. New DNA sequences are being cloned and sequenced rapidly but the structure of the putative encoded proteins cannot be determined based only on their sequence. As the number of protein structures solved by x-ray crystallography is increasing, it will become easier to find structural homologs to fit onto newly protein sequences. Molecular graphics play a key role in understanding current structures and creating (structural) models.

Molecular graphics have evolved over the last 40 years from a simple vector display on a high performance oscilloscope to sensor-based virtual reality.

For a beautifully illustrated account of "History of Visualization of Biological Macromolecules" see

.

Image reference:

Desktop computers are now more powerful than mainframes of the last decades and there are free and commercial desktop software to manipulate 3 dimensional structures for the creation of publication quality images to illustrate research papers, proposals and to help visualize target molecules, their structural properties or their interaction with other molecules or ligands.

To be able to manipulate 3 dimensional structures on a desktop computer with a molecular graphics software is critical for today's molecular biologist and a necessary complement to sequence analysis projects.

2 Wheredo3dimensionalstructurescomefrom?

In summary there are three main methods: X-ray crystallography, NMR and 3D image reconstruction from cryo-electron microscopy.

Biochemists and crystallographers have developed techniques to crystallize macromolecules. Indeed proteins, nucleic acids or their complex can form crystals in specific biochemical conditions. The crystals are very fragile and small (often less than a millimeter) but they still can be placed inside an x-ray beam. Because of the regular

MolecularGraphicsEssentials 15

arrangement of the molecules within the crystals the x-ray will diffract in a very specific pattern that can be recorded on x-ray photographic film or an electronic array detector. With the help of powerful computer and complex software, the mathematical analysis of the diffraction pattern allows the crystallographer to calculate where the electrons (of the atoms) of the protein would be located in 3D space inside the crystal. They then fit a wireframe representation of the amino acids inside the electron density. When the position of the atoms is refined, the structure is published and usually deposited at the Protein Data Bank. These are the structures that you can fetch with a web browser and display inside on your desktop computer. A notable exception is for structures determined in the private sector these coordinates are proprietary and the authors are not obligated to their data public There used to be a lot of months or years of involved work for each solved structure, but new streamlined methods allow for faster determination in as little as one week! (High throughput structure determination, NIH, Protein Structure Initiative.)

(for more information on high throughput (not covered in class) see )

Crystals are placed into an x-ray beam. The atoms of the proteins within the crystals diffract the incident x-ray and create diffraction patterns on a film. With complex mathematical calculations crystallographers obtain an electron density map into which the amino acid sequence is fitted with help of computer graphics.

L(+) lactate dehydrogenase crystals. pentalenene synthase crystals.

Bar=100 ?m. Ostendorp et al.(1996) Lesburg et al. (1995) Protein

Protein Science 5, 862

Science 4, 2436

phosphoribulokinase crystals. Roberts et al. (1995) Protein Science 4, 2442

Diffraction Amplitude waves of

diffracted electrons can add or

subtract to each other. The result are Diffraction image from a

the white dots on the diffraction

pentalenene synthase crystal.

image.

Lesburg et al. (1995) Protein

Science 4, 2436

16 MolecularGraphicsEssentials

Electron density map

Biochem 660 - 2008

3 TheProteinDataBank(PDB):webrepositoryofpublishedstructures

3.1 The web site

"The Protein Data Bank (PDB) is an archive of experimentally determined threedimensional structures of biological macromolecules, serving a global community of researchers, educators, and students. The archives contain atomic coordinates, bibliographic citations, primary and secondary structure information, as well as crystallographic structure factors and NMR experimental data."

While the majority of published structures are derived from Xray crystallography. However, every year a larger number are derived from NMRexperiments.

The number of structure has been increasing exponentially since the first structures deposited in 1972.

The PDB database home page is at

On October 14, 2008 there were 53660 PDB entries.

The "PDB Statistics " button on the top left of the page leads to more information about released entries.

3.2 PDB file names

Sequences are found by their "accession number." Similarly PDB files are designated by a PDB ID code, alphanumeric and only 4 characters long. Most authors publish the PDB ID within their publications.

Many proteins are represented multiple times within the database, as reported by different authors or as mutants or because they were crystallized under different conditions such as pH. For example myoglobins account for 277 entries and hemoglobins 405.

The citing of a PDB structure is done by referencing the PDB ID code, followed by the primary journal citation.

See . html

MolecularGraphicsEssentials 17

Excerpt from web site: Structures should be cited with the PDB ID and the primary reference. For example, structure 102L should be referenced as: PDB ID: 102L D.W. Heinz, W.A. Baase, F.W. Dahlquist, B.W. Matthews How Amino-Acid Insertions are Allowed in an Alpha-Helix of T4 Lysozyme. Nature 361 pp. 561 (1993) Structures without a published reference can be cited with the PDB ID, author names, and title: PDB ID: 1CI0 Shi, W., Ostrov, D.A., Gerchman, S.E., Graziano, V., Kycia, H., Studier, B., Almo, S.C., Burley, S.K., New York Structural GenomiX Research Consortium (NYSGXRC) The Structure of PNP Oxidase from S. Cerevisiae

3.3 File Format

PDB files are plain text files and can be opened with the simplest word processors (e.g WordPad on Windows, TextEdit on Macintosh, pico, nano, vi on Linux.) The data is arranged in columns, formatted to specific width creating "fields" (70 characters wide by default).

Each line is a "record" and starts with the name of a record content type. For example, the most abundant and pertinent information within a PDB file are the ATOM lines, all starting with the ATOM keyword. Each ATOM record represents the 3D coordinates for one atom within the structure. ATOM records are used only for protein and nucleic acid structures. HETATM= records (hetero-atoms) are used for most other atomic coordinates, such as ligands (sugars, solvents, enzymatic substrate), metallic ions (iron, zinc, calcium etc.), and also modified amino acids. However not all authors choose to label their structures in the same way, and the ability to open the file with a word processor is extremely useful to understand and inspect its contents.

Each character is positioned within the file at a specific column position, and it is vital not to disrupt this arrangement when editing a PDB file. For example the following 2 records are not equivalent:

ATOM 1 N HIS 1 49.668 24.248 10.436 1.00 25.00 1 1GCN 50 ATOM 1 N HIS 1 49.668 24.248 10.436 1.00 25.00 1 1GCN 50

Because of the position-specific format in the columns, the latter record is in fact equal to:

ATOM 2 N HIS 1 4.966 2.424 1.043 1.00 25.00

The first 3 real numbers on each line the x, y and z Cartesian coordinates of the atom in three dimensional space. The position of this Nitrogen atom would be in a very different position with this erroneous modification.

The three dimensional XYZ coordinates are the first three real numbers on each line. For example 49.668 24.248 10.436 for the first ATOM record (Nitrogen atom of Histidine amino-acid number 1), in the previous example. In the Cartesian coordinate system the x, y and z axes are perpendicular to one another and their length is 1. The unit length is 1 ? (equal to 0.1 nm [nanometer] in the international notation).

18 MolecularGraphicsEssentials

Biochem 660 - 2008

This is a preferred system of reference for most biological users, however it is worth knowing that in some cases the frame of reference is the length of the crystallographic "unit cell". In this case the axes are labeled a, b and c. They are not necessarily perpendicular to one another and do not necessarily have the same length. If the coordinates are expressed as a function of these axes they are usually referred to as fractional coordinates. Most chemical databases give the coordinates in this fashion. In the PDB formatted file, the CRYST1 and SCALEn records are related to these axes, but for our purpose, these can be ignored.

The "ABOUT PDB" button leads to a page with a link to the PDB file format, currently the format file resides at but that dates back to 1996. More information can be found under the menu "Dictionaries and File Formats" on the left panel.

A typical PDB file has a large header, title section with information about the compound, the journal into which the structure was described, technical information about the crystallization procedure and information about the crystal symmetry, the amino acid sequence of the protein(s), specific records for the secondary structure (alpha helix or beta sheet), the nucleic acid sequence etc.

The current most common record keywords are in the following table:

Keyword Definition

TITLE SECTION: section describing experiment and biological macromolecules

HEADER uniquely identifies entry with the idCode field. Provides a classification for

the entry. Contains the date the coordinates were deposited at the PDB.

OBSLTE appears in entries which have been withdrawn from distribution.

TITLE

contains a title for the experiment or analysis that is represented in the

entry.

CAVEAT warns of severe errors in an entry. Use caution when using an entry

containing this record.

COMPND describes the macromolecular contents of an entry. For each

macromolecular component, the molecule name, synonyms, number

assigned by the Enzyme Commission (EC), and other relevant details are

specified.

SOURCE specifies the biological and/or chemical source of each biological molecule

in the entry.

KEYWDS contains a set of terms relevant to the entry.

EXPDTA presents information about the experiment. E.g X-RAY DIFFRACTION,

NMRNEUTRON DIFRACTION, THEORETICAL MODEL

AUTHOR REVDAT SPRSDE

JRNL

contains the names of the people responsible for the contents of the entry. contains a history of the modifications made to an entry since its release. contains a list of the ID codes of entries that were made obsolete by the given coordinate entry and withdrawn from the PDB release set. One entry may replace many. contains the primary literature citation that describes the experiment which resulted in the deposited coordinate set. Other references are given in REMARK 1

MolecularGraphicsEssentials 19

REMARK presents experimental details, annotations, comments, and information not

included in other records.

REMARK lists important publications related to the structure presented in the entry.

1

REMARK states the highest resolution, in Angstroms, that was used in building the

2

model.

REMARK presents information on refinement program(s) used and the related

3

statistics.

REMARK free text annotation

4 ? 999

PRIMARY STRUCTURE: section listing sequence of residues in each chain

DBREF provides cross-reference links between PDB sequences and the

corresponding database entry or entries.

SEQADV identifies conflicts between sequence information in the ATOM records of

the PDB entry and the sequence database entry given on DBREF. Please

note that these records were designed to identify differences and not errors.

No assumption is made as to which database contains the correct data.

SEQRES contains the amino acid or nucleic acid sequence of residues in each chain

of the macromolecule that was studied. Note that sometimes the sequence

is limited to the portion of the protein that is solved, other times the complete

sequence is present.

MODRES provides descriptions of modifications (e.g., chemical or post-translational)

to protein and nucleic acid residues.

HETEROGEN SECTION: complete description of non-standard residues

HET

describes non-standard residues, such as prosthetic groups, inhibitors,

solvent molecules, and ions for which coordinates are supplied.

HETNAM gives the chemical name of the compound with the given hetID i.e. a unique

chemical name.

HETSYN provides synonyms, if any, for the compound in the corresponding (i.e.,

same hetID) HETNAM record. This is to allow greater flexibility in searching

for HET groups.

FORMUL presents the chemical formula and charge of a non-standard group.

Example for glucose:

FORMUL 3 GLC C6 H12 O6

SECONDARY STRUCT.: helices, sheets, and turns in protein and polypeptides.

HELIX

identifies the position of helices in the molecule. Helices are both named

and numbered. The residues where the helix begins and ends are noted, as

well as the total length.

SHEET identifies the position of sheets in the molecule. Sheets are both named and

numbered. The residues where the sheet begins and ends are noted.

TURN

identifies turns and other short loop turns which normally connect other

secondary structure segments.

CONNECTIVITY: lists location of disulfide bonds and other linkages if present.

SSBOND identifies each disulfide bond in protein and polypeptide structures by

identifying the two residues involved in the bond.

LINK

specifies connectivity between residues that is not implied by the primary

structure. This record supplements information given in CONECT records:

provided for convenience in searching.

HYDBND specifies hydrogen bonds in the entry.

SLTBRG specifies salt bridges in the entry.

20 MolecularGraphicsEssentials

Biochem 660 - 2008

CISPEP Specifies the prolines and other peptides found to be in the cis

conformation.

MISCELLANEOUS FEATURES: describes features such as the active sites.

SITE

supplies the identification of groups comprising important sites in the

macromolecule. Each listed SITE needs a corresponding REMARK 800 that

details its significance.

CRYSTALLOGRAPHIC & COORDINATE TRANSFORMATION: geometry & coord system

CRYST1 presents the unit cell parameters, space group, and Z value. If the structure

was not determined by crystallographic means, CRYST1 simply defines a

unit cube. The Z value is the number of polymeric chains in a unit cell. In the

case of heteropolymers, Z is the number of occurrences of the most

populous chain

ORIGXn (n = 1, 2, or 3). Presents the transformation from the orthogonal coordinates

contained in the entry to the submitted coordinates. ORIGX relates the

coordinates in the ATOM and HETATM records to the coordinates in the

submitted file.

SCALEn (n = 1, 2, or 3). Presents the transformation from the orthogonal coordinates

as contained in the entry to fractional crystallographic coordinates. The unit

cell parameters are used to calculate SCALE.

MTRIXn (n = 1, 2, or 3). Presents transformations expressing non-crystallographic

symmetry.

TVECT presents the translation vector for infinite covalently connected structures.

COORDINATES SECTION: collection of atomic coordinates

MODEL specifies the model serial number when multiple structures are presented in

a single coordinate entry, as is often the case with structures determined by

NMR or molecular dynamics. Every MODELrecord has an associated

ENDMDL record.

ATOM

presents the atomic coordinates for standard residues. They also present

the occupancy and temperature factor for each atom. Heterogen

coordinates use the HETATM record type.

SIGATM presents the standard deviation of atomic parameters as they appear in

ATOM and HETATM records. Each SIGATM record immediately follows the

corresponding ATOM/HETATM record.

ANISOU presents the anisotropic temperature factors.

SIGUIJ presents the standard deviations of anisotropic temperature factors.

TER

indicates the end of a list of ATOM/HETATM records for a chain. The TER

records occur in the coordinate section of the entry, and indicate the last

residue presented for each polypeptide and/or nucleic acid chain for which

there are coordinates. For proteins, the residue defined on the TER record

is the carboxy-terminal residue; for nucleic acids it is the 3'-terminal residue.

For a cyclic molecule, the choice of termini is arbitrary.

HETATM presents the atomic coordinate records for atoms within "non-standard"

groups. These records are used for water molecules and atoms presented

in HET groups. 2 examples: magnesium and ion:

ENDMDL

HETATM1357 MG MG 168

4.669 34.118 19.123 1.00 3.16

MG2+

HETATM3835 FE HEM

1

17.140 3.115 15.066 1.00 14.14

FE3+

paired with MODEL records to group individual structures found in a

coordinate entry.

CONNECTIVITY SECTION: information on chemical connectivity.

CONECT specifies connectivity between atoms for which coordinates are supplied.

The connectivity is described using the atom serial number as found in the

MolecularGraphicsEssentials 21

entry.

BOOKKEEPING SECTION: some final information about the file itself.

MASTER is a control record for bookkeeping. It lists the number of lines in the

coordinate entry or file for selected record types.

END

marks the end of the PDB file.

NOTE: The most important records for the casual user are to first verify that the correct PDB file has been retrieved, with the correct data within it: HEADER, TITLE, COMPND and SOURCE.

The recognition of the presence of non-protein and non nucleic acids moieties: HET

The ATOM and HETATM records, make the actual 3D coordinates set and all or part can be cut/paste in a new text doccument if needed.

MODEL and ENDMDL are not recognized by some software, such as older versions of Rasmol or VMD. The presence of these records may prevent chains beyond MODEL 1 to be shown. Manual editing, or switching visualization package is required in this case.

TER records append the end of each chain. However some (mostly older) PDB files may not have these records in place in which case it may be useful to add them (the Chimera software for example creates a line (bond) between molecules A and B is there is no TER separating their ATOM records.) On the other hand it may be useful to remove TER records in other cases.

NOTE: ATOM is reserved for the protein and nucleic-acid atoms.

HETATM is usually used for all other compounds (listed under HET) such as ligands (e.g. NAD, AMP, ATP), solvents (e.g. P04, S04), water molecules (HOH or WAT) and metal ions (e.g. MG, FE, CA, ZN). However some author do not always follow these conventions and it can be useful to inspect the PDB file with a word processor to know how the file is organized.

Example of ATOM records for 1BL8 (Potassium Channel (Kcsa) From Streptomyces Lividans)

#1 ATOM ATOM ATOM ATOM ATOM ATOM //// ATOM ATOM TER ATOM

#2 #3 #4 #5 #6 1 N ALA A 23 2 CA ALA A 23 3 C ALA A 23 4 O ALA A 23 5 CB ALA A 23 6 N LEU A 24

704 OE1 GLN A 119

705 NE2 GLN A 119

706

GLN A 119

707 N ALA B 23

#7 65.191 66.434 66.148 65.327 67.503 66.837

#8 22.037 22.838 24.075 24.916 21.981 24.176

#9 48.576 48.377 47.534 47.902 47.702 46.401

#10 #11_______________

1.00181.62

N

1.00181.62

C

1.00181.62

C

1.00181.62

O

1.00 74.09

C

1.00163.39

N

79.595 14.626 51.132 1.00193.75

O

79.635 16.611 50.129 1.00193.75

N

85.298 9.520 40.592 1.00173.57

N

22 MolecularGraphicsEssentials

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download