25 - Yale University



SHELX Workshop

St. Paul ACA Meeting 22nd July 2000

Contents

1. Workshop program and aims

2. Introduction to SHELX

3. SHELXD – integrated direct and Patterson methods (beta-test)

4. Guide to SHELX for macromolecules: Phasing

5. Guide to SHELX for macromolecules: Refinement

6. Frequently asked questions (by biocrystallographers)

7. References

8. Further useful sources of information

1. Workshop Program

The Workshop is divided into four sessions, with a discussion period after each session. Each discussion is led by a panel consisting of the session chair and the speakers for that session.

Introduction, phasing etc. Chair: Duncan McRee

8:30 – 8:45 George Sheldrick Historical introduction to SHELX

8:45 – 9:10 George Sheldrick Dual-space ab initio direct methods in SHELXD

9:10 – 9:35 Thomas Schneider MAD phasing

9:35 – 10:00 Louis Farrugia The WinGX user interface

10:00 – 10:20 Discussion

10:20 – 10:35 Coffee/tea

Structure refinement. Chair: Ethan Merritt

10:35 – 11:00 Dale Tronrud Introduction to refinement, solvent model

11:00 – 11:20 George Sheldrick Restraints and constraints

11:20 – 11:45 Bill Clegg Weak data, disorder and other problems in small molecules

11:45 – 12:10 Thomas Schneider Disorder in macromolecules

12:10 – 12:30 Discussion

12:30 – 13:30 Buffet lunch

Twinning. Chair: George Sheldrick

13:30 – 13:45 Regine Herbst-Irmer Racemic twinning and the Flack parameter

13:45 – 14:10 Regine Herbst-Irmer Merohedral twins

14:10 – 14:35 Victor Young Non-merohedral twins

14:35 – 15:00 Thomas Schneider Twinning in macromolecules

15:00 – 15:20 Discussion

15:20 – 15:35 Coffee/Tea

Errors, validation and anisotropic refinement. Chair: Bill Clegg

15:35 – 16:00 Ton Spek Small-molecule validation

16:00 – 16:20 George Sheldrick Estimation of parameter errors

16:20 – 16:45 Ethan Merritt Anisotropic refinement of macromolecules

16:45 – 17:10 Duncan McRee Validation of error estimates for metalloproteins

17:10 – 17:30 Discussion

1.1 Aims and organization of the Workshop

Although centered on a particular program system, it is intended that the Workshop should be educational; previous experience of the SHELX programs should not be essential (though it will clearly help). Applications to small molecules and macromolecules have been mixed up as thoroughly as possible; an exchange of ideas must surely be beneficial to both groups of crystallographers. The gap between the two approaches has long since disappeared. Large small molecules that are bigger than small proteins are now being solved by molecular replacement or anomalous dispersion methods, and small proteins are being solved by direct methods. Anisotropic refinement with full-matrix estimation of standard deviations is now practicable for macromolecules that diffract to high resolution, and the techniques used to model disordered solvent in small molecule structures often now involve restraints developed first for macromolecular refinements.

With these notes, we have tried to provide an introduction to the theory and application of the new structure solution program SHELXD; which is proving very useful for the ab initio solution of larger small molecules given data to atomic resolution, as well as for the location of heavier atoms or anomalous scatterers from MAD, SIR, SIRAS and SAS data at much lower resolution. We have also tried to provide a simple introduction to the SHELX system for biological crystallographers using the programs for the first time, e.g. for the refinement of proteins at high resolution or the refinement of twinned macromolecules at any resolution. No attempt has been made to deal with routine small-molecule applications since these are well covered by the existing documentation.

In order to maximize the information content, the Workshop will consist of talks and discussions rather than computer demonstrations. There is a generous allocation of time for discussions and participants are encouraged to make good use of this to ask awkward questions. The SHELX Workshops in Göttingen have covered similar ground in about a week, so the program is intensive and will require good teamwork from the speakers. Computers will be available during exhibit hours at the ACA Meeting, so participants who would like to try out some of the programs on their own data should contact the appropriate speakers.

A useful byproduct of the Workshop is the production of tutorials, documentation and examples that have been made generally available on the Internet (via links from the SHELX homepage at ).

2. Introduction to SHELX

2.1 History

The original version of SHELX consisted of about 5000 lines of FORTRAN written around 1970 for the solution and refinement of small-molecule and inorganic structures from single crystal diffraction data. Starting in 1976, this version was distributed in compressed form so that the program and test data fitted into one box of ca. 2000 punched cards. SHELX76 was restricted to 160 atoms because of computer limitations! A separate structure solution program SHELXS was released in 1986 to accommodate advances in direct methods, and in 1993 SHELXL replaced the structure refinement part of SHELX76. There was a major update of both SHELXS and SHELXL in 1997 and these are still the current versions. The SHELX97 system includes a program CIFTAB for processing CIF format files that can be used for archiving structural data, and programs SHELXPRO and SHELXWAT designed more specifically for macromolecular applications.

At this Workshop a beta-test version of a new integrated Patterson and direct methods program SHELXD is being released; it is proving particularly useful in MAD phasing of macromolecules as well as for ab initio solution of structures - in the range 200-2000 unique atoms - between small and macromolecules. The SHELX system consists purely of programs that input and output text files. Several excellent graphical interfaces are available from other authors. At the Workshop three such interfaces – PLATON and WinGX for small molecules and XtalView for macromolecules - that like SHELX are available free to academics - will be introduced by their authors, SHELXTL, a commercial version of SHELX incorporating the interactive graphics programs XPREP (reciprocal space exploration) and XP (real space calculations and display), is available from Bruker-AXS.

2.2 Program organization and philosophy

SHELX is written in a simple subset of FORTRAN-77 that has proved to be extremely portable. The programs SHELXS (structure solution) and SHELXL (refinement) both require only two input files: a reflection file (name.hkl) and a file (name.ins) that contains crystal data, atoms (if any), and instructions in the form of keywords followed by free-format numbers, etc. These programs write a listing file name.lst and a file, name.res, that can be renamed or edited to name.ins for the next refinement. The common first part of the filename is read from the command line by typing, e.g., ‘shelxl name’. The programs are executed independently without the use of any hidden files, environment variables, etc.

The programs are general for all space-groups in conventional settings or otherwise and make extensive use of default settings to keep user input and confusion to a minimum. Particular care has been taken to test the programs thoroughly on as many computer systems and crystallographic problems as possible before they were released, a process that often requires several years!

2.3 Distribution of the programs

The programs are provided as sources as well as precompiled executables for common computer systems, and may be downloaded by ftp or using a browser (CDROMs are also available). The programs are free of charge for academics but a modest license fee (currently $2499) is required for for-profit institutions. This license covers the use of the programs for an unlimited time on an unlimited number of computers at one geographical location. This fee is necessary to cover the costs of distribution and support for all users, we do not make a profit but the university requires us to cover our costs. When there is a major new release a new license fee is required for the new version. There will be no additional license fee for the beta-test of SHELXD, but the final version of this program will be released at the same time as the next major SHELX update in 2001 or 2002 and so will require a license fee. To encourage for-profit users to switch to the new version and to prevent a bug-ridden version remaining in circulation, the beta-test is provided in compiled form only and has a built-in expiry date. The final version will be made available as usual in source form without an expiry date. All users are required to fill in and sign an application form before they are given the password for downloading the programs from the SHELX ftp server; this form may be printed from the SHELX homepage.

2.4 Documentation and support

Information about new developments in the SHELX programs, workshops, related programs, frequently asked questions and other sources of information are posted on the SHELX homepage at: which should be checked at regular intervals. A detailed SHELX manual may be downloaded from the SHELX ftp server in Microsoft Word or in Postscript format. This was written with small molecule users in mind and contains a full explanation of the test structures that are provided with the programs. Since macromolecular users may be unfamiliar with these examples these notes include a separate guide for macromolecular Workshop participants. The author is happy to answer questions (email only please, gsheldr@shelx.uni-ac.gwdg.de) provided that the questions are not in the lists of ‘frequently asked questions’!

3. SHELXD – integrated Patterson and direct methods (beta-test)

3.1 Introduction

Although the solution of the crystallographic phase problem is proving more elusive than Fermat’s last theorem, in practice the large majority of small molecule structures are solved in minutes (or even seconds) by conventional direct methods. However the phase probability distributions on which these methods are based become weaker as the number of atoms increases, and few structures with more than about 200 unique equal atoms have been solved in this way. After more than a decade in which little progress was made in solving larger structures, the introduction of the dual-space (also known as Shake & Bake) philosophy by the Buffalo group (Miller et al., 1993) proved to be a significant improvement, increasing the size of structure that could be solved by nearly an order of magnitude (Figure 3.1).

Figure 3.1 A general view of dual-space direct methods. The phase refinement (in reciprocal space) is usually performed using the tangent formula (Karle & Hauptman, 1956) or minimal function (Miller et al., 1993); the atomic model in real space may simply involve picking the highest N peaks or may be more sophisticated.

This procedure, which was implemented in the computer programs SnB (Miller et al., 1994) and later in SHELXD, was of necessity based on the strongest normalized structure factors E, corresponding typically to the largest 15 to 20% of the observed structure factors F in each resolution shell, because the probability formulae only provide significant phase information for the strongest E-values. The number of unique non-hydrogen atoms N is assumed to be approximately known. The dual-space recycling is typically performed for several hundred or more sets of N random starting atoms, with typically 2N cycles for each. In SHELXD, potential solutions are identified by high values of the correlation coefficient CC (Fujinaga & Read, 1987):

CC=100[(wEo2Ec2•(w–(wEo2•(wEc2]/{[(wEo4•(w–((wEo2)2]•[(wEc4•(w–((wEc2)2]}½

These potential solutions can be improved and extended by means of peaklist optimization (Sheldrick & Gould, 1995) that finds the set of potential atoms that maximizes CC for all reflections.

The structure solution, as monitored by the mean phase error, tends to happen quite suddenly over a small number of cycles. Although there is little indication of an impending solution, a single dominant peak in real space typically indicates that the phase refinement is locked in a false minimum (Xu et al., 2000).

3.2 Random omit maps

In the course of testing SHELXD, it was discovered by accident that a very effective procedure is to leave out about 30% of the peaks at random when calculating phases for the next cycle. In retrospect it is possible to understand why this is an effective search strategy, by analogy with the omit maps frequently used in macromolecular crystallography. If the deleted atoms are part of an essentially correct solution, they will probably be regenerated; if not, they will be replaced by different, and possibly better, potential atoms. The effectiveness of this random omit procedure is illustrated in Figure 3.2 using gramicidin A (NS = 317; P212121) as a test structure; gramicidin A was probably the most difficult structure solved by conventional direct methods (Langs, 1988). At least for this structure, the most effective approach involved the combination of the tangent formula in reciprocal space with random omit maps in real space; other attempts at modifying the peak list were much less successful. Note that line (c) corresponds to the original Shake & Bake procedure. A surprising observation in Figure 3.2 is that the combination (d) [no phase refinement/random omit] is able to solve this structure (albeit less efficiently) although no phase probability relations have been employed! This is important because, unlike the random omit maps, the probabilities become weaker as the structure becomes larger. This provides the important clue that for much larger structures, it might be more efficient to discard the probabilistic approach to direct methods completely!

Figure 3.2 Percentage of correct solutions P against cycle number for gramicidin A using various combinations of phase refinement and real space processing: (a) tangent / random omit; (b) minimal function / random omit; (c) minimal function / top N Peaks; (d) no phase refinement / random omit. In the random omit procedure, the highest N peaks were found and 30% of them omitted at random.

It should be emphasized that direct methods are almost entirely a phase searching problem; phase refinement plays a minor role. There are much better ways of refining phases than the tangent formula or the minimal function. For example, Sayre (1974) showed that it was possible to refine the phases of the small protein rubredoxin with 1.5Å data by a least-squares fit to his squaring equation:

Fh = Qh (k Fk Fh-k

Qh is a constant, assuming equal atoms and equal isotropic displacement parameters. This equation equates amplitudes as well as phases, compared with just phases in the case of the tangent formula, and is equally valid for large and small structures, whereas probability formulas become weaker as the size of the structure increases. On the other hand the use of all the data rather than just a small subset of the strongest E-magnitudes probably makes it less suitable for searching phase space.

Table 1. Some previously unsolved structures first solved using SHELXD. SG = space group, N is the number of unique non-hydrogen atoms excluding solvent, NS the number including solvent atoms. HA lists the unique atoms heavier than oxygen, if any, and dmin is the limiting resolution to which data were processed.

| | | | | | |

|Compound |SG |N |NS |HA |dmin(Å) |

|Hirustasin |P43212 |402 |467 |10S |1.20 |

|Cyclodextrin |P21 |448 |467 | |1.00 |

|Cyclodextrin |P1 |483 |562 | |0.88 |

|Decaplanin |P21 |448 |635 |4Cl |1.00 |

|Amylose CA26 |P1 |624 |771 | |1.10 |

|Mersacidin |P32 |750 |826 |24S |1.04 |

|rc-WT Cv HiPIP |P212121 |1264 |1599 |8Fe |1.20 |

|Cytochrome c3 | P31 |2024 |2208 |8Fe |1.20 |

3.3 Application to unknown structures

The best demonstration of the power of a new method is its ability to solve previously unsolved structures; Table 3.1 shows some examples of this. It should be noted that the presence of heavier atoms definitely improves the chances of success, and reduces the computer time needed per solution, but is not essential. It should also be noted that these successes are limited to structures for which data were available to atomic resolution (ca. 1.2 Å) or better. The only exception is hirustasin (Usón et al., 1999) which could be solved using either the 1.2 Å (low temperature) or the 1.4 Å (room temperature) data, even if the data were truncated to 1.55 Å.

3.4 Integration with other approaches

The extension of these algorithms to lower resolution and to larger structures is the subject of intensive current research by the groups in Bari, Buffalo, Göttingen and York. An obvious extension is to search for small groups of atoms (e.g. the five atoms of a peptide group) rather than for individual atoms, but unfortunately this is very computer time intensive. Peak-picking is after all an extreme form of density modification, and the low density elimination procedure of Woolfson and co-workers (Shiono & Woolfson, 1992; Refaat & Woolfson, 1993) may provide a good compromise between peak picking and techniques normally applied to improve maps at lower resolution. Solvent boundaries have apparently not yet been included in direct methods programs. A promising (but complicated) alternative would be to incorporate the wARP approach (Perrakis, Lamzin et al., 1997, 1999) of refining the positions and B-values of potential atoms, adding new atoms that correspond to high difference density and make chemical sense. The peak positions from dual space direct methods are relatively precise, and simply refining B-values against all data can significantly improve map interpretability (Usón et al., 1999; Parisini et al., 1999), as shown in Figure 3.3:

Figure 3.3 (a) Part of the electron density map produced by dual-space recycling followed by peaklist optimisation in the ab initio solution of a HiPIP protein (Parisini et al., 1999) and (b) The same region of a sigma-A weighted 2mFo-DFc map (Read, 1985) after B-value refinement. Although the atom positions were held fixed in this refinement, density appears at the sites of the missing atoms.

3.5 Direct methods for the location of anomalous scatterers

In principle the MAD approach (Hendrickson, 1991; Smith, 1998) in which data are collected at two or more wavelengths for which the f‘ and f“ anomalous scattering factors are non-zero for at least one of the elements present, determines experimental phases directly. There is however a hidden phase problem: it is still necessary to find the positions of the anomalous scatterers in order to calculate reference phases. Without theses reference phases the protein phases cannot be found. Although conventional direct methods and Patterson interpretation programs such as SHELXS-97 can be misused to find a small number of anomalous scatterers from the MAD estimates of the structure factors for these atoms alone (FA) or from the SAS (sometimes referred to as SAD) anomalous differences (F = F+–F-, the number of sites that can be found in this way is limited to about 20. The main problem is that the data are noisy since they are based on differences of observed structure factors; the best antidote is to collect highly redundant data. On the other hand the resolution and completeness of the FA data are not critical; 3.5 Å is adequate, since the anomalous atoms are more than 3.5 Å apart, and the problem is still highly over-determined. Although higher resolution and completeness are not required to find the anomalous scatterers, they do have a major influence on the quality of the resulting electron density maps (Brodersen et al., 2000).

Before attempting to use MAD or SAS data to locate the anomalous scatterers, a critical decision is to which resolution the data should be truncated. If data are used to a higher resolution than there is significant dispersive and anomalous information, the effect will be to add noise. Since direct methods are based on normalised structure factors, which emphasise the high resolution data, they are particularly sensitive to this. Since there is some anomalous signal at all the wavelengths in the MAD experiment, a good test is to calculate the correlation coefficient between the signed anomalous differences (F at different wavelengths as a function of the resolution. A good general rule is to truncate the data where this correlation coefficient falls below about 25 to 30%. Table 3.2 illustrates three very different cases. This procedure can also indicate if there is a problem with the wavelength. For SAS data collected at a single wavelength it is still possible to use the correlation coefficient between the anomalous differences collected from two crystals, or from one crystal in two orientations, before merging the two data-sets.

Table 3.2. Correlation coefficients expressed as percentages between the high energy remote data and the two or three other wavelengths collected in MAD experiments on three different proteins. In (a) the high values involving the peak (pk) and inflection point (ip) data show that it is not necessary to truncate the data, there is significant MAD information up to the highest resolution collected. A poorer correlation would be expected with the low energy remote data (lrm) which has a much smaller anomalous signal. In (b) it would be advisable to truncate the data to about 3.9Å (which indeed led to a successful solution using SHELXD). (c) is clearly hopeless and in fact could not be solved.

(a) Apical Domain (Walsh et al., 1999) 1 x (3 Se-Met in 144aa) C2221

Inf - 8.0 - 6.0 - 5.0 - 4.0 - 3.6 - 3.4 - 3.2 - 3.0 - 2.8 - 2.6 - 2.4 - 2.2

pk 91.2 93.9 93.9 89.6 88.6 89.4 89.4 83.9 76.9 65.7 57.0 44.8

ip 89.7 90.0 87.0 84.4 79.8 78.9 79.4 74.7 71.1 54.3 47.2 39.2

lrm 48.5 52.8 52.9 38.0 28.4 34.6 14.2 21.1 24.7 9.1 5.4 -3.7

(b) RRF (Selmer et al., 1999) 1 x (4 Se-Met in 185aa) P43212

Inf - 8.0 - 6.0 - 5.0 - 4.6 - 4.4 - 4.2 - 4.0 - 3.8 - 3.6 - 3.4 - 3.2 - 3.0

pk 69.3 73.1 62.2 56.9 49.6 45.6 48.6 29.6 20.6 24.6 20.1 14.2

ip 59.4 58.3 41.9 43.3 40.7 50.4 34.6 24.7 17.5 16.6 8.1 3.9

(c) Unknown Protein 4 x (4 Se-Met in 350aa) P21

Inf - 8.0 - 6.0 - 5.0 - 4.6 - 4.4 - 4.2 - 4.0 - 3.8 - 3.6 - 3.4 - 3.2 - 3.0

pk 33.2 29.5 19.9 10.6 7.7 17.4 7.6 9.8 9.3 13.4 6.0 2.8

ip 37.6 38.9 37.8 26.5 13.5 24.0 14.2 27.3 25.9 23.1 24.3 22.8

3.6 Integration of direct and Patterson methods

The original dual-space algorithm is an effective way of locating a specified number of anomalous scatterers from MAD Fa or SAS (F data. The efficiency can however be improved by at least an order of magnitude by using starting atoms consistent with the Patterson function rather than random starting atoms. In addition, the Patterson provides a reliable relative (but not absolute) indication of the correctness of the solution and also of which atoms are probably correct.

The location of possible starting atoms makes extensive use of a special form of the Patterson minimum function (PMF) which is calculated as follows (Nordman, 1966). Place two atoms in a unit-cell and generate all their symmetry equivalents. Look up the Patterson function values corresponding to the unique vectors between all these atoms and sort them in ascending order, then find the mean value of the lowest (say) 30% of the values in this list. Since it is unlikely that this PMF will have a high value for wrong atom positions, especially when the symmetry is high and there are many vectors, it may be used as a criterion for a translational search for a two-atom fragment.

Each strong general Patterson peak is in principle a suitable two-atom ‘fragment’ for this translational search, because it may well correspond to a vector between two heavy atoms! Since we are only interested in generating many different sets of atom co-ordinates consistent with the Patterson function, there is no need to determine the global maximum PMF, indeed often this does not give good starting atoms for the dual-space recycling. A simple and effective approach is to try a fixed number (usually in the range 10000 to 99999) of random translations for a vector, and retain the one with the highest PMF. A random selection of vectors from the Patterson peak-list (excluding Harker peaks), biased so that the high peaks are chosen more often, is an effective way to pick the two-atom search fragment. If the atoms are expected to have lower than average B-values, as is the case for iron atoms in heme groups or iron-sulfur clusters, it is advantageous to sharpen the Patterson, e.g. by using coefficients (E3F)½ rather than F2.

Table 3.3 Application of integrated Patterson / direct methods to the location of the anomalous scatterers from MAD data. In the number of sites column, the first number is the number found and the second is the total number that should be present. aa = amino acid, SG = space group, dmin is the limiting resolution to which the data were processed, and Soln./hr is the number of solutions per hour on a 500MHz pentium PC. PAT-ratio is the speedup obtained by using starting atoms consistent with the Patterson function.

|Protein | No. of sites | No. of aa | SG |dmin[Å] |CC[%] |PAT-ratio |Soln/hr |

|Api-dm |3/3 Se |144 |C2221 |2.2 |45 |16 |256 |

|RRF |3/4 Se |185 |P43212 |4.0 |60 |1.4 |283 |

|ModE |6/6 Se |524 |P21212 |3.0 |66 |7.3 |163 |

|9hem |18/18 Fe |584 |P21 |2.9 |73 |4.0 |240 |

|X1 |32/32 Se |~1600 |C2 |3.5 |49 |5.1 |9 |

|Cyanase |40/40 Se |1560 |P1 |2.4 |57 |0.95 |66 |

|X2 |51/60 Se |~1500 |P21 |2.5 |52 |2.8 |13 |

|X3 |66/66 Se |2160 |P21 |2.6 |60 |12.5 |24 |

Before the first dual-space cycle, the two starting atoms need to be extended to N atoms. A difference Fourier synthesis would be effective for a small number of heavy atoms, but a better technique for a large number is to calculate a full-symmetry Patterson superposition minimum function (PSMF) (Buerger, 1959). First all symmetry equivalents are generated for the two starting atoms. Each pixel of the PSMF map is assigned a value equal to the PMF for all vectors between these atoms and a dummy atom placed at the pixel. Peaks are then obtained by map interpolation and sorted in the usual way.

By applying this procedure before each run through the dual-space recycling, it is possible to generate an unlimited number of different sets of starting atoms, all more or less consistent with the Patterson function. Our tests have shown that this combination of direct and Patterson methods produces more complete and precise solutions than just using the Patterson methods alone. It appears that iterative Patterson-only procedures suffer from an accumulation of atomic co-ordinate errors each time a new atom is added. Because it includes phase refinement, the dual-space approach does not suffer from this degradation as the number of atoms increases. Table 3.3 shown some results using this integrated Patterson / dual space recycling procedure on typical MAD problems; note the efficiency in terms of solutions per hour and the completeness of the solutions.

Table 3.4 Crossword table for location of the 8 iron atoms (two Fe4S4 clusters) in a HiPIP from SAS (F-data collected with Cu-K( radiation (Rayment et al., 1992). Each entry in the table links the atom forming the row with the atom forming the column, the top number of each pair is the minimum distance between the two atoms, taking symmetry into account, and the bottom number is the corresponding PMF. It is easy to find the two clusters by looking for Fe…Fe distances of about 2.8Å, and – despite the weakness of the anomalous signal – the PMF values for the 8 correct atoms are in general higher than those involving spurious atoms.

Peak x y z self cross-vectors

99.9 0.9201 0.0784 0.1133 27.7

26.6

88.4 0.9719 0.1047 0.1356 27.4 2.4

39.7 5.5

85.5 0.9043 0.1258 0.0884 27.7 2.6 3.0

27.3 23.3 5.5

82.7 0.9546 0.0950 0.0503 26.7 2.3 2.5 2.7

15.2 28.4 43.5 26.4

81.1 0.3542 0.5285 0.2615 31.2 14.6 16.6 14.4 14.6

20.9 41.4 14.8 9.5 21.5

80.5 0.4316 0.5144 0.2451 30.0 16.5 18.7 16.4 16.8 3.0

25.5 24.6 20.0 21.2 8.9 0.0

80.4 0.3942 0.5575 0.1995 29.6 14.4 16.4 13.9 14.6 2.7 2.9

0.0 31.4 7.7 22.6 33.8 26.6 19.4

73.9 0.3920 0.5023 0.1694 29.1 14.3 16.6 14.5 14.8 3.2 2.6 3.0

26.1 22.3 16.0 24.5 18.3 10.9 0.0 17.5

--------------------------------------------------------------------

63.8 0.4025 0.4641 0.2218 29.9 16.1 18.4 16.4 16.5 4.0 2.9 5.0

18.4 17.0 13.1 0.0 4.5 0.0 5.4 0.0

58.9 0.9655 0.0517 0.0945 26.9 2.2 3.0 4.5 2.6 15.2 17.3 15.4

45.9 7.3 15.8 7.8 5.3 0.0 0.0 6.1

The Patterson superposition function is also the basis of the above crossword table that provides a convenient way to assess which of the heavy atom sites are correct, and also in some cases to recognize the presence of non-crystallographic symmetry. In this tables the rows and columns correspond to the potential atoms. For each pair of atoms the top number is the minimum distance between them, taking the space group symmetry into account, and the bottom number is the PMF calculated from all vectors between the two atoms, also taking symmetry into account. The first vertical column is based on the self-vectors, i.e. between one atom and its symmetry equivalents. In general wrong sites can be recognized in this table by the presence of several zero PMF values (negative values are replaced by zero). The mean PMF value for a specified number of atoms provides a figure of merit PATFOM, useful for selecting the best solution, though the absolute value depends on the structure in question. Almost always the correct solution has the largest CC and the largest PATFOM.; this was the case for all the examples in Table 3.3. Table 3.4 illustrates a typical crossword table. It is fortuitous that all four iron atoms in one cluster appear before those in the other in this table, but one of the two independent molecules did indeed have higher B-values than the other.

3.7 The .ins file instructions for SHELXD

SHELXD expects ONE and only one source of starting atoms. This can take the form:

A: Input atoms in normal SHELX format for expansion using PLOP

B: PATS to generate ‘slightly better than random’ atoms consistent with the Patterson

C: GROP and a PDB-format model fragment

D: Random atoms (used if none of the above apply)

The reflection data consists of an .hkl file containing F2 (HKLF 4) or F-values (HKLF 3). These may correspond to either native data for ab initio structure solution or structure expansion, or MAD, SAD, SIR or SIRAS FA or (F values for heavy or anomalous atom location.

Dual-space recycling, using the largest E-values (FIND) is followed by peaklist optimization (PLOP); one or both of these commands must be present. In the case of structure expansion only PLOP can be used and the program then stops. When the starting atoms are generated randomly or by PATS or GROP, the calculations are repeated for a new set of starting atoms. The total number of such tries may be specified with NTRY, otherwise the program runs for ever; however when the job is running the calculation may be terminated at the end of the current try by creating a name.fin file in the current working directory.

In the following examples, TITL...UNIT in the normal SHELX format is assumed at the start of the .ins file and HKLF 4 (or 3) followed by END at the end of the file. The cell contents defined by SFAC and UNIT are only used by PLOP; in the FIND stage the atoms are assumed to be of the same type but with occupancies proportional to the square root of the peak height.

1. To solve an approximately equal-atom structure using native data to atomic resolution (1.2Å or better) the middle of the .ins file (between UNIT and HKLF) might be as follows (for 500 unique non-hydrogen atoms):

FIND 400

PLOP 500 600

2. To solve the same structure by first locating a disulfide bond (PATS with a super-sharp Patterson) then expanding to the complete structure (FIND/PLOP):

PATS –2.06

PSMF -4

FIND 400

PLOP 500 600

3. To locate 30 selenium atoms from MAD data:

PATS

FIND 30

MIND -3.5

[the .hkl file could contain h, k, l, FA and ((FA) in FORMAT(3I4,2F8.2)].

4. To solve a cyclodextrin structure with four beta-cyclodextrins in the asymmetric unit and with data barely to atomic resolution, the following could be tried:

GROP -1.8

FIND 240

PLOP 320 400

GEOM 4

ATOM 1 C41 MOL 1 -3.859 4.863 7.904 1.000 10.00

ATOM 2 C31 MOL 1 -5.081 4.209 8.524 1.000 10.00

ATOM 3 C21 MOL 1 -5.211 2.740 8.155 1.000 10.00

.... diglucose fragment in PDB format (see test provided) ...

ATOM 21 C52 MOL 1 -0.292 4.714 7.025 1.000 10.00

ATOM 22 O52 MOL 1 -0.642 5.837 6.253 1.000 10.00

SHELXD is started with the command line:

shelxd name

and expects to find both input files name.ins and name.hkl in the current directory. It writes a summary to the current window (standard output) and creates the files name.lst (more extensive listing file) and name.res (SHELX format atoms, crystal coordinates).

The following instructions may be included in the .ins file. Default values are given in square brackets; the # sign indicates that the default depends on other instructions:

TITL, CELL, ZERR, LATT, SYMM, SFAC and UNIT as usual (see the SHELX manual).

TRIC (or TRIK)

Flags expansion to non-centrosymmetric triclinic.

SHEL dmax [infinity], dmin [0]

Resolution limits in Å for all calculations.

NTRY ntry [0]

Number of global tries if starting from random atoms, PATS or GROP. If ntry is zero or absent, the program runs until it is interrupted by writing a name.fin file in the current working directory.

PATS +np or –dis [100], npt [#], nf [5]

Calculates and stores Patterson. Using top np peaks or a random orientation vector of length |dis|, tries npt random translations, selecting the one with the best Patterson minimum function PMF (see PSMF). When selecting a vector from the list of unique Patterson peaks, special vectors are ignored and the highest vector is chosen from nf random selections. This favors the highest peaks but (if nf is not too large) also allows lower peaks a chance. For example, with the default np = 100 and nf = 5, the chance is 39.5% that one of the first 10 vectors will be chosen and 91.9% that one of the first 50 will be chosen.

If the first parameter is negative, nf random oriented vectors of length |dis| are compared on the basis of their heights in the Patterson and the 'best' used for the translation search.

If PATS is used together with a second FIND parameter ncy greater than zero (or FIND followed by only one number) a full-symmetry Patterson superposition minimum function (i.e. a superposition based on the two peaks and all their symmetry equivalents) is used to locate the atoms in the first FIND cycle. PATS and GROP are mutually exclusive.

GROP +ZZ or -Egr [0], +/-ngt [99], nor [99], ntr [9999]

6D Patterson search for small rigid group. If the first parameter is positive, the search is performed using the Patterson minimum function PMF (see PSMF), using interatomic vectors for which the product of the two atomic numbers is greater than ZZ. For each of |ngt| attempts, nor random orientations are generated. The orientation with the best PMF (based on intramolecular vectors only) for each attempt is subject to ntr translations. The solution with the best PMF in the translational search (using both intra- and intermolecular vectors) in all the |ngt| attempts is used to generate the starting atoms for the next stage (usually FIND). If the first parameter is negative, an analogous procedure is employed but the function maximized is the sum of Ec2(Eo2–1) for reflections with E > |Egr| and resolution d > dlim (see ESEL).

If the second parameter ngt is negative, the above procedure is used for the rotation and translation search, but then a correlation coefficient (CC20) between Eo2 and Ec2 is calculated for each 'best' rotation/translation combination using 20% of all reflections up to the limiting resolution of dlim (20% rather than 100% is used to speed up the calculation). Thus one CC20 value is calculated for each of the |ngt| attempts. The solution with the highest CC provides the starting atoms for the next stage. This is a slower but almost always better than the other criteria.

The search model is read from PDB-format ATOM or HETATM records in the .ins file. All other PDB records should be removed. The atomic number is deduced from the atom name applying PDB rules. The PMF search is recommended for searching for a heavy-atom cluster (e.g. from SAS or MAD data) whereas the (slower) structure-factor based search is suitable for equal-atom fragments such as a short piece of alpha-helix (for solving small proteins) or a diglucose fragment (for solving cyclodextrins).

PSMF pres [4.0], psfac [0.34]

pres is the resolution of the Patterson in terms of minimum ratio of the number of grid points along an axis and the maximum reflection index along that axis. If nres is negative a 'super-sharp' Patterson with coefficients ((E3F) is calculated, otherwise a normal F2 Patterson is used. psfac is the fraction of the lowest values in the sorted list of Patterson heights that is summed to get the PMF.

FRES res [3.0]

Resolution of all Fourier syntheses (including the PSMF but excluding the Patterson itself) in terms of the minimum ratio of the number of grid points along an axis and the maximum reflection index used along that axis.

ESEL Emin [#], dlim [1.0]

Minimum E and high-resolution limit for FIND and TANG. The E2 values are normalized to 1 in resolution shells, then smoothed. Emin defaults to 1.2 for ab initio structure solution and to 1.5 for heavy atom location (the absolute value of the first MIND parameter is used to distinguish between these two cases depending on whether it is less than 1.6 or not).

FIND na [0], ncy [#]

Search for na atoms in ncy internal loop cycles (tangent formula + E-Fourier). ncy defaults to 20 (for heavy-atom location) or the maximum of 20 or na (for ab initio direct methods, distinguished using MIND mdis). The highest na / ( 1 – fr ) peaks are selected, where fr is the WEED parameter. The effect is that approximately na peaks remain after the random omit procedure (WEED). The occupancy is made proportional to square root of peak height in the FIND stage.

TANG ftan [0.9], fex [0.4]

Fraction ftan of the ncy FIND cycles are performed using the tangent formula, the rest using a Sim-weighted E-map. fex is the fraction of reflections with the largest Ecalc values to hold fixed when doing tangent expansion to find the remaining phases.

NTPR ntpr [100]

Maximum number of (largest) TPR per reflection; negative for output of mean phase errors (if phases were input).

MIND mdis [1.0], mdeq [2.2]

|mdis| is the shortest distance allowed between atoms for PATS and FIND. If mdis is negative PATFOM is calculated, and the crossword table for the best PATFOM value so far is output to the .lst file. In this case the solution is passed on to the PLOP stage if either the CC is the best so far or the PATFOM is the best so far. mdeq is the minimum distance between symmetry equivalents for FIND (for PATS the |mdis| distance is used). Thus the default setting of mdeq prevents FIND from placing atoms on special positions. This is usually desirable because it helps to avoid pseudo-solutions such as the 'uraninum atom solution' that are incorrect but fit the tangent formula, but it might be better to change this setting to -0.1 to allow special positions when looking for e.g. metal ions. For PLOP the PREJ instruction can be used to control whether peaks on special positions are selected. Note also that a |mdis| threshold of 1.6A is used to decide between all-atom ab initio and heavy atom location for the purpose of setting various defaults for other parameters.

SKIP min2 [0.5]

During FIND, if the second peak height is less than min2 times the first, the first peak is rejected (before applying WEED to reject other peaks). This is sometimes useful to suppress 'uranium atom' solutions. In fact, for large equal-atom structures in space group P1 it is a good idea to specify ‘SKIP 0.999’ so that the first peak is ALWAYS rejected!

WEED fr [0.3]

Randomly OMIT fraction fr of atoms in FIND stage (except in the last cycle). Does not apply to PLOP.

GEOM ngm [0], ndwt [1.0], nha [0], d13 [2.45], dd [0.3]

After the peaksearch in the FIND and PLOP routines, ngm cycles (typically 2 to 5) of geometry optimization are performed so that distances within dd of d13 are brought closer to d13. In addition, all peak heights after the highest nha (heavy atoms) are multiplied by ndwt (typically 0.7; 1.0 for no action) if the peaks have no other atoms or peaks within the distance range (d13+dd) to (d13–dd). This instruction is an attempt to build in a little chemical information and it is hoped that it will enable the resolution requirement to be relaxed a little.

TEST Ccmin [#], delCC [#]

Go on to PLOP if CC > CCmin or CC is within delCC of best CC value so far. CCmin is reduced by 0.1% each cycle until a solution passes this test. The defaults are 45 and 1 resp. for ab initio solutions, and 10 and 5 resp. for heavy atom location (MIND mdis test).

KEEP nh [0]

Number of (heavy) atoms to retain during PLOP expansion.

PLOP followed by up to 10 numbers

PLOP specifies the number of peaks to start with in each cycle of the 'peaklist optimization' algorithm of Sheldrick & Gould (1995). Peaks are then eliminated one at a time until either the correlation coefficient cannot be increased any more or 50% of the peaks have been eliminated.

PREJ maxb [3], dsp [-0.01], mf [1]

maxb is the maximum number of bonds to atoms or higher peaks, the peak is deleted if there are more. Peaks are also deleted if they are less than dsp from their equivalents (PLOP only, FIND uses second MIND parameter), do not output atoms to final .res file if less than mf atoms in 'molecule'.

SEED nrand [0]

Set random number seed so that exactly the same results are generated if the job is repeated; each integer nrand defines a different sequence of random numbers. If nrand is omitted or zero, the seed is randomized so a different sequence is always generated..

MOVE dx [0], dy [0], dz [0], sign [1]

Shift following coordinates (not ATOM/HETATM).

ATOM and HETATM

PDB format atoms for GROP

HKLF m

m = 4 for F2 in .hkl file, m = 3 for F (or FA or (F)

END

4. A guide to SHELX for macromolecules: Phasing

4.1 Introduction

Since small-molecule direct methods and Patterson interpretation algorithms can be used to locate a small number of heavy atoms or anomalous scatterers, SHELXS has been used by macromolecular crystallographers for a number of years, and SHELXD is designed both for the ab initio solution of small proteins given data to atomic resolution (1.2Å or better) and for the location of the anomalous scatterers from MAD or SAS (also known as SAD or OAS) data.

4.2 Heavy atom location using SHELXS and SHELXD

One might expect that a small-molecule direct methods program, such as SHELXS (Sheldrick, 1990), that routinely solves structures with 20-100 unique atoms in a few minutes or even seconds of computer time, would have no difficulty in locating a handful of heavy atom sites from isomorphous or anomalous (F data. However, such data can be very noisy, and a single seriously aberrant reflection can invalidate a large number of probabilistic phase relations. The most important direct methods formula is still the tangent formula of Karle & Hauptman (1956); most modern direct methods programs (e.g., Busetta et al., 1980; Debaerdemaeker, Tate & Woolfson, 1985; Sheldrick, 1990) use versions of the tangent formula that have been modified to incorporate information from weak reflections as well as strong reflections, which helps to avoid pseudo-solutions with translationally displaced molecules or a single dominant peak (the so-called uranium atom solution). Isomorphous and anomalous (F values represent lower limits on the structure factors for the heavy atom substructure and so do not give reliable estimates of weak reflections; thus, the improvements introduced into direct methods by the introduction of the weak reflections are largely irrelevant when they are applied to (F data. This does not apply when FA values are derived from a MAD experiment, since these are true estimates of the heavy atom structure factors; however, aberrant large and small FA estimates are difficult to avoid and often upset the phase determination process. A further problem in applying direct methods to (F data is that it is not always clear what the effective number of atoms in the cell should be for use in the probability formulas, especially when it is not known in advance how many heavy atom sites are present.

4.3 The Patterson interpretation algorithm in SHELXS

Space-group general automatic Patterson interpretation was introduced in the program SHELXS-86 (Sheldrick, 1985); completely different algorithms are employed in the current version of SHELXS, based on the Patterson superposition minimum function (Buerger, 1959, 1964: Richardson & Jacobson, 1987; Sheldrick, 1991, 1998a; Sheldrick et al., 1993). The algorithm used in SHELXS-97 is as follows:

1. A single Patterson peak, v, is selected automatically (or input by the user) and used as a superposition vector. A sharpened Patterson (with coefficients ((E3F) instead of F2, where E is a normalized structure factor) is calculated twice, once with the origin shifted to –v/2 and once with the origin shifted to +v/2. At each grid point, the minimum of the two Patterson values is stored, and this superposition minimum function is searched for peaks. If a true single-weight heavy atom-to-heavy atom vector has been chosen as the superposition vector, this function should consist ideally of one image of the heavy-atom structure and one inverted image, with two atoms (the ones corresponding to the superposition vector) in common. There are thus about 2N peaks in the map, compared with N2 in the original Patterson, a considerable simplification. The only symmetry element of the superposition function is the inversion center at the origin relating the two images.

2. Possible origin shifts are found so that the full space-group symmetry is obeyed by one of the two images, i.e., for about half the peaks, most of the symmetry equivalents are present in the map. This enables the peaks belonging to the other image to be eliminated and, in principle, solves the heavy-atom substructure. In the space-group P1, the double image cannot be resolved in this way.

3. For each plausible origin shift, the potential atoms are displayed as a triangular table that gives the minimum distance and the Patterson superposition minimum function value for all vectors linking each pair of atoms, taking all symmetry equivalents into account. This ‘crossword’ table enables spurious atoms to be eliminated and occupancies to be estimated and also in some cases reveals the presence of non-crystallographic symmetry.

4. The whole procedure is then repeated for further superposition vectors as required. The program gives preference to general vectors (multiple vectors will lead to multiple images), and it is advisable to specify a minimum distance of (say) 8Å for the superposition vector (3.5Å for selenomethionine MAD data) to increase the chance of finding a true heavy atom-to-heavy atom vector.

4.4 Examples of heavy-atom location with SHELXS

First we will consider a very straightforward example, using SIR (F-data for the protein barnase, that is provided as a test job for SHELXS-97. SHELXS (and SHELXL and SHELXD) requires two standard text input files that in this case are called barnase.ins and barnase.hkl. The .ins file contains the crystal data (cell, space group and contents) followed by specific instructions; the .hkl file is simply a list of h, k, l, (F and (((F) in fixed format. In this particular case the very short .ins file was created by hand using a text editor and the .hkl file was output by the CCP4 program mtz2various, but there a number of possible graphical interfaces (e.g. the XPREP program in the Bruker SHELXTL system) that could have set up both files with a minimum of user effort. The barnase.ins file takes the following form:

TITL Barnase Au del(F) in P3(2)

REM Isomorphous delta-F data for Au derivative (3 sites)

REM kindly donated by Eleanor Dodson, University of York, UK

CELL 1.54178 58.970 58.970 81.580 90.00 90.00 120.00

ZERR 1.00 0.008 0.008 0.016 0.00 0.00 0.00

LATT -1

SYMM -Y, X-Y, .66667+Z

SYMM -X+Y, -X, .33333+Z

SFAC N AU

UNIT 200 9 ! fudge unit-cell contents for delta-F data

PATT 2 ! PATT –4 or PATT –10 for more difficult cases

HKLF 3

END

It will be seen that SHELX instructions consist of a four letter keyword followed by further information in free format on the same line. In the days of punched-card input this was revolutionary, now it looks antique but is still practicable and easy to transfer between different operating systems etc. Comments can be added with the keyword ‘REM’ or may follow ‘!’ on a single line. Some of the information here (e.g. the wavelength and cell esds) will not be used by SHELXS but is required for consistency with SHELXL. ‘LATT –1’ specifies a non-centrosymmetric primitive lattice and is followed by two of the three symmetry operators that define the space group P32 (the operator X, Y, Z is omitted since it is common to all space groups). The last operator could also have been input as ‘SYMM y-x, -x, 1/3+z’; in general lower and upper case letters are equivalent. The SFAC and UNIT instructions define the cell contents, but for heavy-atom location from (F-data the contents need to be fudged (square root of the number of light atoms followed by the expected number of heavy atoms in the cell); this is only important for direct methods. HKLF 3 is used to flag (F rather than ((F)2 (HKLF 4).

‘PATT 2’ specifies Patterson solution attempts using two different superposition vectors; for a difficult problem ‘PATT –10’ (10 trial vectors, the minus sign reduces the thresholds so that more peaks are tested etc., i.e. the program tries harder) would be more apropriate. Note that only this line needs to be changed (to TREF for an easy problem or TREF 5000 for a difficult one) to use the SHELXS direct methods to find the gold sites. In both cases the resulting heavy atom positions are written to the file barnase.res; in the PATT case the barnase.lst listing file includes the crossword table:

Name At.No. x y z s.o.f. Min. distances / PATSMF

AU1 84.4 0.1318 0.0494 0.5458 1.00 29.64

48.4

AU2 80.0 0.2831 0.5398 0.6667 1.00 29.45 27.48

80.1 62.3

AU3 48.2 -0.2260 0.3630 0.6308 1.00 28.91 35.00 35.46

40.8 34.5 41.1

AU4 21.8 0.0246 0.0134 0.6418 1.00 27.28 9.61 37.97 30.80

12.2 0.0 12.7 0.0

Understanding the crossword tables produced by SHELXS and SHELXD is the key to successful heavy atom location. The names and atomic numbers are invented by the program and need not be taken seriously. They are followed by the crystal co-ordinates of the heavy atoms and their site occupancies (always 1.0 except for atoms on special positions; in which case the value is less than one). Special positions should be treated with suspicion for heavy atom derivatives but can happen; they can be eliminated by making the second PATT parameter (the minimum intersite distance) negative. All the remaining items in the table are double entries; the top value is the minimum distance between one atom and its symmetry equivalents (first column) or between the atom marking the row and the atom marking the column (and all its symmetry equivalents; remaining columns). The bottom number is the corresponding Patterson superposition minimum function for all the vectors between one atom and its equivalents (first column) or between one atom and another, including all equivalents of the latter (remaining columns). Thus 40.8 is the Patterson minimum function for vectors between Au3 and its symmetry equivalents and 35.46 is the minimum distance in Angstroms between Au2 and Au3, taking symmetry into account. Au4 is clearly a spurious atom (or possibly an additional low-occupancy site) because of the low Patterson minimum function value involving it. In this example the distance information is not useful (except as a check that the sites are not too close to one another); formation of a trimer with equal gold-gold distances would not be obvious if an intertrimer Au...Au distance were shorter than the intratrimer distance.

4.5 Integrated Patterson and direct methods: SHELXD

SHELXD is designed both for the ab initio solution of macromolecular structures from atomic resolution native data alone and for the location of heavy-atom sites from (F or FA values at much lower resolution, in particular for the location of larger numbers of anomalous scatterers from MAD data. The dual-space approach of SHELXD was inspired by the Shake and Bake philosophy of Miller et al. (1993, 1994) but differs in many details, in particular in the extensive use it makes of the Patterson function that proves very effective in the applications involving (F or FA data. An advantage of the Patterson is that it provides a good noise filter for the (F or FA data: negative regions of the Patterson can simply be ignored. On the other hand, the direct methods approach is efficient at handling a large number of sites, whereas the number of Patterson peaks to analyze increases with the square of the number of atoms. Thus, for reasons of efficiency, the Patterson function is employed at two stages in SHELXD: at the beginning to obtain starting atom positions (otherwise random starting atoms would be employed) and at the end, in the form of the triangular crossword table as used in SHELXS, to recognize which atoms are correct. In between, several cycles of real/reciprocal space alternation are employed as in the ab initio structure solution, alternating between tangent refinement, E-map calculation, and peak-search, and possibly random omit maps, in which a specified fraction of the potential atoms are left out at random. Further details of the algorithms used in SHELXD are given by Sheldrick (1998b), Usón & Sheldrick (1999) and in the previous chapter.

From the user’s point of view, the input to SHELXD for the location of heavy atom sites is extremely similar to that for SHELXS. The PATT (or TREF) instruction is replaced by FIND followed by the expected number of heavy atom sites, and the minimum allowed distance between two sites is given after MIND, with a negative sign to indicate that a crossword table should be calculated for the best solutions. Thus for barnase the .ins file would contain TITL...UNIT as before followed by::

PATS

FIND 3

MIND –8

NTRY 100

HKLF 3

END

This would start with atoms consistent with the Patterson, which results in about 90% of solutions being correct. Leaving out ‘PATS’, i.e. starting instead from random atoms, reduces this percentage to about 20%. The NTRY instruction specifies the number of tries, if this instruction is missing the program runs for ever (which can sometimes be convenient; the job can be interrupted, e.g. by creating a file barnase.fin, when it looks as though the structure has been solved).

4.6 Practical considerations for locating heavy atoms

Since the input files for the direct and Patterson methods in SHELXS and the integrated method in SHELXD are so similar, it is easy to try all three methods for a difficult problem. The Patterson interpretation in SHELXS is a good choice if the heavy atoms have variable occupancies and it is not known how many heavy-atom sites need to be found; the direct methods approaches work best with equal atoms. In general, the conventional direct methods in SHELXS will tend to perform best in the non-polar space-group that does not possess special positions. However, for more than about a dozen sites, only the integrated approach in SHELXD is likely to prove the most effective; the SHELXD algorithm works best when the number of sites is known, at least approximately.

Especially for the MAD method, the quality of the data is decisive; it is essential to collect data with a high redundancy to optimize the signal to noise ratio and eliminate outliers. SHELX does not include a program to extract FA-values from MAD data but the Bruker XPREP program can be used for this. In general, a resolution of 3.5Å is adequate for the location of heavy-atom sites. A critical decision is the resolution limit at which to cut the data. XPREP calculates a correlation coefficient between the signed anomalous differences for each pair of wavelengths as a function of the resolution; in practice the correlation coefficient becomes smaller at high resolution. Experience indicates that the data should be truncated at the resolution at which the correlation coefficient falls below 30%. If this requires throwing away all the data, the anomalous signal is probably too weak for structure solution!

When data have been collected at three or more difference wavelengths without major problems such as crystal decay, icing, wavelength drift etc., SHELXS and SHELXD tend to perform better with FA-values than with anomalous (F-values. It is however debatable as to whether it is better to collect highly redundant data at a single wavelength corresponding to a significant f”-value for an SAS experiment or to spend the same time measuring less precise data at several wavelengths for a MAD experiment.

SHELXS and SHELXD only provide a method of locating the heavy atoms or anomalous scatterers. They do not include facilities for the further calculations necessary to obtain maps. The programs SHARP (de la Fortelle & Bricogne, 1997) and DM (Cowtan, 1999) are recommended for this purpose. Experience indicates that it is only necessary to refine the B-values of the heavy atoms using other programs; their coordinates are already rather precise.

4.7 A selenomethionine MAD example

The following example of GPATase, kindly provided as a test by Janet Smith and Joe Krahn, illustrates the application of SHELXD to a selenomethionine MAD problem. 22 unique selenium atoms were expected, but only 20 can be found, probably because the two N-terminal selenomethionines exhibit high thermal motion. Unusually, in this test the solution with the highest correlation coefficient (CC) is not correct, but the correct solution can easily be identified on the basis of the PATFOM figure of merit and by inspection of the crossword table. First the crossword table for the solution with the highest CC is shown, this is clearly wrong because there are many low and zero Patterson values (the bottom number of each pair):

Solution 20 (false) Initial CC 42.6 PATFOM 2.59

N self cross-vectors

1 19.0

0.0

2 12.8 4.0

0.0 0.0

3 50.4 31.8 33.3

0.2 0.0 0.0

4 50.2 16.1 18.7 33.3

0.0 0.0 0.0 8.7

5 51.5 43.9 43.7 31.2 35.0

15.3 0.0 0.0 0.4 0.0

6 51.1 36.8 35.2 31.9 35.0 18.1

0.0 0.0 0.0 0.0 10.3 0.0

7 34.5 13.6 13.3 36.4 12.2 31.8 26.5

0.0 0.0 0.0 1.1 0.0 0.0 0.0

8 56.0 21.2 23.6 37.8 5.6 31.2 33.8 15.1

1.1 0.0 0.6 0.8 0.0 0.1 0.0 0.0

...etc...

The correct solution (truncated) was as follows:

Solution 1 (correct) Initial CC 33.1 PATFOM 19.63

N self cross-vectors

1 31.4

10.7

2 50.7 33.4

20.7 11.5

3 51.2 35.0 31.2

19.3 11.5 13.1

4 40.8 5.7 35.7 31.2

5.1 7.5 7.1 8.2

5 31.5 53.6 35.4 36.2 49.8

8.0 8.9 15.5 11.6 8.6

6 29.2 14.4 31.5 34.9 16.4 40.9

9.6 9.3 8.9 8.0 7.6 5.7

7 52.6 13.3 36.5 26.5 9.2 46.8 15.4

7.9 9.2 11.7 9.0 6.6 9.1 8.5

8 41.6 45.3 26.7 37.7 44.2 13.3 40.3 41.9

8.8 9.4 8.3 6.5 7.0 5.4 6.0 8.6

...etc...

18 35.0 38.7 33.7 27.9 41.8 16.8 25.9 40.0

0.0 3.5 1.8 2.7 2.1 1.7 2.0 2.5

19 37.8 16.6 25.9 32.3 18.6 39.8 6.1 17.1

0.0 0.0 1.8 1.5 1.8 0.0 2.3 3.6

20 29.3 22.8 27.6 31.8 26.0 31.9 10.7 25.3

2.8 1.5 1.3 0.4 0.0 0.8 2.4 0.0

============================================

21 16.3 20.6 38.8 32.8 22.8 34.5 11.1 22.4

0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0

22 37.4 33.7 26.4 29.7 30.8 46.3 27.3 22.2

0.0 1.4 0.0 0.2 0.0 0.0 0.0 0.0

The sites 21 and 22 are incorrect, as indicated by the many Patterson zeros!

5. A guide to SHELX for macromolecules: Refinement

5.1 Refinement of macromolecules with SHELXL

Recently, improvements in cryocrystallography, area detectors, and synchrotron data collection have led to a rapid increase in the number of high resolution ( LAST ! Approximate isotropic restraints for waters;

! ignored for isotropic

ANIS_* FE SD SG ! Make iron and all sulfur atoms anisotropic

CONN 0 O_201 > LAST ! Don't include water in connectivity array and

BUMP ! generate antibumping restraints automatically

SWAT ! Bulk solvent model

REM HOPE ! Anisotropic scaling not included

MERG 4 ! Remove MERG 4 if Friedel opposites should not be merged

MORE 1 ! MORE 0 for minimum, 2 or 3 for more output for diagnostics

REM Special restraints etc. specific to this structure follow:

REM HFIX 43 C1_1 !

DFIX C1_1 N_1 1.329 ! O=C(H)- (formyl) on N-terminus

DFIX C1_1 O1_1 1.231 ! incorporated into residue 1

DANG N_1 O1_1 2.250 !

DANG C1_1 CA_1 2.435 !

DFIX_52 C OT1 C OT2 1.249 !

DANG_52 CA OT1 CA OT2 2.379 ! Ionized carboxyl at C-terminus

DANG_52 OT1 OT2 2.194 !

SADI_54 0.04 FE SG_6 FE SG_9 FE SG_39 FE SG_42 ! Equal but unknown Fe-S

SADI_54 0.08 FE CB_6 FE CB_9 FE CB_39 FE CB_42 ! distances around Fe

REM HFIX 83 SG_38 SG_138 ! -SH for remaining cysteine (disordered)

DFIX C_18 N_26 1.329 ! Patch break in numbering - residues

DANG O_18 N_26 2.250 ! 18 and 26 are bonded but there is a

DANG CA_18 N_26 2.425 ! gap in numbering for compatibility

DANG C_18 CA_26 2.435 ! with other rubredoxins that have an

FLAT 0.3 O_18 CA_18 N_26 C_18 CA_26 ! extra loop

RTAB Omeg CA_18 C_18 N_26 CA_26 !

RTAB Phi C_18 N_26 CA_26 C_26 !

RTAB Psi N_18 CA_18 C_18 N_26 !

REM DFIX from CSD and R.A.Engh & R.Huber, Acta Cryst. A47 (1991) 392.

REM Remove 'REM ' before HFIX to activate H-atom generation

REM HFIX_ALA 43 N

REM HFIX_ALA 13 CA

REM HFIX_ALA 33 CB

REM HFIX_ASN 43 N

REM HFIX_ASN 13 CA

REM HFIX_ASN 23 CB

REM HFIX_ASN 93 ND2

REM HFIX_ASP 43 N

REM HFIX_ASP 13 CA

REM HFIX_ASP 23 CB

... etc ...

REM HFIX_VAL 43 N

REM HFIX_VAL 13 CA CB

REM HFIX_VAL 33 CG1 CG2

REM Peptide standard torsion angles and restraints

RTAB_* Omeg CA C N_+ CA_+

RTAB_* Phi C_- N CA C

RTAB_* Psi N CA C N_+

RTAB_* Cvol CA

DFIX_* 1.329 C_- N

DANG_* 2.425 CA_- N

DANG_* 2.250 O_- N

DANG_* 2.435 C_- CA

FLAT_* 0.3 O_- CA_- N C_- CA

REM Standard amino-acid restraints etc.

CHIV_ALA C

CHIV_ALA 2.477 CA

DFIX_ALA 1.231 C O

DFIX_ALA 1.525 C CA

DFIX_ALA 1.521 CA CB

DFIX_ALA 1.458 N CA

DANG_ALA 2.462 C N

DANG_ALA 2.401 O CA

DANG_ALA 2.503 C CB

DANG_ALA 2.446 CB N

RTAB_ASN Chi N CA CB CG

CHIV_ASN C CG

CHIV_ASN 2.503 CA

DFIX_ASN 1.231 C O CG OD1

DFIX_ASN 1.525 C CA

DFIX_ASN 1.458 N CA

DFIX_ASN 1.530 CA CB

DFIX_ASN 1.516 CB CG

DFIX_ASN 1.328 CG ND2

DANG_ASN 2.401 O CA

DANG_ASN 2.462 C N

DANG_ASN 2.455 CB N

DANG_ASN 2.504 C CB

DANG_ASN 2.534 CA CG

DANG_ASN 2.393 CB OD1

DANG_ASN 2.419 CB ND2

DANG_ASN 2.245 OD1 ND2

RTAB_ASP Chi N CA CB CG

CHIV_ASP C CG

CHIV_ASP 2.503 CA

DFIX_ASP 1.231 C O

DFIX_ASP 1.525 C CA

DFIX_ASP 1.530 CA CB

DFIX_ASP 1.516 CB CG

DFIX_ASP 1.458 CA N

DFIX_ASP 1.249 CG OD1 CG OD2

DANG_ASP 2.401 O CA

DANG_ASP 2.462 C N

DANG_ASP 2.455 CB N

DANG_ASP 2.504 C CB

DANG_ASP 2.534 CA CG

DANG_ASP 2.379 CB OD1 CB OD2

DANG_ASP 2.194 OD1 OD2

RTAB_CYS Chi N CA CB SG

CHIV_CYS C

CHIV_CYS 2.503 CA

DFIX_CYS 1.231 C O

DFIX_CYS 1.525 C CA

DFIX_CYS 1.458 N CA

DFIX_CYS 1.530 CA CB

DFIX_CYS 1.808 CB SG

DANG_CYS 2.401 O CA

DANG_CYS 2.504 C CB

DANG_CYS 2.455 CB N

DANG_CYS 2.462 C N

DANG_CYS 2.810 CA SG

... etc ...

RTAB_VAL Chi N CA CB CG1

RTAB_VAL Chi N CA CB CG2

CHIV_VAL C

CHIV_VAL 2.516 CA

DFIX_VAL 1.231 C O

DFIX_VAL 1.458 N CA

DFIX_VAL 1.525 C CA

DFIX_VAL 1.540 CA CB

DFIX_VAL 1.521 CB CG2 CB CG1

DANG_VAL 2.401 O CA

DANG_VAL 2.462 C N

DANG_VAL 2.497 C CB

DANG_VAL 2.515 CA CG1 CA CG2

DANG_VAL 2.479 N CB

DANG_VAL 2.504 CG1 CG2

WGHT 0.100000

FVAR 1.00000 0.5 0.5 0.5 0.5

RESI 1 MET

C1 1 -0.01633 0.35547 0.44703 11.00000 0.11817

O1 4 0.01012 0.32681 0.48491 11.00000 0.17896

N 3 0.00712 0.35446 0.37983 11.00000 0.11863

CA 1 0.05947 0.33273 0.35391 11.00000 0.06229

CB 1 0.07411 0.33732 0.27909 11.00000 0.15678

CG 1 0.03196 0.28864 0.22872 11.00000 0.14569

SD 5 0.04907 0.31846 0.14359 11.00000 0.23570

CE 1 0.11380 0.29170 0.12261 11.00000 0.21476

C 1 0.10634 0.38738 0.39766 11.00000 0.09178

O 4 0.10329 0.45513 0.41972 11.00000 0.16480

RESI 2 GLN

N 3 0.14741 0.35678 0.40741 11.00000 0.08599

CA 1 0.18940 0.39931 0.45565 11.00000 0.09291

CB 1 0.22933 0.34643 0.45886 11.00000 0.13253

CG 1 0.27354 0.38674 0.51173 11.00000 0.09866

CD 1 0.24547 0.38838 0.58387 11.00000 0.05748

OE1 4 0.22482 0.32772 0.60689 11.00000 0.16301

NE2 3 0.24704 0.46053 0.62045 11.00000 0.10164

C 1 0.22198 0.47895 0.43826 11.00000 0.08193

O 4 0.25019 0.48377 0.38408 11.00000 0.10402

RESI 3 LYS

N 3 0.21781 0.54034 0.48673 11.00000 0.07413

CA 1 0.25088 0.62006 0.47934 11.00000 0.05181

CB 1 0.21991 0.68311 0.51795 11.00000 0.09646

CG 1 0.16130 0.66288 0.49255 11.00000 0.10455

CD 1 0.12843 0.72146 0.52924 11.00000 0.22324

CE 1 0.10532 0.70085 0.60053 11.00000 0.26354

NZ 3 0.05943 0.74195 0.62796 11.00000 0.40338

C 1 0.30678 0.63497 0.50917 11.00000 0.05714

O 4 0.31462 0.59598 0.55179 11.00000 0.07986

... etc ...

RESI 12 GLU

N 3 0.41413 1.09215 0.48246 11.00000 0.06790

CA 1 0.37955 1.01183 0.48195 11.00000 0.05761

PART 1

CB 1 0.32666 1.01321 0.52971 21.00000 0.12219

CG 1 0.29679 0.93111 0.54638 21.00000 0.15333

CD 1 0.25357 0.93709 0.60700 21.00000 0.20272

OE1 4 0.24346 1.00278 0.63210 21.00000 0.26315

OE2 4 0.23012 0.87537 0.63031 21.00000 0.21375

PART 2

CB 1 0.32549 1.01718 0.52772 -21.00000 0.12065

CG 1 0.27756 0.94582 0.50954 -21.00000 0.15928

CD 1 0.22547 0.95184 0.55635 -21.00000 0.20457

OE1 4 0.20774 0.90241 0.59575 -21.00000 0.22329

OE2 4 0.20259 1.00588 0.55325 -21.00000 0.31441

PART 0

C 1 0.36477 0.97439 0.40859 11.00000 0.04768

O 4 0.34317 1.00861 0.37369 11.00000 0.06890

... etc ...

RESI 38 CYS

N 3 0.77141 0.92674 0.00625 11.00000 0.10936

CA 1 0.78873 0.97402 0.07449 11.00000 0.13706

PART 1

CB 1 0.83868 1.04271 0.05517 41.00000 0.11889

SG 5 0.89948 1.00271 0.02305 41.00000 0.18205

PART 2

CB 1 0.84149 1.03666 0.06538 -41.00000 0.14933

SG 5 0.83686 1.10360 0.01026 -41.00000 0.17328

PART 0

C 1 0.74143 1.01670 0.10383 11.00000 0.08401

O 4 0.70724 1.02319 0.06903 11.00000 0.10188

RESI 39 CYS

N 3 0.74699 1.04547 0.17051 11.00000 0.08888

CA 1 0.70682 1.09027 0.20876 11.00000 0.06869

CB 1 0.72588 1.11964 0.28230 11.00000 0.04269

SG 5 0.67932 1.17560 0.33481 11.00000 0.08016

C 1 0.70922 1.16093 0.17333 11.00000 0.06208

O 4 0.75427 1.20325 0.15858 11.00000 0.07437

... etc ...

RESI 52 ALA

N 3 0.33596 0.63469 0.69557 11.00000 0.04662

CA 1 0.30961 0.68882 0.74487 11.00000 0.08939

CB 1 0.34040 0.77357 0.74194 11.00000 0.13277

C 1 0.24852 0.67507 0.73435 11.00000 0.09032

OT1 4 0.22236 0.72170 0.77321 11.00000 0.11368

OT2 4 0.22682 0.61667 0.69191 11.00000 0.08341

RESI 54 FE

FE 6 0.72017 1.22290 0.43784 11.00000 0.07929

REM Only the waters with high occupancies and low U's have been

REM retained, and all the occupancies have been reset to 1, with

REM a view to running the automatic water divining. Water

REM residue numbers have been changed to start at 201.

RESI 201 HOH

O 4 0.13450 0.53192 0.60802 11.00000 0.13132

RESI 202 HOH

O 4 0.84795 0.53873 0.69488 11.00000 0.15273

RESI 203 HOH

O 4 0.27771 0.95750 0.25086 11.00000 0.11315

RESI 204 HOH

O 4 0.37066 0.71872 0.90376 11.00000 0.10854

... etc ...

RESI 233 HOH

O 4 0.27813 1.38725 0.25914 11.00000 0.10698

HKLF 3

END

6. Frequently asked questions (by biocrystallographers)

Q1: Where is the manual?

A: Postscript and MSWord versions of the manual can be downloaded from the SHELX ftp site. However this manual was written for small-molecule crystallographers, you will still need it as a reference book (it even has an index) but you should start by reading the material provided for the Workshop. There is also a lot of useful information on the SHELX homepage or accessible via links from it, including tutorials for which test data are available..

Q2: How do I transfer my data, including Rfree flags, from X-PLOR or other programs to SHELX?

A: Use the ‘Y’ option in SHELXPRO to convert the .fob file to .hkl, and the ‘I’ option to convert .pdb to .ins. Although SHELXL prefers intensities, for macromolecules it is OK to continue to use F-values if you were using them in X-PLOR. In CCP4, the mtz2various program can write SHELX format files. The Bruker XPREP program provides a space-group general option for transferring Rfree flags from one data-set to another, taking equivalent reflections into account.

Q3: I have a non-standard ligand, how do I make the topology file?

A: SHELXL doesn’t have a topology file, the restraints etc. are all included in the .ins file. One good way to generate these restraints is to find a suitable fragment in the CSD, then use the ‘J’ option in SHELXPRO. If it’s not in the CSD, you could do a quick small-molecule structure (using SHELX) and feed that into SHELXPRO.

Q4: Why are the R-factors different from X-PLOR etc.?

A: Check that you are using the same data (F or F2, resolution cutoffs, Rfree flags ?) and that the bulk solvent model is not causing problems (it tends to interact with the B-values, so it might be best to do a few refinement cycles first to sort this out).

Q5. After using SHELXPRO to prepare the .ins file from a PDB file and then running SHELXL, I get the message: ‘** No match for two atoms in DFIX **’ but otherwise everything seems OK.

A: This message probably refers to the fact that SHELXPRO labels the oxygens of the carboxy-terminus OT1 and OT2 so that different bond length restraints can be applied than to the same type of amino-acid when it is in a peptide chain. This is normal and can be safely ignored. Other such messages should always be investigated carefully, they may indicate missing or bad restraints or bad initial connectivity (which can be corrected using BIND and FREE).

Q6. I can solve the structure by molecular replacement in space group P32 but the R-factors are high and the Rsym for P3221 was not much higher than for P32. What should I do?

A: Your structure may well be merohedrally twinned, but don’t panic! The E-statistics can be calculated using e.g. SHELXS, SHELXD or XPREP; ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download