User guide for scaling data with DIALS

This document aims to provide a guide to using dials.scale, at various levels of depth. A new user is encouraged to read the symmetry and scaling section of the ‘Processing in detail’ tutorial for a quick overview of scaling in DIALS. For most users, it is likely to be sufficient to read only the ‘Byte-sized guide to common scaling options’ below, and return to the rest of the guide if further help is needed.

As a reminder, this is how to run routine data processing after integration to obtain a merged MTZ file:

dials.symmetry integrated.refl integrated.expt
dials.scale symmetrized.refl symmetrized.expt
dials.merge scaled.refl scaled.expt

The user is also advised to familiarise themselves with the standard program output, which may contain useful information, and the html report generated by scaling, which provides numerous plots relating to the merging statistics.

Byte-sized guide to common scaling options

To get started with using DIALS to scale data, load these 8 bits of information into memory, which cover the most commonly used options (with example values) for scaling routine macromolecular crystallography datasets.

When less is more - cutting back data. It is often the case that not all of the data are suitable for merging, so try these options to get the best of your data:

  • d_min=2.0 Apply a resolution cutoff with this simple command.

  • exclude_images="0:100:120" Remove a section of images, perhaps due to bad per-image R-merge values as result of radiation damage. Use multiple commands like this to exclude multiple ranges. The syntax is “a:b:c”, where a is the experiment number (starting at 0), b is the initial image to exclude and c is the final image to exclude.

  • exclude_datasets="10 50 79" Go all out and remove whole datasets, based on the dataset number; useful for large multi-crystal datasets.

Eyes on the prize - getting those MTZ files. You’re confident that this job is a good one and are already daydreaming about refining a high-quality structure. Use these options to get an mtz file from your final run of dials.scale without the extra step of using the full dials.export/dials.merge programs (which may have more extensive options):

  • merged_mtz=scaled.mtz Create a merged MTZ file, using the merging routines available in cctbx.

  • unmerged_mtz=unmerged.mtz Output the scaled data in unmerged MTZ format (but please try to be more creative in the filename choice).

Life is about choices Philosophy aside, one choice that is made automatically during scaling is whether summation or profile intensities seem to give the best estimate of the integrated intensity (or a combination of the two). Take back control, or just explore the options, with this command:

  • intensity_choice=profile Choose from profile, sum or combine (default is combine)

Impartial about partials By default, reflections with a partiality above 0.4 are included in the output datafiles and merging statistics from dials.scale. If you feel like being more discriminatory, or are concerned about the accuracy of the partial intensities, just change the threshold with this command:

  • partiality_threshold=0.99 Disregard all partialities below this value.

Errors in errors dials.scale adjusts the intensity errors by refining an error model (see the output log for the values). While this is an important correction and should improve the data quality for typical macromolecular crystallographic data, perhaps you have an edge case where the model refined is unrealistic. If so, then this correction can be disabled with this command:

  • error_model=None Don’t adjust error estimates.

Practicalities for large datasets

Depending on the computational resources available, scaling of large datasets ( > 1 million reflections) can become slow and memory intensive. There are several options available for managing this. The first option is separating the data in memory to allow blockwise calculations and parallel processing, using the option nproc= (a value of 4 or 8 is probably a reasonable choice). One of the most computationally-intensive parts of the algorithm is the final round of minimisation, which uses full-matrix methods. One can set full_matrix=False to turn this off, however no errors for the scale factors will be determined. A compromise is to set full_matrix_max_iterations=1 to do at least one iteration. A third option is to reduce the number of reflections used by the scaling algorithm during minimisation. If using reflection_selection.method=auto, the number of reflections should be manageable even for very large datasets, but this can always be controlled by the user. To get started, use the command dials.scale -ce2 to see the full set of available options in the section reflection_selection. Try setting reflection_selection.method=quasi_random alongside some of the quasi_random parameters.

Scaling against a reference dataset

DIALS contains functionality for scaling against a reference dataset, also referred to as targeted scaling. This reference can either be a dataset scaled with dials.scale, or an mtz file containing a scaled dataset. The scaled data (excluding the reference) will be output in a single .refl/.expt file.

Scaling against a dials reference dataset. In this example, reference.refl and reference.expt are from a dataset that has already been scaled with dials.scale. To scale another dataset (datafiles integrated.refl, integrated.expt) against this reference, one should use the following command:

dials.scale only_target=True integrated.refl integrated.expt reference.refl reference.expt

This will scale the intensities of the dataset to agree as closely as possible with the intensities of the reference dataset. The only_target=True command is important, else all the data will be scaled together and output in a joint output file.

Scaling against a reference mtz file. In this case, it is assumed that the intensity and variance columns of the mtz file have already been scaled. Reference scaling would be run with the following command:

dials.scale integrated.refl integrated.expt target_mtz=scaled.mtz

The reference scaling algorithm is the same regardless of the target datafile type.

Advanced use - Controlling the scaling models

There are three available scaling models available in dials.scale, accessible by the command line option model = physical array KB *auto. The physical model is similar to the scaling model used in the program aimless, the array model is based on the approach taken in xscale, while the KB model is a simple two-component model suitable for still-image datasets or very small rotation datasets (~ < 1 degree).

The auto option automatically chooses a default model and sensible parameterisation based on the oscillation range of the experiment. This will choose the physical model unless the oscillation range is < 1.0 degree, when the KB model will be chosen. If the oscillation range is < 60 degrees, the absorption correction of the physical model is disabled, as this may be poorly determined. The parameter spacing as a function of rotation is also adjusted down from the defaults if the oscillation range is below 90 degrees, to try to give a sensible automatic parameterisation.

The physical model consists of up to three components; a smoothly varying scale correction, a smoothly varying B-factor correction and an absorption surface correction (all on by default). These are turned on/off with the command line options physical.scale_correction=True/False physical.decay_correction=True/False physical.absorption_correction=True/False. The smoothly varying terms have a parameter at regular intervals in rotation, which can be specified with the physical.scale_interval and physical.decay_interval options. The number of parameters in the absorption surface is determined by the highest order of spherical harmonics function used, controlled by physical.lmax (recommended to be no higher than 6, 4 by default). There is also a weak physical.decay_restraint and strong physical.surface_weight to restrain the parameters of the decay and absorption terms towards zero. The physical model is suitable for most datasets, although the absorption correction should be turned off for datasets with low reciprocal space coverage.

The KB model applies a single scale factor and single B-factor to the whole dataset (B-factor can be turned off with decay_term=False). This is only suitable for very thin wedge/single-image datasets. If the KB model is used, it may be necessary to set full_matrix=False, as the full matrix minimisation round can be unstable depending on the number of reflections per dataset.

The array model consists of up to three components. The first ( array.decay_correction), consists of a smoothly varying correction calculated over a 2D grid of parameters, as a function of rotation vs resolution (d-value). The parameter interval in rotation is controlled by array.decay_interval, while the number of resolution bins is controlled by array.n_resolution_bins. The second (array.absorption_correction) consists of a smoothly varying correction calculated over a 3D grid of parameters, as a function of rotation, x and y position of the measured reflection on the detector. The spacing in rotation is the same as the decay correction, while the detector beginning is controlled with array.n_absorption_bins. Finally, an array.modulation_correction can be applied, which is a smooth 2D correction as a function of x and y position, controlled with array.n_modulation_bins, although this is off by default. The array model is only suitable for wide-rotation datasets with a high number of reflections and it should be tested whether the absorption correction is suitable, as it may lead to overparameterisation.

Advanced use - Choosing reflections to use for minimisation

To minimise the scaling model, a subset of reflections are used for efficiency. Four methods are available with the following command: reflection_selection.method=auto quasi_random intensity_ranges use_all.

By default, the auto method uses the quasi_random selection algorithm, with automatically determined parameters based on the dataset properties. If the dataset is small (<20k reflections), the use_all option is selected.

For each dataset, the quasi_random algorithm chooses reflection groups that have a high connectedness across different areas of reciprocal space, across all resolution shells. In multi-dataset scaling, a separate selection is also made to find reflection groups that have a high connectedness across the datasets (choosing from groups with an average I/sigma above a cutoff). The parameters of the algorithm are therefore controllable with the following options, if one explicity chooses reflection_selection.method=quasi_random: quasi_random.min_per_area, quasi_random.n_resolution_bins, quasi_random.multi_dataset.min_per_dataset and quasi_random.multi_dataset.Isigma_cutoff. The auto option sets these parameters in order to give sufficient connectedness across reciprocal space/datasets depending on the size of the dataset, number or parameters and number of datasets.

The intensity_ranges option chooses intensities between a range of normalised intensities (E2_range), between a range of I/sigma (Isigma_range) and between a resolution range (d_range). This will typically select around 1/3 of all reflections.

The use_all method simply uses all suitable reflections for scaling model minimisation, but may be prohibitively slow and memory-intensive for large datasets.