User guide for scaling data with DIALS¶
This document aims to provide a guide to using dials.scale
, at various levels
of depth. A new user is encouraged to read the Symmetry and Scaling sections of
the Processing in detail
tutorial for a quick overview of scaling in DIALS.
For most users, it is likely to be sufficient to read only the
‘Guide to common scaling options’ below,
and return to the rest of the guide if further help is needed.
As a reminder, this is how to run routine data processing after integration to obtain a merged MTZ file:
dials.symmetry integrated.refl integrated.expt
dials.scale symmetrized.refl symmetrized.expt
dials.merge scaled.refl scaled.expt
The user is also advised to familiarise themselves with the standard program output, which may contain useful information, and the html report generated by scaling, which provides numerous plots relating to the merging statistics.
Guide to common scaling options¶
These sections cover the most commonly used options (with example values) for scaling routine macromolecular crystallography datasets.
Cutting back data
After inspecting the statistics in the dials.scale.html
file, such as
the R-merge vs batch plot, it is often the case that not all of the data are
suitable for merging, perhaps due to radiation damage or nonisomorphism.
This can be the case within a single sweep or across multiple sweeps in a
multi-sweep/multi-crystal experiment. These are example options to use:
d_min=2.0
Applies a resolution cutoff at the given resolution (in Angstrom).exclude_images="100:120"
Removes a section of images for a single sweep dataset. Multiple commands like this can be used to exclude multiple ranges. In the case of multiple-sweeps, one must also provide the experiment ID that the exclusion should apply to, with the syntaxexclude_images="a:b:c"
where a is the experiment ID (a number starting at0
), b is the initial image to exclude and c is the final image to exclude.exclude_datasets="10 50 79"
Removes whole datasets, based on the dataset number; useful for large multi-crystal datasets.
Anomalous data
During scaling, the option anomalous=[True|False]
determines whether
anomalous pairs (I+/I-) are combined during scaling model minimisation and outlier
rejection. By default, anomalous=False
, which is suitable for data with some anomalous signal, however for strongly anomalous data,
the anomalous signal strength may be enhanced when scaling with anomalous=True
Controlling the absorption correction
The default physical scaling model applies a relative absorption correction based
on the incoming and outgoing scattering vectors (this accounts for the relative
difference in absorption for different scattering paths through the crystal,
rather than absolute absorption of the beam by the crystal). This correction is
constrained, and the level of constraint and parameterisation can be changed
with the option absorption_level=[low|medium|high]
. This aims to give
relative absorption corrections of around 1%, 5% and 25%, but will depend on
the dataset. To see the extent of the correction, check the ‘scaling models’
section in the dials.scale.html
file.
Generating MTZ files
For convenience, dials.scale
can invoke the exporting and merging programs to
generate unmerged and merged MTZ files (you may want to use the individual
programs to have more extensive control over the program options):
merged_mtz=scaled.mtz
Create a merged MTZ file, using the merging routines available in cctbx.unmerged_mtz=unmerged.mtz
Output the scaled data in unmerged MTZ format.
Choosing which integrated intensity to use One choice that is made automatically during scaling is whether summation or profile intensities seem give the best estimate of the integrated intensity (or a combination of the two). To see the result of this combination, inspect the table in the scaling log, which scores a set of Imid values on Rpim & CC1/2. To specify which intensity choice to use, there are a couple of options:
intensity_choice=[profile|sum|combine]
Choose from profile, sum or combine (default is combine)combine.Imid=700.0
Specify the crossover value for profile-summation intensity combination.
Adjusting the uncertainties/errors
All scaling programs adjust the uncertainties (sigmas) of the integrated data, to
account for additional systematic errors not sufficiently modelled during integration.
dials.scale
adjusts the intensity errors by refining a two-component error model
(see the output log or dials.scale.html
for the values). While this is
an important correction and should improve the data quality for typical
macromolecular crystallographic data, for poorer quality data the model parameters
may become overinflated.
If so, then this correction can be controlled with the parameters:
error_model=None
Don’t apply an error model.error_model.basic.minimisation=None
Don’t refine the error model in this scaling run. Will keep the pre-existing error model parameters, or the default error model (a=1.0, b=0.02
) on a first scaling run.
For the multi-sweep case, a single error model is applied to the combined dataset, on the assumption that a similar systematic error is affecting all sweeps. This approach may not be optimal for some datasets. As an alternative, a separate error model can be refined on sweeps individually or as groups.
error_model.grouping=[individual|grouped|combined]
If grouped is chosen, then the groups must be specified as below.error_model_group='0 1' error_model_group='2 3'
e.g. groups the sweeps in pairs for error model refinement.
Controlling partials By default, reflections with a partiality above 0.4 are included in the output data files and merging statistics from dials.scale. This threshold can be changed with the parameters:
partiality_threshold=0.95
Disregard all measurements with partialities below this value.
Practicalities for large datasets¶
Depending on the computational resources available, scaling of large datasets
( > 1 million reflections) can become slow and memory intensive.
There are several options available for managing this.
The first option is separating the data in memory to allow blockwise calculations
and parallel processing, using the option nproc=
(a value of 4 or 8 is
probably a reasonable choice).
One of the most computationally-intensive parts of the algorithm is the final
round of minimisation, which uses full-matrix methods. One can set
full_matrix=False
to turn this off, however no errors for the scale
factors will be determined. A compromise is to set
full_matrix_max_iterations=1
to do at least one iteration.
A third option is to reduce the number of reflections used by the scaling
algorithm during minimisation. If using reflection_selection.method=auto
,
the number of reflections should be manageable even for very large datasets, but
this can always be controlled by the user. To get started, use the command
dials.scale -ce2
to see the full set of available options in the section
reflection_selection
. Try setting reflection_selection.method=quasi_random
alongside some of the quasi_random
parameters.
Scaling against a reference dataset¶
DIALS contains functionality for scaling against a reference dataset, also referred to as targeted scaling. This reference can either be a dataset scaled with dials.scale, or an mtz file containing a scaled dataset. The scaled data (excluding the reference) will be output in a single .refl/.expt file.
Scaling against a dials reference dataset.
In this example, reference.refl and reference.expt are from a dataset that has
already been scaled with dials.scale. To scale another dataset (datafiles
integrated.refl integrated.expt
) against this reference, one should use the
following command:
dials.scale only_target=True integrated.refl integrated.expt reference.refl reference.expt
This will scale the intensities of the dataset to agree as closely as possible
with the intensities of the reference dataset. The only_target=True
command is important, else all the data will be scaled together and output in
a joint output file.
Scaling against a reference mtz file. In this case, it is assumed that the intensity and variance columns of the mtz file have already been scaled. Reference scaling would be run with the following command:
dials.scale integrated.refl integrated.expt reference=scaled.mtz
The reference scaling algorithm is the same regardless of the target datafile type.
Advanced use - Controlling the scaling models¶
There are three available scaling models available in dials.scale, accessible
by the command line option model = physical array KB *auto
.
The physical model is similar to the scaling model used in the program aimless,
the array model is based on the approach taken in xscale, while the KB model is
a simple two-component model suitable for still-image datasets or very small
rotation datasets (~ < 1 degree).
The auto option automatically chooses a default model and sensible parameterisation based on the oscillation range of the experiment. This will choose the physical model unless the oscillation range is < 1.0 degree, when the KB model will be chosen. If the oscillation range is < 60 degrees, the absorption correction of the physical model is disabled, as this may be poorly determined. The parameter spacing as a function of rotation is also adjusted down from the defaults if the oscillation range is below 90 degrees, to try to give a sensible automatic parameterisation.
The physical model consists of up to three components; a smoothly varying
scale correction, a smoothly varying B-factor correction and an absorption surface
correction (all on by default). These are turned on/off with the command line options
physical.scale_correction=True/False physical.decay_correction=True/False physical.absorption_correction=True/False
.
The smoothly varying terms have a parameter at regular intervals in rotation,
which can be specified with the physical.scale_interval
and physical.decay_interval
options. The number of parameters in the absorption surface is determined by the
highest order of spherical harmonics function used, controlled by physical.lmax
(recommended to be no higher than 6, 4 by default). There is also a weak
physical.decay_restraint
and strong physical.surface_weight
to
restrain the parameters of the decay and absorption terms towards zero.
The physical model is suitable for most datasets, although the absorption correction
should be turned off for datasets with low reciprocal space coverage.
The KB model applies a single scale factor and single B-factor to the whole
dataset (B-factor can be turned off with decay_term=False
). This is
only suitable for very thin wedge/single-image datasets. If the KB model is
used, it may be necessary to set full_matrix=False
, as the full matrix
minimisation round can be unstable depending on the number of reflections per
dataset.
The array model consists of up to three components. The first (
array.decay_correction
), consists of a smoothly varying correction
calculated over a 2D grid of parameters, as a function of rotation vs resolution
(d-value). The parameter interval in rotation is controlled by
array.decay_interval
, while the number of resolution bins is
controlled by array.n_resolution_bins
.
The second (array.absorption_correction
) consists of a smoothly
varying correction calculated over a 3D grid of parameters, as a function of
rotation, x and y position of the measured reflection on the detector. The spacing
in rotation is the same as the decay correction, while the detector beginning is
controlled with array.n_absorption_bins
.
Finally, an array.modulation_correction
can be applied, which is a
smooth 2D correction as a function of x and y position, controlled with
array.n_modulation_bins
, although this is off by default.
The array model is only suitable for wide-rotation datasets with a high
number of reflections and it should be tested whether the absorption
correction is suitable, as it may lead to overparameterisation.
Advanced use - Choosing reflections to use for minimisation¶
To minimise the scaling model, a subset of reflections are used for efficiency.
Four methods are available with the following command:
reflection_selection.method=auto quasi_random intensity_ranges use_all
.
By default, the auto method uses the quasi_random selection algorithm, with
automatically determined parameters based on the dataset properties. If the
dataset is small (<20k reflections), the use_all
option is selected.
For each dataset, the quasi_random algorithm chooses reflection groups that
have a high connectedness across different areas of reciprocal space,
across all resolution shells. In multi-dataset scaling, a separate selection
is also made to find reflection groups that have a high connectedness across
the datasets (choosing from groups with an average I/sigma above a cutoff).
The parameters of the algorithm are therefore controllable with the following
options, if one explicitly chooses reflection_selection.method=quasi_random
:
quasi_random.min_per_area
, quasi_random.n_resolution_bins
,
quasi_random.multi_dataset.min_per_dataset
and
quasi_random.multi_dataset.Isigma_cutoff
. The auto
option sets these
parameters in order to give sufficient connectedness across reciprocal space/datasets
depending on the size of the dataset, number or parameters and number of datasets.
The intensity_ranges
option chooses intensities between a range of
normalised intensities (E2_range
), between a range of I/sigma (Isigma_range
)
and between a resolution range (d_range
). This will typically select
around 1/3 of all reflections.
The use_all
method simply uses all suitable reflections for scaling model
minimisation, but may be prohibitively slow and memory-intensive for large datasets.