Click here to go to the corresponding page for the latest version of DIALS
User guide for scaling data with DIALS¶
This document aims to provide an in-depth guide on how to use several of the dials.scale command line options. It should be considered an ‘expert’ level guide, and the reader is encouraged to first read the ‘scaling beta lactamase’ tutorial for an overview of scaling in dials. This guide includes:
- how to customise the scaling models and general tips
- how to exclude data after a first round of scaling
- how to control which reflections are used for minimisation
- some tips for how to help performance when scaling large datasets
Guide to the different scaling models¶
There are three available scaling models available in dials.scale, accessible
by the command line option model = physical array KB *auto
.
The physical model is similar to the scaling model used in the program aimless,
the array model is based on the approach taken in xscale, while the KB model is
a simple two-component model suitable for still-image datasets or very small
rotation datasets (~ < 3 degrees).
The auto option automatically chooses a default model and sensible parameterisation based on the oscillation range of the experiment. model=auto will choose the physical model unless the oscillation range is < 1.0 degree, when the KB model will be chosen. The auto parameterisation rules are given at the bottom of this section.
The physical model consists of up to three components - a smoothly varying
scale term, a smoothly varying B-factor term and an absorption surface
correction (all on by default). These are turned on/off with the command line options
scale_term=True/False decay_term=True/False absorption_term=True/False
.
The smoothly varying terms have a parameter at regular intervals in rotation,
which can be specified with the scale_interval
and decay_interval
options. The number of parameters in the absorption surface is determined by the
highest order of spherical harmonics function used, controlled by lmax
(recommended to be no higher than 6, 4 by default). There is also a weak
decay_restraint
and strong surface_weight
to restrain the
parameters of the decay and absorption terms towards 0.
The physical model is suitable for most datasets, although the absorption_term
should be turned off for datasets with low reciprocal space coverage.
The KB model applies a single scale factor and single B factor to the whole
dataset (B-factor can be turned off with decay_term=False
). This is
only suitable for very thin wedge/single-image datasets. If the KB model is
used, it may be necessary to set full_matrix=False
, as the full matrix
minimisation round can be unstable depending on the number of reflections per
dataset.
The array model consists of up to three components. The first (decay_term
),
consists of a smoothly varying correction calculated over a 2D grid of
parameters, as a function of rotation vs resolution (d-value). The parameter
interval in rotation is controlled by decay_interval
, while the number
of resolution bins is controlled by n_resolution_bins
.
The second (absorption_term
) consists of a smoothly varying correction
calculated over a 3D grid of parameters, as a function of rotation, x and y
position of the measured reflection on the detector, controlled with
decay_interval
and n_absorption_bins
.
Finally, a modulation_term
can be applied, which is a smooth 2D correction as a
function of x and y position, controlled with n_modulation_bins
,
although this is off by default. The array model is only suitable for
wide-rotation datasets with a high number of reflections and it should be tested
whether the absorption term is suitable, as it may lead to overparameterisation.
These rules are designed to give a sensisble parameterisation, but not the best for a given dataset. All parameters are controllable when model is not auto.
Excluding data/image handling after initial scaling¶
After a first round of scaling, it may be apparant that there are datasets, or regions of datasets, that are in poor agreement with the rest of the dataset, and it would be advantageous to remove this data and rescale (this is particularly relevant for thin-wedge rotation datasets and still image datasets). dials.scale provides two options for removing data, depending on whether one wishes to exclude a whole dataset or only part of a dataset.
To exclude whole datasets, we can take advantage of the fact that unique
experiment identifiers are assigned to the datasets as labels - these are
currently assigned as strings of integers i.e. ‘0’, ‘1’, ‘2’ etc. (these
can also be assigned manually with dev.dials.assign_experiment_identifiers
)
The assignment of the identifiers can be seen in the scaling log / terminal
output, in one of the first lines of output:
Dataset unique identifiers are ['0', '1', '2', '3']
To exclude datasets, one therefore uses the exclude_datasets
option:
dials.scale ...... exclude_datasets="0 2"
Alternatively, one can use the option use_datasets
:
dials.scale ...... use_datasets="1 3"
These datasets are removed at the start of the program before scaling occurs,
and will not be contained in the output scaled.pickle
and
scaled_experiments.json
.
To help with excluding parts of a dataset, image exclusion can be performed
using the command-line syntax exclude_images="exp_id:start:stop"
. Here
exp_id is the experiment identifier (a string) indicating the dataset,
and start and stop are integers that define the image range to exclude (the
excluding region includes start and stop) i.e. to exclude images 101 to 200 from
experiment “0”, one would use exclude_images="0:101:200"
.
In the reflection_table, the reflections corresponding to these imags are
marked with the user_excluded_for_scaling
flag, and the parameters of the
scaling models are adjusted to span the new image range. These data will not
be included in future scaling or data export, and further image exclusion
can be performed in subsequent scaling jobs.
Note that it is recommended to only exclude data at the beginning or end of a
sweep. One can use it to exclude data in the middle of a sweep, however care
must be taken that only a short image range is excluded. If the interior
excluded range is of the order of the scaling model parameter spacing, this can
cause the scaling model minimisation to fail. In this case it would be better to
split the experiment with dials.slice_sweep
and then proceed with
excluding images at the edge of the new experiments.
Choosing reflections to use for minimisation¶
To minimise the scaling model, a subset of reflections are used for efficiency.
Four methods are available with the following command:
reflection_selection.method=auto quasi_random intensity_ranges use_all
.
By default, the auto method uses the quasi_random selection algorithm, with automatically determined parameters based on the dataset properties. If the dataset is small (<20k reflections), the use_all option is selected.
For each dataset, the quasi_random algorithm chooses reflection groups that
have a high connectedness across different areas of reciprocal space,
across all resolution shells. In multi-dataset scaling, a separate selection
is also made to find reflection groups that have a high connectedness across
the datasets (choosing from groups with an average I/sigma above a cutoff).
The parameters of the algorithm are therefore controllable with the following
options, if one explicity chooses reflection_selection.method=quasi_random
:
quasi_random.min_per_area
, quasi_random.n_resolution_bins
,
quasi_random.multi_dataset.min_per_dataset
and
quasi_random.multi_dataset.Isigma_cutoff
. The auto
option sets these
parameters in order to give sufficient connectedness across reciprocal space/datasets
depending on the size of the dataset, number or parameters and number of datasets.
The intensity_ranges
option chooses intensities between a range of
normalised intensities (E2_range
), between a range of I/sigma (Isigma_range
)
and between a resolution range (d_range
). This will typically select
around 1/3 of all reflections, resulting in a longer runtime compared to the
quasi_random selection.
The use_all
method simply uses all suitable reflections for scaling model
minimisation but may be prohibitively slow and memory-intensive for large datasets.
Practicalities for large datasets¶
Depending on the computational resources available, scaling of large datasets ( > 1 million reflections) can become slow and memory intensive. There are several options available for managing this.
The first option is separating the data in memory to allow blockwise calculations
and parallel processing, using the option nproc=
(a value of 4 or 8 is probably a
reasonable choice).
One of the most intensive part of the algorithm is
full matrix minimisation, which is by default performed after a quicker LBFGS
minimisation round. One can set full_matrix=False
to turn this off, however
no errors for the inverse scale factors will be determined. A compromise is
to set full_matrix_max_iterations=1
to do at least one iteration.
A third option is to reduce the number of reflections used by the scaling
algorithm during minimisation. If using reflection_selection.method=auto
,
the number of reflections should be manageable even for very large datasets, but
this can always be controlled by the user - see the previous section in this guide.