research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoFOUNDATIONS
ADVANCES
ISSN: 2053-2733

Universal parameters of bulk-solvent masks

crossmark logo

aCentre for Integrative Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS–INSERM-UdS, 1 rue Laurent Fries, BP 10142, 67404 Illkirch, France, bFaculté des Sciences et Technologies, Université de Lorraine, BP 239, 54506 Vandoeuvre-les-Nancy, France, cMolecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA, and dDepartment of Bioengineering, University of California, Berkeley, Berkeley, California, USA
*Correspondence e-mail: pafonine@lbl.gov

Edited by P. M. Dominiak, University of Warsaw, Poland (Received 23 September 2023; accepted 8 January 2024; online 9 February 2024)

The bulk solvent is a major component of biomacromolecular crystals that contributes significantly to the observed diffraction intensities. Accurate modelling of the bulk solvent has been recognized as important for many crystallographic calculations. Owing to its simplicity and modelling power, the flat (mask-based) bulk-solvent model is used by most modern crystallographic software packages to account for disordered solvent. In this model, the bulk-solvent contribution is defined by a binary mask and a scale (scattering) function. The mask is calculated on a regular grid using the atomic model coordinates and their chemical types. The grid step and two radii, solvent and shrinkage, are the three parameters that govern the mask calculation. They are highly correlated and their choice is a compromise between the computer time needed to calculate the mask and the accuracy of the mask. It is demonstrated here that this choice can be optimized using a unique value of 0.6 Å for the grid step irrespective of the data resolution, and the radii values adjusted correspondingly. The improved values were tested on a large sample of Protein Data Bank entries derived from X-ray diffraction data and are now used in the computational crystallography toolbox (CCTBX) and in Phenix as the default choice.

1. Introduction

Bulk solvent (or disordered solvent) on average occupies about half the volume of a macromolecular crystal and noticeably contributes to the medium- and low-resolution structure factor intensities [see e.g. Weichenberger et al. (2015[Weichenberger, C. X., Afonine, P. V., Kantardjieff, K. & Rupp, B. (2015). Acta Cryst. D71, 1023-1038.]) for a recent review]. It is therefore important to include its contribution into the model-calculated structure factors to account for the entire unit-cell contents adequately. The procedure needs to be fast and accurate because this calculation is repeated many times during atomic model refinement.

The flat (or mask-based) bulk-solvent model (Jiang & Brünger, 1994[Jiang, J. S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100-115.]) is currently the option of choice in most crystallographic software packages. The model first requires the definition of a solvent mask in the unit cell. This mask is a binary function calculated on a regular grid with values of zero inside the molecular region and one outside. The Fourier coefficients Fmask(s) of this binary mask are then calculated and scaled together with the structure factors Fcalc(s) calculated from the atomic model,

[{\bf F}_{\rm model} \left({\bf s} \right) = {k_{\rm total}} \left({\bf s} \right) \left [ {\bf F}_{\rm calc} \left({\bf s} \right) + {k_{\rm mask}} \left({\bf s} \right) {\bf F}_{\rm mask} \left({\bf s} \right) \right ] . \eqno(1)]

The resolution-dependent scales kmask(s) and ktotal(s) are obtained by fitting Fmodel(s) to the experimental data [see e.g. Afonine et al. (2013[Afonine, P. V., Grosse-Kunstleve, R. W., Adams, P. D. & Urzhumtsev, A. (2013). Acta Cryst. D69, 625-634.])].

The mask calculation as introduced by Jiang & Brünger (1994[Jiang, J. S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100-115.]) uses the following parameters:

(i) The size of the grid step dgrid.

(ii) The solvent probe radius rprobe or rsolv.

(iii) The shrinking radius rshrink.

(iv) Tabulated atomic van der Waals radii.

The mask calculation procedure involves augmenting the atomic van der Waals radius with the solvent probe radius to create a sphere of combined radius around each atom. Grid points falling outside of these spheres, which define the expanded macromolecular region, are designated as the solvent-accessible region surrounding the macromolecule. Subsequently, all grid points within a distance rshrink from any point of the tentative solvent-accessible region defined above are assigned to the bulk-solvent region. The resulting mask is referred to as the bulk-solvent mask.

An optimal choice for these parameters should balance structure factor accuracy and the time required to compute the mask and its Fourier coefficients by Fourier transformation. Based on two cases at 2.2 and 1.8 Å resolution, Jiang & Brünger (1994[Jiang, J. S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100-115.]) determined optimal choices for rprobe, rshrink and dstep to be 1.0, 1.1 and (dmin/4) Å, respectively, where dmin is the resolution of the data set. Later, Rees et al. (2005[Rees, B., Jenner, L. & Yusupov, M. (2005). Acta Cryst. D61, 1299-1301.]) showed that for low-resolution data sets a step size of (dmin/4) Å is too coarse, leading to inaccurate masks, and that a step size somewhere between dmin/5 and dmin/10 is more appropriate. While this solves the problem of mask accuracy at low resolution, at high resolution such a fine grid step will result in a significant (and unnecessary) increase in computational time.

In this work, we suggest that the grid step for mask calculation should not depend on the resolution. We demonstrate that using a step size of 0.6 Å, along with values of rsolv and rshrink set to 1.1 and 0.9 Å, respectively, does not compromise the accuracy of the mask or the calculation time. Therefore, we recommend this combination of parameters for calculating the bulk-solvent mask for structures of any resolution.

2. Method

2.1. Why is a common resolution-independent grid step expected?

The electron-density distribution of a macromolecule in a crystal is a peaky function, while the function that describes the bulk solvent is a flat function with a smooth border [see e.g. Fenn et al. (2010[Fenn, T. D., Schnieders, M. J. & Brunger, A. T. (2010). Acta Cryst. D66, 1024-1031. ])]. Consequently, the Fourier coefficients that describe the bulk-solvent distribution decrease sharply, and usually become negligibly small, at resolutions better than dsolv ≃ 3.5–4.0 Å [see e.g. Phillips (1980[Phillips, S. E. V. (1980). J. Mol. Biol. 142, 531-554.]), Jiang & Brünger (1994[Jiang, J. S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100-115.]) or Afonine et al. (2013[Afonine, P. V., Grosse-Kunstleve, R. W., Adams, P. D. & Urzhumtsev, A. (2013). Acta Cryst. D69, 625-634.])]. To understand the consequences of this on the choice of the grid step, we refer to a one-dimensional example below.

For a periodic function of a single variable with period a, the integral Fourier transform gives an infinite number of Fourier coefficients F(h), where h is an integer, both positive and negative. When such a function is sampled on a regular grid with N points per interval, the discrete Fourier transform yields only N independent values Fgrid(h), e.g.

[{\bf F}_{\rm grid} \left(h \right) = {\bf F} \left(h \right) + \sum \limits_{m = 1}^\infty \, \left [ {\bf F} \left( h + mN \right) + {\bf F} \left( h - mN \right) \right ], \eqno(2)]

for −N/2 < hN/2 [see, for example, formula (4) in Ten Eyck (1977[Ten Eyck, L. F. (1977). Acta Cryst. A33, 486-492.])]. Fgrid(h) differ from the respective F(h) by a convergent infinite series of correcting terms (Appendix A[link]) where [\lim \nolimits_{\left| h \right| \to \infty} {\bf F}\left(h \right) = 0]. Let us suppose that F(h) values are equal exactly to zero for |h| > H = a/dsolv with some resolution limit dsolv. If N in (2[link]) is sufficiently large, for example, N > 2H, all correcting terms with indices h ± mN are also zero, resulting in Fgrid(h) = F(h) as desired. Taking N larger than 2H, i.e. taking the grid step dgrid = a/N smaller than

[{a \over N} \, \lt \, {a \over {2H}} = {1 \over 2} \, d_{\rm solv}, \eqno(3)]

has no effect. Conversely, taking smaller N makes Fgrid(h) different from F(h) by at least one non-zero term F(h ± mN), m ≠ 0. The analogue of (2[link]) for three-dimensional functions can be found in Sayre (1951[Sayre, D. (1951). Acta Cryst. 4, 362-367.]), Lunin et al. (2002[Lunin, V. Y., Urzhumtsev, A. & Bockmayr, A. (2002). Acta Cryst. A58, 283-291.]), Navaza (2002[Navaza, J. (2002). Acta Cryst. A58, 568-573. ]) and Afonine & Urzhumtsev (2004[Afonine, P. V. & Urzhumtsev, A. (2004). Acta Cryst. A60, 19-32.]).

This suggests the potential existence of a universally optimal grid step [d_{\rm grid}^0] for the problem under study, which is related to dsolv ≃ 3.5 Å in a manner similar to (3[link]), albeit with a scale factor that may not be equal to [{1 \over 2}]; the latter arises from the facts that these high-resolution structure factors may be different from exactly zero and the Fourier analysis is carried out in three-dimensional space.

2.2. Models and data

The search for the optimal bulk-solvent mask parameters was conducted using 277 quality-filtered models and X-ray diffraction data obtained from the Protein Data Bank (PDB; Burley et al., 2021[Burley, S. K., Bhikadiya, C., Bi, C., Bittrich, S., Chen, L., Crichlow, G. V., Christie, C. H., Dalenberg, K., Di Costanzo, L., Duarte, J. M., Dutta, S., Feng, Z., Ganesan, S., Goodsell, D. S., Ghosh, S., Green, R. K., Guranović, V., Guzenko, D., Hudson, B. P., Lawson, C., Liang, Y., Lowe, R., Namkoong, H., Peisach, E., Persikova, I., Randle, C., Rose, A., Rose, Y., Sali, A., Segura, J., Sekharan, M., Shao, C., Tao, Y., Voigt, M., Westbrook, J., Young, J. Y., Zardecki, C. & Zhuravleva, M. (2021). Nucleic Acids Res. 49, D437-D451.]). The quality filters included a crystallographic R factor better than 0.25, overall and per-resolution shell data completeness above 95%, no data pathologies such as twinning, and a relatively high upper data resolution limit (dmin ≤ 3.0 Å). To accelerate the calculations, we excluded very large models. The results obtained with these 277 models were then validated with a larger set of 2077 structures used in a previous bulk-solvent study (Afonine et al., 2013[Afonine, P. V., Grosse-Kunstleve, R. W., Adams, P. D. & Urzhumtsev, A. (2013). Acta Cryst. D69, 625-634.]) and representing a broad range of model sizes and data resolutions, from subatomic to low.

Among the 277 models, 54% were refined using Refmac (Murshudov et al., 2011[Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355-367.]), 25% using Phenix (Afonine et al., 2012[Afonine, P. V., Grosse-Kunstleve, R. W., Echols, N., Headd, J. J., Moriarty, N. W., Mustyakimov, M., Terwilliger, T. C., Urzhumtsev, A., Zwart, P. H. & Adams, P. D. (2012). Acta Cryst. D68, 352-367.]), 14% using CNS/X-plor (Brünger et al., 1998[Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905-921.]), 0.04% using BUSTER/TNT (Tronrud et al., 1987[Tronrud, D. E., Ten Eyck, L. F. & Matthews, B. W. (1987). Acta Cryst. A43, 489-501.]; Roversi et al., 2000[Roversi, P., Blanc, E., Vonrhein, C., Evans, G. & Bricogne, G. (2000). Acta Cryst. D56, 1316-1323.]; Blanc et al., 2004[Blanc, E., Roversi, P., Vonrhein, C., Flensburg, C., Lea, S. M. & Bricogne, G. (2004). Acta Cryst. D60, 2210-2221.]) and 0.03% using SHELX (Sheldrick, 2015[Sheldrick, G. M. (2015). Acta Cryst. C71, 3-8. ]). These programs may potentially employ different types of bulk-solvent models, for example the Babinet-based model (Langridge et al., 1960[Langridge, R., Marvin, D. A., Seeds, W. E., Wilson, H. R., Hooper, C. W., Wilkins, M. H. F. & Hamilton, L. D. (1960). J. Mol. Biol. 2, 38-I, N12.]; Moews & Kretsinger, 1975[Moews, P. C. & Kretsinger, R. H. (1975). J. Mol. Biol. 91, 201-225.]) in SHELX, Refmac and BUSTER/TNT, as well as different parameters for mask calculations when the mask-based solvent model was used, such as in Phenix, CNS/X-plor and again Refmac. We believe that this diversity in refinement programs, each with its distinct formulation of bulk-solvent modelling, helps mitigate any potential model bias related to the solvent parameters used in these programs.

2.3. Finding optimal values for rsolv, rshrink and dgrid

For each selected PDB entry, the values of dgrid were systematically sampled in the range between 0.2 and 1 Å in steps of 0.1 Å, and both rsolv and rshrink were sampled between 0.5 and 1.5 Å, also in steps of 0.1 Å. For each trial triplet of values (rsolv, rshrink, dgrid), the scales ktotal(s) and kmask(s) in (1[link]) were re-calculated as detailed by Afonine et al. (2013[Afonine, P. V., Grosse-Kunstleve, R. W., Adams, P. D. & Urzhumtsev, A. (2013). Acta Cryst. D69, 625-634.]), and the R-factor values, referred to as R(4), were calculated using all reflections up to 4 Å resolution. In what follows, Rn(4) stands for R(4) calculated for the structure numbered n. These values were the principal information to identify potential universal values for [r_{\rm solv}^0], [r_{\rm shrink}^0] and [d_{\rm grid}^0]. The details follow in Section 3[link].

3. Results

3.1. Variation of the optimal R factor with the mask grid step

The search for common parameters was based on the hypothesis that there is a common behaviour of Rn(4) with parameter values for different structures. First, we tried to decouple the search for optimal dgrid and (rsolv, rshrink). This was achieved by finding the combination of (rsolv, rshrink) that minimizes Rn(4) for each trial dgrid and for each structure n,

[{\bar R}_n^{(4)} \left( d_{\rm grid} \right) = \min \limits_{r_{\rm solv}, r_{\rm shrink}} \left \{ R_n^{(4)} \left( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \right ) \right \}. \eqno(4)]

Obviously, these values are different for each structure, but they vary with dgrid very similarly. In particular, this is true for their variation around the average of [{\bar R}_n^{(4)} ( d_{\rm grid} )] over grid steps

[\Delta {\bar R}_n^{(4)} \left( d_{\rm grid} \right) = {\bar R}_n^{(4)} \left( d_{\rm grid} \right) - \left \langle {\bar R}_n^{(4)} \left( d_{\rm grid} \right) \right \rangle_{\rm grid}. \eqno(5)]

Subtracting the average in (5[link]) makes the dependency of [\Delta {\bar R}_n^{(4)} ( d_{\rm grid} )] on dgrid similar for all structures, which in turn makes it possible to analyse the average of these dependencies over all structures (Fig. 1[link]).

[Figure 1]
Figure 1
The variation [\Delta {\bar R}_n^{(4)}] in the [\bar R_n^{(4)}] factor, as defined in the text, as a function of the grid step dgrid. Each data point is the average of [\Delta {\bar R}_n^{(4)}] across all structures. Intervals of 1σ are given for each grid step.

It is to be expected that [\Delta {\bar R}_n^{(4)} ( d_{\rm grid} )] should increase with step size. However, the value does not change significantly for dgrid in the range of 0.2–0.4 Å, suggesting that steps smaller than 0.4 Å are unnecessarily small. Above this step value, [\Delta {\bar R}_n^{(4)} ( d_{\rm grid} )] starts to increase, and the goal is to find a compromise between the introduced errors and the gain in computation time. Increasing the step from 0.4 Å to 0.6 Å or 0.8 Å increases the grid size, and therefore the number of computing operations, by about four times or eight times, respectively.

A step size of 1.0 Å resulted in very large errors and was excluded from further analysis. Calculations with a step size of 0.9 Å resulted in a large number of outliers with large [\Delta {\bar R}_n^{(4)}], making this step size also unsuitable. This leads to 0.4–0.8 Å as the range for the grid-step search.

3.2. Acceptable combinations of parameters

Next, for each model n, we analysed the parameter values that lead to the lowest Rn(4) value across all combinations of (rsolv, rshrink, dgrid),

[\eqalignno{{\bar R}_n^{(4)} = & \, \min \limits_{r_{\rm solv}, r_{\rm shrink}, d_{\rm grid}} \left \{ R_n^{(4)} \left ( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \right ) \right \} \cr = & \, \min \limits_{d_{\rm grid}} \left \{ {\bar R}_n^{(4)} \left( d_{\rm grid} \right) \right \} . &(6)}]

It is possible that, for a given structure, several combinations of (rsolv, rshrink, dgrid) result in Rn(4) values that are close to the global minimum of (6[link]). To address such small fluctuations in the Rn(4) values, we introduce a parameter ɛR considering all values [R_n^{(4)} ( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} )][{\bar R}_n^{(4)} + \varepsilon_R] to be as good as [{\bar R}_n^{(4)}], where the value of ɛR varies in the range 0.001–0.002.

The parameter values (rsolv, rshrink, dgrid) corresponding to (6[link]) are expected to vary from one structure to another, and we are looking for the combinations that are persistent over all structures. As a formal quantitative measure, for each set of parameters (rsolv, rshrink, dgrid) and for each structure n, we calculate a non-negative value

[\eqalignno{ & \Delta R_n^{(4)} \left( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \semi \varepsilon_R \right) \cr & \quad = \max \left \{ R_n^{(4)} \left( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \right) - {\bar R}_n^{(4)} - \varepsilon_R \semi 0 \right \} . &(7)}]

To be able to combine the distribution of (rsolv, rshrink, dgrid) for each structure into one cumulative distribution over all structures, we convert [\Delta R_n^{(4)}] in (7[link]) into

[P_n \left( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \right) = \exp{ \left [ -C \, \Delta R_n^{(4)} \left( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \semi \varepsilon_R \right) \right ]} \eqno(8)]

with constants C > 0 and ɛR. The product of (8[link]) over all structures,

[P \left ( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \right) = \prod \limits_n P_n \left( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \right) , \eqno(9)]

reflects both the contrast of the lowest Rn(4) for an individual structure and the persistence of the parameter values over all structures. P(rsolv, rshrink, dgrid) varies from 0 to 1; the higher the value of (9[link]), the more preferable the combination of parameters. Calculations with data sets of different sizes suggested the choice of C in the range between 0.01 and 1.0 and ɛR as stated above in order to obtain a decent contrast while keeping points with neighbouring Rn(4) values. In general, we observed that the variation in the constants C and ɛR around the values given above obviously modifies the contrast of the distribution (9[link]) while not influencing the location of its peaks.

Fig. 2[link] shows the results with ɛR = 0.002 and C = 0.5. As expected, the distribution shows that mask shrinking does not impact the results when rshrink < dgrid (rectangular bottom-left regions). Also, the distribution shows a clear nearly linear correlation between rsolv and rshrink, giving preferable values roughly at the line rsolvrshrink ≃ 0.2 Å for each grid step. Finally, it shows that the optimal radii (rsolv, rshrink) cluster in the range (1.1 ± 0.1 Å, 0.9 ± 0.1 Å).

[Figure 2]
Figure 2
The distribution P(rsolv, rshrink, dgrid) × 102 of the variation in Rn(4) with respect to the best value [{\bar R}_n^{(4)}] for all combinations of the mask parameters. The values on the axes are rsolv × 10 (horizontal) and rshrink × 10 (vertical). High P values correspond to the (rsolv, rshrink, dgrid) combinations giving Rn(4) close to the minimum best value of [{\bar R}_n^{(4)}]. The colour scheme indicates the P-function value ranges: red (P < 0.01), yellow (0.01 ≤ P < 0.2), light green (0.2 ≤ P < 0.9) and dark green (P ≥ 0.9).

As expected, increasing the grid step makes Rn(4) worse. In agreement with the first test (Fig. 1[link]), most frequently the lowest Rn(4) occurred for a grid step size of 0.3–0.4 Å (P values up to 0.99). Consequently, a step of 0.4 Å may be considered as a candidate for the most accurate calculations since a smaller step of 0.3 Å does not significantly improve the R factors while leading to an increased computational time. Using a grid with step sizes of 0.5–0.6 Å makes it possible for Rn(4) to be close to [{\bar R}_n^{(4)}], indicated by high values of the function P(rsolv, rshrink, dgrid) in the range 0.91–0.93. Increasing the step further reduces the maximum P-function value to 0.83. Since a larger grid step is preferable to reduce the computing time, dgrid = 0.6 Å is a good potential compromise candidate for the universal value.

The values of the radii (rsolv, rshrink) leading to the minimum [{\bar R}_n^{(4)}] value in (6[link]) also varied only slightly over the trial grid step sizes. This provided a relatively small number of tentative combinations (rsolv, rshrink, dgrid) to identify the optimal ones which we denote [ ( r_{\rm solv}^0, r_{\rm shrink}^0, d_{\rm grid}^0 )].

3.3. Optimal set of parameters

The analysis described in Sections 3.1[link]–3.2[link] results in a range of (rsolv, rshrink, dgrid) parameters minimizing R(4) on average for all test models. These values, however, do not necessarily lead to the lowest R(4) for a particular model.

Next, we can ask which of these combinations, if any, lead to a value of [R_n^{(4)} (r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} )] that is larger than, and by how much, the lowest value of [{\bar R}_n^{(4)}] in (6[link]) for what fraction of structures. To answer this question, we calculate the fraction p(ΔR) of the structures with the difference

[R_n^{(4)} \left(r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} \right) - {\bar R}_n^{(4)} \, \gt \, \Delta R \eqno(10)]

for different ΔR values. The expected solution corresponds to the minimum of the p(ΔR) function. For the sake of completeness, we calculated p(ΔR) for all triplet values of the parameters considered above.

The combination (rsolv = 1.0 Å, rshrink = 1.0 Å) gives poor results for all grid step sizes and the combination (rsolv = 1.2 Å, rshrink = 0.8 Å) gives results acceptable only for very small grid steps, dgrid = 0.3–0.4 Å (Fig. 3[link]). The best results are observed for the sets of rsolv, rshrink with rsolvrshrink = 0.2 Å, with a slight preference for the combination (rsolv = 1.1 Å, rshrink = 0.9 Å, dgrid = 0.6 Å). Here, Rn(4) increased by less than 0.5% for all structures of the search set except one, for which this value was below 0.6%. The same plots (Fig. 3[link]) indicate that if more accurate calculations are required, then the combination (rsolv = 1.1 Å, rshrink = 0.8 Å, dgrid = 0.4 Å) is optimal. However, using this finer grid step would lead to a nearly fourfold increase in the number of grid points and, consequently, in the number of computational operations required. Conversely, for a very large structure, if a coarser grid is acceptable, a possible combination would be (rsolv = 1.0 Å, rshrink = 0.8 Å, dgrid = 0.7 Å), resulting in only a slight increase in overall R factors.

[Figure 3]
Figure 3
The fraction of the models that satisfy [R_n^{(4)} ( r_{\rm solv}, r_{\rm shrink}, d_{\rm grid} ) - {\bar R}_n^{(4)} \, \gt \, \Delta R] shown as a function of ΔR. The colour scheme for the grid step is the same for all plots and is given in the top right plot. The values of [(r_{\rm solv} \semi r_{\rm shrink} )] are indicated in each plot.

3.4. New versus old mask calculation parameters

Finally, for each model from the complete data set, we analysed how much the R(4) and the R factor calculated using all reflection data change if the new mask calculation parameter values ([r_{\rm solv}^0] = 1.1 Å, [r_{\rm shrink}^0] = 0.9 Å, [d_{\rm grid}^0] = 0.6 Å) are used instead of the values of (rsolv = 1.1 Å, rshrink = 1.0 Å, dgrid = dmin/4) used previously.

Fig. 4[link] shows that Rn(4) varies little and typically remains within ±0.3% for most structures, with the exception of a few cases where it varies within ±0.5%. We consider these variations negligible. As expected, Rn changes even less than Rn(4), since the bulk-solvent contribution vanishes beyond 3–4 Å resolution.

[Figure 4]
Figure 4
The variation in ΔR(4) calculated for the test set of 2077 models when moving from the conventional set of mask parameters to the selected set [r_{\rm solv}^0, r_{\rm shrink}^0, d_{\rm grid}^0] of optimal values. Each point corresponds to an individual model. The plots show the distribution of the variation in ΔR(4) (vertical) versus (left) the variation ΔR in the overall R factor calculated for the whole set of reflections and (right) the R(4) value itself. For the model of 3b6a (indicated by red arrows) the values (0.64; 0.92) become (0.26; 0.36) after the removal of isolated small-volume regions appearing in the new mask.

The only notable outlier is PDB entry 3b6a (Willems et al., 2008[Willems, A. R., Tahlan, K., Taguchi, T., Zhang, K., Lee, Z. Z., Ichinose, K., Junop, M. S. & Nodwell, J. R. (2008). J. Mol. Biol. 376, 1377-1387.]), for which [\Delta R_n^{(4)}] = 0.92% (ΔRn = 0.64%). This structure was solved at a resolution of dmin = 3.0 Å, which means that the original algorithm used a mask with a grid step size of 0.75 Å, coarser than the proposed 0.6 Å. This seemingly counterintuitive result can be rationalized as follows. The bulk-solvent mask typically consists of a large region and several (often many) smaller isolated regions (Afonine et al., 2024[Afonine, P. V., Sobolev, O. V., Adams, P. D. & Urzhumtsev, A. (2024). Protein Science. In the press.]). These small regions are typically cavities inside the protein or computational artefacts. The number and size of such regions vary based upon the choice of mask parameters (rsolv, rshrink, dgrid). With a step size [d_{\rm grid}^0] = 0.6 Å, the mask for 3b6a contains about 20 isolated small regions incapable of containing even a single disordered water molecule. Excluding these regions from the bulk-solvent mask reduces [\Delta R_n^{(4)}] and ΔRn to 0.36% and 0.26%, respectively, suggesting that these regions are computational artefacts.

4. Conclusions

The choice of mask parameters for the flat bulk-solvent model, i.e. solvent and mask shrinkage radii and the grid sampling step size, affects both the accuracy of the fit between model and data at medium to low resolution and the speed of the calculations. Since accounting for the bulk solvent typically occurs in crystallographic calculations that involve atomic model and reflection data, from simple operations like R-factor or map calculation to complex procedures like model refinement and building, the computational efficiency of this step is critical. The parameters governing the speed and accuracy of the flat bulk-solvent model are the solvent radius rsolv, the mask shrinkage radius rshrink and the grid step dgrid for the mask sampling. When this model was introduced (Jiang & Brünger, 1994[Jiang, J. S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100-115.]) the choice of values for these parameters, of rsolv = 1.0 Å, rshrink = 1.1 Å and dgrid = (dmin/4) Å, was based on only two study cases at medium resolution (around 2 Å). A decade later, this choice was revisited for low-resolution cases by Rees et al. (2005[Rees, B., Jenner, L. & Yusupov, M. (2005). Acta Cryst. D61, 1299-1301.]), resulting in the suggestion that much finer steps are needed at these resolutions. While these finer steps, calculated as a fraction of dmin, address the problem of accuracy for low-resolution models, they in turn create the problem of substantially increasing the computational time for higher-resolution cases. A universal resolution-independent choice of mask calculation parameters is therefore highly desirable, and we here show that this is possible. A systematic study across hundreds of models from the PDB performed in this work reveals an optimal choice of these parameters to be [r_{\rm solv}^0] = 1.1 Å, [r_{\rm shrink}^0] = 0.9 Å and [d_{\rm grid}^0] = 0.6 Å. Validation of this choice with a much larger test set of models shows that these values are broadly applicable. In the last stages of refinement, a finer grid with a step dgrid = 0.4 Å and radii rsolv = 1.1 Å, rshrink = 0.8 Å may possibly be used to improve the results further. The parameters described here are implemented in CCTBX and are used in the Phenix suite (Liebschner et al., 2019[Liebschner, D., Afonine, P. V., Baker, M. L., Bunkóczi, G., Chen, V. B., Croll, T. I., Hintze, B., Hung, L.-W., Jain, S., McCoy, A. J., Moriarty, N. W., Oeffner, R. D., Poon, B. K., Prisant, M. G., Read, R. J., Richardson, J. S., Richardson, D. C., Sammito, M. D., Sobolev, O. V., Stockwell, D. H., Terwilliger, T. C., Urzhumtsev, A. G., Videau, L. L., Williams, C. J. & Adams, P. D. (2019). Acta Cryst. D75, 861-877.]), where applicable, starting from Version 1.20rc4-4425.

APPENDIX A

Discrete Fourier transform and Fourier coefficients

For a periodic function f(x) of a single variable with period a = 1, the integral Fourier transform results in an infinite number of Fourier coefficients F(h), where h is an integer. The infinite Fourier series defined by these coefficients converges to the function f(x). The sharper the function, the faster the convergence is (we do not specify the formal, mathematically strict, conditions of convergence when studying smooth functions like density distributions). When such a function is sampled on a regular grid with N points per interval, using the obvious equation for integer m

[\exp \left ( i2\pi n {{h + mN} \over {N}} \right ) = \exp \left ( i2\pi n {h \over N} \right ) \eqno(11)]

gives the convergent Fourier series on the grid nodes xn = n/N which can be expressed as

[\eqalignno{ f\left(x_n \right) = & \, \sum \limits_{h = - \infty}^\infty {\bf F} \left(h \right) \exp \left ( i2\pi h x_n \right ) \cr = & \, \sum \limits_{h = - \infty}^\infty {\bf F} \left (h \right ) \exp \left ( i2\pi h {n \over N} \right ) \cr = & \, \sum \limits_{h = - \infty}^\infty {\bf F} \left ( h \right ) \exp \left ( i2\pi n {h \over N} \right ) \cr = & \, \sum \limits_{h = H}^{H + N - 1} \left \{ \sum \limits_{m = - \infty}^\infty \left [ {\bf F} \left ( h + mN \right ) \right ] \right \} \exp \left ( i2\pi n {h \over N} \right ) . \cr &&(12)}]

Here H is any integer number, and convergence of the original Fourier series proves convergence of each internal series in the right-hand expression of (12[link]). In other words, given N real numbers on a regular grid, the discrete Fourier transform results in a set of values

[{\bf F}_{\rm grid} \left ( h \right ) = \sum \limits_{m = - \infty}^\infty \left [ {\bf F} \left ( h + mN \right ) \right ] , \eqno(13)]

which possess Hermitian symmetry. There are only N/2 independent values since, according to their definition in (13[link]), Fgrid(h) are periodic and so Fgrid(h) = Fgrid(h + N) for every h. Usually, Fgrid(h) taken with the consecutive indices are used as approximations to F(h). In some applications, the range 0 ≤ h < N can be chosen. However, since the Fourier coefficients F(h) for convergent series generally decrease with |h|, it is more practical to choose −N/2 < hN/2, which results in a smaller |Fgrid(h) − F(h)| difference.

Funding information

Pavel V. Afonine and Paul D. Adams thank the NIH (grant Nos. R01GM071939, P01GM063210 and R24GM141254) and the Phenix Industrial Consortium for support of the Phenix project. This work was supported in part by the US Department of Energy under Contract No. DE-AC02-05CH11231. Alexandre Urzhumtsev thanks the French Infrastructure for Integrated Structural Biology (FRISBI) ANR-10-INSB-05-01 and Instruct-ERIC.

References

First citationAfonine, P. V., Grosse-Kunstleve, R. W., Adams, P. D. & Urzhumtsev, A. (2013). Acta Cryst. D69, 625–634.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationAfonine, P. V., Grosse-Kunstleve, R. W., Echols, N., Headd, J. J., Moriarty, N. W., Mustyakimov, M., Terwilliger, T. C., Urzhumtsev, A., Zwart, P. H. & Adams, P. D. (2012). Acta Cryst. D68, 352–367.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationAfonine, P. V., Sobolev, O. V., Adams, P. D. & Urzhumtsev, A. (2024). Protein Science. In the press.  Google Scholar
First citationAfonine, P. V. & Urzhumtsev, A. (2004). Acta Cryst. A60, 19–32.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationBlanc, E., Roversi, P., Vonrhein, C., Flensburg, C., Lea, S. M. & Bricogne, G. (2004). Acta Cryst. D60, 2210–2221.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationBrünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905–921.  Web of Science CrossRef IUCr Journals Google Scholar
First citationBurley, S. K., Bhikadiya, C., Bi, C., Bittrich, S., Chen, L., Crichlow, G. V., Christie, C. H., Dalenberg, K., Di Costanzo, L., Duarte, J. M., Dutta, S., Feng, Z., Ganesan, S., Goodsell, D. S., Ghosh, S., Green, R. K., Guranović, V., Guzenko, D., Hudson, B. P., Lawson, C., Liang, Y., Lowe, R., Namkoong, H., Peisach, E., Persikova, I., Randle, C., Rose, A., Rose, Y., Sali, A., Segura, J., Sekharan, M., Shao, C., Tao, Y., Voigt, M., Westbrook, J., Young, J. Y., Zardecki, C. & Zhuravleva, M. (2021). Nucleic Acids Res. 49, D437–D451.  Web of Science CrossRef CAS PubMed Google Scholar
First citationFenn, T. D., Schnieders, M. J. & Brunger, A. T. (2010). Acta Cryst. D66, 1024–1031.   Web of Science CrossRef IUCr Journals Google Scholar
First citationJiang, J. S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100–115.  CrossRef CAS PubMed Web of Science Google Scholar
First citationLangridge, R., Marvin, D. A., Seeds, W. E., Wilson, H. R., Hooper, C. W., Wilkins, M. H. F. & Hamilton, L. D. (1960). J. Mol. Biol. 2, 38–I, N12.  Google Scholar
First citationLiebschner, D., Afonine, P. V., Baker, M. L., Bunkóczi, G., Chen, V. B., Croll, T. I., Hintze, B., Hung, L.-W., Jain, S., McCoy, A. J., Moriarty, N. W., Oeffner, R. D., Poon, B. K., Prisant, M. G., Read, R. J., Richardson, J. S., Richardson, D. C., Sammito, M. D., Sobolev, O. V., Stockwell, D. H., Terwilliger, T. C., Urzhumtsev, A. G., Videau, L. L., Williams, C. J. & Adams, P. D. (2019). Acta Cryst. D75, 861–877.  Web of Science CrossRef IUCr Journals Google Scholar
First citationLunin, V. Y., Urzhumtsev, A. & Bockmayr, A. (2002). Acta Cryst. A58, 283–291.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationMoews, P. C. & Kretsinger, R. H. (1975). J. Mol. Biol. 91, 201–225.  CrossRef PubMed CAS Web of Science Google Scholar
First citationMurshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationNavaza, J. (2002). Acta Cryst. A58, 568–573.   Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationPhillips, S. E. V. (1980). J. Mol. Biol. 142, 531–554.  CrossRef CAS PubMed Web of Science Google Scholar
First citationRees, B., Jenner, L. & Yusupov, M. (2005). Acta Cryst. D61, 1299–1301.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationRoversi, P., Blanc, E., Vonrhein, C., Evans, G. & Bricogne, G. (2000). Acta Cryst. D56, 1316–1323.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationSayre, D. (1951). Acta Cryst. 4, 362–367.  CrossRef CAS IUCr Journals Web of Science Google Scholar
First citationSheldrick, G. M. (2015). Acta Cryst. C71, 3–8.   Web of Science CrossRef IUCr Journals Google Scholar
First citationTen Eyck, L. F. (1977). Acta Cryst. A33, 486–492.  CrossRef CAS IUCr Journals Web of Science Google Scholar
First citationTronrud, D. E., Ten Eyck, L. F. & Matthews, B. W. (1987). Acta Cryst. A43, 489–501.  CrossRef CAS Web of Science IUCr Journals Google Scholar
First citationWeichenberger, C. X., Afonine, P. V., Kantardjieff, K. & Rupp, B. (2015). Acta Cryst. D71, 1023–1038.  Web of Science CrossRef IUCr Journals Google Scholar
First citationWillems, A. R., Tahlan, K., Taguchi, T., Zhang, K., Lee, Z. Z., Ichinose, K., Junop, M. S. & Nodwell, J. R. (2008). J. Mol. Biol. 376, 1377–1387.  Web of Science CrossRef PubMed CAS Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoFOUNDATIONS
ADVANCES
ISSN: 2053-2733
Follow Acta Cryst. A
Sign up for e-alerts
Follow Acta Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds