Bioisis

INFORMATION CONTENT ... by Robert P. Rambo, Ph.D.

A SAXS curve is a measurement of a band-limited function. As such, we can apply Shannon's information theory to SAXS datasets, an insight explored by Peter Moore (1982) and D. Taupin and V. Luzzati (1982). The band-limited function in SAXS is the P(r)-distribution, this is our signal that we are attempting to measure by a SAXS experiment to some level of accuracy (i.e., resolution). Shannon showed through the Shannon-Whittaker interpolation formula that a band-limited function could be entirely recapitulated by sampling the signal at discrete, equidistant points (Shannon points). Peter Moore extended this framework to small-angle scattering experiments and demonstrated a SAS dataset, defined by the q-vector range, could be entirely specified by a discrete sampling of points defined by n = qmax*dmaxπ where n is an integer and dmax is the particle's maximum dimension (Figure 1). In modern SAXS experiments, the number of observations, I(q), greatly exceeds n. For instance, a modern detector will collect ~500 points to a qmax of 0.32 and for the 30S ribosomal subunit, we require only 23 equidistant I(q) observations to fully describe the particle at the specified resolution limit. This shows that your typical SAXS curve from a modern instrument will be highly over-sampled.

Figure 1


Information_theory_figure_1

The approach by Peter Moore establishes what I refer to as the Moore function. He effectively reduced the scattering curve down to a set of coefficients (Moore coefficients) that can be combined in various manners to calculate the particle's correlation function, I(0), Rg, P(r) and the average distance within the molecule. The Moore function provides a target function for merging and refining datasets which is implemented in Scatter. We exploit the data redundancy in the SAXS curve to test how consistent the SAXS dataset is with itself.

The Shannon points are equally spaced samples occurring at a frequency of π/dmax. If the sampling rate is less than π/dmax, the signal cannot be recovered without additional independent information. Practically, the minimum sampling rate required to fully recover the information from a SAS signal will also depend on the signal-to-noise (S-to-N) ratio in the dataset (Figure 2). The angular range of the experiment defines the “communication” channel for observing the SAS signal. Here, the maximum rate of information, C, that can be communicated in the presence of noise is given by the Shannon-Hartley theorem. For more information see Rambo and Tainer (Supporting Information).

Figure 2


Information_theory_figure_2

C is proportional to the particle’s dmax, and is defined in terms of bits of information per Å of SAS data. The noisy-coding channel theorem guarantees the recovery of a nearly error free signal as long as the sampling rate, Δq, is less than C. More importantly, it demonstrates that S-to-N ratio of less than one contains useable information as long as Δq < C. For xylanase, dmax = 43 Å, and a S-to-N ratio of 0.9, determines a C of 0.068 bits per Å-1, well below the sampling frequency of modern beamlines at Δq ~ 0.006 Å-1 << C. While the above derivation suggests the minimum requirement to recover a SAS signal from a noisy dataset, it does not necessarily represent the practical limit as it will depend greatly on the decoding algorithm i.e., GNOM, GIFT, FOXS, CRYSOL, etc.

Modern SAXS curves are highly over-sampled measurements of a signal implying the information in the curve is redundant. Application of the Shannon-Whittaker interpolation formula demonstrates that a majority of the observations are correlated and that the information from a SAXS experiment is conveyed not by a single I(q) observation but through the set of observations bounded by the experimental channel with the minimum restriction Δq << C.