VSNNormalizer

The VSNNormalizer performs Variance Stabilizing Normalization (VSN). It applies a data-driven arcsinh transformation that flattens variance across the full intensity range, making downstream statistical analyses more reliable and robust to intensity-dependent noise.

As of version 0.2, VSNNormalizer is implemented in pure Python (NumPy + SciPy). The previous R/rpy2 backend has been removed; an R installation is no longer required to use this normalizer.

Overview

Variance Stabilizing Normalization addresses a common problem in high-throughput biological data: the variance of measurements often depends on their intensity level. In proteomics and microarray data, low-intensity measurements typically have lower variance while high-intensity measurements have higher variance. This heteroscedasticity can bias statistical analyses.

VSN works by:

  1. Learning transformation parameters: per-sample offset a_j and log-scale-factor beta_j are estimated by maximum profile likelihood.

  2. Applying transformation: h_ij = arsinh(exp(beta_j) * y_ij + a_j), which behaves like log(y_ij) for large positive y_ij and is well-defined for zero or negative inputs.

  3. Calibrating across samples: parameters are estimated jointly so per-sample intensities become comparable.

  4. Robust subset selection: a least-trimmed-squares (LTS) iteration trims a configurable upper quantile of high-residual rows from each intensity slice, so the fit is not dominated by outliers.

Key Features

  • Variance stabilization: variance becomes approximately constant across intensity ranges.

  • Data-driven: parameters are learned from the data, not predetermined.

  • Cross-sample calibration: per-sample affine pre-transform aligns scales.

  • Robust fitting: LTS reweighting suppresses the influence of outlier features.

  • No R dependency: native NumPy/SciPy; no rpy2, no R install required.

  • Vectorized: closed-form analytic gradient and fully vectorized objective; uses SciPy’s L-BFGS-B.

Algorithm Details

The model is

\[h_{ij} = \mathrm{arsinh}(b_j \cdot y_{ij} + a_j), \quad b_j = \exp(\beta_j) > 0\]

where \(y_{ij}\) is the raw intensity for feature i in sample j. a_j (offset) and β_j (log-scaling-factor) are per-sample parameters; β is parametrized in log-space to keep b_j strictly positive.

Fitting minimizes the negative profile log-likelihood

\[\mathcal{L}(\theta) = \frac{n_t}{2}\,\log(2\pi\sigma^2) + \frac{1}{2}\sum_{i,j} \log(1 + Y_{ij}^2) - n_{\mathrm{features}} \sum_j \beta_j\]

where \(Y_{ij} = b_j y_{ij} + a_j\), the per-feature mean μ_i and the variance σ² are profiled out (closed-form), and n_t is the total number of (feature, sample) cells.

The robust LTS step then iterates:

  1. Apply the current (a, β) to all features.

  2. Compute the per-feature residual variance over samples.

  3. Slice features into 5 intensity bins by rank of the per-feature mean.

  4. Within each bin, keep features whose residual variance is below the lts_quantile percentile (and always keep the lowest-intensity slice in full).

  5. Refit on this subset, warm-starting from the previous parameters.

Up to 7 LTS iterations are performed (matches R-VSN). The final transform is converted to a log2-comparable scale via

\[h_{ij}^{\log_2} = \frac{\mathrm{arsinh}(b_j y_{ij} + a_j)}{\ln 2} - h_{\mathrm{offset}}, \quad h_{\mathrm{offset}} = \log_2\!\big(2 \exp(\overline{\beta_j})\big).\]

Numerical agreement with R-VSN

The native engine matches the Bioconductor vsn package’s vsn2 output to ~1e-6 (and a/β parameters to ~1e-6) on realistic proteomics-shaped inputs (e.g. the canonical kidney 8704×2 dataset). On smaller, harder synthetic inputs, scipy’s L-BFGS-B and R’s lbfgsb may converge to slightly different local optima on the near-flat profile-likelihood surface; the resulting outputs typically agree within a few times 0.01 on the log2 scale, which is well within the noise of any reasonable downstream analysis.

Parameters

class pronoms.normalizers.VSNNormalizer(calib: str = 'affine', reference_sample: int | None = None, lts_quantile: float = 0.75)[source]

Bases: object

Variance-stabilizing normalization (Huber et al., 2002).

The fitted model is h_ij = arsinh(exp(beta_j) * y_ij + a_j), which behaves like log(y_ij) for large y_ij and is well-defined for zero or negative inputs. Per-sample offset a_j and log-scale factor beta_j are estimated by maximum profile likelihood with a least-trimmed-squares robustification step. The output is on a log2-comparable scale.

Parameters:
  • calib (str, optional) – Calibration method, by default "affine". "affine" is the only supported value; other values raise ValueError.

  • reference_sample (Optional[int], optional) – Deprecated. Always treated as None regardless of value. Pass None (the default) to silence the deprecation warning. Will be removed in a future major release.

  • lts_quantile (float, optional) – Quantile for the Least-Trimmed-Squares robust step, by default 0.75. Must lie in (0, 1]. 1.0 disables the robust step.

vsn_params

Fitted parameters as a plain Python dict. Only populated after normalize(). Keys:

coefficientsnp.ndarray, shape (1, n_samples, 2)

Per-sample (a, beta) coefficients in R-VSN’s layout. Slice [0, :, 0] is a; [0, :, 1] is beta = log(b).

anp.ndarray, shape (n_samples,)

Per-sample offset.

b_lognp.ndarray, shape (n_samples,)

Per-sample log-scale factor; the actual scaling is np.exp(b_log).

sigsqfloat

Profiled variance on the natural arsinh scale.

hoffsetfloat

Offset applied during transform to put output on a log2-like scale.

munp.ndarray, shape (n_features,)

Per-feature mean of the transformed data on the natural arsinh scale (rows trimmed by LTS in iterations >1 appear as NaN, matching R’s bookkeeping).

convergedbool

Whether the inner L-BFGS-B converged on the final LTS iteration.

n_lts_iterint

Number of LTS iterations executed (capped at 7).

Type:

Optional[dict]

normalize(X: ndarray, protein_ids: list[str] | None = None, sample_ids: list[str] | None = None) ndarray[source]

Fit VSN on X and return the transformed matrix.

Parameters:
  • X (np.ndarray) – Input data with shape (n_samples, n_features).

  • protein_ids (Optional[List[str]], optional) – Unused; retained for API compatibility with the previous R-backed implementation.

  • sample_ids (Optional[List[str]], optional) – Unused; retained for API compatibility.

Returns:

Normalized data with the same shape as X, on a log2-comparable (variance-stabilized) scale.

Return type:

np.ndarray

Raises:

ValueError – If the input contains NaN or Inf, or if the native VSN engine fails to fit (e.g., fewer than 2 samples).

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (8, 8), gridsize: int = 50, cmap: str = 'viridis') Figure[source]

Plot a hexbin comparison of raw vs. VSN-normalized intensities.

Parameters:
  • before_data (np.ndarray) – Data before normalization.

  • after_data (np.ndarray) – Data after normalization (output of normalize).

  • figsize (Tuple[int, int], optional)

  • gridsize (int, optional)

  • cmap (str, optional)

Return type:

plt.Figure

Usage Example

import numpy as np
from pronoms.normalizers import VSNNormalizer

rng = np.random.default_rng(42)
data = np.array([
    rng.normal([100, 1000, 10000], [10, 100, 1000]),   # Sample 1
    rng.normal([120, 1200, 12000], [12, 120, 1200]),   # Sample 2
    rng.normal([80,  800,  8000],  [8,  80,  800]),    # Sample 3
])

normalizer = VSNNormalizer()
normalized = normalizer.normalize(data)

# Inspect fitted parameters (plain Python dict)
params = normalizer.vsn_params
print("a:",       params["a"])
print("beta:",    params["b_log"])
print("sigma^2:", params["sigsq"])

Visualization:

fig = normalizer.plot_comparison(data, normalized)
fig.show()

When to Use

VSNNormalizer is particularly useful when:

  • Intensity-dependent variance: data shows increasing variance with higher intensities.

  • Proteomics applications: mass-spectrometry data with heteroscedastic noise.

  • Microarray data: gene expression with intensity-dependent variance.

  • Statistical analysis preparation: before methods that assume constant variance.

  • Cross-sample comparison: when samples must be on truly comparable scales.

Considerations

  • At least two samples are required (rows of the input matrix). Single-sample inputs raise ValueError.

  • NaN / Inf values must be handled before calling normalize; non-finite cells raise ValueError.

  • Computational cost: more expensive than simple scaling methods due to L-BFGS-B parameter estimation, but typically completes in well under a second on realistic proteomics matrices.

  • Output scale: the result is on a log2-comparable, variance-stabilized scale; not in original intensity units.

  • Hyperparameter: lts_quantile=0.75 is the pronoms default; pass 1.0 to disable the LTS step (single ML fit).

See Also

Citation

Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18 Suppl 1:S96–104. doi:10.1093/bioinformatics/18.suppl_1.s96, PMID: 12169536