VSNNormalizer

The VSNNormalizer performs Variance Stabilizing Normalization (VSN). It applies a data-driven arcsinh transformation that flattens variance across the full intensity range, making downstream statistical analyses more reliable and robust to intensity-dependent noise.

As of version 0.2, VSNNormalizer is implemented in pure Python (NumPy + SciPy). The previous R/rpy2 backend has been removed; an R installation is no longer required to use this normalizer.

Overview

Variance Stabilizing Normalization addresses a common problem in high-throughput biological data: the variance of measurements often depends on their intensity level. In proteomics and microarray data, low-intensity measurements typically have lower variance while high-intensity measurements have higher variance. This heteroscedasticity can bias statistical analyses.

VSN works by:

Learning transformation parameters: per-sample offset a_j and log-scale-factor beta_j are estimated by maximum profile likelihood.
Applying transformation: h_ij = arsinh(exp(beta_j) * y_ij + a_j), which behaves like log(y_ij) for large positive y_ij and is well-defined for zero or negative inputs.
Calibrating across samples: parameters are estimated jointly so per-sample intensities become comparable.
Robust subset selection: a least-trimmed-squares (LTS) iteration trims a configurable upper quantile of high-residual rows from each intensity slice, so the fit is not dominated by outliers.

Key Features

Variance stabilization: variance becomes approximately constant across intensity ranges.
Data-driven: parameters are learned from the data, not predetermined.
Cross-sample calibration: per-sample affine pre-transform aligns scales.
Robust fitting: LTS reweighting suppresses the influence of outlier features.
No R dependency: native NumPy/SciPy; no rpy2, no R install required.
Vectorized: closed-form analytic gradient and fully vectorized objective; uses SciPy’s L-BFGS-B.

Algorithm Details

The model is

\[h_{ij} = \mathrm{arsinh}(b_j \cdot y_{ij} + a_j), \quad b_j = \exp(\beta_j) > 0\]

where \(y_{ij}\) is the raw intensity for feature i in sample j. a_j (offset) and β_j (log-scaling-factor) are per-sample parameters; β is parametrized in log-space to keep b_j strictly positive.

Fitting minimizes the negative profile log-likelihood

\[\mathcal{L}(\theta) = \frac{n_t}{2}\,\log(2\pi\sigma^2) + \frac{1}{2}\sum_{i,j} \log(1 + Y_{ij}^2) - n_{\mathrm{features}} \sum_j \beta_j\]

where \(Y_{ij} = b_j y_{ij} + a_j\), the per-feature mean μ_i and the variance σ² are profiled out (closed-form), and n_t is the total number of (feature, sample) cells.

The robust LTS step then iterates:

Apply the current (a, β) to all features.
Compute the per-feature residual variance over samples.
Slice features into 5 intensity bins by rank of the per-feature mean.
Within each bin, keep features whose residual variance is below the lts_quantile percentile (and always keep the lowest-intensity slice in full).
Refit on this subset, warm-starting from the previous parameters.

Up to 7 LTS iterations are performed (matches R-VSN). The final transform is converted to a log2-comparable scale via

\[h_{ij}^{\log_2} = \frac{\mathrm{arsinh}(b_j y_{ij} + a_j)}{\ln 2} - h_{\mathrm{offset}}, \quad h_{\mathrm{offset}} = \log_2\!\big(2 \exp(\overline{\beta_j})\big).\]

Numerical agreement with R-VSN

The native engine matches the Bioconductor vsn package’s vsn2 output to ~1e-6 (and a/β parameters to ~1e-6) on realistic proteomics-shaped inputs (e.g. the canonical kidney 8704×2 dataset). On smaller, harder synthetic inputs, scipy’s L-BFGS-B and R’s lbfgsb may converge to slightly different local optima on the near-flat profile-likelihood surface; the resulting outputs typically agree within a few times 0.01 on the log2 scale, which is well within the noise of any reasonable downstream analysis.

Parameters

class pronoms.normalizers.VSNNormalizer(calib: str = 'affine', reference_sample: int | None = None, lts_quantile: float = 0.75)[source]

Bases: object

Variance-stabilizing normalization (Huber et al., 2002).

The fitted model is h_ij = arsinh(exp(beta_j) * y_ij + a_j), which behaves like log(y_ij) for large y_ij and is well-defined for zero or negative inputs. Per-sample offset a_j and log-scale factor beta_j are estimated by maximum profile likelihood with a least-trimmed-squares robustification step. The output is on a log2-comparable scale.

Parameters:

calib (str, optional) – Calibration method, by default "affine". "affine" is the only supported value; other values raise ValueError.
reference_sample (Optional[int], optional) – Deprecated. Always treated as None regardless of value. Pass None (the default) to silence the deprecation warning. Will be removed in a future major release.
lts_quantile (float, optional) – Quantile for the Least-Trimmed-Squares robust step, by default 0.75. Must lie in (0, 1]. 1.0 disables the robust step.

vsn_params

Fitted parameters as a plain Python dict. Only populated after normalize(). Keys:

coefficientsnp.ndarray, shape (1, n_samples, 2): Per-sample (a, beta) coefficients in R-VSN’s layout. Slice [0, :, 0] is a; [0, :, 1] is beta = log(b).
anp.ndarray, shape (n_samples,): Per-sample offset.
b_lognp.ndarray, shape (n_samples,): Per-sample log-scale factor; the actual scaling is np.exp(b_log).
sigsqfloat: Profiled variance on the natural arsinh scale.
hoffsetfloat: Offset applied during transform to put output on a log2-like scale.
munp.ndarray, shape (n_features,): Per-feature mean of the transformed data on the natural arsinh scale (rows trimmed by LTS in iterations >1 appear as NaN, matching R’s bookkeeping).
convergedbool: Whether the inner L-BFGS-B converged on the final LTS iteration.
n_lts_iterint: Number of LTS iterations executed (capped at 7).

Type:: Optional[dict]

normalize(X: ndarray, protein_ids: list[str] | None = None, sample_ids: list[str] | None = None) → ndarray[source]

Fit VSN on X and return the transformed matrix.

Parameters:

X (np.ndarray) – Input data with shape (n_samples, n_features).
protein_ids (Optional[List[str]], optional) – Unused; retained for API compatibility with the previous R-backed implementation.
sample_ids (Optional[List[str]], optional) – Unused; retained for API compatibility.

Returns:

Normalized data with the same shape as X, on a log2-comparable (variance-stabilized) scale.

Return type:

np.ndarray

Raises:

ValueError – If the input contains NaN or Inf, or if the native VSN engine fails to fit (e.g., fewer than 2 samples).

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (8, 8), gridsize: int = 50, cmap: str = 'viridis') → Figure[source]

Plot a hexbin comparison of raw vs. VSN-normalized intensities.

Parameters:

before_data (np.ndarray) – Data before normalization.
after_data (np.ndarray) – Data after normalization (output of normalize).
figsize (Tuple[int, int], optional)
gridsize (int, optional)
cmap (str, optional)

Return type:

plt.Figure

Usage Example

import numpy as np
from pronoms.normalizers import VSNNormalizer

rng = np.random.default_rng(42)
data = np.array([
    rng.normal([100, 1000, 10000], [10, 100, 1000]),   # Sample 1
    rng.normal([120, 1200, 12000], [12, 120, 1200]),   # Sample 2
    rng.normal([80,  800,  8000],  [8,  80,  800]),    # Sample 3
])

normalizer = VSNNormalizer()
normalized = normalizer.normalize(data)

# Inspect fitted parameters (plain Python dict)
params = normalizer.vsn_params
print("a:",       params["a"])
print("beta:",    params["b_log"])
print("sigma^2:", params["sigsq"])

Visualization:

fig = normalizer.plot_comparison(data, normalized)
fig.show()

When to Use

VSNNormalizer is particularly useful when:

Intensity-dependent variance: data shows increasing variance with higher intensities.
Proteomics applications: mass-spectrometry data with heteroscedastic noise.
Microarray data: gene expression with intensity-dependent variance.
Statistical analysis preparation: before methods that assume constant variance.
Cross-sample comparison: when samples must be on truly comparable scales.

Considerations

At least two samples are required (rows of the input matrix). Single-sample inputs raise ValueError.
NaN / Inf values must be handled before calling normalize; non-finite cells raise ValueError.
Computational cost: more expensive than simple scaling methods due to L-BFGS-B parameter estimation, but typically completes in well under a second on realistic proteomics matrices.
Output scale: the result is on a log2-comparable, variance-stabilized scale; not in original intensity units.
Hyperparameter: lts_quantile=0.75 is the pronoms default; pass 1.0 to disable the LTS step (single ML fit).

Citation

Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18 Suppl 1:S96–104. doi:10.1093/bioinformatics/18.suppl_1.s96, PMID: 12169536

VSNNormalizer

Overview

Key Features

Algorithm Details

Numerical agreement with R-VSN

Parameters

Usage Example

When to Use

Considerations

See Also

Citation