VSNNormalizer
The VSNNormalizer performs Variance Stabilizing Normalization (VSN). It applies a data-driven arcsinh transformation that flattens variance across the full intensity range, making downstream statistical analyses more reliable and robust to intensity-dependent noise.
As of version 0.2, VSNNormalizer is implemented in pure Python (NumPy + SciPy). The previous R/rpy2 backend has been removed; an R installation is no longer required to use this normalizer.
Overview
Variance Stabilizing Normalization addresses a common problem in high-throughput biological data: the variance of measurements often depends on their intensity level. In proteomics and microarray data, low-intensity measurements typically have lower variance while high-intensity measurements have higher variance. This heteroscedasticity can bias statistical analyses.
VSN works by:
Learning transformation parameters: per-sample offset
a_jand log-scale-factorbeta_jare estimated by maximum profile likelihood.Applying transformation:
h_ij = arsinh(exp(beta_j) * y_ij + a_j), which behaves likelog(y_ij)for large positivey_ijand is well-defined for zero or negative inputs.Calibrating across samples: parameters are estimated jointly so per-sample intensities become comparable.
Robust subset selection: a least-trimmed-squares (LTS) iteration trims a configurable upper quantile of high-residual rows from each intensity slice, so the fit is not dominated by outliers.
Key Features
Variance stabilization: variance becomes approximately constant across intensity ranges.
Data-driven: parameters are learned from the data, not predetermined.
Cross-sample calibration: per-sample affine pre-transform aligns scales.
Robust fitting: LTS reweighting suppresses the influence of outlier features.
No R dependency: native NumPy/SciPy; no rpy2, no R install required.
Vectorized: closed-form analytic gradient and fully vectorized objective; uses SciPy’s L-BFGS-B.
Algorithm Details
The model is
where \(y_{ij}\) is the raw intensity for feature i in sample j. a_j (offset) and β_j (log-scaling-factor) are per-sample parameters; β is parametrized in log-space to keep b_j strictly positive.
Fitting minimizes the negative profile log-likelihood
where \(Y_{ij} = b_j y_{ij} + a_j\), the per-feature mean μ_i and the variance σ² are profiled out (closed-form), and n_t is the total number of (feature, sample) cells.
The robust LTS step then iterates:
Apply the current
(a, β)to all features.Compute the per-feature residual variance over samples.
Slice features into 5 intensity bins by rank of the per-feature mean.
Within each bin, keep features whose residual variance is below the
lts_quantilepercentile (and always keep the lowest-intensity slice in full).Refit on this subset, warm-starting from the previous parameters.
Up to 7 LTS iterations are performed (matches R-VSN). The final transform is converted to a log2-comparable scale via
Numerical agreement with R-VSN
The native engine matches the Bioconductor vsn package’s vsn2 output to ~1e-6 (and a/β parameters to ~1e-6) on realistic proteomics-shaped inputs (e.g. the canonical kidney 8704×2 dataset). On smaller, harder synthetic inputs, scipy’s L-BFGS-B and R’s lbfgsb may converge to slightly different local optima on the near-flat profile-likelihood surface; the resulting outputs typically agree within a few times 0.01 on the log2 scale, which is well within the noise of any reasonable downstream analysis.
Parameters
- class pronoms.normalizers.VSNNormalizer(calib: str = 'affine', reference_sample: int | None = None, lts_quantile: float = 0.75)[source]
Bases:
objectVariance-stabilizing normalization (Huber et al., 2002).
The fitted model is
h_ij = arsinh(exp(beta_j) * y_ij + a_j), which behaves likelog(y_ij)for largey_ijand is well-defined for zero or negative inputs. Per-sample offseta_jand log-scale factorbeta_jare estimated by maximum profile likelihood with a least-trimmed-squares robustification step. The output is on a log2-comparable scale.- Parameters:
calib (str, optional) – Calibration method, by default
"affine"."affine"is the only supported value; other values raiseValueError.reference_sample (Optional[int], optional) – Deprecated. Always treated as
Noneregardless of value. PassNone(the default) to silence the deprecation warning. Will be removed in a future major release.lts_quantile (float, optional) – Quantile for the Least-Trimmed-Squares robust step, by default
0.75. Must lie in (0, 1].1.0disables the robust step.
- vsn_params
Fitted parameters as a plain Python dict. Only populated after
normalize(). Keys:coefficientsnp.ndarray, shape(1, n_samples, 2)Per-sample (a, beta) coefficients in R-VSN’s layout. Slice
[0, :, 0]isa;[0, :, 1]isbeta = log(b).anp.ndarray, shape(n_samples,)Per-sample offset.
b_lognp.ndarray, shape(n_samples,)Per-sample log-scale factor; the actual scaling is
np.exp(b_log).sigsqfloatProfiled variance on the natural
arsinhscale.hoffsetfloatOffset applied during transform to put output on a log2-like scale.
munp.ndarray, shape(n_features,)Per-feature mean of the transformed data on the natural
arsinhscale (rows trimmed by LTS in iterations >1 appear asNaN, matching R’s bookkeeping).convergedboolWhether the inner L-BFGS-B converged on the final LTS iteration.
n_lts_iterintNumber of LTS iterations executed (capped at 7).
- Type:
Optional[dict]
- normalize(X: ndarray, protein_ids: list[str] | None = None, sample_ids: list[str] | None = None) ndarray[source]
Fit VSN on
Xand return the transformed matrix.- Parameters:
X (np.ndarray) – Input data with shape
(n_samples, n_features).protein_ids (Optional[List[str]], optional) – Unused; retained for API compatibility with the previous R-backed implementation.
sample_ids (Optional[List[str]], optional) – Unused; retained for API compatibility.
- Returns:
Normalized data with the same shape as
X, on a log2-comparable (variance-stabilized) scale.- Return type:
np.ndarray
- Raises:
ValueError – If the input contains NaN or Inf, or if the native VSN engine fails to fit (e.g., fewer than 2 samples).
- plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (8, 8), gridsize: int = 50, cmap: str = 'viridis') Figure[source]
Plot a hexbin comparison of raw vs. VSN-normalized intensities.
- Parameters:
before_data (np.ndarray) – Data before normalization.
after_data (np.ndarray) – Data after normalization (output of
normalize).figsize (Tuple[int, int], optional)
gridsize (int, optional)
cmap (str, optional)
- Return type:
plt.Figure
Usage Example
import numpy as np
from pronoms.normalizers import VSNNormalizer
rng = np.random.default_rng(42)
data = np.array([
rng.normal([100, 1000, 10000], [10, 100, 1000]), # Sample 1
rng.normal([120, 1200, 12000], [12, 120, 1200]), # Sample 2
rng.normal([80, 800, 8000], [8, 80, 800]), # Sample 3
])
normalizer = VSNNormalizer()
normalized = normalizer.normalize(data)
# Inspect fitted parameters (plain Python dict)
params = normalizer.vsn_params
print("a:", params["a"])
print("beta:", params["b_log"])
print("sigma^2:", params["sigsq"])
Visualization:
fig = normalizer.plot_comparison(data, normalized)
fig.show()
When to Use
VSNNormalizer is particularly useful when:
Intensity-dependent variance: data shows increasing variance with higher intensities.
Proteomics applications: mass-spectrometry data with heteroscedastic noise.
Microarray data: gene expression with intensity-dependent variance.
Statistical analysis preparation: before methods that assume constant variance.
Cross-sample comparison: when samples must be on truly comparable scales.
Considerations
At least two samples are required (rows of the input matrix). Single-sample inputs raise
ValueError.NaN / Inf values must be handled before calling
normalize; non-finite cells raiseValueError.Computational cost: more expensive than simple scaling methods due to L-BFGS-B parameter estimation, but typically completes in well under a second on realistic proteomics matrices.
Output scale: the result is on a log2-comparable, variance-stabilized scale; not in original intensity units.
Hyperparameter:
lts_quantile=0.75is the pronoms default; pass1.0to disable the LTS step (single ML fit).
See Also
MedianNormalizer: simple scaling-based normalization.
QuantileNormalizer: making distributions identical across samples.
MADNormalizer: robust normalization via median absolute deviation.
RankNormalizer: rank-based transformation.
Citation
Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18 Suppl 1:S96–104. doi:10.1093/bioinformatics/18.suppl_1.s96, PMID: 12169536