SPLMNormalizer

The SPLMNormalizer implements Stable Protein Log-Mean Normalization (SPLM), which identifies a subset of stably expressed proteins based on their low coefficient of variation and uses them as internal standards for normalization. This method is particularly effective when a subset of proteins can be assumed to remain constant across experimental conditions.

Overview

SPLM normalization addresses the challenge of selecting appropriate reference features for normalization in proteomics data. Rather than assuming all proteins are equally suitable as references, SPLM:

Identifies stable proteins: Selects features with the lowest coefficient of variation (std/mean) computed in linear space
Uses stable proteins as references: Calculates scaling factors based only on these stable features
Normalizes all features: Applies the scaling factors derived from stable proteins to the entire dataset

This approach is particularly powerful when:

A subset of proteins are expected to be housekeeping or constitutively expressed
Technical variation affects all proteins proportionally
You want to avoid bias from highly variable proteins in normalization
Working with targeted proteomics where reference proteins can be identified

Key Features

Automatic stable protein selection: Identifies the most stable features by linear-space coefficient of variation
Reference-based normalization: Uses only stable proteins for scaling factor calculation
Log-space centering: Removes multiplicative effects through log-transformed centering on stable proteins
Robust to variable proteins: Normalization is not affected by highly variable features
Preserves biological variation: Maintains true biological differences while removing technical bias

Algorithm Details

The SPLM algorithm works through the following steps:

Calculate per-protein CV in linear space: For each protein j, CV_j = std(X[:, j]) / mean(X[:, j]). Constant proteins (std=0) get CV=0; proteins with mean=0 are deprioritized as +inf.
Select stable proteins: Choose the num_stable_proteins with lowest CV
Log transformation: X_log = log(X + ε) where ε prevents log(0)
Calculate scaling factors: For each sample i, factor_i = mean(X_log[i, stable_proteins])
Calculate grand mean: grand_mean = mean(all scaling factors)
Normalize in log-space: X_norm_log[i, j] = X_log[i, j] - factor_i + grand_mean
Back-transform: X_normalized = exp(X_norm_log) - ε

Mathematical representation:

\[\text{CV}_j = \frac{\sigma(X_{:,j})}{\mu(X_{:,j})}\]

\[\text{factor}_i = \frac{1}{k} \sum_{j \in \text{stable}} \log(X_{i,j} + \epsilon)\]

where k is the number of stable proteins.

Parameters

class pronoms.normalizers.SPLMNormalizer(num_stable_proteins: int = 100, epsilon: float = 1e-06)[source]

Bases: object

Normalizer based on Stable Protein Log-Mean Normalization (SPLM-Norm).

Scales proteomics intensity data using a subset of stably expressed proteins (lowest coefficient of variation, computed in linear space as std(X) / mean(X) per protein across samples). It uses the mean of log-transformed intensities of these stable proteins per sample to define scaling factors, performs normalization in log-space, recenters, and then transforms back to the original scale.

num_stable_proteins

Number of stable proteins used for calculating scaling factors.

Type:: int

epsilon

Small constant added before log transformation to avoid log(0).

Type:: float

stable_protein_indices

Indices of the proteins identified as stable. Available after normalize().

Type:: Optional[np.ndarray]

log_scaling_factors

The per-sample log-space scaling factors derived from stable proteins. Available after normalize().

Type:: Optional[np.ndarray]

grand_mean_log_scaling_factor

The mean of the log_scaling_factors across all samples. Available after normalize().

Type:: Optional[float]

normalize(X: ndarray) → ndarray[source]

Perform Stable Protein Log-Mean Normalization on input data X.

Parameters:

X (np.ndarray) – Input data matrix with shape (n_samples, n_features). Each row represents a sample, each column represents a feature/protein.

Returns:

Normalized data matrix with the same shape as X.

Return type:

np.ndarray

Raises:

ValueError –

If input is not a 2D array with at least one feature. - If input data contains NaN or Inf values. - If num_stable_proteins is greater than the number of features in X. - If stable proteins cannot be determined (e.g., all proteins have zero variance).

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'SPLM Normalization Comparison') → Figure[source]

Plot data before vs after normalization using a 2D hexbin density plot.

Parameters:

before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “SPLM Normalization Comparison”.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Usage Example

Basic SPLM normalization:

import numpy as np
from pronoms.normalizers import SPLMNormalizer

# Create sample data with stable and variable proteins
np.random.seed(42)

# Stable proteins (low variability)
stable_proteins = np.array([
    [100, 200, 150],  # Sample 1
    [105, 210, 155],  # Sample 2
    [95, 190, 145]    # Sample 3
])

# Variable proteins (high variability)
variable_proteins = np.array([
    [50, 1000],   # Sample 1
    [150, 500],   # Sample 2
    [25, 2000]    # Sample 3
])

# Combine stable and variable proteins
data = np.hstack([stable_proteins, variable_proteins])

# Create and apply normalizer
# Use 3 stable proteins (should select the first 3 columns)
normalizer = SPLMNormalizer(num_stable_proteins=3, epsilon=1.0)
normalized_data = normalizer.normalize(data)

print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data)

# Examine which proteins were selected as stable
print(f"\nStable protein indices: {normalizer.stable_feature_indices_}")
print(f"Log-CVs of all proteins: {normalizer.log_cvs_}")
print(f"Scaling factors: {normalizer.log_scaling_factors_}")

Visualization:

# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()

When to Use

SPLMNormalizer is particularly useful when:

Housekeeping proteins present: Dataset contains proteins expected to be stably expressed
Targeted proteomics: Working with a curated set of proteins where some serve as references
Technical variation dominant: When most variation is technical rather than biological
Reference protein selection: When you want data-driven selection of reference features
Proportional scaling needed: When technical effects scale all proteins proportionally

Considerations

Stable protein assumption: Requires that some proteins are truly stable across conditions
Number of stable proteins: Choice of num_stable_proteins can significantly affect results
Log-space processing: Assumes multiplicative rather than additive effects
Minimum protein requirement: Needs sufficient proteins to reliably identify stable ones
Biological interpretation: May remove true biological signal if stable proteins are misidentified