MedianNormalizer

The MedianNormalizer scales each sample by its median value, then rescales all samples by the mean of medians to preserve overall scale. This simple yet effective normalization method corrects for systematic differences in sample loading or labeling efficiency across samples.

Overview

Median normalization is based on the assumption that most proteins/features are not changing between samples, so differences in median intensity reflect technical rather than biological variation. The method works in two steps:

Sample-wise scaling: Each sample is divided by its own median
Global rescaling: All samples are multiplied by the mean of all original medians to preserve the overall data scale

This approach is particularly effective for:

Correcting for differences in sample loading amounts
Adjusting for labeling efficiency variations
Normalizing systematic technical differences across samples
Preprocessing data where most features are expected to be unchanged

Key Features

Scale preservation: Maintains the overall magnitude of the data through global rescaling
Robust to outliers: Uses median instead of mean, making it less sensitive to extreme values
Simple and fast: Computationally efficient with minimal parameters
Widely applicable: Suitable for most proteomics and genomics datasets

Algorithm Details

For a data matrix X with shape (n_samples, n_features):

Calculate sample medians: For each sample i, compute median_i = median(X[i, :])
Scale samples: X_scaled[i, :] = X[i, :] / median_i
Calculate global scaling factor: global_factor = mean(all medians)
Apply global rescaling: X_normalized[i, :] = X_scaled[i, :] * global_factor

Mathematical representation:

\[X_{normalized}[i,j] = \frac{X[i,j]}{\text{median}(X[i,:])} \times \frac{1}{n} \sum_{k=1}^{n} \text{median}(X[k,:])\]

Parameters

class pronoms.normalizers.MedianNormalizer[source]

Bases: object

Normalizer that scales each sample by its median.

This normalizer adjusts each sample (row) in the data matrix by dividing that sample by its own median and then multiplying by the mean of all sample medians. After normalization every row’s median equals mean_of_medians, preserving the overall scale of the dataset rather than collapsing every row to a median of 1.

Inputs are arranged as (n_samples, n_features) (rows are samples, columns are proteins/features), following the sklearn convention.

scaling_factors

Per-sample medians used as the divisor (one value per row). Only available after calling normalize().

Type:: Optional[np.ndarray]

mean_of_medians

Mean of scaling_factors; the common value every row’s median is rescaled to. Only available after calling normalize().

Type:: Optional[float]

normalize(X: ndarray) → ndarray[source]

Perform median normalization on input data X.

Parameters:

X (np.ndarray) – Input data matrix with shape (n_samples, n_features). Each row represents a sample, each column represents a feature/protein.

Returns:

Normalized data matrix with the same shape as X.

Return type:

np.ndarray

Raises:

ValueError –

If input is not a 2D array with at least one feature. - If input data contains NaN or Inf values. - If any sample’s median is ≤ 0 (protein quantities must be positive).

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'Median Normalization Comparison', log_axes: bool = True) → Figure[source]

Plot data before vs after normalization using a 2D hexbin density plot.

Parameters:

before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “Median Normalization Comparison”.
log_axes (bool, optional) – If True (default), plot log10 of the values on both axes. If False, plot raw values.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Usage Example

Basic median normalization:

import numpy as np
from pronoms.normalizers import MedianNormalizer

# Create sample data with different loading amounts
data = np.array([
    [100, 200, 300, 400],  # Sample 1: high loading
    [50, 100, 150, 200],   # Sample 2: medium loading
    [25, 50, 75, 100]      # Sample 3: low loading
])

# Create and apply normalizer
normalizer = MedianNormalizer()
normalized_data = normalizer.normalize(data)

print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data)

# Check that sample medians are now equal
print("\nSample medians after normalization:")
for i, sample in enumerate(normalized_data):
    print(f"Sample {i+1}: {np.median(sample):.2f}")

Visualization:

# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()

When to Use

MedianNormalizer is particularly useful when:

Sample loading varies: Different amounts of sample were loaded across runs
Labeling efficiency differs: Variations in chemical labeling or sample preparation
Most features unchanged: The majority of proteins/features are not expected to change between conditions
Simple normalization needed: When a straightforward, robust method is preferred
Preprocessing step: As an initial normalization before more complex methods

Considerations

Assumes stable features: Method assumes most features are not changing between samples
May over-normalize: If many features are truly changing, median-based normalization may remove real biological signal
Not suitable for sparse data: Performance may be poor with many zero or missing values
Global scaling: The global rescaling step may not be appropriate for all applications