QuantileNormalizer

The QuantileNormalizer applies quantile normalization to make the empirical distributions of all samples identical. This powerful normalization method ensures that after transformation, all samples have exactly the same distribution, making them directly comparable regardless of their original distributions.

Overview

Quantile normalization is based on the principle that samples should have the same distribution if technical variation is removed. The method works by:

Ranking values: Sort each sample independently to get ranks
Computing reference distribution: Average values across samples at each rank position
Reassigning values: Replace original values with the reference distribution values at corresponding ranks

This approach is particularly powerful for:

Removing systematic differences between samples or batches
Making samples directly comparable across different experimental conditions
Integrating data from multiple studies or platforms
Preprocessing for downstream analyses that assume similar distributions

Key Features

Identical distributions: All samples have exactly the same distribution after normalization
Rank preservation: The relative ordering within each sample is maintained
Batch effect removal: Effective at removing systematic batch-to-batch variation
Reference-based: Uses the average distribution across all samples as the target

Algorithm Details

For a data matrix X with shape (n_samples, n_features):

Sort each sample: For each sample i, create sorted_i = sort(X[i, :])
Compute reference distribution: ref[j] = mean(sorted_1[j], sorted_2[j], …, sorted_n[j])
Rank with average ties: For each sample, compute the average rank of every value (so tied values share a single fractional rank)
Map ranks to reference: Linearly interpolate the reference distribution at each rank, so tied values receive the average of the reference values at their tied positions (Bolstad et al., 2003)

Example: For samples [1, 3, 2] and [10, 30, 20]: - Sorted: [1, 2, 3] and [10, 20, 30] - Reference: [(1+10)/2, (2+20)/2, (3+30)/2] = [5.5, 11, 16.5] - Result: [5.5, 16.5, 11] and [5.5, 16.5, 11]

Tie example: For sample [5, 5, 10] with reference [5.33, 9.0, 14.33]: - Average ranks: [1.5, 1.5, 3] - Result: [(5.33 + 9.0) / 2, (5.33 + 9.0) / 2, 14.33] = [7.17, 7.17, 14.33]

Parameters

class pronoms.normalizers.QuantileNormalizer[source]

Bases: object

Normalizer that performs quantile normalization across samples.

Quantile normalization makes the distribution of intensities for each sample identical by replacing each value with the mean of the corresponding quantiles across all samples.

Tied values within a row receive the same normalized value (the average of the reference values at the tied ranks), following Bolstad et al. (2003).

reference_distribution

The reference distribution used for normalization. Only available after calling normalize().

Type:: Optional[np.ndarray]

normalize(X: ndarray) → ndarray[source]

Perform quantile normalization on input data X.

Parameters:: X (np.ndarray) – Input data matrix with shape (n_samples, n_features). Each row represents a sample, each column represents a feature/protein.
Returns:: Normalized data matrix with the same shape as X.
Return type:: np.ndarray
Raises:: ValueError – If input data contains NaN or Inf values.

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'Quantile Normalization Comparison') → Figure[source]

Plot data before vs after normalization using a 2D hexbin density plot.

Parameters:

before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “Quantile Normalization Comparison”.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Usage Example

Basic quantile normalization:

import numpy as np
from pronoms.normalizers import QuantileNormalizer

# Create sample data with different distributions
data = np.array([
    [1, 5, 10, 20],    # Sample 1: low values
    [100, 500, 1000, 2000],  # Sample 2: high values
    [2, 8, 15, 25]     # Sample 3: intermediate values
])

# Create and apply normalizer
normalizer = QuantileNormalizer()
normalized_data = normalizer.normalize(data)

print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data)

# Verify identical distributions
print("\nSample distributions after normalization:")
for i, sample in enumerate(normalized_data):
    print(f"Sample {i+1}: {sorted(sample)}")

Visualization:

# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()

When to Use

QuantileNormalizer is particularly useful when:

Batch effects present: Strong systematic differences between experimental batches
Cross-platform integration: Combining data from different measurement platforms
Distribution assumptions: Downstream methods assume samples have similar distributions
Direct comparability needed: When samples must be directly comparable after normalization
Large-scale studies: Multi-center studies where technical variation dominates

Considerations

Strong assumption: Assumes all samples should have identical distributions
May remove biology: Can eliminate real biological differences between sample groups
Rank-based: Only preserves relative ordering, not absolute differences
Reference dependency: Results depend on the composition of the sample set
Not suitable for sparse data: Performance degrades with many missing or zero values