QuantileNormalizer

The QuantileNormalizer applies quantile normalization to make the empirical distributions of all samples identical. This powerful normalization method ensures that after transformation, all samples have exactly the same distribution, making them directly comparable regardless of their original distributions.

Overview

Quantile normalization is based on the principle that samples should have the same distribution if technical variation is removed. The method works by:

  1. Ranking values: Sort each sample independently to get ranks

  2. Computing reference distribution: Average values across samples at each rank position

  3. Reassigning values: Replace original values with the reference distribution values at corresponding ranks

This approach is particularly powerful for:

  • Removing systematic differences between samples or batches

  • Making samples directly comparable across different experimental conditions

  • Integrating data from multiple studies or platforms

  • Preprocessing for downstream analyses that assume similar distributions

Key Features

  • Identical distributions: All samples have exactly the same distribution after normalization

  • Rank preservation: The relative ordering within each sample is maintained

  • Batch effect removal: Effective at removing systematic batch-to-batch variation

  • Reference-based: Uses the average distribution across all samples as the target

Algorithm Details

For a data matrix X with shape (n_samples, n_features):

  1. Sort each sample: For each sample i, create sorted_i = sort(X[i, :])

  2. Compute reference distribution: ref[j] = mean(sorted_1[j], sorted_2[j], …, sorted_n[j])

  3. Rank with average ties: For each sample, compute the average rank of every value (so tied values share a single fractional rank)

  4. Map ranks to reference: Linearly interpolate the reference distribution at each rank, so tied values receive the average of the reference values at their tied positions (Bolstad et al., 2003)

Example: For samples [1, 3, 2] and [10, 30, 20]: - Sorted: [1, 2, 3] and [10, 20, 30] - Reference: [(1+10)/2, (2+20)/2, (3+30)/2] = [5.5, 11, 16.5] - Result: [5.5, 16.5, 11] and [5.5, 16.5, 11]

Tie example: For sample [5, 5, 10] with reference [5.33, 9.0, 14.33]: - Average ranks: [1.5, 1.5, 3] - Result: [(5.33 + 9.0) / 2, (5.33 + 9.0) / 2, 14.33] = [7.17, 7.17, 14.33]

Parameters

class pronoms.normalizers.QuantileNormalizer[source]

Bases: object

Normalizer that performs quantile normalization across samples.

Quantile normalization makes the distribution of intensities for each sample identical by replacing each value with the mean of the corresponding quantiles across all samples.

Tied values within a row receive the same normalized value (the average of the reference values at the tied ranks), following Bolstad et al. (2003).

reference_distribution

The reference distribution used for normalization. Only available after calling normalize().

Type:

Optional[np.ndarray]

normalize(X: ndarray) ndarray[source]

Perform quantile normalization on input data X.

Parameters:

X (np.ndarray) – Input data matrix with shape (n_samples, n_features). Each row represents a sample, each column represents a feature/protein.

Returns:

Normalized data matrix with the same shape as X.

Return type:

np.ndarray

Raises:

ValueError – If input data contains NaN or Inf values.

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'Quantile Normalization Comparison') Figure[source]

Plot data before vs after normalization using a 2D hexbin density plot.

Parameters:
  • before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).

  • after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).

  • figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).

  • title (str, optional) – Plot title, by default “Quantile Normalization Comparison”.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Usage Example

Basic quantile normalization:

import numpy as np
from pronoms.normalizers import QuantileNormalizer

# Create sample data with different distributions
data = np.array([
    [1, 5, 10, 20],    # Sample 1: low values
    [100, 500, 1000, 2000],  # Sample 2: high values
    [2, 8, 15, 25]     # Sample 3: intermediate values
])

# Create and apply normalizer
normalizer = QuantileNormalizer()
normalized_data = normalizer.normalize(data)

print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data)

# Verify identical distributions
print("\nSample distributions after normalization:")
for i, sample in enumerate(normalized_data):
    print(f"Sample {i+1}: {sorted(sample)}")

Visualization:

# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()

When to Use

QuantileNormalizer is particularly useful when:

  • Batch effects present: Strong systematic differences between experimental batches

  • Cross-platform integration: Combining data from different measurement platforms

  • Distribution assumptions: Downstream methods assume samples have similar distributions

  • Direct comparability needed: When samples must be directly comparable after normalization

  • Large-scale studies: Multi-center studies where technical variation dominates

Considerations

  • Strong assumption: Assumes all samples should have identical distributions

  • May remove biology: Can eliminate real biological differences between sample groups

  • Rank-based: Only preserves relative ordering, not absolute differences

  • Reference dependency: Results depend on the composition of the sample set

  • Not suitable for sparse data: Performance degrades with many missing or zero values

See Also