QuantileNormalizer
The QuantileNormalizer applies quantile normalization to make the empirical distributions of all samples identical. This powerful normalization method ensures that after transformation, all samples have exactly the same distribution, making them directly comparable regardless of their original distributions.
Overview
Quantile normalization is based on the principle that samples should have the same distribution if technical variation is removed. The method works by:
Ranking values: Sort each sample independently to get ranks
Computing reference distribution: Average values across samples at each rank position
Reassigning values: Replace original values with the reference distribution values at corresponding ranks
This approach is particularly powerful for:
Removing systematic differences between samples or batches
Making samples directly comparable across different experimental conditions
Integrating data from multiple studies or platforms
Preprocessing for downstream analyses that assume similar distributions
Key Features
Identical distributions: All samples have exactly the same distribution after normalization
Rank preservation: The relative ordering within each sample is maintained
Batch effect removal: Effective at removing systematic batch-to-batch variation
Reference-based: Uses the average distribution across all samples as the target
Algorithm Details
For a data matrix X with shape (n_samples, n_features):
Sort each sample: For each sample i, create sorted_i = sort(X[i, :])
Compute reference distribution: ref[j] = mean(sorted_1[j], sorted_2[j], …, sorted_n[j])
Rank with average ties: For each sample, compute the average rank of every value (so tied values share a single fractional rank)
Map ranks to reference: Linearly interpolate the reference distribution at each rank, so tied values receive the average of the reference values at their tied positions (Bolstad et al., 2003)
Example: For samples [1, 3, 2] and [10, 30, 20]: - Sorted: [1, 2, 3] and [10, 20, 30] - Reference: [(1+10)/2, (2+20)/2, (3+30)/2] = [5.5, 11, 16.5] - Result: [5.5, 16.5, 11] and [5.5, 16.5, 11]
Tie example: For sample [5, 5, 10] with reference [5.33, 9.0, 14.33]: - Average ranks: [1.5, 1.5, 3] - Result: [(5.33 + 9.0) / 2, (5.33 + 9.0) / 2, 14.33] = [7.17, 7.17, 14.33]
Parameters
- class pronoms.normalizers.QuantileNormalizer[source]
Bases:
objectNormalizer that performs quantile normalization across samples.
Quantile normalization makes the distribution of intensities for each sample identical by replacing each value with the mean of the corresponding quantiles across all samples.
Tied values within a row receive the same normalized value (the average of the reference values at the tied ranks), following Bolstad et al. (2003).
- reference_distribution
The reference distribution used for normalization. Only available after calling normalize().
- Type:
Optional[np.ndarray]
- normalize(X: ndarray) ndarray[source]
Perform quantile normalization on input data X.
- Parameters:
X (np.ndarray) – Input data matrix with shape (n_samples, n_features). Each row represents a sample, each column represents a feature/protein.
- Returns:
Normalized data matrix with the same shape as X.
- Return type:
np.ndarray
- Raises:
ValueError – If input data contains NaN or Inf values.
- plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'Quantile Normalization Comparison') Figure[source]
Plot data before vs after normalization using a 2D hexbin density plot.
- Parameters:
before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “Quantile Normalization Comparison”.
- Returns:
Figure object containing the hexbin density plot.
- Return type:
plt.Figure
Usage Example
Basic quantile normalization:
import numpy as np
from pronoms.normalizers import QuantileNormalizer
# Create sample data with different distributions
data = np.array([
[1, 5, 10, 20], # Sample 1: low values
[100, 500, 1000, 2000], # Sample 2: high values
[2, 8, 15, 25] # Sample 3: intermediate values
])
# Create and apply normalizer
normalizer = QuantileNormalizer()
normalized_data = normalizer.normalize(data)
print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data)
# Verify identical distributions
print("\nSample distributions after normalization:")
for i, sample in enumerate(normalized_data):
print(f"Sample {i+1}: {sorted(sample)}")
Visualization:
# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()
When to Use
QuantileNormalizer is particularly useful when:
Batch effects present: Strong systematic differences between experimental batches
Cross-platform integration: Combining data from different measurement platforms
Distribution assumptions: Downstream methods assume samples have similar distributions
Direct comparability needed: When samples must be directly comparable after normalization
Large-scale studies: Multi-center studies where technical variation dominates
Considerations
Strong assumption: Assumes all samples should have identical distributions
May remove biology: Can eliminate real biological differences between sample groups
Rank-based: Only preserves relative ordering, not absolute differences
Reference dependency: Results depend on the composition of the sample set
Not suitable for sparse data: Performance degrades with many missing or zero values
See Also
MedianNormalizer: For simpler scaling-based normalization
RankNormalizer: For rank-based transformation without enforcing identical distributions
VSNNormalizer: For variance-stabilizing normalization
MADNormalizer: For robust normalization using median absolute deviation