MedianPolishNormalizer
The MedianPolishNormalizer implements Tukey’s Median Polish algorithm to iteratively remove row and column medians from a data matrix. This method decomposes the data into overall, row, column, and residual effects, effectively removing systematic biases from both samples (rows) and features (columns).
Overview
Median Polish is a robust exploratory data analysis technique that decomposes a two-way table into additive components:
Data = Overall + Row Effect + Column Effect + Residual
The algorithm works by iteratively:
Removing row medians: Subtracting the median of each row from all values in that row
Removing column medians: Subtracting the median of each column from all values in that column
Updating overall effect: Tracking the cumulative median adjustments
Repeating until convergence: Continuing until changes become negligible
This approach is particularly effective for:
Removing systematic biases affecting entire samples or features
Exploratory analysis of two-way structured data
Preprocessing for downstream analyses that assume additive effects
Microarray and proteomics data where both sample and feature effects are present
Key Features
Dual bias removal: Corrects for both row (sample) and column (feature) effects simultaneously
Robust method: Uses medians instead of means, making it resistant to outliers
Additive decomposition: Provides interpretable components (overall, row, column, residual)
Iterative convergence: Continues until stable solution is reached
Log-space option: Can automatically log-transform data for multiplicative effects
Algorithm Details
The Median Polish algorithm iteratively removes medians until convergence:
Initialize: Start with the original data matrix
Row sweep: For each row, subtract its median from all values
Column sweep: For each column, subtract its median from all values
Update overall: Add the median of medians to the overall effect
Check convergence: Repeat steps 2-4 until changes are below threshold
Return residuals: Final result is overall + residuals
Mathematical representation:
After convergence: X[i,j] = Overall + Row[i] + Column[j] + Residual[i,j]
The normalized output typically returns: Overall + Residual[i,j]
Parameters
- class pronoms.normalizers.MedianPolishNormalizer(max_iterations: int = 10, tolerance: float = 0.01, epsilon: float = 1e-06, log_transform: bool = True)[source]
Bases:
objectNormalizer based on Tukey’s Median Polish algorithm.
This algorithm iteratively removes median effects from rows (samples) and columns (features) of a matrix, typically applied to log-transformed data. It decomposes the data X into: X[i, j] = overall_median + row_effect[i] + col_effect[j] + residual[i, j]
The normalized data returned is typically the residuals + overall_median, transformed back to the original scale if log transformation was used.
- max_iterations
Maximum number of iterations allowed for the algorithm.
- Type:
int
- tolerance
Convergence threshold on the per-iteration row/column medians removed from the residual matrix. The algorithm stops as soon as
max(|row_medians|.max(), |col_medians|.max()) <= tolerance.- Type:
float
- epsilon
Small constant added before log transformation to handle non-positive values.
- Type:
float
- log_transform
Whether to apply log transformation before median polish and back-transform after.
- Type:
bool
- row_effects
The calculated median effects for each row (sample). Available after normalize().
- Type:
Optional[np.ndarray]
- col_effects
The calculated median effects for each column (feature). Available after normalize().
- Type:
Optional[np.ndarray]
- overall_median
The calculated overall median effect. Available after normalize().
- Type:
Optional[float]
- residuals
The final residuals after removing row, column, and overall effects. Available after normalize().
- Type:
Optional[np.ndarray]
- converged
Whether the algorithm converged within max_iterations. Available after normalize().
- Type:
Optional[bool]
- iterations_run
Number of iterations actually performed. Available after normalize().
- Type:
Optional[int]
- normalize(X: ndarray) ndarray[source]
Apply Tukey’s Median Polish normalization to the data.
If log_transform is True, the input data is log-transformed before polishing. The method returns the normalized data defined as overall_median + residuals. Note: If log_transform was used, the returned data remains in log-space.
- Parameters:
X (np.ndarray) – Input data matrix (n_samples, n_features).
- Returns:
Normalized data matrix (overall_median + residuals). If log_transform=True, this matrix is in log-space.
- Return type:
np.ndarray
- plot_comparison(original_data: ndarray, normalized_data: ndarray, figsize: tuple[int, int] = (10, 8)) Figure[source]
Generate a hexbin plot comparing original data (log scale) vs. normalized data.
If log_transform was used during normalization, the normalized data (y-axis) will be in log-space. The original data (x-axis) is always plotted on a log scale for comparison consistency, especially when normalization involved log transform.
- Parameters:
original_data (np.ndarray) – The raw data matrix (n_samples, n_features).
normalized_data (np.ndarray) – The data matrix after normalization (n_samples, n_features). This will be in log-space if log_transform=True was used.
figsize (Tuple[int, int], optional) – Figure size for the plot, by default (10, 8).
- Returns:
Matplotlib figure object containing the hexbin plot.
- Return type:
plt.Figure
Usage Example
Basic median polish normalization:
import numpy as np
from pronoms.normalizers import MedianPolishNormalizer
# Create sample data with row and column effects
np.random.seed(42)
base_data = np.random.normal(100, 10, (4, 5))
# Add systematic row effects (sample biases)
row_effects = np.array([0, 20, -10, 15]).reshape(-1, 1)
# Add systematic column effects (feature biases)
col_effects = np.array([0, 50, -20, 30, 10])
# Combine effects
data = base_data + row_effects + col_effects
# Create and apply normalizer
normalizer = MedianPolishNormalizer(log_transform=False, max_iter=10)
normalized_data = normalizer.normalize(data)
print("Original data:")
print(data)
print("\nNormalized data (residuals + overall):")
print(normalized_data)
# Examine the decomposition
print(f"\nOverall effect: {normalizer.overall_:.2f}")
print(f"Row effects: {normalizer.row_effects_}")
print(f"Column effects: {normalizer.col_effects_}")
With log transformation:
# For multiplicative effects, use log transformation
normalizer_log = MedianPolishNormalizer(log_transform=True)
normalized_log = normalizer_log.normalize(data)
print("Log-transformed normalization:")
print(normalized_log)
Visualization:
# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()
When to Use
MedianPolishNormalizer is particularly useful when:
Two-way effects present: Both sample (row) and feature (column) biases exist
Exploratory analysis: Understanding the structure of systematic effects in data
Microarray data: Classic application for gene expression data
Proteomics preprocessing: When both sample preparation and protein-specific effects are present
Robust normalization needed: When outliers might affect mean-based methods
Considerations
Additive assumption: Assumes effects are additive (or multiplicative if log-transformed)
Convergence: May require multiple iterations to reach stable solution
Interpretation: Results are residuals plus overall effect, not original scale
Missing values: Algorithm may not handle missing data well
Small datasets: May be unstable with very small sample or feature numbers
See Also
MedianNormalizer: For simple median-based scaling without two-way decomposition
MADNormalizer: For robust normalization using median absolute deviation
QuantileNormalizer: For making distributions identical across samples
VSNNormalizer: For variance-stabilizing normalization