MedianPolishNormalizer

The MedianPolishNormalizer implements Tukey’s Median Polish algorithm to iteratively remove row and column medians from a data matrix. This method decomposes the data into overall, row, column, and residual effects, effectively removing systematic biases from both samples (rows) and features (columns).

Overview

Median Polish is a robust exploratory data analysis technique that decomposes a two-way table into additive components:

Data = Overall + Row Effect + Column Effect + Residual

The algorithm works by iteratively:

Removing row medians: Subtracting the median of each row from all values in that row
Removing column medians: Subtracting the median of each column from all values in that column
Updating overall effect: Tracking the cumulative median adjustments
Repeating until convergence: Continuing until changes become negligible

This approach is particularly effective for:

Removing systematic biases affecting entire samples or features
Exploratory analysis of two-way structured data
Preprocessing for downstream analyses that assume additive effects
Microarray and proteomics data where both sample and feature effects are present

Key Features

Dual bias removal: Corrects for both row (sample) and column (feature) effects simultaneously
Robust method: Uses medians instead of means, making it resistant to outliers
Additive decomposition: Provides interpretable components (overall, row, column, residual)
Iterative convergence: Continues until stable solution is reached
Log-space option: Can automatically log-transform data for multiplicative effects

Algorithm Details

The Median Polish algorithm iteratively removes medians until convergence:

Initialize: Start with the original data matrix
Row sweep: For each row, subtract its median from all values
Column sweep: For each column, subtract its median from all values
Update overall: Add the median of medians to the overall effect
Check convergence: Repeat steps 2-4 until changes are below threshold
Return residuals: Final result is overall + residuals

Mathematical representation:

After convergence: X[i,j] = Overall + Row[i] + Column[j] + Residual[i,j]

The normalized output typically returns: Overall + Residual[i,j]

Parameters

class pronoms.normalizers.MedianPolishNormalizer(max_iterations: int = 10, tolerance: float = 0.01, epsilon: float = 1e-06, log_transform: bool = True)[source]

Bases: object

Normalizer based on Tukey’s Median Polish algorithm.

This algorithm iteratively removes median effects from rows (samples) and columns (features) of a matrix, typically applied to log-transformed data. It decomposes the data X into: X[i, j] = overall_median + row_effect[i] + col_effect[j] + residual[i, j]

The normalized data returned is typically the residuals + overall_median, transformed back to the original scale if log transformation was used.

max_iterations

Maximum number of iterations allowed for the algorithm.

Type:: int

tolerance

Convergence threshold on the per-iteration row/column medians removed from the residual matrix. The algorithm stops as soon as max(|row_medians|.max(), |col_medians|.max()) <= tolerance.

Type:: float

epsilon

Small constant added before log transformation to handle non-positive values.

Type:: float

log_transform

Whether to apply log transformation before median polish and back-transform after.

Type:: bool

row_effects

The calculated median effects for each row (sample). Available after normalize().

Type:: Optional[np.ndarray]

col_effects

The calculated median effects for each column (feature). Available after normalize().

Type:: Optional[np.ndarray]

overall_median

The calculated overall median effect. Available after normalize().

Type:: Optional[float]

residuals

The final residuals after removing row, column, and overall effects. Available after normalize().

Type:: Optional[np.ndarray]

converged

Whether the algorithm converged within max_iterations. Available after normalize().

Type:: Optional[bool]

iterations_run

Number of iterations actually performed. Available after normalize().

Type:: Optional[int]

normalize(X: ndarray) → ndarray[source]

Apply Tukey’s Median Polish normalization to the data.

If log_transform is True, the input data is log-transformed before polishing. The method returns the normalized data defined as overall_median + residuals. Note: If log_transform was used, the returned data remains in log-space.

Parameters:: X (np.ndarray) – Input data matrix (n_samples, n_features).
Returns:: Normalized data matrix (overall_median + residuals). If log_transform=True, this matrix is in log-space.
Return type:: np.ndarray

plot_comparison(original_data: ndarray, normalized_data: ndarray, figsize: tuple[int, int] = (10, 8)) → Figure[source]

Generate a hexbin plot comparing original data (log scale) vs. normalized data.

If log_transform was used during normalization, the normalized data (y-axis) will be in log-space. The original data (x-axis) is always plotted on a log scale for comparison consistency, especially when normalization involved log transform.

Parameters:

original_data (np.ndarray) – The raw data matrix (n_samples, n_features).
normalized_data (np.ndarray) – The data matrix after normalization (n_samples, n_features). This will be in log-space if log_transform=True was used.
figsize (Tuple[int, int], optional) – Figure size for the plot, by default (10, 8).

Returns:

Matplotlib figure object containing the hexbin plot.

Return type:

plt.Figure

Usage Example

Basic median polish normalization:

import numpy as np
from pronoms.normalizers import MedianPolishNormalizer

# Create sample data with row and column effects
np.random.seed(42)
base_data = np.random.normal(100, 10, (4, 5))

# Add systematic row effects (sample biases)
row_effects = np.array([0, 20, -10, 15]).reshape(-1, 1)

# Add systematic column effects (feature biases)
col_effects = np.array([0, 50, -20, 30, 10])

# Combine effects
data = base_data + row_effects + col_effects

# Create and apply normalizer
normalizer = MedianPolishNormalizer(log_transform=False, max_iter=10)
normalized_data = normalizer.normalize(data)

print("Original data:")
print(data)
print("\nNormalized data (residuals + overall):")
print(normalized_data)

# Examine the decomposition
print(f"\nOverall effect: {normalizer.overall_:.2f}")
print(f"Row effects: {normalizer.row_effects_}")
print(f"Column effects: {normalizer.col_effects_}")

With log transformation:

# For multiplicative effects, use log transformation
normalizer_log = MedianPolishNormalizer(log_transform=True)
normalized_log = normalizer_log.normalize(data)

print("Log-transformed normalization:")
print(normalized_log)

Visualization:

# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()

When to Use

MedianPolishNormalizer is particularly useful when:

Two-way effects present: Both sample (row) and feature (column) biases exist
Exploratory analysis: Understanding the structure of systematic effects in data
Microarray data: Classic application for gene expression data
Proteomics preprocessing: When both sample preparation and protein-specific effects are present
Robust normalization needed: When outliers might affect mean-based methods

Considerations

Additive assumption: Assumes effects are additive (or multiplicative if log-transformed)
Convergence: May require multiple iterations to reach stable solution
Interpretation: Results are residuals plus overall effect, not original scale
Missing values: Algorithm may not handle missing data well
Small datasets: May be unstable with very small sample or feature numbers