MADNormalizer
The MADNormalizer is a robust scaling method that centers each sample by subtracting its median and scales by the Median Absolute Deviation (MAD). This approach reduces the influence of outliers and non-normal distributions, providing a more reliable normalization than standard deviation-based methods.
Overview
MAD normalization is a robust alternative to z-score normalization that uses median-based statistics instead of mean-based ones. The method transforms each sample to have a median of 0 and a MAD-based scale, making it particularly suitable for data with:
Outliers or extreme values
Non-normal distributions
Skewed data where mean and standard deviation are not representative
Need for robust statistical preprocessing
The approach works by:
(Optional) Log transform: By default the data is transformed to
log2(X + 1)before computing statistics. Setlog_transform=Falseto operate on the raw values.Centering: Subtracting the (log) median from each value
Scaling: Dividing by
k * MADwherekis the consistency constant chosen viascale_to_sigma(1.4826for σ-equivalent output,1for raw MAD)
This creates standardized samples that are less sensitive to outliers compared to traditional z-score normalization. Working in log-space is the default because it stabilizes variance and matches the typical multiplicative noise structure of mass-spectrometry intensity data.
Note
Pass scale_to_sigma=True to multiply MAD by the standard 1.4826
consistency constant. The output is then a robust z-score (per-row
spread ≈ 1 σ for normal data) and matches R’s mad() and
statsmodels.robust.scale.mad by default. The current implicit
default (raw MAD divisor) is preserved for backward compatibility but
emits a DeprecationWarning and will flip to scale_to_sigma=True
in a future major release. Pass the argument explicitly to lock in the
behavior you want.
Key Features
Robust to outliers: Uses median instead of mean, reducing outlier influence
Distribution-free: Works well with non-normal and skewed distributions
Standardized output: Centers data around 0 with MAD-based scaling
Preserves relationships: Maintains relative ordering within samples
Algorithm Details
For a data matrix X with shape (n_samples, n_features), let Y denote the data
on which the statistics are computed. With the default log_transform=True the
algorithm uses Y = log2(X + 1); with log_transform=False it uses Y = X.
Calculate median: For each sample i, compute median_i = median(Y[i, :])
Calculate MAD: MAD_i = median(|Y[i, :] - median_i|)
Apply transformation: X_normalized[i, j] = (Y[i, j] - median_i) / (k * MAD_i)
The constant k is 1.4826 when scale_to_sigma=True (the
σ-consistency constant under normality, 1 / Φ⁻¹(0.75)) and 1 when
scale_to_sigma=False.
Mathematical representation (with log_transform=True, the default):
Mathematical representation (with log_transform=False):
where in either case:
Example (log_transform=False, scale_to_sigma=False): For sample
[1, 5, 10, 100]:
Median = 7.5
MAD = median([6.5, 2.5, 2.5, 92.5]) = 4.5
Normalized ≈ [-1.44, -0.56, 0.56, 20.56]
With scale_to_sigma=True every value above is divided by 1.4826,
giving roughly [-0.97, -0.38, 0.38, 13.87] — interpretable directly as a
robust z-score.
Parameters
- class pronoms.normalizers.MADNormalizer(log_transform: bool = True, scale_to_sigma: bool = <object object>)[source]
Bases:
objectMedian Absolute Deviation (MAD) Normalizer.
Centers each sample (row) by subtracting its median and scales it by its Median Absolute Deviation (MAD).
Optionally performs calculations on log2-transformed data (default) to stabilize variance and handle typical intensity distributions.
- If log_transform=True (default):
Calculations (median, MAD) are performed on log2(X + 1). Normalization: (log2(X + 1) - median_log) / (k * MAD_log)
- If log_transform=False:
Calculations are performed directly on X. Normalization: (X - median) / (k * MAD)
Where
kis the consistency constant set byscale_to_sigma:scale_to_sigma=True:k = 1.4826(MAD_SIGMA_CONSTANT). The output is a robust z-score: per-row spread ≈ 1 σ for normal data. Matches R’smad()default andstatsmodels.robust.scale.mad.scale_to_sigma=False:k = 1(raw MAD divisor). Per-row spread is ≈ 1.4826 × what a true robust z-score would give. Use this if you explicitly want raw-MAD output and have not standardized to σ.
Deprecated since version Calling: without
scale_to_sigmaemits aDeprecationWarning; the implicit default (raw MAD) will be replaced byscale_to_sigma=Truein a future major release. Pass the argument explicitly to lock in your intended behavior across versions.- log_transform
Whether log2 transformation was applied before normalization.
- Type:
bool
- scale_to_sigma
Whether the divisor is
MAD_SIGMA_CONSTANT * MAD(σ-equivalent) rather than raw MAD.- Type:
bool
- row_medians
Median of the (potentially log2-transformed) data for each sample.
- Type:
np.ndarray
- row_mads
Raw Median Absolute Deviation (MAD) of the (potentially log2-transformed) data for each sample. Always the unscaled MAD, regardless of
scale_to_sigma.- Type:
np.ndarray
- normalize(X: ndarray) ndarray[source]
Apply MAD normalization to the input data matrix X.
- Parameters:
X (np.ndarray) – Input data matrix (n_samples, n_features). Must contain non-negative values if log_transform=True.
- Returns:
Normalized data matrix.
- Return type:
np.ndarray
- Raises:
ValueError –
If input is not a 2D array with at least one feature. - If input data contains NaN or Inf values. - If log_transform=True and input data contains negative values. - If MAD is zero for any sample (which prevents normalization).
- plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'MAD Normalization Comparison') Figure[source]
Plot data before vs after normalization using a 2D hexbin density plot.
- Parameters:
before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “MAD Normalization Comparison”.
- Returns:
Figure object containing the hexbin density plot.
- Return type:
plt.Figure
Usage Example
Basic MAD normalization:
import numpy as np
from pronoms.normalizers import MADNormalizer
# Create sample data with outliers
data = np.array([
[10, 20, 15, 25, 1000], # Sample 1: with outlier
[100, 120, 110, 130, 105], # Sample 2: normal range
[5, 8, 6, 9, 7] # Sample 3: low values
])
# Create and apply normalizer.
# By default, log_transform=True, so statistics are computed on log2(X + 1).
# Pass log_transform=False to operate on the raw values instead.
# scale_to_sigma=True multiplies MAD by 1.4826 so the output is a
# robust z-score (matches R's mad()).
normalizer = MADNormalizer(scale_to_sigma=True)
normalized_data = normalizer.normalize(data)
print("Original data:")
print(data)
print("\nMAD normalized data (computed in log2 space):")
print(normalized_data)
# Check centering (medians should be ~0 in either mode)
print("\nSample medians after normalization:")
for i, sample in enumerate(normalized_data):
print(f"Sample {i+1}: {np.median(sample):.6f}")
# To reproduce the worked example above (raw-scale MAD, no σ-consistency):
raw_normalizer = MADNormalizer(log_transform=False, scale_to_sigma=False)
raw_normalized = raw_normalizer.normalize(data)
Visualization:
# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()
When to Use
MADNormalizer is particularly useful when:
Outliers present: Data contains extreme values that would skew mean/std-based methods
Non-normal distributions: Data is skewed or has heavy tails
Robust preprocessing needed: When stability against outliers is important
Proteomics data: Common in mass spectrometry data with occasional extreme measurements
Quality control: When some samples may have measurement artifacts
Considerations
Zero MAD handling: Samples with zero MAD (all identical values) cannot be scaled and will raise a
ValueErrorNegative values with log_transform=True: The default
log_transform=Truerequires all input values to be non-negative; negative inputs raise aValueError. Uselog_transform=Falsefor data that may contain negativesScale interpretation: MAD-based scaling differs from standard deviation scaling, and with the default log transform the output is on a log2 scale rather than the original scale
σ-consistency: Pass
scale_to_sigma=Trueto multiply MAD by 1.4826 so the output is a robust z-score; otherwise the per-row spread is ≈ 1.4826× larger than a true z-score, which matters for cross-tool comparisons (R, statsmodels), regularized regression with a fixed penalty strength, polynomial/interaction features, and any hard “x σ” thresholding downstreamComputational cost: Slightly more expensive than mean/std-based methods due to median calculations
Distribution assumptions: While robust, still assumes some variability within samples
See Also
MedianNormalizer: For median-based scaling without robust standardization
QuantileNormalizer: For making distributions identical across samples
RankNormalizer: For rank-based transformation that handles outliers differently
VSNNormalizer: For variance-stabilizing normalization