MADNormalizer

The MADNormalizer is a robust scaling method that centers each sample by subtracting its median and scales by the Median Absolute Deviation (MAD). This approach reduces the influence of outliers and non-normal distributions, providing a more reliable normalization than standard deviation-based methods.

Overview

MAD normalization is a robust alternative to z-score normalization that uses median-based statistics instead of mean-based ones. The method transforms each sample to have a median of 0 and a MAD-based scale, making it particularly suitable for data with:

Outliers or extreme values
Non-normal distributions
Skewed data where mean and standard deviation are not representative
Need for robust statistical preprocessing

The approach works by:

(Optional) Log transform: By default the data is transformed to log2(X + 1) before computing statistics. Set log_transform=False to operate on the raw values.
Centering: Subtracting the (log) median from each value
Scaling: Dividing by k * MAD where k is the consistency constant chosen via scale_to_sigma (1.4826 for σ-equivalent output, 1 for raw MAD)

This creates standardized samples that are less sensitive to outliers compared to traditional z-score normalization. Working in log-space is the default because it stabilizes variance and matches the typical multiplicative noise structure of mass-spectrometry intensity data.

Note

Pass scale_to_sigma=True to multiply MAD by the standard 1.4826 consistency constant. The output is then a robust z-score (per-row spread ≈ 1 σ for normal data) and matches R’s mad() and statsmodels.robust.scale.mad by default. The current implicit default (raw MAD divisor) is preserved for backward compatibility but emits a DeprecationWarning and will flip to scale_to_sigma=True in a future major release. Pass the argument explicitly to lock in the behavior you want.

Key Features

Robust to outliers: Uses median instead of mean, reducing outlier influence
Distribution-free: Works well with non-normal and skewed distributions
Standardized output: Centers data around 0 with MAD-based scaling
Preserves relationships: Maintains relative ordering within samples

Algorithm Details

For a data matrix X with shape (n_samples, n_features), let Y denote the data on which the statistics are computed. With the default log_transform=True the algorithm uses Y = log2(X + 1); with log_transform=False it uses Y = X.

Calculate median: For each sample i, compute median_i = median(Y[i, :])
Calculate MAD: MAD_i = median(|Y[i, :] - median_i|)
Apply transformation: X_normalized[i, j] = (Y[i, j] - median_i) / (k * MAD_i)

The constant k is 1.4826 when scale_to_sigma=True (the σ-consistency constant under normality, 1 / Φ⁻¹(0.75)) and 1 when scale_to_sigma=False.

Mathematical representation (with log_transform=True, the default):

\[X_{normalized}[i,j] = \frac{\log_2(X[i,j] + 1) - \text{median}(\log_2(X[i,:] + 1))}{k \cdot \text{MAD}(\log_2(X[i,:] + 1))}\]

Mathematical representation (with log_transform=False):

\[X_{normalized}[i,j] = \frac{X[i,j] - \text{median}(X[i,:])}{k \cdot \text{MAD}(X[i,:])}\]

where in either case:

\[\text{MAD}(Y[i,:]) = \text{median}(|Y[i,:] - \text{median}(Y[i,:])|)\]

Example (log_transform=False, scale_to_sigma=False): For sample [1, 5, 10, 100]:

Median = 7.5
MAD = median([6.5, 2.5, 2.5, 92.5]) = 4.5
Normalized ≈ [-1.44, -0.56, 0.56, 20.56]

With scale_to_sigma=True every value above is divided by 1.4826, giving roughly [-0.97, -0.38, 0.38, 13.87] — interpretable directly as a robust z-score.

Parameters

class pronoms.normalizers.MADNormalizer(log_transform: bool = True, scale_to_sigma: bool = <object object>)[source]

Bases: object

Median Absolute Deviation (MAD) Normalizer.

Centers each sample (row) by subtracting its median and scales it by its Median Absolute Deviation (MAD).

Optionally performs calculations on log2-transformed data (default) to stabilize variance and handle typical intensity distributions.

If log_transform=True (default):: Calculations (median, MAD) are performed on log2(X + 1). Normalization: (log2(X + 1) - median_log) / (k * MAD_log)
If log_transform=False:: Calculations are performed directly on X. Normalization: (X - median) / (k * MAD)

Where k is the consistency constant set by scale_to_sigma:

scale_to_sigma=True: k = 1.4826 (MAD_SIGMA_CONSTANT). The output is a robust z-score: per-row spread ≈ 1 σ for normal data. Matches R’s mad() default and statsmodels.robust.scale.mad.
scale_to_sigma=False: k = 1 (raw MAD divisor). Per-row spread is ≈ 1.4826 × what a true robust z-score would give. Use this if you explicitly want raw-MAD output and have not standardized to σ.

Deprecated since version Calling: without scale_to_sigma emits a DeprecationWarning; the implicit default (raw MAD) will be replaced by scale_to_sigma=True in a future major release. Pass the argument explicitly to lock in your intended behavior across versions.

log_transform

Whether log2 transformation was applied before normalization.

Type:: bool

scale_to_sigma

Whether the divisor is MAD_SIGMA_CONSTANT * MAD (σ-equivalent) rather than raw MAD.

Type:: bool

row_medians

Median of the (potentially log2-transformed) data for each sample.

Type:: np.ndarray

row_mads

Raw Median Absolute Deviation (MAD) of the (potentially log2-transformed) data for each sample. Always the unscaled MAD, regardless of scale_to_sigma.

Type:: np.ndarray

normalize(X: ndarray) → ndarray[source]

Apply MAD normalization to the input data matrix X.

Parameters:

X (np.ndarray) – Input data matrix (n_samples, n_features). Must contain non-negative values if log_transform=True.

Returns:

Normalized data matrix.

Return type:

np.ndarray

Raises:

ValueError –

If input is not a 2D array with at least one feature. - If input data contains NaN or Inf values. - If log_transform=True and input data contains negative values. - If MAD is zero for any sample (which prevents normalization).

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'MAD Normalization Comparison') → Figure[source]

Plot data before vs after normalization using a 2D hexbin density plot.

Parameters:

before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “MAD Normalization Comparison”.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Usage Example

Basic MAD normalization:

import numpy as np
from pronoms.normalizers import MADNormalizer

# Create sample data with outliers
data = np.array([
    [10, 20, 15, 25, 1000],    # Sample 1: with outlier
    [100, 120, 110, 130, 105], # Sample 2: normal range
    [5, 8, 6, 9, 7]            # Sample 3: low values
])

# Create and apply normalizer.
# By default, log_transform=True, so statistics are computed on log2(X + 1).
# Pass log_transform=False to operate on the raw values instead.
# scale_to_sigma=True multiplies MAD by 1.4826 so the output is a
# robust z-score (matches R's mad()).
normalizer = MADNormalizer(scale_to_sigma=True)
normalized_data = normalizer.normalize(data)

print("Original data:")
print(data)
print("\nMAD normalized data (computed in log2 space):")
print(normalized_data)

# Check centering (medians should be ~0 in either mode)
print("\nSample medians after normalization:")
for i, sample in enumerate(normalized_data):
    print(f"Sample {i+1}: {np.median(sample):.6f}")

# To reproduce the worked example above (raw-scale MAD, no σ-consistency):
raw_normalizer = MADNormalizer(log_transform=False, scale_to_sigma=False)
raw_normalized = raw_normalizer.normalize(data)

Visualization:

# Visualize the normalization effect
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()

When to Use

MADNormalizer is particularly useful when:

Outliers present: Data contains extreme values that would skew mean/std-based methods
Non-normal distributions: Data is skewed or has heavy tails
Robust preprocessing needed: When stability against outliers is important
Proteomics data: Common in mass spectrometry data with occasional extreme measurements
Quality control: When some samples may have measurement artifacts

Considerations

Zero MAD handling: Samples with zero MAD (all identical values) cannot be scaled and will raise a ValueError
Negative values with log_transform=True: The default log_transform=True requires all input values to be non-negative; negative inputs raise a ValueError. Use log_transform=False for data that may contain negatives
Scale interpretation: MAD-based scaling differs from standard deviation scaling, and with the default log transform the output is on a log2 scale rather than the original scale
σ-consistency: Pass scale_to_sigma=True to multiply MAD by 1.4826 so the output is a robust z-score; otherwise the per-row spread is ≈ 1.4826× larger than a true z-score, which matters for cross-tool comparisons (R, statsmodels), regularized regression with a fixed penalty strength, polynomial/interaction features, and any hard “x σ” thresholding downstream
Computational cost: Slightly more expensive than mean/std-based methods due to median calculations
Distribution assumptions: While robust, still assumes some variability within samples