MADNormalizer
=============

The ``MADNormalizer`` is a robust scaling method that centers each sample by subtracting its median and scales by the Median Absolute Deviation (MAD). This approach reduces the influence of outliers and non-normal distributions, providing a more reliable normalization than standard deviation-based methods.

Overview
--------

MAD normalization is a robust alternative to z-score normalization that uses median-based statistics instead of mean-based ones. The method transforms each sample to have a median of 0 and a MAD-based scale, making it particularly suitable for data with:

- Outliers or extreme values
- Non-normal distributions
- Skewed data where mean and standard deviation are not representative
- Need for robust statistical preprocessing

The approach works by:

1. **(Optional) Log transform**: By default the data is transformed to ``log2(X + 1)`` before computing statistics. Set ``log_transform=False`` to operate on the raw values.
2. **Centering**: Subtracting the (log) median from each value
3. **Scaling**: Dividing by ``k * MAD`` where ``k`` is the consistency constant chosen via ``scale_to_sigma`` (``1.4826`` for σ-equivalent output, ``1`` for raw MAD)

This creates standardized samples that are less sensitive to outliers compared to traditional z-score normalization. Working in log-space is the default because it stabilizes variance and matches the typical multiplicative noise structure of mass-spectrometry intensity data.

.. note::

   Pass ``scale_to_sigma=True`` to multiply MAD by the standard 1.4826
   consistency constant. The output is then a robust z-score (per-row
   spread ≈ 1 σ for normal data) and matches R's ``mad()`` and
   ``statsmodels.robust.scale.mad`` by default. The current implicit
   default (raw MAD divisor) is preserved for backward compatibility but
   emits a ``DeprecationWarning`` and will flip to ``scale_to_sigma=True``
   in a future major release. Pass the argument explicitly to lock in the
   behavior you want.

Key Features
------------

- **Robust to outliers**: Uses median instead of mean, reducing outlier influence
- **Distribution-free**: Works well with non-normal and skewed distributions
- **Standardized output**: Centers data around 0 with MAD-based scaling
- **Preserves relationships**: Maintains relative ordering within samples

Algorithm Details
-----------------

For a data matrix X with shape (n_samples, n_features), let ``Y`` denote the data
on which the statistics are computed. With the default ``log_transform=True`` the
algorithm uses ``Y = log2(X + 1)``; with ``log_transform=False`` it uses ``Y = X``.

1. **Calculate median**: For each sample i, compute median_i = median(Y[i, :])
2. **Calculate MAD**: MAD_i = median(|Y[i, :] - median_i|)
3. **Apply transformation**: X_normalized[i, j] = (Y[i, j] - median_i) / (k * MAD_i)

The constant ``k`` is ``1.4826`` when ``scale_to_sigma=True`` (the
σ-consistency constant under normality, ``1 / Φ⁻¹(0.75)``) and ``1`` when
``scale_to_sigma=False``.

**Mathematical representation** (with ``log_transform=True``, the default):

.. math::

   X_{normalized}[i,j] = \frac{\log_2(X[i,j] + 1) - \text{median}(\log_2(X[i,:] + 1))}{k \cdot \text{MAD}(\log_2(X[i,:] + 1))}

**Mathematical representation** (with ``log_transform=False``):

.. math::

   X_{normalized}[i,j] = \frac{X[i,j] - \text{median}(X[i,:])}{k \cdot \text{MAD}(X[i,:])}

where in either case:

.. math::

   \text{MAD}(Y[i,:]) = \text{median}(|Y[i,:] - \text{median}(Y[i,:])|)

**Example** (``log_transform=False``, ``scale_to_sigma=False``): For sample
[1, 5, 10, 100]:

- Median = 7.5
- MAD = median([6.5, 2.5, 2.5, 92.5]) = 4.5
- Normalized ≈ [-1.44, -0.56, 0.56, 20.56]

With ``scale_to_sigma=True`` every value above is divided by 1.4826,
giving roughly [-0.97, -0.38, 0.38, 13.87] — interpretable directly as a
robust z-score.

Parameters
----------

.. autoclass:: pronoms.normalizers.MADNormalizer
   :members:
   :undoc-members:
   :show-inheritance:

Usage Example
-------------

Basic MAD normalization:

.. code-block:: python

   import numpy as np
   from pronoms.normalizers import MADNormalizer
   
   # Create sample data with outliers
   data = np.array([
       [10, 20, 15, 25, 1000],    # Sample 1: with outlier
       [100, 120, 110, 130, 105], # Sample 2: normal range
       [5, 8, 6, 9, 7]            # Sample 3: low values
   ])
   
   # Create and apply normalizer.
   # By default, log_transform=True, so statistics are computed on log2(X + 1).
   # Pass log_transform=False to operate on the raw values instead.
   # scale_to_sigma=True multiplies MAD by 1.4826 so the output is a
   # robust z-score (matches R's mad()).
   normalizer = MADNormalizer(scale_to_sigma=True)
   normalized_data = normalizer.normalize(data)

   print("Original data:")
   print(data)
   print("\nMAD normalized data (computed in log2 space):")
   print(normalized_data)

   # Check centering (medians should be ~0 in either mode)
   print("\nSample medians after normalization:")
   for i, sample in enumerate(normalized_data):
       print(f"Sample {i+1}: {np.median(sample):.6f}")

   # To reproduce the worked example above (raw-scale MAD, no σ-consistency):
   raw_normalizer = MADNormalizer(log_transform=False, scale_to_sigma=False)
   raw_normalized = raw_normalizer.normalize(data)

Visualization:

.. code-block:: python

   # Visualize the normalization effect
   fig = normalizer.plot_comparison(data, normalized_data)
   fig.show()

When to Use
-----------

MADNormalizer is particularly useful when:

- **Outliers present**: Data contains extreme values that would skew mean/std-based methods
- **Non-normal distributions**: Data is skewed or has heavy tails
- **Robust preprocessing needed**: When stability against outliers is important
- **Proteomics data**: Common in mass spectrometry data with occasional extreme measurements
- **Quality control**: When some samples may have measurement artifacts

Considerations
--------------

- **Zero MAD handling**: Samples with zero MAD (all identical values) cannot be scaled and will raise a ``ValueError``
- **Negative values with log_transform=True**: The default ``log_transform=True`` requires all input values to be non-negative; negative inputs raise a ``ValueError``. Use ``log_transform=False`` for data that may contain negatives
- **Scale interpretation**: MAD-based scaling differs from standard deviation scaling, and with the default log transform the output is on a log2 scale rather than the original scale
- **σ-consistency**: Pass ``scale_to_sigma=True`` to multiply MAD by 1.4826 so the output is a robust z-score; otherwise the per-row spread is ≈ 1.4826× larger than a true z-score, which matters for cross-tool comparisons (R, statsmodels), regularized regression with a fixed penalty strength, polynomial/interaction features, and any hard "x σ" thresholding downstream
- **Computational cost**: Slightly more expensive than mean/std-based methods due to median calculations
- **Distribution assumptions**: While robust, still assumes some variability within samples

See Also
--------

- :doc:`median_normalizer`: For median-based scaling without robust standardization
- :doc:`quantile_normalizer`: For making distributions identical across samples
- :doc:`rank_normalizer`: For rank-based transformation that handles outliers differently
- :doc:`vsn_normalizer`: For variance-stabilizing normalization