RankNormalizer

The RankNormalizer transforms each sample’s values to their ranks, where the smallest value receives rank 1 and the largest receives rank N (number of features). This transformation is useful for making data distributions more uniform and reducing the impact of outliers.

Overview

Rank normalization replaces each value in a sample with its rank position when the values are sorted from smallest to largest. This creates a uniform distribution of ranks from 1 to N for each sample, making it particularly useful for:

Reducing the impact of outliers
Creating comparable scales across different measurement ranges
Preprocessing for non-parametric statistical methods
Making data distributions more uniform

Key Features

Tied Value Handling: When multiple values are identical, they receive the median rank of their group
Optional Normalization: Ranks can be divided by N to create values between 1/N and 1 for comparability across datasets
Robust to Outliers: Extreme values only affect the highest/lowest ranks, not the entire distribution

Algorithm Details

For each sample (row) in the data matrix:

Sort the values from smallest to largest
Assign ranks starting from 1
For tied values, assign the median rank of the group
Optionally divide all ranks by N (number of features)

Example with ties: If values [1, 2, 2, 3] are encountered: - Value 1 gets rank 1 - Both values of 2 get rank 2.5 (median of ranks 2 and 3) - Value 3 gets rank 4

Parameters

class pronoms.normalizers.RankNormalizer(normalize_by_n: bool = False)[source]

Bases: object

Normalizer that transforms each sample’s values to their ranks.

This normalizer replaces each value in a sample with its rank, where the smallest value gets rank 1 and the largest gets rank N (number of features). Tied values are assigned the median rank of their group.

normalize_by_n

Whether to divide ranks by N (number of features) for comparability.

Type:: bool

ranks

The rank-transformed data. Only available after calling normalize().

Type:: np.ndarray | None

normalize(X: ndarray) → ndarray[source]

Perform rank transformation on input data X.

Parameters:

X (np.ndarray) – Input data matrix with shape (n_samples, n_features). Each row represents a sample, each column represents a feature/protein.

Returns:

Rank-transformed data matrix with the same shape as X. Values range from 1 to N (or 1/N to 1 if normalize_by_n=True).

Return type:

np.ndarray

Raises:

ValueError –

If input is not a 2D array with at least one feature. - If input data contains NaN or Inf values.

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'Rank Normalization Comparison', log_axes: bool = False) → Figure[source]

Plot data before vs after normalization using a 2D hexbin density plot.

Parameters:

before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “Rank Normalization Comparison”.
log_axes (bool, optional) – If True, plot log10 of the original values on the x-axis. If False (default), plot raw original values on the x-axis. The y-axis always shows the actual rank values from the normalization. Log scaling of x-axis can help visualize data with wide dynamic ranges.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Notes

The y-axis limits and label are set assuming integer ranks in [1, n_features] (the default normalize_by_n=False case). When the normalizer was constructed with normalize_by_n=True the plotted y-values are in (1/n_features, 1] and the y-axis label (“Assigned Rank (1 to N)”) and ylim (0, n_features+1) will not match the data — read the y-tick values rather than the label in that case, or pass the raw integer-rank output through the helper directly.

Usage Example

Basic rank normalization:

import numpy as np
from pronoms.normalizers import RankNormalizer

# Create sample data
data = np.array([
    [100, 50, 75, 200],  # Sample 1
    [10, 10, 30, 20]     # Sample 2 (with ties)
])

# Create and apply normalizer
normalizer = RankNormalizer()
normalized_data = normalizer.normalize(data)

print("Original data:")
print(data)
print("\nRank-transformed data:")
print(normalized_data)
# Output:
# [[4. 1. 2. 3.]     # Sample 1: ranks of [100,50,75,200]
#  [2.5 2.5 4. 1.]]  # Sample 2: ranks with ties at 10

Normalized ranks (divide by N):

# Normalize ranks to [1/N, 1] range
normalizer = RankNormalizer(normalize_by_n=True)
normalized_data = normalizer.normalize(data)

print("Normalized rank data (divided by N):")
print(normalized_data)
# Output:
# [[1.    0.25  0.5   0.75 ]     # Sample 1: ranks/4
#  [0.625 0.625 1.    0.25 ]]    # Sample 2: ranks/4

Visualization:

# Visualize the transformation effect
# By default, x-axis shows raw values (log_axes=False)
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()

# For data with wide dynamic ranges, use log-transformed x-axis
fig = normalizer.plot_comparison(data, normalized_data, log_axes=True)
fig.show()

# The y-axis always shows the actual rank values from normalization
# log_axes only affects the x-axis (original values) transformation

When to Use

RankNormalizer is particularly useful when:

Outliers are present: Rank transformation limits the influence of extreme values
Different measurement scales: When features have vastly different ranges
Non-parametric analysis: As preprocessing for rank-based statistical tests
Distribution uniformity: When you need uniform distributions across samples
Comparative studies: When comparing datasets with different numbers of features (use normalize_by_n=True)

Considerations

Information loss: Rank transformation loses information about the magnitude of differences between values
Tied values: The method for handling ties (median rank) may not be suitable for all applications
Discrete output: Results are discrete ranks rather than continuous values
Sample independence: Each sample is ranked independently, so cross-sample relationships may be altered