RankNormalizer
The RankNormalizer transforms each sample’s values to their ranks, where the smallest value receives rank 1 and the largest receives rank N (number of features). This transformation is useful for making data distributions more uniform and reducing the impact of outliers.
Overview
Rank normalization replaces each value in a sample with its rank position when the values are sorted from smallest to largest. This creates a uniform distribution of ranks from 1 to N for each sample, making it particularly useful for:
Reducing the impact of outliers
Creating comparable scales across different measurement ranges
Preprocessing for non-parametric statistical methods
Making data distributions more uniform
Key Features
Tied Value Handling: When multiple values are identical, they receive the median rank of their group
Optional Normalization: Ranks can be divided by N to create values between 1/N and 1 for comparability across datasets
Robust to Outliers: Extreme values only affect the highest/lowest ranks, not the entire distribution
Algorithm Details
For each sample (row) in the data matrix:
Sort the values from smallest to largest
Assign ranks starting from 1
For tied values, assign the median rank of the group
Optionally divide all ranks by N (number of features)
Example with ties: If values [1, 2, 2, 3] are encountered: - Value 1 gets rank 1 - Both values of 2 get rank 2.5 (median of ranks 2 and 3) - Value 3 gets rank 4
Parameters
- class pronoms.normalizers.RankNormalizer(normalize_by_n: bool = False)[source]
Bases:
objectNormalizer that transforms each sample’s values to their ranks.
This normalizer replaces each value in a sample with its rank, where the smallest value gets rank 1 and the largest gets rank N (number of features). Tied values are assigned the median rank of their group.
- normalize_by_n
Whether to divide ranks by N (number of features) for comparability.
- Type:
bool
- ranks
The rank-transformed data. Only available after calling normalize().
- Type:
np.ndarray | None
- normalize(X: ndarray) ndarray[source]
Perform rank transformation on input data X.
- Parameters:
X (np.ndarray) – Input data matrix with shape (n_samples, n_features). Each row represents a sample, each column represents a feature/protein.
- Returns:
Rank-transformed data matrix with the same shape as X. Values range from 1 to N (or 1/N to 1 if normalize_by_n=True).
- Return type:
np.ndarray
- Raises:
ValueError –
If input is not a 2D array with at least one feature. - If input data contains NaN or Inf values.
- plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'Rank Normalization Comparison', log_axes: bool = False) Figure[source]
Plot data before vs after normalization using a 2D hexbin density plot.
- Parameters:
before_data (np.ndarray) – Data before normalization, shape (n_samples, n_features).
after_data (np.ndarray) – Data after normalization, shape (n_samples, n_features).
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “Rank Normalization Comparison”.
log_axes (bool, optional) – If True, plot log10 of the original values on the x-axis. If False (default), plot raw original values on the x-axis. The y-axis always shows the actual rank values from the normalization. Log scaling of x-axis can help visualize data with wide dynamic ranges.
- Returns:
Figure object containing the hexbin density plot.
- Return type:
plt.Figure
Notes
The y-axis limits and label are set assuming integer ranks in
[1, n_features](the defaultnormalize_by_n=Falsecase). When the normalizer was constructed withnormalize_by_n=Truethe plotted y-values are in(1/n_features, 1]and the y-axis label (“Assigned Rank (1 to N)”) and ylim(0, n_features+1)will not match the data — read the y-tick values rather than the label in that case, or pass the raw integer-rank output through the helper directly.
Usage Example
Basic rank normalization:
import numpy as np
from pronoms.normalizers import RankNormalizer
# Create sample data
data = np.array([
[100, 50, 75, 200], # Sample 1
[10, 10, 30, 20] # Sample 2 (with ties)
])
# Create and apply normalizer
normalizer = RankNormalizer()
normalized_data = normalizer.normalize(data)
print("Original data:")
print(data)
print("\nRank-transformed data:")
print(normalized_data)
# Output:
# [[4. 1. 2. 3.] # Sample 1: ranks of [100,50,75,200]
# [2.5 2.5 4. 1.]] # Sample 2: ranks with ties at 10
Normalized ranks (divide by N):
# Normalize ranks to [1/N, 1] range
normalizer = RankNormalizer(normalize_by_n=True)
normalized_data = normalizer.normalize(data)
print("Normalized rank data (divided by N):")
print(normalized_data)
# Output:
# [[1. 0.25 0.5 0.75 ] # Sample 1: ranks/4
# [0.625 0.625 1. 0.25 ]] # Sample 2: ranks/4
Visualization:
# Visualize the transformation effect
# By default, x-axis shows raw values (log_axes=False)
fig = normalizer.plot_comparison(data, normalized_data)
fig.show()
# For data with wide dynamic ranges, use log-transformed x-axis
fig = normalizer.plot_comparison(data, normalized_data, log_axes=True)
fig.show()
# The y-axis always shows the actual rank values from normalization
# log_axes only affects the x-axis (original values) transformation
When to Use
RankNormalizer is particularly useful when:
Outliers are present: Rank transformation limits the influence of extreme values
Different measurement scales: When features have vastly different ranges
Non-parametric analysis: As preprocessing for rank-based statistical tests
Distribution uniformity: When you need uniform distributions across samples
Comparative studies: When comparing datasets with different numbers of features (use
normalize_by_n=True)
Considerations
Information loss: Rank transformation loses information about the magnitude of differences between values
Tied values: The method for handling ties (median rank) may not be suitable for all applications
Discrete output: Results are discrete ranks rather than continuous values
Sample independence: Each sample is ranked independently, so cross-sample relationships may be altered
See Also
QuantileNormalizer: For making distributions identical rather than just ranked
MedianNormalizer: For scaling-based normalization that preserves relative differences
MADNormalizer: For robust normalization that handles outliers differently