RankNormalizer ============== The ``RankNormalizer`` transforms each sample's values to their ranks, where the smallest value receives rank 1 and the largest receives rank N (number of features). This transformation is useful for making data distributions more uniform and reducing the impact of outliers. Overview -------- Rank normalization replaces each value in a sample with its rank position when the values are sorted from smallest to largest. This creates a uniform distribution of ranks from 1 to N for each sample, making it particularly useful for: - Reducing the impact of outliers - Creating comparable scales across different measurement ranges - Preprocessing for non-parametric statistical methods - Making data distributions more uniform Key Features ------------ - **Tied Value Handling**: When multiple values are identical, they receive the median rank of their group - **Optional Normalization**: Ranks can be divided by N to create values between 1/N and 1 for comparability across datasets - **Robust to Outliers**: Extreme values only affect the highest/lowest ranks, not the entire distribution Algorithm Details ----------------- For each sample (row) in the data matrix: 1. Sort the values from smallest to largest 2. Assign ranks starting from 1 3. For tied values, assign the median rank of the group 4. Optionally divide all ranks by N (number of features) **Example with ties**: If values [1, 2, 2, 3] are encountered: - Value 1 gets rank 1 - Both values of 2 get rank 2.5 (median of ranks 2 and 3) - Value 3 gets rank 4 Parameters ---------- .. autoclass:: pronoms.normalizers.RankNormalizer :members: :undoc-members: :show-inheritance: Usage Example ------------- Basic rank normalization: .. code-block:: python import numpy as np from pronoms.normalizers import RankNormalizer # Create sample data data = np.array([ [100, 50, 75, 200], # Sample 1 [10, 10, 30, 20] # Sample 2 (with ties) ]) # Create and apply normalizer normalizer = RankNormalizer() normalized_data = normalizer.normalize(data) print("Original data:") print(data) print("\nRank-transformed data:") print(normalized_data) # Output: # [[4. 1. 2. 3.] # Sample 1: ranks of [100,50,75,200] # [2.5 2.5 4. 1.]] # Sample 2: ranks with ties at 10 Normalized ranks (divide by N): .. code-block:: python # Normalize ranks to [1/N, 1] range normalizer = RankNormalizer(normalize_by_n=True) normalized_data = normalizer.normalize(data) print("Normalized rank data (divided by N):") print(normalized_data) # Output: # [[1. 0.25 0.5 0.75 ] # Sample 1: ranks/4 # [0.625 0.625 1. 0.25 ]] # Sample 2: ranks/4 Visualization: .. code-block:: python # Visualize the transformation effect # By default, x-axis shows raw values (log_axes=False) fig = normalizer.plot_comparison(data, normalized_data) fig.show() # For data with wide dynamic ranges, use log-transformed x-axis fig = normalizer.plot_comparison(data, normalized_data, log_axes=True) fig.show() # The y-axis always shows the actual rank values from normalization # log_axes only affects the x-axis (original values) transformation When to Use ----------- RankNormalizer is particularly useful when: - **Outliers are present**: Rank transformation limits the influence of extreme values - **Different measurement scales**: When features have vastly different ranges - **Non-parametric analysis**: As preprocessing for rank-based statistical tests - **Distribution uniformity**: When you need uniform distributions across samples - **Comparative studies**: When comparing datasets with different numbers of features (use ``normalize_by_n=True``) Considerations -------------- - **Information loss**: Rank transformation loses information about the magnitude of differences between values - **Tied values**: The method for handling ties (median rank) may not be suitable for all applications - **Discrete output**: Results are discrete ranks rather than continuous values - **Sample independence**: Each sample is ranked independently, so cross-sample relationships may be altered See Also -------- - :doc:`quantile_normalizer`: For making distributions identical rather than just ranked - :doc:`median_normalizer`: For scaling-based normalization that preserves relative differences - :doc:`mad_normalizer`: For robust normalization that handles outliers differently