SPLMNormalizer ============== The ``SPLMNormalizer`` implements Stable Protein Log-Mean Normalization (SPLM), which identifies a subset of stably expressed proteins based on their low coefficient of variation and uses them as internal standards for normalization. This method is particularly effective when a subset of proteins can be assumed to remain constant across experimental conditions. Overview -------- SPLM normalization addresses the challenge of selecting appropriate reference features for normalization in proteomics data. Rather than assuming all proteins are equally suitable as references, SPLM: 1. **Identifies stable proteins**: Selects features with the lowest coefficient of variation (``std/mean``) computed in linear space 2. **Uses stable proteins as references**: Calculates scaling factors based only on these stable features 3. **Normalizes all features**: Applies the scaling factors derived from stable proteins to the entire dataset This approach is particularly powerful when: - A subset of proteins are expected to be housekeeping or constitutively expressed - Technical variation affects all proteins proportionally - You want to avoid bias from highly variable proteins in normalization - Working with targeted proteomics where reference proteins can be identified Key Features ------------ - **Automatic stable protein selection**: Identifies the most stable features by linear-space coefficient of variation - **Reference-based normalization**: Uses only stable proteins for scaling factor calculation - **Log-space centering**: Removes multiplicative effects through log-transformed centering on stable proteins - **Robust to variable proteins**: Normalization is not affected by highly variable features - **Preserves biological variation**: Maintains true biological differences while removing technical bias Algorithm Details ----------------- The SPLM algorithm works through the following steps: 1. **Calculate per-protein CV in linear space**: For each protein j, CV_j = std(X[:, j]) / mean(X[:, j]). Constant proteins (std=0) get CV=0; proteins with mean=0 are deprioritized as +inf. 2. **Select stable proteins**: Choose the `num_stable_proteins` with lowest CV 3. **Log transformation**: X_log = log(X + ε) where ε prevents log(0) 4. **Calculate scaling factors**: For each sample i, factor_i = mean(X_log[i, stable_proteins]) 5. **Calculate grand mean**: grand_mean = mean(all scaling factors) 6. **Normalize in log-space**: X_norm_log[i, j] = X_log[i, j] - factor_i + grand_mean 7. **Back-transform**: X_normalized = exp(X_norm_log) - ε **Mathematical representation**: .. math:: \text{CV}_j = \frac{\sigma(X_{:,j})}{\mu(X_{:,j})} .. math:: \text{factor}_i = \frac{1}{k} \sum_{j \in \text{stable}} \log(X_{i,j} + \epsilon) where k is the number of stable proteins. Parameters ---------- .. autoclass:: pronoms.normalizers.SPLMNormalizer :members: :undoc-members: :show-inheritance: Usage Example ------------- Basic SPLM normalization: .. code-block:: python import numpy as np from pronoms.normalizers import SPLMNormalizer # Create sample data with stable and variable proteins np.random.seed(42) # Stable proteins (low variability) stable_proteins = np.array([ [100, 200, 150], # Sample 1 [105, 210, 155], # Sample 2 [95, 190, 145] # Sample 3 ]) # Variable proteins (high variability) variable_proteins = np.array([ [50, 1000], # Sample 1 [150, 500], # Sample 2 [25, 2000] # Sample 3 ]) # Combine stable and variable proteins data = np.hstack([stable_proteins, variable_proteins]) # Create and apply normalizer # Use 3 stable proteins (should select the first 3 columns) normalizer = SPLMNormalizer(num_stable_proteins=3, epsilon=1.0) normalized_data = normalizer.normalize(data) print("Original data:") print(data) print("\nNormalized data:") print(normalized_data) # Examine which proteins were selected as stable print(f"\nStable protein indices: {normalizer.stable_feature_indices_}") print(f"Log-CVs of all proteins: {normalizer.log_cvs_}") print(f"Scaling factors: {normalizer.log_scaling_factors_}") Visualization: .. code-block:: python # Visualize the normalization effect fig = normalizer.plot_comparison(data, normalized_data) fig.show() When to Use ----------- SPLMNormalizer is particularly useful when: - **Housekeeping proteins present**: Dataset contains proteins expected to be stably expressed - **Targeted proteomics**: Working with a curated set of proteins where some serve as references - **Technical variation dominant**: When most variation is technical rather than biological - **Reference protein selection**: When you want data-driven selection of reference features - **Proportional scaling needed**: When technical effects scale all proteins proportionally Considerations -------------- - **Stable protein assumption**: Requires that some proteins are truly stable across conditions - **Number of stable proteins**: Choice of `num_stable_proteins` can significantly affect results - **Log-space processing**: Assumes multiplicative rather than additive effects - **Minimum protein requirement**: Needs sufficient proteins to reliably identify stable ones - **Biological interpretation**: May remove true biological signal if stable proteins are misidentified See Also -------- - :doc:`median_normalizer`: For simple scaling-based normalization - :doc:`quantile_normalizer`: For making distributions identical across samples - :doc:`mad_normalizer`: For robust normalization using median absolute deviation - :doc:`vsn_normalizer`: For variance-stabilizing normalization