DirectLFQNormalizer
The DirectLFQNormalizer implements the DirectLFQ algorithm for protein quantification directly from peptide or ion-level intensity data. This method directly infers protein abundances by modeling the relationship between peptides and their parent proteins, enabling accurate label-free quantification across many samples without the biases of traditional summary-based approaches.
Overview
DirectLFQ addresses fundamental limitations in traditional label-free quantification approaches that typically summarize peptide intensities (e.g., by taking the top 3 peptides or using all peptides). Instead, DirectLFQ:
Models peptide-protein relationships: Directly accounts for the contribution of each peptide to its parent protein
Handles missing values: Uses all available peptide information without requiring complete data
Scales to large datasets: Efficiently processes hundreds or thousands of samples
Provides dual output: Returns both protein-level and peptide-level quantification
This approach is particularly powerful for:
Large-scale proteomics studies with many samples
Datasets with significant missing values
Comparative studies requiring accurate protein quantification
Clinical proteomics where precision is critical
Key Features
Direct quantification: Bypasses traditional peptide summarization steps
Missing value robust: Utilizes all available peptide evidence
Dual-level output: Provides both protein and peptide quantification
Scalable: Handles large sample numbers efficiently
Normalization integrated: Combines quantification with normalization in one step
Algorithm Details
DirectLFQ uses a sophisticated statistical model to infer protein abundances from peptide intensities. The algorithm:
Constructs design matrix: Maps peptides to their parent proteins
Applies statistical model: Uses robust regression to estimate protein abundances
Handles missing values: Incorporates all available evidence without imputation
Normalizes across samples: Ensures comparable scales between samples
Returns dual quantification: Provides both protein and peptide-level results
The method avoids common pitfalls of peptide summarization approaches by directly modeling the underlying biological relationships.
Parameters
- class pronoms.normalizers.DirectLFQNormalizer(do_between_sample_norm: bool = True, n_quad_samples: int = 50, n_quad_ions: int = 10, min_nonan: int = 1, num_cores: int | None = None)[source]
Bases:
objectNormalizer using the DirectLFQ algorithm for in-memory processing.
This normalizer wraps the external directlfq library to perform intensity normalization directly on NumPy arrays without intermediate file I/O. It processes peptide-level data to produce normalized protein-level and peptide-level intensities.
- Parameters:
do_between_sample_norm (bool, optional) – Whether to perform between-sample normalization (median centering based on selected stable proteins), by default True.
n_quad_samples (int, optional) – Number of samples used for quadratic stabilization during between-sample normalization, by default 50.
n_quad_ions (int, optional) – Number of ions used for quadratic stabilization during protein intensity estimation, by default 10.
min_nonan (int, optional) – Minimum number of non-NaN values required per protein for its intensity to be estimated, by default 1.
num_cores (int | None, optional) – Number of CPU cores to use for parallel processing in directlfq. If None, directlfq attempts to use all available cores, by default None.
- do_between_sample_norm
Flag indicating if between-sample normalization is enabled.
- Type:
bool
- n_quad_samples
Number of samples for quadratic stabilization (sample norm).
- Type:
int
- n_quad_ions
Number of ions for quadratic stabilization (protein estimation).
- Type:
int
- min_nonan
Minimum non-NaN values required per protein.
- Type:
int
- num_cores
Number of cores used by directlfq.
- Type:
Optional[int]
- normalize(X: ndarray, proteins: list[str], peptides: list[str]) tuple[ndarray, ndarray, ndarray, ndarray][source]
Run DirectLFQ on the given peptide-level intensity matrix in memory.
This method orchestrates the DirectLFQ workflow: 1. Constructs a DataFrame in the format required by directlfq. 2. Applies preprocessing steps (log transform, sorting, NaN removal). 3. Optionally performs between-sample normalization. 4. Estimates protein intensities. 5. Extracts normalized protein and ion matrices and their corresponding IDs.
- Parameters:
X (np.ndarray) – Input data matrix with shape (n_samples, n_features), where features typically represent peptides or ions.
proteins (list[str]) – List of protein identifiers corresponding to each feature (column) in X. The length must equal X.shape[1].
peptides (list[str]) – List of peptide or ion identifiers corresponding to each feature (column) in X. The length must equal X.shape[1].
- Returns:
tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
A tuple containing four NumPy arrays –
protein_matrix: Normalized protein intensities (shape: n_samples, n_proteins).
ion_matrix: Normalized peptide/ion intensities (shape: n_samples, n_peptides).
protein_ids: Array of unique protein identifiers corresponding to the columns of protein_matrix (shape: n_proteins,).
peptide_ids: Array of unique peptide/ion identifiers corresponding to the columns of ion_matrix (shape: n_peptides,).
- Raises:
ValueError –
If input X is not 2-dimensional. - If lengths of proteins or peptides do not match X.shape[1]. - If X contains NaN or infinite values. - If internal DataFrame processing or ID extraction fails.
ImportError – If the ‘directlfq’ library is not installed.
- plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'DirectLFQ Protein Normalization Comparison') Figure[source]
Plot protein data before vs after DirectLFQ normalization using a hexbin plot.
Note: This plots the protein level intensities. DirectLFQ computes these from the input peptide/ion intensities.
- Parameters:
before_data (np.ndarray) – Protein intensity data before normalization, shape (n_samples, n_proteins). This needs to be calculated/provided separately if the input to normalize was peptide-level.
after_data (np.ndarray) – Normalized protein intensity data after normalization, shape (n_samples, n_proteins). Typically the first element returned by the normalize method.
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “DirectLFQ Protein Normalization Comparison”.
- Returns:
Figure object containing the hexbin density plot.
- Return type:
plt.Figure
Usage Example
Basic DirectLFQ quantification:
import numpy as np
from pronoms.normalizers import DirectLFQNormalizer
# Example peptide-level data (samples x peptides)
# In practice, load from MaxQuant or similar output
peptide_data = np.array([
[1000, 1100, 500, 600, 0], # Sample 1
[1200, 1300, 550, 650, 200], # Sample 2
[900, 1000, 450, 550, 0] # Sample 3
])
# Protein and peptide identifiers
protein_ids = ['ProtA', 'ProtA', 'ProtB', 'ProtB', 'ProtC']
peptide_ids = ['Pep1', 'Pep2', 'Pep3', 'Pep4', 'Pep5']
# Create and apply normalizer
normalizer = DirectLFQNormalizer(num_cores=2)
protein_matrix, peptide_matrix, protein_names, peptide_names = normalizer.normalize(
peptide_data,
proteins=protein_ids,
peptides=peptide_ids
)
print("Protein quantification:")
print(f"Shape: {protein_matrix.shape}")
print(f"Proteins: {protein_names}")
print(protein_matrix)
print("\nPeptide quantification:")
print(f"Shape: {peptide_matrix.shape}")
print(f"Peptides: {peptide_names}")
print(peptide_matrix)
Visualization:
# Visualize protein-level normalization
fig = normalizer.plot_comparison(peptide_data, protein_matrix)
fig.show()
When to Use
DirectLFQNormalizer is particularly useful when:
Large-scale studies: Processing hundreds or thousands of samples
Missing value issues: Datasets with substantial missing peptide measurements
Accurate quantification needed: Clinical or biomarker studies requiring precision
Peptide-level data available: Starting from MaxQuant, Proteome Discoverer, or similar outputs
Comparative proteomics: Studies comparing protein abundances across conditions
Considerations
Computational requirements: More intensive than simple summarization methods
Python dependency: Requires the
directlfqPython packageData format: Needs peptide-to-protein mapping information
Memory usage: Large datasets may require substantial memory
Parameter tuning: May benefit from adjusting algorithm parameters for specific datasets
See Also
MedianNormalizer: For simple scaling-based normalization at the protein level
QuantileNormalizer: For making distributions identical across samples
VSNNormalizer: For variance-stabilizing normalization
RankNormalizer: For rank-based transformation
Citation
Ammar C, Schessner JP, Willems S, Michaelis AC, Mann M. Accurate Label-Free Quantification by directLFQ to Compare Unlimited Numbers of Proteomes. Mol Cell Proteomics. 2023 Jul;22(7):100581. doi:10.1016/j.mcpro.2023.100581, PMID: 37225017