DirectLFQNormalizer

The DirectLFQNormalizer implements the DirectLFQ algorithm for protein quantification directly from peptide or ion-level intensity data. This method directly infers protein abundances by modeling the relationship between peptides and their parent proteins, enabling accurate label-free quantification across many samples without the biases of traditional summary-based approaches.

Overview

DirectLFQ addresses fundamental limitations in traditional label-free quantification approaches that typically summarize peptide intensities (e.g., by taking the top 3 peptides or using all peptides). Instead, DirectLFQ:

Models peptide-protein relationships: Directly accounts for the contribution of each peptide to its parent protein
Handles missing values: Uses all available peptide information without requiring complete data
Scales to large datasets: Efficiently processes hundreds or thousands of samples
Provides dual output: Returns both protein-level and peptide-level quantification

This approach is particularly powerful for:

Large-scale proteomics studies with many samples
Datasets with significant missing values
Comparative studies requiring accurate protein quantification
Clinical proteomics where precision is critical

Key Features

Direct quantification: Bypasses traditional peptide summarization steps
Missing value robust: Utilizes all available peptide evidence
Dual-level output: Provides both protein and peptide quantification
Scalable: Handles large sample numbers efficiently
Normalization integrated: Combines quantification with normalization in one step

Algorithm Details

DirectLFQ uses a sophisticated statistical model to infer protein abundances from peptide intensities. The algorithm:

Constructs design matrix: Maps peptides to their parent proteins
Applies statistical model: Uses robust regression to estimate protein abundances
Handles missing values: Incorporates all available evidence without imputation
Normalizes across samples: Ensures comparable scales between samples
Returns dual quantification: Provides both protein and peptide-level results

The method avoids common pitfalls of peptide summarization approaches by directly modeling the underlying biological relationships.

Parameters

class pronoms.normalizers.DirectLFQNormalizer(do_between_sample_norm: bool = True, n_quad_samples: int = 50, n_quad_ions: int = 10, min_nonan: int = 1, num_cores: int | None = None)[source]

Bases: object

Normalizer using the DirectLFQ algorithm for in-memory processing.

This normalizer wraps the external directlfq library to perform intensity normalization directly on NumPy arrays without intermediate file I/O. It processes peptide-level data to produce normalized protein-level and peptide-level intensities.

Parameters:

do_between_sample_norm (bool, optional) – Whether to perform between-sample normalization (median centering based on selected stable proteins), by default True.
n_quad_samples (int, optional) – Number of samples used for quadratic stabilization during between-sample normalization, by default 50.
n_quad_ions (int, optional) – Number of ions used for quadratic stabilization during protein intensity estimation, by default 10.
min_nonan (int, optional) – Minimum number of non-NaN values required per protein for its intensity to be estimated, by default 1.
num_cores (int | None, optional) – Number of CPU cores to use for parallel processing in directlfq. If None, directlfq attempts to use all available cores, by default None.

do_between_sample_norm

Flag indicating if between-sample normalization is enabled.

Type:: bool

n_quad_samples

Number of samples for quadratic stabilization (sample norm).

Type:: int

n_quad_ions

Number of ions for quadratic stabilization (protein estimation).

Type:: int

min_nonan

Minimum non-NaN values required per protein.

Type:: int

num_cores

Number of cores used by directlfq.

Type:: Optional[int]

normalize(X: ndarray, proteins: list[str], peptides: list[str]) → tuple[ndarray, ndarray, ndarray, ndarray][source]

Run DirectLFQ on the given peptide-level intensity matrix in memory.

This method orchestrates the DirectLFQ workflow: 1. Constructs a DataFrame in the format required by directlfq. 2. Applies preprocessing steps (log transform, sorting, NaN removal). 3. Optionally performs between-sample normalization. 4. Estimates protein intensities. 5. Extracts normalized protein and ion matrices and their corresponding IDs.

Parameters:

X (np.ndarray) – Input data matrix with shape (n_samples, n_features), where features typically represent peptides or ions.
proteins (list[str]) – List of protein identifiers corresponding to each feature (column) in X. The length must equal X.shape[1].
peptides (list[str]) – List of peptide or ion identifiers corresponding to each feature (column) in X. The length must equal X.shape[1].

Returns:

tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
A tuple containing four NumPy arrays –
- protein_matrix: Normalized protein intensities (shape: n_samples, n_proteins).
- ion_matrix: Normalized peptide/ion intensities (shape: n_samples, n_peptides).
- protein_ids: Array of unique protein identifiers corresponding to the columns of protein_matrix (shape: n_proteins,).
- peptide_ids: Array of unique peptide/ion identifiers corresponding to the columns of ion_matrix (shape: n_peptides,).

Raises:

ValueError –
- If input X is not 2-dimensional. - If lengths of proteins or peptides do not match X.shape[1]. - If X contains NaN or infinite values. - If internal DataFrame processing or ID extraction fails.
ImportError – If the ‘directlfq’ library is not installed.

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'DirectLFQ Protein Normalization Comparison') → Figure[source]

Plot protein data before vs after DirectLFQ normalization using a hexbin plot.

Note: This plots the protein level intensities. DirectLFQ computes these from the input peptide/ion intensities.

Parameters:

before_data (np.ndarray) – Protein intensity data before normalization, shape (n_samples, n_proteins). This needs to be calculated/provided separately if the input to normalize was peptide-level.
after_data (np.ndarray) – Normalized protein intensity data after normalization, shape (n_samples, n_proteins). Typically the first element returned by the normalize method.
figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).
title (str, optional) – Plot title, by default “DirectLFQ Protein Normalization Comparison”.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Usage Example

Basic DirectLFQ quantification:

import numpy as np
from pronoms.normalizers import DirectLFQNormalizer

# Example peptide-level data (samples x peptides)
# In practice, load from MaxQuant or similar output
peptide_data = np.array([
    [1000, 1100, 500, 600, 0],     # Sample 1
    [1200, 1300, 550, 650, 200],   # Sample 2
    [900, 1000, 450, 550, 0]       # Sample 3
])

# Protein and peptide identifiers
protein_ids = ['ProtA', 'ProtA', 'ProtB', 'ProtB', 'ProtC']
peptide_ids = ['Pep1', 'Pep2', 'Pep3', 'Pep4', 'Pep5']

# Create and apply normalizer
normalizer = DirectLFQNormalizer(num_cores=2)

protein_matrix, peptide_matrix, protein_names, peptide_names = normalizer.normalize(
    peptide_data,
    proteins=protein_ids,
    peptides=peptide_ids
)

print("Protein quantification:")
print(f"Shape: {protein_matrix.shape}")
print(f"Proteins: {protein_names}")
print(protein_matrix)

print("\nPeptide quantification:")
print(f"Shape: {peptide_matrix.shape}")
print(f"Peptides: {peptide_names}")
print(peptide_matrix)

Visualization:

# Visualize protein-level normalization
fig = normalizer.plot_comparison(peptide_data, protein_matrix)
fig.show()

When to Use

DirectLFQNormalizer is particularly useful when:

Large-scale studies: Processing hundreds or thousands of samples
Missing value issues: Datasets with substantial missing peptide measurements
Accurate quantification needed: Clinical or biomarker studies requiring precision
Peptide-level data available: Starting from MaxQuant, Proteome Discoverer, or similar outputs
Comparative proteomics: Studies comparing protein abundances across conditions

Considerations

Computational requirements: More intensive than simple summarization methods
Python dependency: Requires the directlfq Python package
Data format: Needs peptide-to-protein mapping information
Memory usage: Large datasets may require substantial memory
Parameter tuning: May benefit from adjusting algorithm parameters for specific datasets

Citation

Ammar C, Schessner JP, Willems S, Michaelis AC, Mann M. Accurate Label-Free Quantification by directLFQ to Compare Unlimited Numbers of Proteomes. Mol Cell Proteomics. 2023 Jul;22(7):100581. doi:10.1016/j.mcpro.2023.100581, PMID: 37225017

DirectLFQNormalizer

Overview

Key Features

Algorithm Details

Parameters

Usage Example

When to Use

Considerations

See Also

Citation