DirectLFQNormalizer

The DirectLFQNormalizer implements the DirectLFQ algorithm for protein quantification directly from peptide or ion-level intensity data. This method directly infers protein abundances by modeling the relationship between peptides and their parent proteins, enabling accurate label-free quantification across many samples without the biases of traditional summary-based approaches.

Overview

DirectLFQ addresses fundamental limitations in traditional label-free quantification approaches that typically summarize peptide intensities (e.g., by taking the top 3 peptides or using all peptides). Instead, DirectLFQ:

  1. Models peptide-protein relationships: Directly accounts for the contribution of each peptide to its parent protein

  2. Handles missing values: Uses all available peptide information without requiring complete data

  3. Scales to large datasets: Efficiently processes hundreds or thousands of samples

  4. Provides dual output: Returns both protein-level and peptide-level quantification

This approach is particularly powerful for:

  • Large-scale proteomics studies with many samples

  • Datasets with significant missing values

  • Comparative studies requiring accurate protein quantification

  • Clinical proteomics where precision is critical

Key Features

  • Direct quantification: Bypasses traditional peptide summarization steps

  • Missing value robust: Utilizes all available peptide evidence

  • Dual-level output: Provides both protein and peptide quantification

  • Scalable: Handles large sample numbers efficiently

  • Normalization integrated: Combines quantification with normalization in one step

Algorithm Details

DirectLFQ uses a sophisticated statistical model to infer protein abundances from peptide intensities. The algorithm:

  1. Constructs design matrix: Maps peptides to their parent proteins

  2. Applies statistical model: Uses robust regression to estimate protein abundances

  3. Handles missing values: Incorporates all available evidence without imputation

  4. Normalizes across samples: Ensures comparable scales between samples

  5. Returns dual quantification: Provides both protein and peptide-level results

The method avoids common pitfalls of peptide summarization approaches by directly modeling the underlying biological relationships.

Parameters

class pronoms.normalizers.DirectLFQNormalizer(do_between_sample_norm: bool = True, n_quad_samples: int = 50, n_quad_ions: int = 10, min_nonan: int = 1, num_cores: int | None = None)[source]

Bases: object

Normalizer using the DirectLFQ algorithm for in-memory processing.

This normalizer wraps the external directlfq library to perform intensity normalization directly on NumPy arrays without intermediate file I/O. It processes peptide-level data to produce normalized protein-level and peptide-level intensities.

Parameters:
  • do_between_sample_norm (bool, optional) – Whether to perform between-sample normalization (median centering based on selected stable proteins), by default True.

  • n_quad_samples (int, optional) – Number of samples used for quadratic stabilization during between-sample normalization, by default 50.

  • n_quad_ions (int, optional) – Number of ions used for quadratic stabilization during protein intensity estimation, by default 10.

  • min_nonan (int, optional) – Minimum number of non-NaN values required per protein for its intensity to be estimated, by default 1.

  • num_cores (int | None, optional) – Number of CPU cores to use for parallel processing in directlfq. If None, directlfq attempts to use all available cores, by default None.

do_between_sample_norm

Flag indicating if between-sample normalization is enabled.

Type:

bool

n_quad_samples

Number of samples for quadratic stabilization (sample norm).

Type:

int

n_quad_ions

Number of ions for quadratic stabilization (protein estimation).

Type:

int

min_nonan

Minimum non-NaN values required per protein.

Type:

int

num_cores

Number of cores used by directlfq.

Type:

Optional[int]

normalize(X: ndarray, proteins: list[str], peptides: list[str]) tuple[ndarray, ndarray, ndarray, ndarray][source]

Run DirectLFQ on the given peptide-level intensity matrix in memory.

This method orchestrates the DirectLFQ workflow: 1. Constructs a DataFrame in the format required by directlfq. 2. Applies preprocessing steps (log transform, sorting, NaN removal). 3. Optionally performs between-sample normalization. 4. Estimates protein intensities. 5. Extracts normalized protein and ion matrices and their corresponding IDs.

Parameters:
  • X (np.ndarray) – Input data matrix with shape (n_samples, n_features), where features typically represent peptides or ions.

  • proteins (list[str]) – List of protein identifiers corresponding to each feature (column) in X. The length must equal X.shape[1].

  • peptides (list[str]) – List of peptide or ion identifiers corresponding to each feature (column) in X. The length must equal X.shape[1].

Returns:

  • tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]

  • A tuple containing four NumPy arrays

    • protein_matrix: Normalized protein intensities (shape: n_samples, n_proteins).

    • ion_matrix: Normalized peptide/ion intensities (shape: n_samples, n_peptides).

    • protein_ids: Array of unique protein identifiers corresponding to the columns of protein_matrix (shape: n_proteins,).

    • peptide_ids: Array of unique peptide/ion identifiers corresponding to the columns of ion_matrix (shape: n_peptides,).

Raises:
  • ValueError

    • If input X is not 2-dimensional. - If lengths of proteins or peptides do not match X.shape[1]. - If X contains NaN or infinite values. - If internal DataFrame processing or ID extraction fails.

  • ImportError – If the ‘directlfq’ library is not installed.

plot_comparison(before_data: ndarray, after_data: ndarray, figsize: tuple[int, int] = (10, 8), title: str = 'DirectLFQ Protein Normalization Comparison') Figure[source]

Plot protein data before vs after DirectLFQ normalization using a hexbin plot.

Note: This plots the protein level intensities. DirectLFQ computes these from the input peptide/ion intensities.

Parameters:
  • before_data (np.ndarray) – Protein intensity data before normalization, shape (n_samples, n_proteins). This needs to be calculated/provided separately if the input to normalize was peptide-level.

  • after_data (np.ndarray) – Normalized protein intensity data after normalization, shape (n_samples, n_proteins). Typically the first element returned by the normalize method.

  • figsize (Tuple[int, int], optional) – Figure size, by default (10, 8).

  • title (str, optional) – Plot title, by default “DirectLFQ Protein Normalization Comparison”.

Returns:

Figure object containing the hexbin density plot.

Return type:

plt.Figure

Usage Example

Basic DirectLFQ quantification:

import numpy as np
from pronoms.normalizers import DirectLFQNormalizer

# Example peptide-level data (samples x peptides)
# In practice, load from MaxQuant or similar output
peptide_data = np.array([
    [1000, 1100, 500, 600, 0],     # Sample 1
    [1200, 1300, 550, 650, 200],   # Sample 2
    [900, 1000, 450, 550, 0]       # Sample 3
])

# Protein and peptide identifiers
protein_ids = ['ProtA', 'ProtA', 'ProtB', 'ProtB', 'ProtC']
peptide_ids = ['Pep1', 'Pep2', 'Pep3', 'Pep4', 'Pep5']

# Create and apply normalizer
normalizer = DirectLFQNormalizer(num_cores=2)

protein_matrix, peptide_matrix, protein_names, peptide_names = normalizer.normalize(
    peptide_data,
    proteins=protein_ids,
    peptides=peptide_ids
)

print("Protein quantification:")
print(f"Shape: {protein_matrix.shape}")
print(f"Proteins: {protein_names}")
print(protein_matrix)

print("\nPeptide quantification:")
print(f"Shape: {peptide_matrix.shape}")
print(f"Peptides: {peptide_names}")
print(peptide_matrix)

Visualization:

# Visualize protein-level normalization
fig = normalizer.plot_comparison(peptide_data, protein_matrix)
fig.show()

When to Use

DirectLFQNormalizer is particularly useful when:

  • Large-scale studies: Processing hundreds or thousands of samples

  • Missing value issues: Datasets with substantial missing peptide measurements

  • Accurate quantification needed: Clinical or biomarker studies requiring precision

  • Peptide-level data available: Starting from MaxQuant, Proteome Discoverer, or similar outputs

  • Comparative proteomics: Studies comparing protein abundances across conditions

Considerations

  • Computational requirements: More intensive than simple summarization methods

  • Python dependency: Requires the directlfq Python package

  • Data format: Needs peptide-to-protein mapping information

  • Memory usage: Large datasets may require substantial memory

  • Parameter tuning: May benefit from adjusting algorithm parameters for specific datasets

See Also

Citation

Ammar C, Schessner JP, Willems S, Michaelis AC, Mann M. Accurate Label-Free Quantification by directLFQ to Compare Unlimited Numbers of Proteomes. Mol Cell Proteomics. 2023 Jul;22(7):100581. doi:10.1016/j.mcpro.2023.100581, PMID: 37225017