Utils Module

data_validation

Data Validation Module for FTIR Spectral Preprocessing

Provides comprehensive validation checks for FTIR spectral data to ensure data quality before preprocessing and analysis.

xpectrass.utils.data_validation.validate_spectra(df, expected_samples_per_class=500, expected_classes=None, wavenumber_range=(399.0, 4000.0), intensity_range=(0.0, 150.0), verbose=True)[source]

Comprehensive validation of FTIR spectral data.

Parameters:
  • df (pl.DataFrame) – Wide-format DataFrame with columns: ‘sample’, ‘label’, and wavenumber columns.

  • expected_samples_per_class (int, default 500) – Expected number of samples per plastic type.

  • expected_classes (list of str, optional) – Expected class labels. Default: [‘HDPE’, ‘LDPE’, ‘PET’, ‘PP’, ‘PS’, ‘PVC’]

  • wavenumber_range (tuple, default (399.0, 4000.0)) – Expected (min, max) wavenumber range in cm⁻¹.

  • intensity_range (tuple, default (0.0, 150.0)) – Valid intensity range for %T values.

  • verbose (bool, default True) – Print validation report to console.

Returns:

Validation report with keys: - ‘valid’: bool - overall pass/fail - ‘n_samples’: int - total samples - ‘n_wavenumbers’: int - spectral points - ‘class_counts’: dict - samples per label - ‘missing_values’: int - count of NaN/Inf - ‘out_of_range’: dict - samples with intensities outside range - ‘wavenumber_check’: dict - actual vs expected range - ‘duplicates’: list - duplicate sample names - ‘issues’: list - list of issue descriptions

Return type:

dict

xpectrass.utils.data_validation.detect_outlier_spectra(df, method='zscore', threshold=3.0)[source]

Detect outlier spectra based on overall intensity statistics.

Parameters:
  • df (pl.DataFrame) – Wide-format spectral DataFrame.

  • method (str, default "zscore") – Detection method: ‘zscore’, ‘iqr’, or ‘mad’.

  • threshold (float, default 3.0) – Threshold for outlier detection.

Returns:

  • ‘outlier_samples’: list of sample names flagged as outliers

  • ’outlier_indices’: list of row indices

  • ’statistics’: dict with mean/std/median per sample

Return type:

dict

xpectrass.utils.data_validation.check_wavenumber_consistency(file_paths, skiprows=15, tolerance=0.1)[source]

Check if all files have consistent wavenumber grids.

Parameters:
  • file_paths (list of str) – Paths to CSV spectral files.

  • skiprows (int, default 15) – Header rows to skip.

  • tolerance (float, default 0.1) – Maximum allowed difference in wavenumber values.

Returns:

  • ‘consistent’: bool

  • ’reference_shape’: tuple

  • ’mismatched_files’: list

Return type:

dict

validate_spectra

validate_spectra(
    df: pl.DataFrame,
    expected_samples_per_class: int = 500,
    expected_classes: List[str] = None,
    wavenumber_range: Tuple[float, float] = (399.0, 4000.0),
    intensity_range: Tuple[float, float] = (0.0, 150.0),
    verbose: bool = True
) -> Dict[str, Any]

detect_outlier_spectra

detect_outlier_spectra(
    df: pl.DataFrame,
    method: str = "zscore",  # 'zscore', 'iqr', 'mad'
    threshold: float = 3.0
) -> Dict[str, Any]

baseline

Baseline Correction Module for FTIR Spectral Preprocessing

Provides baseline correction using 50+ algorithms from pybaselines library plus custom windowed filters for FTIR and ToF-SIMS spectra.

IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py

Features: - Single spectrum correction via baseline_correction() - Batch DataFrame processing via apply_baseline_correction() - Automatic column detection and sorting by wavenumber - Performance optimized for large datasets (vectorized operations) - Pandas and Polars DataFrame support - Method evaluation via evaluate_baseline_correction_methods()

Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:

import logging logging.getLogger(‘utils.baseline’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.baseline’).setLevel(logging.ERROR) # Only errors

Available Methods: Run baseline_method_names() to see all available correction algorithms. Common methods: airpls, asls, arpls, iarpls, drpls, mor, snip, poly

xpectrass.utils.baseline.baseline_correction(intensities, wavenumbers=None, method='airpls', window_size=101, poly_order=4, clip_negative=False, return_baseline=False, **kwargs)[source]

Baseline-correct a 1-D FT-IR or ToF-SIMS spectrum with >50 algorithms.

Parameters:
  • intensities (array-like) – Raw y-values (%T or absorbance); will be converted to float64.

  • wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). If provided, passed to pybaselines for correct spacing. If None, assumes uniform spacing with dx=1. For FTIR spectra, should match the length of intensities.

  • method (str, default "airpls") – Name of the baseline algorithm. All pybaselines methods plus two custom filters (“median_filter”, “adaptive_window”) are accepted.

  • window_size (int, default 101) – Odd kernel width for the two custom windowed filters.

  • poly_order (int, default 4) – Polynomial order for the “poly” baseline.

  • clip_negative (bool, default False) – If True, set negative corrected values to 0 (useful for %T spectra).

  • return_baseline (bool, default False) – If True, return (corrected, baseline) instead of just corrected.

  • **kwargs – Extra keyword arguments are forwarded verbatim to the selected pybaselines algorithm (e.g. lam=1e6, p=0.01 for AsLS).

Returns:

  • corrected (np.ndarray) – Baseline-subtracted intensities (same dtype & length as input).

  • baseline (np.ndarray , optional) – Returned only if return_baseline=True.

Return type:

ndarray | Tuple[ndarray, ndarray]

Notes

NaN Handling:
  • If input contains NaN values, they are temporarily removed for baseline estimation

  • Baseline is computed only on finite values

  • NaN positions are preserved in output (marked as NaN)

  • If all values are NaN, returns array of NaN

xpectrass.utils.baseline.apply_baseline_correction(data, method='airpls', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, window_size=101, poly_order=4, clip_negative=False, show_progress=True, **kwargs)[source]

Apply baseline correction to a DataFrame of FTIR spectra (batch processing).

Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies baseline correction to all samples.

Parameters:
  • data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).

  • method (str, default "airpls") – Baseline correction method. All pybaselines methods plus custom filters (“median_filter”, “adaptive_window”, “poly”) are supported. Common methods: “airpls”, “asls”, “arpls”, “iarpls”, “drpls”, “iasls”, “aspls”, “psalsa”, “derpsalsa”, “mpls”, “mor”, “imor”, “amormol”, “snip”

  • label_column (str, default "label") – Name of the label/group column to exclude from correction.

  • exclude_columns (list[str], optional) – Additional column names to exclude from correction (e.g., ‘sample’, ‘id’).

  • wn_min (float, optional) – Minimum wavenumber for column detection (default: 200.0 cm⁻¹). Columns with wavenumbers below this value will be excluded.

  • wn_max (float, optional) – Maximum wavenumber for column detection (default: 8000.0 cm⁻¹). Columns with wavenumbers above this value will be excluded.

  • window_size (int, default 101) – Odd kernel width for custom windowed filters (“median_filter”, “adaptive_window”).

  • poly_order (int, default 4) – Polynomial order for the “poly” baseline method.

  • clip_negative (bool, default False) – If True, set negative corrected values to 0 (useful for %T spectra).

  • show_progress (bool, default True) – If True, display a progress bar during processing.

  • **kwargs (additional parameters) – Extra keyword arguments forwarded to the baseline correction algorithm (e.g., lam=1e6, p=0.01 for AsLS/AirPLS methods).

  • sample_id_column (str)

Returns:

  • pd.DataFrame | pl.DataFrame – Baseline-corrected DataFrame (same type as input) with spectral data corrected and metadata columns preserved. Output columns are sorted by ascending wavenumber for standardization.

  • NaN Handling

  • ————

  • Robustly handles NaN (missing) values in spectral data

  • - NaN values are temporarily removed before baseline estimation

  • - Baseline is computed only on finite values

  • - NaN positions are preserved in output

  • - If an entire spectrum is NaN, it remains as NaN

  • - Prevents baseline algorithms from failing on sparse/incomplete data

  • Performance

  • ———–

  • Optimized for large datasets using

  • - Robust wavenumber column detection (parses column names, not dtype)

  • - Automatic column sorting to ensure monotonic wavenumber order

  • - Vectorized numpy array access (no DataFrame.loc overhead)

  • - Pre-allocated output arrays (no dynamic list appending)

  • - Progress tracking via tqdm

Return type:

DataFrame | DataFrame

Warning

  • Warns if spectral columns are reordered during processing

  • Warns if wavenumber bounds are auto-expanded beyond defaults (200-8000 cm⁻¹)

Examples

>>> # Apply AirPLS baseline correction to all samples
>>> df_corrected = apply_baseline_correction(df_wide, method="airpls")
>>> # Use AsLS with custom parameters
>>> df_corrected = apply_baseline_correction(
...     df_wide,
...     method="asls",
...     lam=1e6,
...     p=0.01
... )
>>> # Use polynomial baseline
>>> df_corrected = apply_baseline_correction(
...     df_wide,
...     method="poly",
...     poly_order=3
... )
>>> # Works with both pandas and polars
>>> df_pd_corrected = apply_baseline_correction(df_pandas)
>>> df_pl_corrected = apply_baseline_correction(df_polars)
>>> # Disable progress bar for cleaner output
>>> df_corrected = apply_baseline_correction(df_wide, show_progress=False)
>>> # Works correctly even with NaN values in spectra
>>> df_with_nans = df_wide.copy()
>>> df_with_nans.iloc[0, 10:20] = np.nan  # Introduce NaN values
>>> df_corrected = apply_baseline_correction(df_with_nans, method="airpls")
# NaN positions are preserved in output
xpectrass.utils.baseline.baseline_method_names()[source]

Return a sorted list of method names that can be passed to baseline_correction(method=…).

The list is generated dynamically from pybaselines.Baseline, skipping the deprecated solver helpers, and then augmented with the two custom windowed filters plus the convenient ‘poly’ alias.

Return type:

list[str]

xpectrass.utils.baseline.plot_baseline_correction_metric_boxes(df, metric_name, figsize=(9, 5), mean_bar_width=0.6, color_boxes=None, color_mean=None, plot_mean_sd=False, save_plot=False, save_path='')[source]

Box-plot of a baseline-quality metric (RFZN or NAR) across methods.

Parameters:
  • df (pandas.DataFrame) – Rows = samples, columns = baseline-correction methods.

  • metric_name (str) – Title for the y-axis and plot.

  • figsize ((w, h), default (9, 5)) – Size of the figure in inches.

  • mean_bar_width (float, default 0.6) – Width of the mean ± SD bar overlay (same units as box widths).

  • color_boxes (str | None) – Matplotlib colour for the boxes. None → Matplotlib default cycle.

  • color_mean (str | None) – Colour for the mean ± SD bars. None → Matplotlib default cycle.

  • plot_mean_sd (bool)

  • save_plot (bool)

  • save_path (str)

Return type:

None

xpectrass.utils.baseline.plot_baseline_correction_metric_boxes_masked(df, metric_name, max_value, figsize=(9, 5), mean_bar_width=0.6, color_boxes=None, color_mean=None, plot_mean_sd=False, save_plot=False, save_path='masked_')[source]
Parameters:
Return type:

None

xpectrass.utils.baseline.evaluate_baseline_correction_methods(data, flat_windows, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, negative_clip=False, diagnostic_peaks=None, baseline_methods=None, n_samples=None, sample_selection='random', random_state=None, n_jobs=-1)[source]

Parallel computation of RFZN, NAR, SNR for every (sample, method) pair.

Parameters:
  • data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).

  • flat_windows (list of tuples) – Wavenumber ranges to use for baseline noise evaluation. Each tuple is (min_wavenumber, max_wavenumber) for regions expected to contain only baseline (no peaks).

  • label_column (str, default "label") – Name of the label/group column to exclude from evaluation.

  • exclude_columns (list[str], optional) – Additional column names to exclude from evaluation (e.g., ‘sample’, ‘id’).

  • wn_min (float, optional) – Minimum wavenumber for column detection (default: 200.0 cm⁻¹).

  • wn_max (float, optional) – Maximum wavenumber for column detection (default: 8000.0 cm⁻¹).

  • negative_clip (bool, default False) – If True, clip negative values to 0 during baseline correction.

  • diagnostic_peaks (list of tuples, optional) – Specific wavenumber ranges for peak detection in SNR calculation. Each tuple is (min_wavenumber, max_wavenumber) for diagnostic peaks. Example: [(2900, 2930), (2840, 2870)] for CH2/CH3 stretch regions. If None, uses global maximum across entire spectrum.

  • baseline_methods (list of str, optional) – List of baseline correction methods to evaluate. If None (default), evaluates all available methods from baseline_method_names(). Providing a subset significantly speeds up evaluation. Example: [‘als’, ‘asls’, ‘arpls’] to test only ALS variants. Use baseline_method_names() to see all available methods.

  • n_samples (int, optional) – Number of samples to evaluate. If None, evaluates all samples.

  • sample_selection (str, default "random") – How to select samples if n_samples < total samples. Options: “random”, “first”, “last”.

  • random_state (int, optional) – Random seed for reproducible sample selection when using “random”.

  • n_jobs (int, default -1) – Number of parallel jobs. -1 uses all CPU cores.

  • sample_id_column (str)

Returns:

  • rfzn_tbl (pandas.DataFrame) – Residual Flat-Zone Noise (RFZN) values for each (sample, method) pair. Units: Same as input spectral intensities (e.g., absorbance, %T). Lower values indicate better baseline correction (less residual noise).

  • nar_tbl (pandas.DataFrame) – Negative Area Ratio (NAR) values for each (sample, method) pair. Units: Unitless ratio in range [0, 1]. Ratio of negative area to total area after baseline correction. Lower values indicate better correction (fewer negative artifacts).

  • snr_tbl (pandas.DataFrame) – Signal-to-Noise Ratio (SNR) values for each (sample, method) pair. Units: Unitless ratio (peak_height / noise_level). Higher values indicate better correction (stronger signal relative to noise).

Notes

Metric Interpretations:
  • RFZN: RMS noise in flat zones. Good methods: < 0.01 (absorbance units)

  • NAR: Fraction of negative intensity. Good methods: < 0.05 (5%)

  • SNR: Peak/noise ratio. Good methods: > 10 (depends on sample)

Example Usage:
>>> flat_windows = [(2500, 2600), (3200, 3500)]  # Baseline-only regions
>>>
>>> # Evaluate all available methods
>>> rfzn, nar, snr = evaluate_baseline_correction_methods(
...     df, flat_windows, diagnostic_peaks=[(2900, 2930)]
... )
>>>
>>> # Evaluate only specific methods (faster)
>>> rfzn, nar, snr = evaluate_baseline_correction_methods(
...     df, flat_windows,
...     baseline_methods=['als', 'asls', 'arpls', 'rubberband']
... )
>>>
>>> # Find best method for each sample
>>> best_methods = rfzn.idxmin(axis=1)  # Method with lowest noise per sample
xpectrass.utils.baseline.find_best_baseline_method(rfzn_tbl, nar_tbl, snr_tbl, rfzn_threshold=0.01, nar_threshold=0.05, snr_min=10.0, top_n=5)[source]

Recommend best baseline correction methods based on evaluation metrics.

Analyzes RFZN, NAR, and SNR across all samples to identify methods that consistently perform well. Methods are ranked by a composite score combining all three metrics.

Parameters:
  • rfzn_tbl (pd.DataFrame) – RFZN values from evaluate_baseline_correction_methods()

  • nar_tbl (pd.DataFrame) – NAR values from evaluate_baseline_correction_methods()

  • snr_tbl (pd.DataFrame) – SNR values from evaluate_baseline_correction_methods()

  • rfzn_threshold (float, default 0.01) – Maximum acceptable RFZN (lower is better). Default: 0.01 absorbance units.

  • nar_threshold (float, default 0.05) – Maximum acceptable NAR (lower is better). Default: 0.05 (5%).

  • snr_min (float, default 10.0) – Minimum acceptable SNR (higher is better). Default: 10.

  • top_n (int, default 5) – Number of top methods to return.

Returns:

Ranked methods with columns: - method: Method name - median_rfzn: Median RFZN across samples - median_nar: Median NAR across samples - median_snr: Median SNR across samples - pass_rate: Fraction of samples passing all thresholds (0-1) - composite_score: Weighted score (higher is better) Sorted by composite_score descending (best methods first).

Return type:

pd.DataFrame

Notes

Composite Score Calculation:
  • Normalizes each metric to [0, 1] range

  • RFZN: Lower is better (inverted for scoring)

  • NAR: Lower is better (inverted for scoring)

  • SNR: Higher is better

  • Pass rate: Bonus for consistent performance

  • Composite = (0.3 * RFZN_score) + (0.3 * NAR_score) + (0.3 * SNR_score) + (0.1 * pass_rate)

Example

>>> rfzn, nar, snr = evaluate_baseline_correction_methods(df, flat_windows)
>>> recommendations = recommend_baseline_methods(rfzn, nar, snr, top_n=3)
>>> print(recommendations)
     method  median_rfzn  median_nar  median_snr  pass_rate  composite_score
0   airpls       0.0045        0.02        25.3       0.95             0.89
1    arpls       0.0052        0.03        23.1       0.92             0.85
2    drpls       0.0061        0.04        21.7       0.88             0.81

baseline_correction

baseline_correction(
    intensities: np.ndarray,
    method: str = "airpls",
    window_size: int = 101,
    poly_order: int = 4,
    clip_negative: bool = True,
    return_baseline: bool = False,
    **kwargs
) -> np.ndarray

baseline_method_names

baseline_method_names() -> List[str]

Returns list of 50+ available baseline correction methods.


denoise

Denoising Module for FTIR Spectral Preprocessing

IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py

Features: - Single spectrum denoising via denoise() - Batch DataFrame processing via apply_denoising() - Automatic column detection and sorting by wavenumber - Performance optimized for large datasets (vectorized operations) - Pandas and Polars DataFrame support - Method evaluation via evaluate_denoising_methods() - Memory-safe evaluation via evaluate_denoising_methods_safe() - Composite scoring for method selection via find_best_denoising_method()

Memory Management: For systems with limited RAM or large datasets, use evaluate_denoising_methods_safe() instead of evaluate_denoising_methods(). See MEMORY_MANAGEMENT_GUIDE.md for details.

Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:

import logging logging.getLogger(‘utils.denoise’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.denoise’).setLevel(logging.ERROR) # Only errors

Available Methods: Run denoise_method_names() to see all available denoising algorithms. Common methods: savgol, wavelet, moving_average, gaussian, median, whittaker, lowpass

xpectrass.utils.denoise.denoise(intensities, wavenumbers=None, method='savgol', **kwargs)[source]

Denoise a 1-D FTIR spectrum using various filtering methods.

Parameters:
  • intensities (array-like) – Raw intensity values (1-D).

  • wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). If provided, validates that data is sorted in ascending order and warns if not. Ensures API consistency with other preprocessing modules (baseline, atmospheric).

  • method (str, default "savgol") – Denoising method. Options: - ‘savgol’: Savitzky-Golay filter (preserves peak shape) - ‘wavelet’: Discrete wavelet transform denoising - ‘moving_average’: Simple moving average - ‘gaussian’: Gaussian filter - ‘median’: Median filter (good for spike noise) - ‘whittaker’: Penalized least squares smoother - ‘lowpass’: Low-pass Butterworth filter

  • **kwargs (method-specific parameters) – savgol: window_length (15), polyorder (3) wavelet: wavelet (‘db4’), level (3), threshold_mode (‘soft’) moving_average: window (11) gaussian: sigma (2.0) median: kernel_size (5) whittaker: lam (1e4), d (2) lowpass: cutoff (0.1), order (4)

Returns:

Denoised intensity values. NaN values in input are preserved at their original positions; denoising is applied only to finite values.

Return type:

np.ndarray

xpectrass.utils.denoise.denoise_method_names()[source]

Return list of available denoising method names.

Return type:

List[str]

xpectrass.utils.denoise.apply_denoising(data, method='savgol', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, show_progress=True, **kwargs)[source]

Apply denoising to a DataFrame of FTIR spectra (batch processing).

Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies denoising to all samples.

Parameters:
  • data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).

  • method (str, default "savgol") – Denoising method. Options: - ‘savgol’: Savitzky-Golay filter (preserves peak shape) - ‘wavelet’: Discrete wavelet transform denoising - ‘moving_average’: Simple moving average - ‘gaussian’: Gaussian filter - ‘median’: Median filter (good for spike noise) - ‘whittaker’: Penalized least squares smoother - ‘lowpass’: Low-pass Butterworth filter

  • label_column (str, default "label") – Name of the label/group column to exclude from denoising.

  • exclude_columns (list[str], optional) – Additional column names to exclude from denoising (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns.

  • wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default, or auto-expands if no columns found within default range.

  • wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default, or auto-expands if no columns found within default range.

  • show_progress (bool, default True) – If True, display a progress bar during processing.

  • **kwargs (additional parameters) – Method-specific parameters forwarded to the denoising algorithm: - savgol: window_length (15), polyorder (3) - wavelet: wavelet (‘db4’), level (3), threshold_mode (‘soft’) - moving_average: window (11) - gaussian: sigma (2.0) - median: kernel_size (5) - whittaker: lam (1e4), d (2) - lowpass: cutoff (0.1), order (4)

  • sample_id_column (str)

Returns:

Denoised DataFrame (same type as input) with spectral data denoised and metadata columns preserved. Columns are sorted by ascending wavenumber order.

Return type:

pd.DataFrame | pl.DataFrame

Examples

>>> # Apply Savitzky-Golay denoising to all samples
>>> df_denoised = apply_denoising(df_wide, method="savgol")
>>> # Use wavelet denoising with custom parameters
>>> df_denoised = apply_denoising(
...     df_wide,
...     method="wavelet",
...     wavelet="db4",
...     level=3,
...     threshold_mode="soft"
... )
>>> # Use Gaussian smoothing
>>> df_denoised = apply_denoising(
...     df_wide,
...     method="gaussian",
...     sigma=2.0
... )
>>> # Works with both pandas and polars
>>> df_pd_denoised = apply_denoising(df_pandas)
>>> df_pl_denoised = apply_denoising(df_polars)
>>> # Disable progress bar for cleaner output
>>> df_denoised = apply_denoising(df_wide, show_progress=False)
xpectrass.utils.denoise.estimate_snr(y_raw, y_denoised, flat_regions=None, wavenumbers=None)[source]

Estimate Signal-to-Noise Ratio improvement (NaN-aware).

Parameters:
  • y_raw (np.ndarray) – Original noisy spectrum.

  • y_denoised (np.ndarray) – Denoised spectrum.

  • flat_regions (list of tuples, optional) – Regions known to be baseline-only (for noise estimation). Can be either: - List of (start_idx, end_idx) integer index tuples (legacy) - List of (wn_min, wn_max) float wavenumber tuples (recommended) If wavenumbers provided, flat_regions interpreted as wavenumber ranges. If None, uses high-frequency residual estimation.

  • wavenumbers (np.ndarray, optional) – Wavenumber array. Required if flat_regions specified as wavenumber ranges.

Returns:

Estimated SNR in dB. Returns np.nan if insufficient finite data.

Return type:

float

Notes

  • Uses NaN-aware statistics to handle missing values in spectra

  • Wavenumber-based regions (recommended): More robust across different spectral resolutions

  • Index-based regions (legacy): Faster but resolution-dependent

Examples

>>> # Index-based (legacy)
>>> snr = estimate_snr(y_raw, y_denoised, flat_regions=[(10, 50), (200, 250)])
>>> # Wavenumber-based (recommended)
>>> wn = np.linspace(650, 4000, 1000)
>>> snr = estimate_snr(y_raw, y_denoised, flat_regions=[(2500, 2600), (3200, 3500)], wavenumbers=wn)
xpectrass.utils.denoise.evaluate_denoising_methods_safe(data, methods=None, **kwargs)[source]

Memory-safe wrapper for evaluate_denoising_methods() with conservative defaults.

This function automatically sets safe defaults to prevent memory issues: - n_samples=50 (instead of all samples) - n_jobs=2 (instead of all CPU cores) - methods=[‘savgol’, ‘gaussian’, ‘median’] if not specified

Use this function for initial exploration, then switch to full evaluate_denoising_methods() with custom parameters if needed.

Parameters:
  • data (pd.DataFrame | pl.DataFrame) – Spectral DataFrame

  • methods (list of str, optional) – Denoising methods to test. If None, uses [‘savgol’, ‘gaussian’, ‘median’].

  • **kwargs (additional parameters) – Forwarded to evaluate_denoising_methods(). Note that n_samples and n_jobs will be overridden to safe defaults unless explicitly provided.

Returns:

Evaluation results with columns: sample, method, snr_db, smoothness, fidelity, time_ms

Return type:

pd.DataFrame

Examples

>>> # Safe evaluation (won't cause memory issues)
>>> results = evaluate_denoising_methods_safe(df)
>>> recommendations = find_best_denoising_method(results)
>>> # With custom methods but still safe
>>> results = evaluate_denoising_methods_safe(df, methods=['savgol', 'wavelet'])
xpectrass.utils.denoise.evaluate_denoising_methods(data, methods=None, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, n_samples=None, sample_selection='random', random_state=None, n_jobs=-1)[source]

Compare denoising methods on a subset of spectra.

MEMORY WARNING: Parallel processing can consume significant memory. For systems with <16 GB RAM or large datasets (>1000 samples): - Set n_jobs=2 (not -1) - Set n_samples=50 (not None) - Test with methods=[‘savgol’, ‘gaussian’] first See MEMORY_MANAGEMENT_GUIDE.md for detailed recommendations.

Parameters:
  • data (pd.DataFrame | pl.DataFrame) – Wide-format spectral DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).

  • methods (list of str, optional) – Methods to evaluate. If None, evaluates all available methods. Available: ‘savgol’, ‘wavelet’, ‘moving_average’, ‘gaussian’, ‘median’, ‘whittaker’, ‘lowpass’. Recommendation: Start with 2-3 methods to test memory usage.

  • label_column (str, default "label") – Name of the label/group column to exclude from evaluation.

  • exclude_columns (list[str], optional) – Additional column names to exclude from evaluation (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns.

  • wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default.

  • wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default.

  • n_samples (int, optional) – Number of samples to evaluate. If None, uses all samples.

  • sample_selection (str, default "random") – How to select samples: “random”, “first”, or “last”.

  • random_state (int, optional) – Random seed for reproducibility when sample_selection=”random”.

  • n_jobs (int, default -1) – Number of parallel jobs. -1 uses all CPU cores.

  • sample_id_column (str)

Returns:

Evaluation metrics for each (sample, method) combination with columns: - sample: Sample identifier - method: Denoising method name - snr_db: Signal-to-noise ratio improvement (dB) - smoothness: Inverse of 2nd derivative variance (higher = smoother) - fidelity: Correlation with original signal (0-1, higher = better) - time_ms: Computation time in milliseconds (for performance comparison)

Return type:

pd.DataFrame

xpectrass.utils.denoise.plot_denoising_evaluation(eval_df, metrics=None, figsize=(14, 5), show_mean_sd=True, save_plot=None, save_path=None)[source]

Plot evaluation metrics from evaluate_denoising_methods as box plots.

Creates box plots for each metric (SNR, smoothness, fidelity) across different denoising methods to help select the best method.

Parameters:
  • eval_df (pd.DataFrame) – Output from evaluate_denoising_methods() with columns: [‘sample’, ‘method’, ‘snr_db’, ‘smoothness’, ‘fidelity’]

  • metrics (list of str, optional) – Metrics to plot. If None, plots all three: [‘snr_db’, ‘smoothness’, ‘fidelity’] Available: ‘snr_db’, ‘smoothness’, ‘fidelity’

  • figsize (tuple, default (14, 5)) – Figure size (width, height)

  • show_mean_sd (bool, default True) – If True, overlay mean ± SD on box plots

  • save_path (str, optional) – If provided, save the figure to this path (e.g., ‘denoising_eval.pdf’)

  • save_path – If provided, save the figure to this path (e.g., ‘denoising_eval.pdf’)

  • save_plot (bool | None)

Returns:

Displays matplotlib figure

Return type:

None

Examples

>>> # Evaluate methods
>>> eval_results = evaluate_denoising_methods(df_wide, n_samples=50)
>>>
>>> # Plot all metrics
>>> plot_denoising_evaluation(eval_results)
>>>
>>> # Plot only SNR and fidelity
>>> plot_denoising_evaluation(eval_results, metrics=['snr_db', 'fidelity'])
>>>
>>> # Save to file
>>> plot_denoising_evaluation(eval_results, save_path='denoise_eval.pdf')
xpectrass.utils.denoise.plot_denoising_evaluation_summary(eval_df, figsize=(10, 6), save_plot=None, save_path=None)[source]

Create a summary table showing mean ± SD for all metrics across methods.

Displays a colored heatmap-style visualization to quickly identify the best performing denoising methods.

Parameters:
  • eval_df (pd.DataFrame) – Output from evaluate_denoising_methods()

  • figsize (tuple, default (10, 6)) – Figure size

  • save_plot (bool, optional) – If True, save the figure to save_path

  • save_path (str, optional) – If provided, save the figure to this path

Return type:

None

Examples

>>> eval_results = evaluate_denoising_methods(df_wide, n_samples=50)
>>> plot_denoising_evaluation_summary(eval_results)
xpectrass.utils.denoise.find_best_denoising_method(eval_df, snr_min=10.0, smoothness_min=1000.0, fidelity_min=0.9, time_max_ms=100.0, top_n=5)[source]

Recommend best denoising methods based on evaluation metrics.

Analyzes SNR, smoothness, fidelity, and computation time across all samples to identify methods that consistently perform well. Methods are ranked by a composite score combining all metrics.

Parameters:
  • eval_df (pd.DataFrame) – Output from evaluate_denoising_methods() with columns: [‘sample’, ‘method’, ‘snr_db’, ‘smoothness’, ‘fidelity’, ‘time_ms’]

  • snr_min (float, default 10.0) – Minimum acceptable SNR in dB (higher is better).

  • smoothness_min (float, default 1e3) – Minimum acceptable smoothness (higher is better).

  • fidelity_min (float, default 0.9) – Minimum acceptable fidelity correlation (0-1, higher is better).

  • time_max_ms (float, default 100.0) – Maximum acceptable computation time in milliseconds (lower is better).

  • top_n (int, default 5) – Number of top methods to return.

Returns:

Ranked methods with columns: - method: Method name - median_snr_db: Median SNR across samples - median_smoothness: Median smoothness across samples - median_fidelity: Median fidelity across samples - median_time_ms: Median computation time across samples - pass_rate: Fraction of samples passing all thresholds (0-1) - composite_score: Weighted score (higher is better) Sorted by composite_score descending (best methods first).

Return type:

pd.DataFrame

Notes

Composite Score Calculation:
  • Normalizes each metric to [0, 1] range

  • SNR: Higher is better

  • Smoothness: Higher is better

  • Fidelity: Higher is better

  • Time: Lower is better (inverted for scoring)

  • Pass rate: Bonus for consistent performance

  • Composite = (0.3 * SNR_score) + (0.25 * smoothness_score) +

    (0.3 * fidelity_score) + (0.05 * time_score) + (0.1 * pass_rate)

Example

>>> eval_results = evaluate_denoising_methods(df, n_samples=50)
>>> recommendations = find_best_denoising_method(eval_results, top_n=3)
>>> print(recommendations)
     method  median_snr_db  median_smoothness  median_fidelity  median_time_ms  pass_rate  composite_score
0   savgol          18.5             2.5e4            0.985            12.3       0.94             0.87
1  wavelet          16.2             1.8e4            0.972            45.2       0.88             0.82
2 gaussian          15.8             2.1e4            0.968            8.5        0.86             0.80
xpectrass.utils.denoise.plot_denoising_comparison(y_raw, wavenumbers, methods=None, sample_name='', figsize=(12, 8))[source]

Plot comparison of multiple denoising methods.

Parameters:
  • y_raw (np.ndarray) – Raw spectrum.

  • wavenumbers (np.ndarray) – Wavenumber axis.

  • methods (list of str, optional) – Methods to compare. Default: all.

  • sample_name (str) – Sample name for title.

  • figsize (tuple) – Figure size.

Return type:

None

denoise

denoise(
    intensities: np.ndarray,
    method: str = "savgol",
    **kwargs
) -> np.ndarray

Methods: savgol, wavelet, moving_average, gaussian, median, whittaker, lowpass


normalization

Normalization Module for FTIR Spectral Preprocessing

IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py

Provides multiple normalization methods for FTIR spectra including SNV, vector, area, min-max, and peak normalization. Also includes mean centering for PCA/PLS preparation.

xpectrass.utils.normalization.normalize(intensities, wavenumbers=None, method='snv', **kwargs)[source]

Normalize a 1-D FTIR spectrum.

Parameters:
  • intensities (array-like) – Intensity values (1-D).

  • wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). Required for peak_wavenumber and adaptive_regional methods.

  • method (str, default "snv") –

    Normalization method. Options: - ‘snv’: Standard Normal Variate (mean=0, std=1 within spectrum) - ‘vector’: L2 vector normalization (unit length) - ‘minmax’: Min-Max scaling to [0, 1] - ‘area’: Area normalization (total area = 1) - ‘peak’: Normalize by peak intensity - ‘range’: Normalize by intensity range - ‘max’: Normalize by maximum value - ‘detrend’: Polynomial detrending - ‘snv_detrend’: SNV followed by detrending

    Novel methods (1D-compatible): - ‘robust_snv’: Robust SNV using median/MAD - ‘curvature_weighted’: Curvature-weighted normalization - ‘peak_envelope’: Peak envelope normalization - ‘entropy_weighted’: Entropy-weighted normalization - ‘pqn’: Probabilistic quotient normalization - ‘total_variation’: Total variation normalization - ‘spectral_moments’: Spectral moment normalization - ‘adaptive_regional’: Adaptive regional normalization (requires wavenumbers) - ‘derivative_ratio’: Derivative ratio normalization - ‘signal_to_baseline’: Signal-to-baseline ratio normalization

  • **kwargs (method-specific parameters) –

    peak: peak_idx (index), peak_wavenumber (float, requires wavenumbers),

    peak_value (float), use_absolute (bool)

    minmax: feature_range (tuple, default (0, 1)) detrend: order (int, default 1) adaptive_regional: regions (list of tuples), method_per_region (dict)

Returns:

Normalized intensity values.

Return type:

np.ndarray

Notes

Methods that require multiple spectra (2D data) are NOT available here: - ‘mean_center’: Use mean_center() directly on 2D array - ‘auto_scale’: Use auto_scale() directly on 2D array - ‘pareto’: Use pareto_scale() directly on 2D array These methods compute column-wise statistics and should be applied via normalize_df() for batch processing.

PQN (Probabilistic Quotient Normalization): - For single spectrum WITH reference: Provide reference kwarg for true PQN

Example: normalize(y, method=’pqn’, reference=ref_spectrum)

  • For single spectrum WITHOUT reference: Falls back to median scaling (NOT true PQN!) A warning will be issued in this case.

  • For batch PQN: Use normalize_df() instead (auto-computes reference from dataset)

xpectrass.utils.normalization.normalize_method_names()[source]

Return list of available 1D normalization method names.

Note: Methods requiring 2D data (mean_center, auto_scale, pareto) are not included. Use normalize_df() for batch processing with those methods.

Return type:

List[str]

xpectrass.utils.normalization.normalize_df(data, method='snv', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, show_progress=True, **kwargs)[source]

Normalize multiple spectra (DataFrame or numpy array).

Works with both pandas and polars DataFrames, or numpy arrays. For DataFrames: each row is a sample, numerical columns are wavenumbers. For numpy arrays: shape (n_samples, n_wavenumbers).

Parameters:
  • data (pd.DataFrame | pl.DataFrame | np.ndarray) – Wide-format DataFrame where rows = samples, columns = wavenumbers, OR numpy array of shape (n_samples, n_wavenumbers).

  • method (str, default "snv") –

    Normalization method. Options:

    Single-spectrum methods (row-wise, 1D): - ‘snv’: Standard Normal Variate (mean=0, std=1 within each spectrum) - ‘vector’: L2 vector normalization (unit length) - ‘minmax’: Min-Max scaling to [0, 1] - ‘area’: Area normalization (total area = 1) - ‘peak’: Normalize by peak intensity - ‘range’: Normalize by intensity range - ‘max’: Normalize by maximum value - ‘detrend’: Polynomial detrending - ‘snv_detrend’: SNV followed by detrending - Plus all novel methods (robust_snv, curvature_weighted, etc.)

    Multi-spectrum methods (column-wise, 2D - for PCA/PLS prep): - ‘mean_center’: Column-wise mean centering (mean=0 per wavenumber)

    Requires ≥2 samples

    • ’auto_scale’: Column-wise auto-scaling (mean=0, std=1 per wavenumber) Requires ≥2 samples

    • ’pareto’: Column-wise Pareto scaling (mean=0, scaled by sqrt(std)) Requires ≥2 samples

    Dataset-level methods (require reference from entire dataset): - ‘pqn’: Probabilistic Quotient Normalization (computes reference spectrum

    from dataset median/mean, then normalizes each spectrum relative to it) Requires ≥3 samples

  • label_column (str, default "label") – Name of the label/group column to exclude from normalization. Only used for DataFrame inputs.

  • exclude_columns (list[str], optional) – Additional column names to exclude from normalization (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns. Only used for DataFrame inputs.

  • wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default, or auto-expands if no columns found within default range.

  • wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default, or auto-expands if no columns found within default range.

  • show_progress (bool, default True) – If True, display a progress bar during processing. Only used for DataFrame inputs.

  • **kwargs (method-specific parameters) –

    peak: peak_idx (index) or peak_wavenumber (requires wavenumbers array) minmax: feature_range (tuple, default (0, 1)) detrend: order (int, default 1) pqn: reference_type (str, default ‘median’) - ‘median’ or ‘mean’ for

    computing reference spectrum from dataset

  • sample_id_column (str)

Returns:

Normalized data (same type as input).

Return type:

pd.DataFrame | pl.DataFrame | np.ndarray

Raises:

ValueError – If using 2D methods (mean_center, auto_scale, pareto) with <2 samples. If using PQN with <3 samples.

Examples

>>> # Normalize DataFrame with SNV
>>> df_norm = normalize_batch(df_wide, method="snv")
>>> # Normalize with vector normalization
>>> df_norm = normalize_batch(df_wide, method="vector")
>>> # Normalize numpy array
>>> spectra_norm = normalize_batch(spectra_array, method="snv")
>>> # Min-max normalization to [0, 1]
>>> df_norm = normalize_batch(df_wide, method="minmax", feature_range=(0, 1))
>>> # PQN with median reference (proper batch PQN)
>>> df_norm = normalize_batch(df_wide, method="pqn", reference_type="median")
>>> # PQN with mean reference
>>> df_norm = normalize_batch(df_wide, method="pqn", reference_type="mean")
>>> # Disable progress bar
>>> df_norm = normalize_batch(df_wide, method="snv", show_progress=False)
xpectrass.utils.normalization.mean_center(spectra, axis=0, return_mean=False)[source]

Mean-center spectra (essential preprocessing for PCA/PLS).

Parameters:
  • spectra (np.ndarray, shape (n_samples, n_wavenumbers)) – Matrix of spectra.

  • axis (int, default 0) – Axis along which to compute mean. - 0: Column-wise (feature/wavenumber centering) - standard for PCA - 1: Row-wise (sample centering)

  • return_mean (bool, default False) – If True, return the mean array for later reconstruction.

Returns:

  • centered (np.ndarray) – Mean-centered spectra.

  • mean (np.ndarray, optional) – Mean values (returned if return_mean=True).

Return type:

ndarray | Tuple[ndarray, ndarray]

xpectrass.utils.normalization.auto_scale(spectra, return_params=False)[source]

Auto-scaling (mean centering + unit variance scaling).

Each variable (wavenumber) is scaled to have mean=0 and std=1. Common preprocessing for PCA/PLS when variables have different scales.

Parameters:
  • spectra (np.ndarray, shape (n_samples, n_wavenumbers)) – Matrix of spectra.

  • return_params (bool, default False) – If True, return mean and std for reconstruction.

Returns:

  • scaled (np.ndarray) – Auto-scaled spectra.

  • mean (np.ndarray, optional)

  • std (np.ndarray, optional)

Return type:

ndarray | Tuple[ndarray, ndarray, ndarray]

xpectrass.utils.normalization.pareto_scale(spectra, return_params=False)[source]

Pareto scaling (mean centering + sqrt(std) scaling).

Less aggressive than auto-scaling; preserves more of the original data structure. Good for spectral data.

Parameters:
  • spectra (np.ndarray, shape (n_samples, n_wavenumbers)) – Matrix of spectra.

  • return_params (bool, default False) – If True, return mean and std for reconstruction.

Returns:

scaled – Pareto-scaled spectra.

Return type:

np.ndarray

xpectrass.utils.normalization.detrend(intensities, order=1, wavenumbers=None)[source]

Remove polynomial trend from spectrum.

Often used after SNV to remove residual slope.

Parameters:
  • intensities (np.ndarray) – 1-D spectrum.

  • order (int, default 1) – Polynomial order (1 = linear detrending).

  • wavenumbers (np.ndarray, optional) – Wavenumber array for physical slope calculation. If provided, polynomial fit uses actual wavenumber values (cm⁻¹). If None, uses array indices (0, 1, 2, …).

Returns:

Detrended spectrum.

Return type:

np.ndarray

Notes

  • With wavenumbers: Fit uses physical x-axis (cm⁻¹), yielding physically meaningful slope coefficients. Important when comparing spectra on different grids or after resampling.

  • Without wavenumbers: Fit uses indices (0, 1, 2, …), which is grid-dependent. Slope coefficients are in units of intensity/index.

Examples

>>> # Index-based detrending (grid-dependent)
>>> detrended = detrend(spectrum)
>>>
>>> # Physical detrending (grid-independent)
>>> detrended = detrend(spectrum, wavenumbers=wn, order=1)
xpectrass.utils.normalization.snv_detrend(intensities, detrend_order=1, wavenumbers=None)[source]

SNV followed by detrending.

Common combined preprocessing for scatter correction.

Parameters:
  • intensities (np.ndarray) – 1-D spectrum.

  • detrend_order (int, default 1) – Polynomial order for detrending (1 = linear).

  • wavenumbers (np.ndarray, optional) – Wavenumber array for physical slope calculation in detrending. If provided, uses actual wavenumber values (cm⁻¹). If None, uses array indices.

Returns:

SNV-normalized and detrended spectrum.

Return type:

np.ndarray

xpectrass.utils.normalization.normalize_curvature_weighted(y, sigma=3.0, min_weight=0.01)[source]

Normalize with weights proportional to local curvature (2nd derivative).

Physical motivation: In FT-IR, peaks have high curvature while baseline regions are flat. This method emphasizes peak regions during normalization, making it more representative of actual chemical information.

The normalization factor is the curvature-weighted L2 norm.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • sigma (float, default 3.0) – Gaussian smoothing sigma for curvature estimation (reduces noise).

  • min_weight (float, default 0.01) – Minimum weight to avoid division issues in flat regions.

Returns:

Curvature-weighted normalized spectrum.

Return type:

np.ndarray

Notes

Handles flat spectra (zero curvature) by falling back to uniform weighting.

xpectrass.utils.normalization.normalize_peak_envelope(y, percentile=95, window_size=50)[source]

Normalize by the upper envelope of the spectrum.

Physical motivation: The upper envelope represents the maximum signal level across the spectrum, accounting for varying peak densities and intensities. This is more representative than single-peak normalization.

Works on absolute values to handle negative/baseline-corrected spectra.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • percentile (float, default 95) – Percentile for envelope estimation (avoids noise spikes).

  • window_size (int, default 50) – Rolling window size for envelope estimation.

Returns:

Envelope-normalized spectrum.

Return type:

np.ndarray

Notes

Uses absolute values to avoid NaN issues with negative/baseline-corrected spectra.

xpectrass.utils.normalization.normalize_entropy_weighted(y, n_bins=50, window_size=30, epsilon=1e-10)[source]

Normalize with weights based on local spectral entropy.

Motivation: Regions with high entropy (high variability/information) should contribute more to normalization. Flat baseline regions have low entropy and contribute less.

Local entropy is computed in sliding windows using histogram-based probability estimation.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • n_bins (int, default 50) – Number of bins for local histogram.

  • window_size (int, default 30) – Window size for local entropy calculation.

  • epsilon (float) – Small value to avoid log(0).

Returns:

Entropy-weighted normalized spectrum.

Return type:

np.ndarray

xpectrass.utils.normalization.normalize_total_variation(y, order=1)[source]

Normalize by total variation (sum of absolute differences).

Physical motivation: Total variation captures the “roughness” or total signal content independent of baseline offset. It’s related to the first derivative energy and is baseline-invariant.

TV = sum(|y[i+1] - y[i]|) for first order

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • order (int, default 1) – Order of differences (1 = first derivative, 2 = second derivative).

Returns:

TV-normalized spectrum.

Return type:

np.ndarray

xpectrass.utils.normalization.normalize_spectral_moments(y, moment_order=2, use_central=True)[source]

Normalize using spectral moments.

Physical motivation: Higher-order moments capture distribution characteristics beyond simple mean/variance. The nth moment emphasizes larger deviations, useful for peak-rich spectra.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • moment_order (int, default 2) – Order of moment to use (2 = variance-like, 3 = skewness-like, etc.)

  • use_central (bool, default True) – If True, use central moments (subtract mean first).

Returns:

Moment-normalized spectrum.

Return type:

np.ndarray

xpectrass.utils.normalization.normalize_adaptive_regional(y, wavenumbers, regions=None, method_per_region=None)[source]

Apply different normalization to different spectral regions.

Physical motivation: Different FT-IR regions have different characteristics: - 3600-2800 cm⁻¹: O-H, N-H, C-H stretching (often intense) - 1800-1500 cm⁻¹: C=O, C=C, amide bands - 1500-400 cm⁻¹: Fingerprint region (complex)

Each region may benefit from different normalization.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • wavenumbers (np.ndarray, optional) – Wavenumber axis. Required for this method.

  • regions (list of tuples, optional) – [(start1, end1), (start2, end2), …] defining regions. Default: standard FT-IR regions.

  • method_per_region (dict, optional) – {“region_idx”: “method_name”} mapping. Default: SNV for all regions.

Returns:

Regionally-normalized spectrum.

Return type:

np.ndarray

Raises:

ValueError – If wavenumbers is None.

xpectrass.utils.normalization.normalize_derivative_ratio(y, sigma=2.0)[source]

Normalize using the ratio of derivative energies.

Physical motivation: The ratio of second to first derivative energy characterizes peak sharpness independent of intensity. This provides baseline-independent normalization.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • sigma (float) – Smoothing parameter before derivative computation.

Returns:

Derivative-ratio normalized spectrum.

Return type:

np.ndarray

xpectrass.utils.normalization.normalize_signal_to_baseline(y, baseline_percentile=10, signal_percentile=90)[source]

Normalize by the ratio of signal to baseline levels.

Physical motivation: This separates the “signal” (peaks) from “baseline” (background) and normalizes by their contrast. Useful when baseline levels vary between samples.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • baseline_percentile (float) – Percentile to estimate baseline level.

  • signal_percentile (float) – Percentile to estimate signal level.

Returns:

Signal-to-baseline normalized spectrum.

Return type:

np.ndarray

xpectrass.utils.normalization.normalize_robust_snv(y, consistency_correction=True, epsilon=1e-10)[source]

Robust Standard Normal Variate using median and MAD.

Traditional SNV uses mean and std, which are sensitive to: - Baseline artifacts (shifts the mean) - Outlier peaks (inflates std) - Asymmetric intensity distributions

RSNV uses median (robust center) and MAD (robust scale).

Formula: (x - median(x)) / MAD(x)

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • consistency_correction (bool, default True) – If True, scale MAD by 1.4826 to be consistent with std for normal data.

  • epsilon (float, default 1e-10) – Small value added to MAD to avoid division by zero and prevent zero vectors (which cause issues with cosine-based methods).

Returns:

  • np.ndarray – Robustly normalized spectrum.

  • Reference

  • ———

  • Novel method - combines robust statistics with scatter correction.

Return type:

ndarray

Notes

When MAD is very small or zero (flat spectrum), epsilon prevents returning a zero vector, which would cause errors in downstream methods that use cosine similarity (e.g., clustering, PCA with cosine kernel).

xpectrass.utils.normalization.normalize_pqn(y, reference=None, reference_type='median')[source]

Probabilistic Quotient Normalization (PQN) for FT-IR spectra.

IMPORTANT: True PQN requires a reference spectrum from your dataset. For single-spectrum processing without a reference, this function performs “median scaling” (dividing by median intensity), which is NOT true PQN. Use normalize_df() for proper batch PQN.

Originally from metabolomics (Dieterle et al., 2006), PQN uses median fold-change relative to a reference spectrum, which is robust to varying numbers/intensities of peaks.

Physical motivation: Accounts for dilution effects and path length variations without being dominated by major peaks.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • reference (np.ndarray, optional) –

    Reference spectrum from your dataset. - If provided: Performs true PQN using this reference - If None: Falls back to “median scaling” (divides by median

    of positive values). This is NOT true PQN!

  • reference_type (str, default "median") – Currently unused. Reserved for future use in batch processing via normalize_df() where reference can be computed from dataset.

Returns:

Normalized spectrum.

Return type:

np.ndarray

Notes

Single spectrum (reference=None):

Returns: y / median(y[y > 0]) This is “median scaling”, not true PQN.

With reference spectrum:
  1. Compute quotients: q[i] = y[i] / reference[i] (where both > 0)

  2. Normalization factor: median(q)

  3. Returns: y / median(q)

For batch processing:

Use normalize_df() with method=’pqn’ to automatically compute a reference spectrum (median or mean) from your dataset.

References

Dieterle et al. (2006) Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Anal Chem 78(13):4281-90.

Examples

>>> # Single spectrum: median scaling (NOT true PQN)
>>> y_scaled = normalize_pqn(spectrum)  # reference=None
>>>
>>> # True PQN: provide reference from dataset
>>> ref = np.median(all_spectra, axis=0)  # median spectrum
>>> y_pqn = normalize_pqn(spectrum, reference=ref)
>>>
>>> # Batch PQN: use normalize_df() instead
>>> df_pqn = normalize_df(df, method='pqn')
xpectrass.utils.normalization.normalize_novel(y, method='robust_snv', **kwargs)[source]

Apply novel normalization method.

Parameters:
  • y (np.ndarray) – 1-D spectrum.

  • method (str) – One of: ‘robust_snv’, ‘curvature’, ‘envelope’, ‘entropy’, ‘pqn’, ‘total_variation’, ‘moments’, ‘derivative_ratio’, ‘signal_baseline’

  • **kwargs (method-specific parameters)

Returns:

Normalized spectrum.

Return type:

np.ndarray

xpectrass.utils.normalization.novel_normalize_method_names()[source]

Return list of novel normalization method names.

Return type:

list

normalize

normalize(intensities: np.ndarray, method: str = "snv", **kwargs) -> np.ndarray

Methods: snv, vector, minmax, area, peak, range, max

mean_center

mean_center(
    spectra: np.ndarray,
    axis: int = 0,
    return_mean: bool = True
) -> Tuple[np.ndarray, np.ndarray]

auto_scale

auto_scale(spectra: np.ndarray, return_params: bool = True) -> Tuple[np.ndarray, np.ndarray, np.ndarray]

atmospheric

Atmospheric Correction Module for FTIR Spectral Preprocessing

Corrects for CO₂ and H₂O vapor interference in FTIR spectra that result from atmospheric absorption during measurement.

IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py

Default atmospheric regions: - CO₂: 2300-2400 cm⁻¹ (asymmetric stretch) and 650-690 cm⁻¹ (bending) - H₂O: 1350-1900 cm⁻¹ (bending) and 3550-3900 cm⁻¹ (stretching)

Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:

import logging logging.getLogger(‘utils.atmospheric’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.atmospheric’).setLevel(logging.ERROR) # Only errors

Auto-Detection: Use auto_detect=True in atmospheric_correction() to automatically check for atmospheric interference and receive warnings if detected.

xpectrass.utils.atmospheric.atmospheric_correction_spectrum(intensities, wavenumbers, method='interpolate', co2_ranges=None, h2o_ranges=None, reference_spectrum=None, **kwargs)[source]

Public wrapper for correcting a single spectrum (numpy arrays).

Parameters:
Return type:

ndarray

xpectrass.utils.atmospheric.identify_atmospheric_features(intensities, wavenumbers, threshold=0.1)[source]

Check for presence of atmospheric interference.

Parameters:
  • intensities (ndarray) – Spectral intensity values

  • wavenumbers (ndarray) – Wavenumber grid (cm⁻¹)

  • threshold (float) – Sensitivity threshold as a fraction of total spectral variation. Higher values (e.g., 0.2) reduce false positives but may miss weak interference. Lower values (e.g., 0.05) are more sensitive but may flag noise. Default of 0.1 works well for typical FTIR spectra.

Returns:

Dictionary with ‘co2_detected’, ‘h2o_detected’ (bool), and ‘recommendations’ (list)

Return type:

dict

xpectrass.utils.atmospheric.exclude_and_interpolate_spectrum(intensities, wavenumbers, exclude_ranges=None, interpolate_ranges=None, method='interpolate', reference_spectrum=None, **kwargs)[source]

Public wrapper for excluding/interpolating regions on a single spectrum.

Parameters:
  • intensities (ndarray) – Spectral intensity values

  • wavenumbers (ndarray) – Wavenumber grid (cm⁻¹)

  • exclude_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to physically remove from output

  • interpolate_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to process using specified method

  • method (str) – Method for interpolate_ranges (‘interpolate’, ‘spline’, ‘reference’, ‘zero’, ‘exclude’)

  • reference_spectrum (ndarray | None) – Reference spectrum for ‘reference’ method

  • **kwargs – Additional arguments for specific methods (e.g., reference_scale)

Returns:

Tuple of (corrected_intensities, corrected_wavenumbers)

Return type:

Tuple[ndarray, ndarray]

xpectrass.utils.atmospheric.atmospheric_correction(data, method='interpolate', co2_ranges=None, h2o_ranges=None, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, auto_detect=False, **kwargs)[source]

Apply atmospheric correction to a DataFrame of FTIR spectra.

Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies correction to all samples.

NaN Handling:

All correction methods robustly handle NaN values in spectral data: - ‘interpolate’/’spline’: Filters NaN from boundary regions; sets atmospheric

regions to NaN if insufficient finite boundary data exists

  • ‘reference’: Filters NaN before computing scale factor

  • ‘zero’: Uses nanmean for baselines; skips regions with all-NaN boundaries

  • ‘exclude’: Marks regions as NaN

Performance:

Optimized for large datasets using: - Vectorized numpy array access (no DataFrame.loc overhead) - Pre-allocated output arrays (no list appending) - Progress tracking via tqdm For maximum performance on very large datasets (100k+ spectra), consider using atmospheric_correction_spectrum() on extracted numpy arrays directly.

Parameters:
  • data (DataFrame | DataFrame) – Input DataFrame (pandas or polars)

  • method (str) – Correction method (‘interpolate’, ‘spline’, ‘reference’, ‘zero’, ‘exclude’)

  • co2_ranges (List[Tuple[float, float]] | None) – Custom CO₂ regions to correct (default: standard FTIR regions)

  • h2o_ranges (List[Tuple[float, float]] | None) – Custom H₂O regions to correct (default: standard FTIR regions)

  • label_column (str) – Name of label/metadata column to preserve

  • exclude_columns (List[str] | None) – Additional columns to exclude from processing

  • wn_min (float | None) – Minimum wavenumber for column detection (default: auto-detect)

  • wn_max (float | None) – Maximum wavenumber for column detection (default: auto-detect)

  • auto_detect (bool) – If True, automatically check first spectrum for atmospheric interference and warn if detected but no custom ranges provided (default: False)

  • **kwargs – Additional arguments passed to correction methods

  • sample_id_column (str)

Returns:

Corrected DataFrame in same format as input (columns sorted by ascending wavenumber)

Return type:

DataFrame | DataFrame

Warning

  • Warns if spectral columns are reordered during processing

  • Warns if wavenumber bounds are auto-expanded

  • Warns if auto_detect=True and atmospheric interference detected without custom ranges

xpectrass.utils.atmospheric.exclude_and_interpolate_regions(data, exclude_ranges=None, interpolate_ranges=None, method='interpolate', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, reference_spectrum=None, **kwargs)[source]

Exclude wavenumber ranges and process interpolate regions for DataFrame of spectra.

Parameters:
  • data (DataFrame | DataFrame) – Input DataFrame (pandas or polars)

  • exclude_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to physically remove from output

  • interpolate_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to process using specified method

  • method (str) – Method for interpolate_ranges (‘interpolate’, ‘spline’, ‘reference’, ‘zero’, ‘exclude’)

  • label_column (str) – Name of label/metadata column to preserve

  • exclude_columns (List[str] | None) – Additional columns to exclude from processing

  • wn_min (float | None) – Minimum wavenumber for column detection (default: auto-detect)

  • wn_max (float | None) – Maximum wavenumber for column detection (default: auto-detect)

  • reference_spectrum (ndarray | None) – Reference spectrum for ‘reference’ method

  • **kwargs – Additional arguments for specific methods (e.g., reference_scale)

  • sample_id_column (str)

Returns:

Processed DataFrame in same format as input

Return type:

DataFrame | DataFrame

atmospheric_correction

atmospheric_correction(
    intensities: np.ndarray,
    wavenumbers: np.ndarray,
    method: str = "interpolate",
    co2_range: Tuple[float, float] = (2300, 2400),
    h2o_ranges: List[Tuple[float, float]] = [(1350, 1900), (3550, 3900)],
    **kwargs
) -> np.ndarray

Methods: interpolate, spline, reference, zero, exclude


derivatives

Spectral Derivatives Module for FTIR Preprocessing

Compute smoothed spectral derivatives for resolution enhancement and baseline removal.

xpectrass.utils.derivatives.spectral_derivative(intensities, order=1, window_length=15, polyorder=3, delta=1.0)[source]

Compute smoothed spectral derivative using Savitzky-Golay.

Parameters:
  • intensities (np.ndarray) – 1-D intensity array.

  • order (int, default 1) – Derivative order (1 = first derivative, 2 = second derivative).

  • window_length (int, default 15) – Savitzky-Golay window length (must be odd).

  • polyorder (int, default 3) – Polynomial order for Savitzky-Golay filter.

  • delta (float, default 1.0) – Spacing between samples (affects derivative scaling). Important for FT-IR: Set this to the actual wavenumber spacing (e.g., median of np.diff(wavenumbers)) to get physically meaningful derivatives in units of dI/d(cm⁻¹). Default of 1.0 assumes unit spacing.

Returns:

Derivative spectrum.

Return type:

np.ndarray

Notes

  • 1st derivative: Resolves overlapping peaks, removes constant baseline

  • 2nd derivative: Sharpens peaks, removes linear baseline

  • Higher derivatives increase noise; adjust window_length accordingly

Warning

  • Input must be 1-D array (will raise ValueError if multi-dimensional)

  • Large derivative orders may trigger automatic window expansion (logged as warning)

Examples

>>> import numpy as np
>>> wn = np.linspace(400, 4000, 1000)
>>> intensities = np.exp(-0.5 * ((wn - 1500) / 100) ** 2)
>>>
>>> # Correct: specify delta for physical units
>>> delta = np.median(np.diff(wn))
>>> deriv = spectral_derivative(intensities, order=1, delta=delta)
>>>
>>> # Warning: default delta=1.0 gives index-based units
>>> deriv = spectral_derivative(intensities, order=1)  # Not recommended for FT-IR
xpectrass.utils.derivatives.first_derivative(intensities, window_length=15, polyorder=3)[source]

Compute first derivative.

Benefits: - Removes constant baseline offset - Resolves overlapping bands - Enhances small spectral differences

Parameters:
Return type:

ndarray

xpectrass.utils.derivatives.second_derivative(intensities, window_length=15, polyorder=4)[source]

Compute second derivative.

Benefits: - Removes linear baseline - Sharpens peaks (negative peaks in output) - Heavily used in FTIR for band identification

Note: Peaks appear as negative minima in 2nd derivative.

Parameters:
Return type:

ndarray

xpectrass.utils.derivatives.gap_derivative(intensities, gap=5, segment=5, delta=1.0, pad_mode='edge')[source]

Norris-Williams gap derivative.

Averages points on either side of a gap, then takes difference. More noise-resistant than point-to-point derivatives.

Parameters:
  • intensities (np.ndarray) – 1-D spectrum.

  • gap (int, default 5) – Gap size (number of points to skip). Will be cast to int if float provided.

  • segment (int, default 5) – Number of points to average on each side. Will be cast to int if float provided.

  • delta (float, default 1.0) – Spacing between consecutive points (e.g., wavenumber spacing in cm⁻¹). The derivative is divided by (gap + segment) * delta to get proper units. For FT-IR with uniform 1 cm⁻¹ spacing, delta=1.0 is correct.

  • pad_mode (str, default 'edge') – Padding mode for edges. Options: - ‘edge’: Replicate edge values (default, simple but may create plateaus) - ‘constant’: Pad with zeros (pure approach, no artifacts) - None: Return unpadded array (length reduced by gap + 2*segment - 1)

Returns:

Gap derivative. If pad_mode is None, array is shorter than input by (gap + 2*segment - 1). Otherwise, same length as input.

Return type:

np.ndarray

Notes

  • Padding with ‘edge’ mode replicates edge values, which can introduce artificial plateaus at spectrum ends. For pure results, use pad_mode=None or ‘constant’.

  • For FT-IR spectra, edges (<600 cm⁻¹, >4000 cm⁻¹) often have noise, so ‘edge’ padding is usually acceptable.

  • The derivative is scaled by (gap + segment) * delta to approximate dI/d(wavenumber). For uniform grids, this gives physically meaningful units.

Examples

>>> import numpy as np
>>> wn = np.linspace(400, 4000, 1000)
>>> y = np.exp(-0.5 * ((wn - 1500) / 100) ** 2)
>>>
>>> # With padding (same length as input)
>>> deriv = gap_derivative(y, gap=5, segment=5, pad_mode='edge')
>>> len(deriv) == len(y)  # True
>>>
>>> # Without padding (pure derivative, shorter)
>>> deriv = gap_derivative(y, gap=5, segment=5, pad_mode=None)
>>> len(deriv) == len(y) - 14  # True (lost gap + 2*segment - 1 points)
>>>
>>> # With delta scaling for physical units
>>> delta = np.median(np.diff(wn))
>>> deriv = gap_derivative(y, gap=5, segment=5, delta=delta)
xpectrass.utils.derivatives.derivative_with_smoothing(intensities, order=1, smooth_window=11, deriv_window=15, smooth_polyorder=3, deriv_polyorder=None, delta=1.0, smooth_first=True)[source]

Apply derivative with separate smoothing control.

This function allows independent control over smoothing and derivative computation, useful for very noisy data or when you want to optimize smoothing separately from derivative calculation.

Parameters:
  • intensities (np.ndarray) – 1-D spectrum.

  • order (int, default 1) – Derivative order.

  • smooth_window (int, default 11) – Window length for initial smoothing step (if smooth_first=True). Must be odd and > smooth_polyorder.

  • deriv_window (int, default 15) – Window length for derivative calculation. Must be odd and > deriv_polyorder.

  • smooth_polyorder (int, default 3) – Polynomial order for smoothing step (Savitzky-Golay). Must be less than smooth_window.

  • deriv_polyorder (int, optional) – Polynomial order for derivative step. If None, defaults to order + 1 (minimum required for the derivative order).

  • delta (float, default 1.0) – Spacing between samples (affects derivative scaling). Important for FT-IR: Set to actual wavenumber spacing for physically meaningful derivatives in dI/d(cm⁻¹).

  • smooth_first (bool, default True) – If True, smooth before taking derivative (recommended). If False, differentiate first then smooth (may distort peak shapes and amplify noise before smoothing—use with caution).

Returns:

Derivative spectrum.

Return type:

np.ndarray

Warning

  • Setting smooth_first=False differentiates noisy data before smoothing, which can amplify noise and distort spectral features. Generally not recommended unless you have specific scientific reasons.

  • High smooth_polyorder with small smooth_window may under-smooth data.

Notes

  • For most applications, use spectral_derivative() which combines smoothing and differentiation optimally in a single step.

  • Use this function only when you need separate control over smoothing and derivative parameters (e.g., aggressive pre-smoothing for very noisy data).

  • The two-step approach (smooth → derivative) may introduce edge artifacts at both ends of the spectrum.

Examples

>>> import numpy as np
>>> wn = np.linspace(400, 4000, 1000)
>>> y = np.exp(-0.5 * ((wn - 1500) / 100) ** 2)
>>> y_noisy = y + np.random.normal(0, 0.01, len(y))
>>>
>>> # Recommended: smooth first (default)
>>> delta = np.median(np.diff(wn))
>>> deriv = derivative_with_smoothing(
...     y_noisy,
...     order=1,
...     smooth_window=25,  # Heavy smoothing
...     deriv_window=15,
...     delta=delta,
...     smooth_first=True
... )
>>>
>>> # Not recommended: differentiate noisy data first
>>> deriv = derivative_with_smoothing(
...     y_noisy,
...     order=1,
...     smooth_first=False  # Amplifies noise before smoothing
... )
xpectrass.utils.derivatives.derivative_batch(data, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, order=1, window_length=15, polyorder=3, delta=1.0, show_progress=True)[source]

Compute spectral derivatives for multiple spectra (DataFrame or numpy array).

Works with both pandas and polars DataFrames, or numpy arrays. For DataFrames: each row is a sample, numerical columns are wavenumbers. For numpy arrays: shape (n_samples, n_wavenumbers).

Parameters:
  • data (pd.DataFrame | pl.DataFrame | np.ndarray) – Wide-format DataFrame where rows = samples, columns = wavenumbers, OR numpy array of shape (n_samples, n_wavenumbers).

  • label_column (str, default "label") – Name of the label/group column to exclude from derivative computation. Only used for DataFrame inputs.

  • exclude_columns (list[str], optional) – Additional column names to exclude from derivative computation (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns. Only used for DataFrame inputs.

  • wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default, or auto-expands if no columns found within default range.

  • wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default, or auto-expands if no columns found within default range.

  • order (int, default 1) – Derivative order: - 1: First derivative (resolves overlapping peaks, removes constant baseline) - 2: Second derivative (sharpens peaks, removes linear baseline) - 3+: Higher derivatives (increases noise sensitivity)

  • window_length (int, default 15) – Savitzky-Golay filter window length (must be odd). Larger values = more smoothing but less detail.

  • polyorder (int, default 3) – Polynomial order for Savitzky-Golay filter. Must be less than window_length.

  • delta (float, default 1.0) – Spacing between samples (affects derivative scaling). For DataFrame inputs, this parameter is automatically computed from wavenumber spacing and the provided value is ignored. For numpy array inputs, uses the provided delta value.

  • show_progress (bool, default True) – If True, display a progress bar during processing. Only used for DataFrame inputs.

  • sample_id_column (str)

Returns:

Derivative spectra (same type as input).

Return type:

pd.DataFrame | pl.DataFrame | np.ndarray

Examples

>>> # First derivative of DataFrame
>>> df_d1 = derivative_batch(df_wide, order=1)
>>> # Second derivative with larger smoothing window
>>> df_d2 = derivative_batch(df_wide, order=2, window_length=21)
>>> # Third derivative (highly sensitive to noise)
>>> df_d3 = derivative_batch(df_wide, order=3, window_length=25, polyorder=4)
>>> # Numpy array processing (legacy)
>>> spectra_d1 = derivative_batch(spectra_array, order=1)
>>> # Disable progress bar
>>> df_d1 = derivative_batch(df_wide, order=1, show_progress=False)

Notes

  • 1st derivative: Removes constant baseline, enhances spectral differences

  • 2nd derivative: Removes linear baseline, sharpens peaks (peaks appear as negative minima)

  • Higher derivatives amplify noise; increase window_length for smoother results

  • Savitzky-Golay filtering preserves peak shapes better than simple numerical derivatives

xpectrass.utils.derivatives.plot_derivatives(data, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, orders=[0, 1, 2], sample=None, wavenumbers=None, window_length=15, polyorder=3, figsize=(10, 8), invert_x=True)[source]

Plot spectrum and its derivatives for DataFrame or numpy array.

Works with both pandas/polars DataFrames and numpy arrays. For DataFrames: automatically extracts wavenumbers from column names. For numpy arrays: wavenumbers parameter is required.

Parameters:
  • data (pd.DataFrame | pl.DataFrame | np.ndarray) – Wide-format DataFrame (rows=samples, columns=wavenumbers) OR 1-D numpy array of intensities.

  • label_column (str, default "label") – Name of the label/group column to exclude. Only used for DataFrame inputs.

  • exclude_columns (list[str], optional) – Additional column names to exclude (e.g., ‘sample’, ‘id’). Only used for DataFrame inputs.

  • wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default.

  • wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default.

  • orders (list of int, default [0, 1, 2]) – Derivative orders to plot: - 0: Original spectrum - 1: First derivative - 2: Second derivative - 3+: Higher derivatives

  • sample (str | int, optional) – For DataFrames: sample name (index) to plot. If None, plots the first sample. For numpy arrays: ignored (plots the provided array).

  • wavenumbers (np.ndarray, optional) – Wavenumber axis. Required only for numpy array input. For DataFrames, automatically extracted from column names.

  • window_length (int, default 15) – Savitzky-Golay window length for derivative computation.

  • polyorder (int, default 3) – Polynomial order for Savitzky-Golay filter.

  • figsize (tuple, default (10, 8)) – Figure size (width, height).

  • invert_x (bool, default True) – If True, invert x-axis (higher wavenumbers on left).

  • sample_id_column (str)

Return type:

None

Examples

>>> # Plot derivatives from DataFrame
>>> plot_derivatives(df_wide, orders=[0, 1, 2], sample="PP225")
>>> # Plot original and 2nd derivative only
>>> plot_derivatives(df_wide, orders=[0, 2], sample="HDPE1")
>>> # Plot from first sample (default)
>>> plot_derivatives(df_wide, orders=[0, 1, 2, 3])
>>> # Plot from numpy array
>>> plot_derivatives(spectrum, wavenumbers=wn_array, orders=[0, 1, 2])
>>> # Custom window for smoother derivatives
>>> plot_derivatives(df_wide, sample="PET1", window_length=25, polyorder=4)

Notes

  • 1st derivative: Removes constant baseline, enhances differences

  • 2nd derivative: Removes linear baseline, sharpens peaks (negative minima)

  • Higher orders increase noise sensitivity; use larger window_length

spectral_derivative

spectral_derivative(
    intensities: np.ndarray,
    order: int = 1,
    window_length: int = 15,
    polyorder: int = 3,
    delta: float = 1.0
) -> np.ndarray

first_derivative / second_derivative

first_derivative(intensities, window_length=15, polyorder=3) -> np.ndarray
second_derivative(intensities, window_length=15, polyorder=4) -> np.ndarray

scatter_correction

Scatter Correction Module for FTIR Spectral Preprocessing

Provides multiplicative scatter correction (MSC), extended MSC (EMSC), and related methods for correcting light scattering effects.

IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py

Features: - Single spectrum correction via scatter_correction() - Batch DataFrame processing via apply_scatter_correction() - Automatic column detection and sorting by wavenumber - Performance optimized for large datasets (vectorized operations) - Pandas and Polars DataFrame support

Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:

import logging logging.getLogger(‘utils.scatter_correction’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.scatter_correction’).setLevel(logging.ERROR) # Only errors

Available Methods: Run scatter_method_names() to see all available correction methods. Common methods: msc, emsc, snv, snv_detrend

xpectrass.utils.scatter_correction.scatter_correction(intensities, wavenumbers=None, method='msc', reference=None, **kwargs)[source]

Apply scatter correction to a single FTIR spectrum.

Parameters:
  • intensities (array-like) – Raw intensity values (1-D). Absorbance data (AU), not transmittance (%).

  • wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). Ensures API consistency with other preprocessing modules (baseline, denoise, atmospheric). Not used in calculations but validates data integrity.

  • method (str, default "msc") – Correction method: - ‘msc’: Multiplicative Scatter Correction - ‘emsc’: Extended MSC (includes polynomial baseline terms) - ‘snv’: Standard Normal Variate (per-spectrum normalization) - ‘snv_detrend’: SNV followed by polynomial detrending

  • reference (np.ndarray, optional) – Reference spectrum for MSC/EMSC. If None, cannot be applied (use apply_scatter_correction for batch processing with automatic reference). Must have same length as intensities.

  • **kwargs (method-specific parameters) – emsc: poly_order (default 2) snv_detrend: detrend_order (default 1)

Returns:

Scatter-corrected intensity values. NaN values in input are preserved at their original positions; correction is applied only to finite values.

Return type:

np.ndarray

Raises:

ValueError – If method requires reference spectrum but none provided, or if reference length doesn’t match intensities.

Notes

NaN Handling:
  • If input contains NaN values, they are preserved in output

  • Correction is computed only on finite values

  • If all values are NaN, returns array of NaN

Methods requiring reference (msc, emsc):
  • For single spectrum, reference must be provided explicitly

  • For batch processing, use apply_scatter_correction() which computes mean reference automatically

xpectrass.utils.scatter_correction.scatter_method_names()[source]

Return list of available scatter correction method names.

Return type:

List[str]

xpectrass.utils.scatter_correction.apply_scatter_correction(data, method='msc', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, reference=None, show_progress=True, **kwargs)[source]

Apply scatter correction to a DataFrame of FTIR spectra (batch processing).

Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies scatter correction to all samples.

Parameters:
  • data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).

  • method (str, default "msc") – Scatter correction method. Options: - ‘msc’: Multiplicative Scatter Correction - ‘emsc’: Extended MSC (includes polynomial baseline terms) - ‘snv’: Standard Normal Variate (per-spectrum normalization) - ‘snv_detrend’: SNV followed by polynomial detrending

  • label_column (str, default "label") – Name of the label/group column to exclude from correction.

  • exclude_columns (list[str], optional) – Additional column names to exclude from correction (e.g., ‘sample’, ‘id’).

  • wn_min (float, optional) – Minimum wavenumber for column detection (default: 200.0 cm⁻¹). Columns with wavenumbers below this value will be excluded.

  • wn_max (float, optional) – Maximum wavenumber for column detection (default: 8000.0 cm⁻¹). Columns with wavenumbers above this value will be excluded.

  • reference (np.ndarray, optional) – Reference spectrum for MSC/EMSC. If None, uses mean of all spectra. Must match the length of spectral columns.

  • show_progress (bool, default True) – If True, display a progress bar during processing.

  • **kwargs (additional parameters) – Method-specific parameters: - emsc: poly_order (default 2) - snv_detrend: detrend_order (default 1)

  • sample_id_column (str)

Returns:

  • pd.DataFrame | pl.DataFrame – Scatter-corrected DataFrame (same type as input) with spectral data corrected and metadata columns preserved. Output columns are sorted by ascending wavenumber for standardization.

  • NaN Handling

  • ————

  • Robustly handles NaN (missing) values in spectral data

  • - NaN values are preserved in output at their original positions

  • - Correction is computed only on finite values

  • - If an entire spectrum is NaN, it remains as NaN

  • - For MSC/EMSC, reference spectrum is computed from finite values only

  • Performance

  • ———–

  • Optimized for large datasets using

  • - Robust wavenumber column detection (parses column names, not dtype)

  • - Automatic column sorting to ensure monotonic wavenumber order

  • - Vectorized numpy array access (no DataFrame.loc overhead)

  • - Pre-allocated output arrays (no dynamic list appending)

  • - Progress tracking via tqdm

Return type:

DataFrame | DataFrame

Examples

>>> # Apply MSC scatter correction to all samples
>>> df_corrected = apply_scatter_correction(df_wide, method="msc")
>>> # Use EMSC with custom polynomial order
>>> df_corrected = apply_scatter_correction(
...     df_wide,
...     method="emsc",
...     poly_order=3
... )
>>> # Use SNV (no reference needed)
>>> df_corrected = apply_scatter_correction(df_wide, method="snv")
>>> # Works with both pandas and polars
>>> df_pd_corrected = apply_scatter_correction(df_pandas)
>>> df_pl_corrected = apply_scatter_correction(df_polars)
>>> # Disable progress bar for cleaner output
>>> df_corrected = apply_scatter_correction(df_wide, show_progress=False)
xpectrass.utils.scatter_correction.msc_single(spectrum, reference)[source]

Apply MSC to a single spectrum and return coefficients.

Deprecated: Use scatter_correction() with method=’msc’ instead.

This function is retained for backward compatibility only.

Parameters:
  • spectrum (np.ndarray) – Single spectrum.

  • reference (np.ndarray) – Reference spectrum.

Returns:

  • corrected (np.ndarray) – Corrected spectrum.

  • a (float) – Offset coefficient.

  • b (float) – Scaling coefficient.

Return type:

Tuple[ndarray, float, float]

scatter_correction

scatter_correction(
    spectra: np.ndarray,  # (n_samples, n_wavenumbers)
    method: str = "msc",
    reference: np.ndarray = None,
    **kwargs
) -> np.ndarray

Methods: msc, emsc, snv, snv_detrend


region_selection

Region Selection Module for FTIR Spectral Preprocessing

Provides utilities for selecting, excluding, and extracting spectral regions based on wavenumber ranges.

xpectrass.utils.region_selection.get_region_names()[source]

Return list of predefined region names.

Return type:

List[str]

xpectrass.utils.region_selection.get_region_range(name)[source]

Get wavenumber range for a named region.

Parameters:

name (str)

Return type:

Tuple[float, float]

xpectrass.utils.region_selection.select_region(df, regions)[source]

Select spectral regions by wavenumber ranges.

Parameters:
  • df (pl.DataFrame) – Wide-format DataFrame with ‘sample’, ‘label’, and wavenumber columns.

  • regions (tuple, list of tuples, or str) –

    • tuple: (start, end) wavenumber range

    • list of tuples: multiple ranges to include

    • str: predefined region name (e.g., ‘fingerprint’, ‘ch_stretch’)

Returns:

DataFrame with only selected wavenumber columns.

Return type:

pl.DataFrame

Examples

>>> select_region(df, (400, 1500))  # Fingerprint region
>>> select_region(df, 'ch_stretch')  # Named region
>>> select_region(df, [(400, 1500), (2800, 3100)])  # Multiple regions
xpectrass.utils.region_selection.exclude_regions(df, regions)[source]

Exclude spectral regions (opposite of select_region).

Parameters:
  • df (pl.DataFrame) – Wide-format DataFrame.

  • regions (tuple, list of tuples, or str) – Regions to exclude.

Returns:

DataFrame with excluded regions removed.

Return type:

pl.DataFrame

xpectrass.utils.region_selection.exclude_atmospheric(df)[source]

Convenience function to exclude atmospheric interference regions.

Excludes CO2 (2300-2400 cm⁻¹) and H2O (1350-1900, 3550-3900 cm⁻¹).

Parameters:

df (DataFrame)

Return type:

DataFrame

xpectrass.utils.region_selection.select_region_np(intensities, wavenumbers, start, end)[source]

Select region from 1-D spectrum arrays.

Parameters:
  • intensities (np.ndarray) – Intensity values.

  • wavenumbers (np.ndarray) – Wavenumber values.

  • start (float) – Wavenumber range.

  • end (float) – Wavenumber range.

Returns:

  • selected_intensities (np.ndarray)

  • selected_wavenumbers (np.ndarray)

Return type:

Tuple[ndarray, ndarray]

xpectrass.utils.region_selection.select_regions_np(intensities, wavenumbers, regions)[source]

Select multiple regions from 1-D spectrum arrays.

Regions are concatenated in the order provided.

Parameters:
Return type:

Tuple[ndarray, ndarray]

xpectrass.utils.region_selection.analyze_regions(df, regions=None)[source]

Analyze intensity statistics across different spectral regions.

Parameters:
  • df (pl.DataFrame) – Wide-format spectral DataFrame.

  • regions (list of tuples, optional) – Regions to analyze. If None, analyzes predefined plastic regions.

Returns:

Statistics for each region (mean, std, min, max, peak location).

Return type:

pd.DataFrame

xpectrass.utils.region_selection.get_wavenumbers(df)[source]

Extract wavenumber array from DataFrame columns.

Parameters:

df (DataFrame)

Return type:

ndarray

xpectrass.utils.region_selection.get_spectra_matrix(df)[source]

Extract spectra as numpy matrix (n_samples, n_wavenumbers).

Parameters:

df (DataFrame)

Return type:

ndarray

select_region

select_region(
    df: pl.DataFrame,
    regions: Union[str, Tuple, List[Tuple]]
) -> pl.DataFrame

exclude_regions

exclude_regions(df: pl.DataFrame, regions: Union[str, Tuple, List[Tuple]]) -> pl.DataFrame

FTIR_REGIONS

FTIR_REGIONS = {
    'fingerprint': (400, 1500),
    'ch_stretch': (2800, 3100),
    'carbonyl': (1650, 1800),
    # ... and more
}

file_management

xpectrass.utils.file_management.process_batch_files(files, skiprows=15, separator=',', engine='pl', concat_how='vertical', keep_index=True, index_col=None, show_progress=True)[source]

Import a batch of FT-IR CSVs and concatenate them into one Polars frame.

Parameters:
  • files (iterable of str | Path) – Paths to the spectral CSV files.

  • skiprows (int, default 15) – Number of rows to skip at the start of the file (e.g. metadata).

  • separator (str, default ',') – Delimiter for the CSV file.

  • engine ({'pd', 'pl'}, default 'pl') –

    • ‘pd’ → read via the pandas-based importer, then convert to Polars.

    • ’pl’ → read directly via the Polars importer (faster).

  • concat_how ({'vertical', 'vertical_relaxed'}, default 'vertical') –

    • ‘vertical’ → schemas must match exactly; raises if not.

    • ’vertical_relaxed’ → union by column name; missing cols filled with

      nulls (Polars ≥ 0.20).

  • keep_index (bool, default True) – If True and engine is ‘pd’, include the pandas Index as a column when converting to Polars (include_index=True). Recommended because your importer names the index “sample”.

  • index_col (str, optional) – Column name to use as the row identifier/index. If None, uses default behavior (integer index for pandas, “sample” column for polars). Common values: ‘sample’, ‘sample_name’, etc.

  • show_progress (bool, default True) – Toggle the tqdm progress bar.

Returns:

All spectra stacked row-wise (each row = one sample).

Return type:

pl.DataFrame

xpectrass.utils.file_management.import_data(file_path, engine='pl', skiprows=15, separator=',', index_col=None)[source]

Load a single‐sample CSV of spectral data, set the wavenumber index, transpose so samples are rows, and attach a simple sample label.

Parameters:
  • file_path (str or pathlib.Path) – Path to the CSV file to import.

  • skiprows (int, default 15) – Number of rows to skip at the start of the file (e.g. metadata).

  • separator (str, default ',') – Delimiter for the CSV file.

  • engine (str, default 'pl') – ‘pd’ for pandas or ‘pl’ for polars.

  • index_col (str, optional) – Column name to use as the DataFrame’s row index. If None, uses default integer index. Common values: ‘sample’, ‘sample_name’, etc.

Returns:

Transposed DataFrame (samples × wavenumber) with: - Index name = “sample” (pandas) or “sample” column (polars) - Column name = wavenumber values, index name = “wavenumber” - A “label” column containing the alphabetic prefix of the sample name

Return type:

pd.DataFrame | pl.DataFrame

xpectrass.utils.file_management.import_data_pd(file_path, skiprows=15, sep=',', index_col=None)[source]

Load a single‐sample CSV of spectral data, set the wavenumber index, transpose so samples are rows, and attach a simple sample label.

Parameters:
  • file_path (str or pathlib.Path) – Path to the CSV file to import.

  • skiprows (int, default 15) – Number of rows to skip at the start of the file (e.g. metadata).

  • sep (str, default ',') – Delimiter for the CSV file.

  • index_col (str, optional) – Column name to use as the DataFrame’s row index after transposing. If None, uses default integer index with name “sample”. Common values: ‘sample’, ‘sample_name’, etc.

Returns:

Transposed DataFrame (samples × wavenumber) with: - Index name = index_col (if provided) or “sample” (default) - Column names = wavenumber values - A “label” column containing the alphabetic prefix of the sample name

Return type:

pd.DataFrame

xpectrass.utils.file_management.import_data_pl(file_path, skiprows=15, sep=',', index_col=None)[source]

Load a single-sample spectral CSV into a Polars DataFrame, reshape it so that each sample is a row with wavenumbers as columns, and attach a simple sample label.

Parameters:
  • file_path (str or pathlib.Path) – Path to the CSV file.

  • skiprows (int, default 15) – Number of lines to skip before reading data (e.g., metadata).

  • sep (str, default ",") – Field delimiter for the CSV.

  • index_col (str, optional) – Column name to use as the row identifier. If provided, this column will be moved to the first position and can be used for setting as index when converting to pandas. If None, “sample” column remains as a regular column. Common values: ‘sample’, ‘sample_name’, etc.

Returns:

A wide DataFrame where:
  • Each row is a sample.

  • If index_col is specified, that column is in the first position.

  • Columns are wavenumbers (as floats or ints).

  • A “label” column holds the alphabetic prefix of the sample name.

Return type:

pl.DataFrame

process_batch_files

process_batch_files(
    files: Iterable[str],
    skiprows: int = 15,
    separator: str = ',',
    engine: str = "pl",
    show_progress: bool = True
) -> pl.DataFrame

import_data

import_data(file_path: str, engine: str = 'pl', skiprows: int = 15) -> pl.DataFrame