Utils Module
data_validation
Data Validation Module for FTIR Spectral Preprocessing
Provides comprehensive validation checks for FTIR spectral data to ensure data quality before preprocessing and analysis.
- xpectrass.utils.data_validation.validate_spectra(df, expected_samples_per_class=500, expected_classes=None, wavenumber_range=(399.0, 4000.0), intensity_range=(0.0, 150.0), verbose=True)[source]
Comprehensive validation of FTIR spectral data.
- Parameters:
df (pl.DataFrame) – Wide-format DataFrame with columns: ‘sample’, ‘label’, and wavenumber columns.
expected_samples_per_class (int, default 500) – Expected number of samples per plastic type.
expected_classes (list of str, optional) – Expected class labels. Default: [‘HDPE’, ‘LDPE’, ‘PET’, ‘PP’, ‘PS’, ‘PVC’]
wavenumber_range (tuple, default (399.0, 4000.0)) – Expected (min, max) wavenumber range in cm⁻¹.
intensity_range (tuple, default (0.0, 150.0)) – Valid intensity range for %T values.
verbose (bool, default True) – Print validation report to console.
- Returns:
Validation report with keys: - ‘valid’: bool - overall pass/fail - ‘n_samples’: int - total samples - ‘n_wavenumbers’: int - spectral points - ‘class_counts’: dict - samples per label - ‘missing_values’: int - count of NaN/Inf - ‘out_of_range’: dict - samples with intensities outside range - ‘wavenumber_check’: dict - actual vs expected range - ‘duplicates’: list - duplicate sample names - ‘issues’: list - list of issue descriptions
- Return type:
- xpectrass.utils.data_validation.detect_outlier_spectra(df, method='zscore', threshold=3.0)[source]
Detect outlier spectra based on overall intensity statistics.
- Parameters:
- Returns:
‘outlier_samples’: list of sample names flagged as outliers
’outlier_indices’: list of row indices
’statistics’: dict with mean/std/median per sample
- Return type:
- xpectrass.utils.data_validation.check_wavenumber_consistency(file_paths, skiprows=15, tolerance=0.1)[source]
Check if all files have consistent wavenumber grids.
validate_spectra
validate_spectra(
df: pl.DataFrame,
expected_samples_per_class: int = 500,
expected_classes: List[str] = None,
wavenumber_range: Tuple[float, float] = (399.0, 4000.0),
intensity_range: Tuple[float, float] = (0.0, 150.0),
verbose: bool = True
) -> Dict[str, Any]
detect_outlier_spectra
detect_outlier_spectra(
df: pl.DataFrame,
method: str = "zscore", # 'zscore', 'iqr', 'mad'
threshold: float = 3.0
) -> Dict[str, Any]
baseline
Baseline Correction Module for FTIR Spectral Preprocessing
Provides baseline correction using 50+ algorithms from pybaselines library plus custom windowed filters for FTIR and ToF-SIMS spectra.
IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py
Features: - Single spectrum correction via baseline_correction() - Batch DataFrame processing via apply_baseline_correction() - Automatic column detection and sorting by wavenumber - Performance optimized for large datasets (vectorized operations) - Pandas and Polars DataFrame support - Method evaluation via evaluate_baseline_correction_methods()
Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:
import logging logging.getLogger(‘utils.baseline’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.baseline’).setLevel(logging.ERROR) # Only errors
Available Methods: Run baseline_method_names() to see all available correction algorithms. Common methods: airpls, asls, arpls, iarpls, drpls, mor, snip, poly
- xpectrass.utils.baseline.baseline_correction(intensities, wavenumbers=None, method='airpls', window_size=101, poly_order=4, clip_negative=False, return_baseline=False, **kwargs)[source]
Baseline-correct a 1-D FT-IR or ToF-SIMS spectrum with >50 algorithms.
- Parameters:
intensities (array-like) – Raw y-values (%T or absorbance); will be converted to
float64.wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). If provided, passed to pybaselines for correct spacing. If None, assumes uniform spacing with dx=1. For FTIR spectra, should match the length of intensities.
method (str, default "airpls") – Name of the baseline algorithm. All pybaselines methods plus two custom filters (“median_filter”, “adaptive_window”) are accepted.
window_size (int, default 101) – Odd kernel width for the two custom windowed filters.
poly_order (int, default 4) – Polynomial order for the “poly” baseline.
clip_negative (bool, default False) – If True, set negative corrected values to 0 (useful for %T spectra).
return_baseline (bool, default False) – If True, return
(corrected, baseline)instead of justcorrected.**kwargs – Extra keyword arguments are forwarded verbatim to the selected pybaselines algorithm (e.g.
lam=1e6, p=0.01for AsLS).
- Returns:
corrected (np.ndarray) – Baseline-subtracted intensities (same dtype & length as input).
baseline (np.ndarray , optional) – Returned only if
return_baseline=True.
- Return type:
Notes
- NaN Handling:
If input contains NaN values, they are temporarily removed for baseline estimation
Baseline is computed only on finite values
NaN positions are preserved in output (marked as NaN)
If all values are NaN, returns array of NaN
- xpectrass.utils.baseline.apply_baseline_correction(data, method='airpls', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, window_size=101, poly_order=4, clip_negative=False, show_progress=True, **kwargs)[source]
Apply baseline correction to a DataFrame of FTIR spectra (batch processing).
Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies baseline correction to all samples.
- Parameters:
data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).
method (str, default "airpls") – Baseline correction method. All pybaselines methods plus custom filters (“median_filter”, “adaptive_window”, “poly”) are supported. Common methods: “airpls”, “asls”, “arpls”, “iarpls”, “drpls”, “iasls”, “aspls”, “psalsa”, “derpsalsa”, “mpls”, “mor”, “imor”, “amormol”, “snip”
label_column (str, default "label") – Name of the label/group column to exclude from correction.
exclude_columns (list[str], optional) – Additional column names to exclude from correction (e.g., ‘sample’, ‘id’).
wn_min (float, optional) – Minimum wavenumber for column detection (default: 200.0 cm⁻¹). Columns with wavenumbers below this value will be excluded.
wn_max (float, optional) – Maximum wavenumber for column detection (default: 8000.0 cm⁻¹). Columns with wavenumbers above this value will be excluded.
window_size (int, default 101) – Odd kernel width for custom windowed filters (“median_filter”, “adaptive_window”).
poly_order (int, default 4) – Polynomial order for the “poly” baseline method.
clip_negative (bool, default False) – If True, set negative corrected values to 0 (useful for %T spectra).
show_progress (bool, default True) – If True, display a progress bar during processing.
**kwargs (additional parameters) – Extra keyword arguments forwarded to the baseline correction algorithm (e.g., lam=1e6, p=0.01 for AsLS/AirPLS methods).
sample_id_column (str)
- Returns:
pd.DataFrame | pl.DataFrame – Baseline-corrected DataFrame (same type as input) with spectral data corrected and metadata columns preserved. Output columns are sorted by ascending wavenumber for standardization.
NaN Handling
————
Robustly handles NaN (missing) values in spectral data
- NaN values are temporarily removed before baseline estimation
- Baseline is computed only on finite values
- NaN positions are preserved in output
- If an entire spectrum is NaN, it remains as NaN
- Prevents baseline algorithms from failing on sparse/incomplete data
Performance
———–
Optimized for large datasets using
- Robust wavenumber column detection (parses column names, not dtype)
- Automatic column sorting to ensure monotonic wavenumber order
- Vectorized numpy array access (no DataFrame.loc overhead)
- Pre-allocated output arrays (no dynamic list appending)
- Progress tracking via tqdm
- Return type:
DataFrame | DataFrame
Warning
Warns if spectral columns are reordered during processing
Warns if wavenumber bounds are auto-expanded beyond defaults (200-8000 cm⁻¹)
Examples
>>> # Apply AirPLS baseline correction to all samples >>> df_corrected = apply_baseline_correction(df_wide, method="airpls")
>>> # Use AsLS with custom parameters >>> df_corrected = apply_baseline_correction( ... df_wide, ... method="asls", ... lam=1e6, ... p=0.01 ... )
>>> # Use polynomial baseline >>> df_corrected = apply_baseline_correction( ... df_wide, ... method="poly", ... poly_order=3 ... )
>>> # Works with both pandas and polars >>> df_pd_corrected = apply_baseline_correction(df_pandas) >>> df_pl_corrected = apply_baseline_correction(df_polars)
>>> # Disable progress bar for cleaner output >>> df_corrected = apply_baseline_correction(df_wide, show_progress=False)
>>> # Works correctly even with NaN values in spectra >>> df_with_nans = df_wide.copy() >>> df_with_nans.iloc[0, 10:20] = np.nan # Introduce NaN values >>> df_corrected = apply_baseline_correction(df_with_nans, method="airpls") # NaN positions are preserved in output
- xpectrass.utils.baseline.baseline_method_names()[source]
Return a sorted list of method names that can be passed to baseline_correction(method=…).
The list is generated dynamically from pybaselines.Baseline, skipping the deprecated solver helpers, and then augmented with the two custom windowed filters plus the convenient ‘poly’ alias.
- xpectrass.utils.baseline.plot_baseline_correction_metric_boxes(df, metric_name, figsize=(9, 5), mean_bar_width=0.6, color_boxes=None, color_mean=None, plot_mean_sd=False, save_plot=False, save_path='')[source]
Box-plot of a baseline-quality metric (RFZN or NAR) across methods.
- Parameters:
df (pandas.DataFrame) – Rows = samples, columns = baseline-correction methods.
metric_name (str) – Title for the y-axis and plot.
figsize ((w, h), default (9, 5)) – Size of the figure in inches.
mean_bar_width (float, default 0.6) – Width of the mean ± SD bar overlay (same units as box widths).
color_boxes (str | None) – Matplotlib colour for the boxes. None → Matplotlib default cycle.
color_mean (str | None) – Colour for the mean ± SD bars. None → Matplotlib default cycle.
plot_mean_sd (bool)
save_plot (bool)
save_path (str)
- Return type:
None
- xpectrass.utils.baseline.plot_baseline_correction_metric_boxes_masked(df, metric_name, max_value, figsize=(9, 5), mean_bar_width=0.6, color_boxes=None, color_mean=None, plot_mean_sd=False, save_plot=False, save_path='masked_')[source]
- xpectrass.utils.baseline.evaluate_baseline_correction_methods(data, flat_windows, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, negative_clip=False, diagnostic_peaks=None, baseline_methods=None, n_samples=None, sample_selection='random', random_state=None, n_jobs=-1)[source]
Parallel computation of RFZN, NAR, SNR for every (sample, method) pair.
- Parameters:
data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).
flat_windows (list of tuples) – Wavenumber ranges to use for baseline noise evaluation. Each tuple is (min_wavenumber, max_wavenumber) for regions expected to contain only baseline (no peaks).
label_column (str, default "label") – Name of the label/group column to exclude from evaluation.
exclude_columns (list[str], optional) – Additional column names to exclude from evaluation (e.g., ‘sample’, ‘id’).
wn_min (float, optional) – Minimum wavenumber for column detection (default: 200.0 cm⁻¹).
wn_max (float, optional) – Maximum wavenumber for column detection (default: 8000.0 cm⁻¹).
negative_clip (bool, default False) – If True, clip negative values to 0 during baseline correction.
diagnostic_peaks (list of tuples, optional) – Specific wavenumber ranges for peak detection in SNR calculation. Each tuple is (min_wavenumber, max_wavenumber) for diagnostic peaks. Example: [(2900, 2930), (2840, 2870)] for CH2/CH3 stretch regions. If None, uses global maximum across entire spectrum.
baseline_methods (list of str, optional) – List of baseline correction methods to evaluate. If None (default), evaluates all available methods from baseline_method_names(). Providing a subset significantly speeds up evaluation. Example: [‘als’, ‘asls’, ‘arpls’] to test only ALS variants. Use baseline_method_names() to see all available methods.
n_samples (int, optional) – Number of samples to evaluate. If None, evaluates all samples.
sample_selection (str, default "random") – How to select samples if n_samples < total samples. Options: “random”, “first”, “last”.
random_state (int, optional) – Random seed for reproducible sample selection when using “random”.
n_jobs (int, default -1) – Number of parallel jobs. -1 uses all CPU cores.
sample_id_column (str)
- Returns:
rfzn_tbl (pandas.DataFrame) – Residual Flat-Zone Noise (RFZN) values for each (sample, method) pair. Units: Same as input spectral intensities (e.g., absorbance, %T). Lower values indicate better baseline correction (less residual noise).
nar_tbl (pandas.DataFrame) – Negative Area Ratio (NAR) values for each (sample, method) pair. Units: Unitless ratio in range [0, 1]. Ratio of negative area to total area after baseline correction. Lower values indicate better correction (fewer negative artifacts).
snr_tbl (pandas.DataFrame) – Signal-to-Noise Ratio (SNR) values for each (sample, method) pair. Units: Unitless ratio (peak_height / noise_level). Higher values indicate better correction (stronger signal relative to noise).
Notes
- Metric Interpretations:
RFZN: RMS noise in flat zones. Good methods: < 0.01 (absorbance units)
NAR: Fraction of negative intensity. Good methods: < 0.05 (5%)
SNR: Peak/noise ratio. Good methods: > 10 (depends on sample)
- Example Usage:
>>> flat_windows = [(2500, 2600), (3200, 3500)] # Baseline-only regions >>> >>> # Evaluate all available methods >>> rfzn, nar, snr = evaluate_baseline_correction_methods( ... df, flat_windows, diagnostic_peaks=[(2900, 2930)] ... ) >>> >>> # Evaluate only specific methods (faster) >>> rfzn, nar, snr = evaluate_baseline_correction_methods( ... df, flat_windows, ... baseline_methods=['als', 'asls', 'arpls', 'rubberband'] ... ) >>> >>> # Find best method for each sample >>> best_methods = rfzn.idxmin(axis=1) # Method with lowest noise per sample
- xpectrass.utils.baseline.find_best_baseline_method(rfzn_tbl, nar_tbl, snr_tbl, rfzn_threshold=0.01, nar_threshold=0.05, snr_min=10.0, top_n=5)[source]
Recommend best baseline correction methods based on evaluation metrics.
Analyzes RFZN, NAR, and SNR across all samples to identify methods that consistently perform well. Methods are ranked by a composite score combining all three metrics.
- Parameters:
rfzn_tbl (pd.DataFrame) – RFZN values from evaluate_baseline_correction_methods()
nar_tbl (pd.DataFrame) – NAR values from evaluate_baseline_correction_methods()
snr_tbl (pd.DataFrame) – SNR values from evaluate_baseline_correction_methods()
rfzn_threshold (float, default 0.01) – Maximum acceptable RFZN (lower is better). Default: 0.01 absorbance units.
nar_threshold (float, default 0.05) – Maximum acceptable NAR (lower is better). Default: 0.05 (5%).
snr_min (float, default 10.0) – Minimum acceptable SNR (higher is better). Default: 10.
top_n (int, default 5) – Number of top methods to return.
- Returns:
Ranked methods with columns: - method: Method name - median_rfzn: Median RFZN across samples - median_nar: Median NAR across samples - median_snr: Median SNR across samples - pass_rate: Fraction of samples passing all thresholds (0-1) - composite_score: Weighted score (higher is better) Sorted by composite_score descending (best methods first).
- Return type:
pd.DataFrame
Notes
- Composite Score Calculation:
Normalizes each metric to [0, 1] range
RFZN: Lower is better (inverted for scoring)
NAR: Lower is better (inverted for scoring)
SNR: Higher is better
Pass rate: Bonus for consistent performance
Composite = (0.3 * RFZN_score) + (0.3 * NAR_score) + (0.3 * SNR_score) + (0.1 * pass_rate)
Example
>>> rfzn, nar, snr = evaluate_baseline_correction_methods(df, flat_windows) >>> recommendations = recommend_baseline_methods(rfzn, nar, snr, top_n=3) >>> print(recommendations) method median_rfzn median_nar median_snr pass_rate composite_score 0 airpls 0.0045 0.02 25.3 0.95 0.89 1 arpls 0.0052 0.03 23.1 0.92 0.85 2 drpls 0.0061 0.04 21.7 0.88 0.81
baseline_correction
baseline_correction(
intensities: np.ndarray,
method: str = "airpls",
window_size: int = 101,
poly_order: int = 4,
clip_negative: bool = True,
return_baseline: bool = False,
**kwargs
) -> np.ndarray
baseline_method_names
baseline_method_names() -> List[str]
Returns list of 50+ available baseline correction methods.
denoise
Denoising Module for FTIR Spectral Preprocessing
IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py
Features: - Single spectrum denoising via denoise() - Batch DataFrame processing via apply_denoising() - Automatic column detection and sorting by wavenumber - Performance optimized for large datasets (vectorized operations) - Pandas and Polars DataFrame support - Method evaluation via evaluate_denoising_methods() - Memory-safe evaluation via evaluate_denoising_methods_safe() - Composite scoring for method selection via find_best_denoising_method()
Memory Management: For systems with limited RAM or large datasets, use evaluate_denoising_methods_safe() instead of evaluate_denoising_methods(). See MEMORY_MANAGEMENT_GUIDE.md for details.
Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:
import logging logging.getLogger(‘utils.denoise’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.denoise’).setLevel(logging.ERROR) # Only errors
Available Methods: Run denoise_method_names() to see all available denoising algorithms. Common methods: savgol, wavelet, moving_average, gaussian, median, whittaker, lowpass
- xpectrass.utils.denoise.denoise(intensities, wavenumbers=None, method='savgol', **kwargs)[source]
Denoise a 1-D FTIR spectrum using various filtering methods.
- Parameters:
intensities (array-like) – Raw intensity values (1-D).
wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). If provided, validates that data is sorted in ascending order and warns if not. Ensures API consistency with other preprocessing modules (baseline, atmospheric).
method (str, default "savgol") – Denoising method. Options: - ‘savgol’: Savitzky-Golay filter (preserves peak shape) - ‘wavelet’: Discrete wavelet transform denoising - ‘moving_average’: Simple moving average - ‘gaussian’: Gaussian filter - ‘median’: Median filter (good for spike noise) - ‘whittaker’: Penalized least squares smoother - ‘lowpass’: Low-pass Butterworth filter
**kwargs (method-specific parameters) – savgol: window_length (15), polyorder (3) wavelet: wavelet (‘db4’), level (3), threshold_mode (‘soft’) moving_average: window (11) gaussian: sigma (2.0) median: kernel_size (5) whittaker: lam (1e4), d (2) lowpass: cutoff (0.1), order (4)
- Returns:
Denoised intensity values. NaN values in input are preserved at their original positions; denoising is applied only to finite values.
- Return type:
np.ndarray
- xpectrass.utils.denoise.denoise_method_names()[source]
Return list of available denoising method names.
- xpectrass.utils.denoise.apply_denoising(data, method='savgol', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, show_progress=True, **kwargs)[source]
Apply denoising to a DataFrame of FTIR spectra (batch processing).
Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies denoising to all samples.
- Parameters:
data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).
method (str, default "savgol") – Denoising method. Options: - ‘savgol’: Savitzky-Golay filter (preserves peak shape) - ‘wavelet’: Discrete wavelet transform denoising - ‘moving_average’: Simple moving average - ‘gaussian’: Gaussian filter - ‘median’: Median filter (good for spike noise) - ‘whittaker’: Penalized least squares smoother - ‘lowpass’: Low-pass Butterworth filter
label_column (str, default "label") – Name of the label/group column to exclude from denoising.
exclude_columns (list[str], optional) – Additional column names to exclude from denoising (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns.
wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default, or auto-expands if no columns found within default range.
wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default, or auto-expands if no columns found within default range.
show_progress (bool, default True) – If True, display a progress bar during processing.
**kwargs (additional parameters) – Method-specific parameters forwarded to the denoising algorithm: - savgol: window_length (15), polyorder (3) - wavelet: wavelet (‘db4’), level (3), threshold_mode (‘soft’) - moving_average: window (11) - gaussian: sigma (2.0) - median: kernel_size (5) - whittaker: lam (1e4), d (2) - lowpass: cutoff (0.1), order (4)
sample_id_column (str)
- Returns:
Denoised DataFrame (same type as input) with spectral data denoised and metadata columns preserved. Columns are sorted by ascending wavenumber order.
- Return type:
pd.DataFrame | pl.DataFrame
Examples
>>> # Apply Savitzky-Golay denoising to all samples >>> df_denoised = apply_denoising(df_wide, method="savgol")
>>> # Use wavelet denoising with custom parameters >>> df_denoised = apply_denoising( ... df_wide, ... method="wavelet", ... wavelet="db4", ... level=3, ... threshold_mode="soft" ... )
>>> # Use Gaussian smoothing >>> df_denoised = apply_denoising( ... df_wide, ... method="gaussian", ... sigma=2.0 ... )
>>> # Works with both pandas and polars >>> df_pd_denoised = apply_denoising(df_pandas) >>> df_pl_denoised = apply_denoising(df_polars)
>>> # Disable progress bar for cleaner output >>> df_denoised = apply_denoising(df_wide, show_progress=False)
- xpectrass.utils.denoise.estimate_snr(y_raw, y_denoised, flat_regions=None, wavenumbers=None)[source]
Estimate Signal-to-Noise Ratio improvement (NaN-aware).
- Parameters:
y_raw (np.ndarray) – Original noisy spectrum.
y_denoised (np.ndarray) – Denoised spectrum.
flat_regions (list of tuples, optional) – Regions known to be baseline-only (for noise estimation). Can be either: - List of (start_idx, end_idx) integer index tuples (legacy) - List of (wn_min, wn_max) float wavenumber tuples (recommended) If wavenumbers provided, flat_regions interpreted as wavenumber ranges. If None, uses high-frequency residual estimation.
wavenumbers (np.ndarray, optional) – Wavenumber array. Required if flat_regions specified as wavenumber ranges.
- Returns:
Estimated SNR in dB. Returns np.nan if insufficient finite data.
- Return type:
Notes
Uses NaN-aware statistics to handle missing values in spectra
Wavenumber-based regions (recommended): More robust across different spectral resolutions
Index-based regions (legacy): Faster but resolution-dependent
Examples
>>> # Index-based (legacy) >>> snr = estimate_snr(y_raw, y_denoised, flat_regions=[(10, 50), (200, 250)])
>>> # Wavenumber-based (recommended) >>> wn = np.linspace(650, 4000, 1000) >>> snr = estimate_snr(y_raw, y_denoised, flat_regions=[(2500, 2600), (3200, 3500)], wavenumbers=wn)
- xpectrass.utils.denoise.evaluate_denoising_methods_safe(data, methods=None, **kwargs)[source]
Memory-safe wrapper for evaluate_denoising_methods() with conservative defaults.
This function automatically sets safe defaults to prevent memory issues: - n_samples=50 (instead of all samples) - n_jobs=2 (instead of all CPU cores) - methods=[‘savgol’, ‘gaussian’, ‘median’] if not specified
Use this function for initial exploration, then switch to full evaluate_denoising_methods() with custom parameters if needed.
- Parameters:
data (pd.DataFrame | pl.DataFrame) – Spectral DataFrame
methods (list of str, optional) – Denoising methods to test. If None, uses [‘savgol’, ‘gaussian’, ‘median’].
**kwargs (additional parameters) – Forwarded to evaluate_denoising_methods(). Note that n_samples and n_jobs will be overridden to safe defaults unless explicitly provided.
- Returns:
Evaluation results with columns: sample, method, snr_db, smoothness, fidelity, time_ms
- Return type:
pd.DataFrame
Examples
>>> # Safe evaluation (won't cause memory issues) >>> results = evaluate_denoising_methods_safe(df) >>> recommendations = find_best_denoising_method(results)
>>> # With custom methods but still safe >>> results = evaluate_denoising_methods_safe(df, methods=['savgol', 'wavelet'])
- xpectrass.utils.denoise.evaluate_denoising_methods(data, methods=None, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, n_samples=None, sample_selection='random', random_state=None, n_jobs=-1)[source]
Compare denoising methods on a subset of spectra.
MEMORY WARNING: Parallel processing can consume significant memory. For systems with <16 GB RAM or large datasets (>1000 samples): - Set n_jobs=2 (not -1) - Set n_samples=50 (not None) - Test with methods=[‘savgol’, ‘gaussian’] first See MEMORY_MANAGEMENT_GUIDE.md for detailed recommendations.
- Parameters:
data (pd.DataFrame | pl.DataFrame) – Wide-format spectral DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).
methods (list of str, optional) – Methods to evaluate. If None, evaluates all available methods. Available: ‘savgol’, ‘wavelet’, ‘moving_average’, ‘gaussian’, ‘median’, ‘whittaker’, ‘lowpass’. Recommendation: Start with 2-3 methods to test memory usage.
label_column (str, default "label") – Name of the label/group column to exclude from evaluation.
exclude_columns (list[str], optional) – Additional column names to exclude from evaluation (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns.
wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default.
wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default.
n_samples (int, optional) – Number of samples to evaluate. If None, uses all samples.
sample_selection (str, default "random") – How to select samples: “random”, “first”, or “last”.
random_state (int, optional) – Random seed for reproducibility when sample_selection=”random”.
n_jobs (int, default -1) – Number of parallel jobs. -1 uses all CPU cores.
sample_id_column (str)
- Returns:
Evaluation metrics for each (sample, method) combination with columns: - sample: Sample identifier - method: Denoising method name - snr_db: Signal-to-noise ratio improvement (dB) - smoothness: Inverse of 2nd derivative variance (higher = smoother) - fidelity: Correlation with original signal (0-1, higher = better) - time_ms: Computation time in milliseconds (for performance comparison)
- Return type:
pd.DataFrame
- xpectrass.utils.denoise.plot_denoising_evaluation(eval_df, metrics=None, figsize=(14, 5), show_mean_sd=True, save_plot=None, save_path=None)[source]
Plot evaluation metrics from evaluate_denoising_methods as box plots.
Creates box plots for each metric (SNR, smoothness, fidelity) across different denoising methods to help select the best method.
- Parameters:
eval_df (pd.DataFrame) – Output from evaluate_denoising_methods() with columns: [‘sample’, ‘method’, ‘snr_db’, ‘smoothness’, ‘fidelity’]
metrics (list of str, optional) – Metrics to plot. If None, plots all three: [‘snr_db’, ‘smoothness’, ‘fidelity’] Available: ‘snr_db’, ‘smoothness’, ‘fidelity’
figsize (tuple, default (14, 5)) – Figure size (width, height)
show_mean_sd (bool, default True) – If True, overlay mean ± SD on box plots
save_path (str, optional) – If provided, save the figure to this path (e.g., ‘denoising_eval.pdf’)
save_path – If provided, save the figure to this path (e.g., ‘denoising_eval.pdf’)
save_plot (bool | None)
- Returns:
Displays matplotlib figure
- Return type:
None
Examples
>>> # Evaluate methods >>> eval_results = evaluate_denoising_methods(df_wide, n_samples=50) >>> >>> # Plot all metrics >>> plot_denoising_evaluation(eval_results) >>> >>> # Plot only SNR and fidelity >>> plot_denoising_evaluation(eval_results, metrics=['snr_db', 'fidelity']) >>> >>> # Save to file >>> plot_denoising_evaluation(eval_results, save_path='denoise_eval.pdf')
- xpectrass.utils.denoise.plot_denoising_evaluation_summary(eval_df, figsize=(10, 6), save_plot=None, save_path=None)[source]
Create a summary table showing mean ± SD for all metrics across methods.
Displays a colored heatmap-style visualization to quickly identify the best performing denoising methods.
- Parameters:
- Return type:
None
Examples
>>> eval_results = evaluate_denoising_methods(df_wide, n_samples=50) >>> plot_denoising_evaluation_summary(eval_results)
- xpectrass.utils.denoise.find_best_denoising_method(eval_df, snr_min=10.0, smoothness_min=1000.0, fidelity_min=0.9, time_max_ms=100.0, top_n=5)[source]
Recommend best denoising methods based on evaluation metrics.
Analyzes SNR, smoothness, fidelity, and computation time across all samples to identify methods that consistently perform well. Methods are ranked by a composite score combining all metrics.
- Parameters:
eval_df (pd.DataFrame) – Output from evaluate_denoising_methods() with columns: [‘sample’, ‘method’, ‘snr_db’, ‘smoothness’, ‘fidelity’, ‘time_ms’]
snr_min (float, default 10.0) – Minimum acceptable SNR in dB (higher is better).
smoothness_min (float, default 1e3) – Minimum acceptable smoothness (higher is better).
fidelity_min (float, default 0.9) – Minimum acceptable fidelity correlation (0-1, higher is better).
time_max_ms (float, default 100.0) – Maximum acceptable computation time in milliseconds (lower is better).
top_n (int, default 5) – Number of top methods to return.
- Returns:
Ranked methods with columns: - method: Method name - median_snr_db: Median SNR across samples - median_smoothness: Median smoothness across samples - median_fidelity: Median fidelity across samples - median_time_ms: Median computation time across samples - pass_rate: Fraction of samples passing all thresholds (0-1) - composite_score: Weighted score (higher is better) Sorted by composite_score descending (best methods first).
- Return type:
pd.DataFrame
Notes
- Composite Score Calculation:
Normalizes each metric to [0, 1] range
SNR: Higher is better
Smoothness: Higher is better
Fidelity: Higher is better
Time: Lower is better (inverted for scoring)
Pass rate: Bonus for consistent performance
- Composite = (0.3 * SNR_score) + (0.25 * smoothness_score) +
(0.3 * fidelity_score) + (0.05 * time_score) + (0.1 * pass_rate)
Example
>>> eval_results = evaluate_denoising_methods(df, n_samples=50) >>> recommendations = find_best_denoising_method(eval_results, top_n=3) >>> print(recommendations) method median_snr_db median_smoothness median_fidelity median_time_ms pass_rate composite_score 0 savgol 18.5 2.5e4 0.985 12.3 0.94 0.87 1 wavelet 16.2 1.8e4 0.972 45.2 0.88 0.82 2 gaussian 15.8 2.1e4 0.968 8.5 0.86 0.80
- xpectrass.utils.denoise.plot_denoising_comparison(y_raw, wavenumbers, methods=None, sample_name='', figsize=(12, 8))[source]
Plot comparison of multiple denoising methods.
denoise
denoise(
intensities: np.ndarray,
method: str = "savgol",
**kwargs
) -> np.ndarray
Methods: savgol, wavelet, moving_average, gaussian, median, whittaker, lowpass
normalization
Normalization Module for FTIR Spectral Preprocessing
IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py
Provides multiple normalization methods for FTIR spectra including SNV, vector, area, min-max, and peak normalization. Also includes mean centering for PCA/PLS preparation.
- xpectrass.utils.normalization.normalize(intensities, wavenumbers=None, method='snv', **kwargs)[source]
Normalize a 1-D FTIR spectrum.
- Parameters:
intensities (array-like) – Intensity values (1-D).
wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). Required for peak_wavenumber and adaptive_regional methods.
method (str, default "snv") –
Normalization method. Options: - ‘snv’: Standard Normal Variate (mean=0, std=1 within spectrum) - ‘vector’: L2 vector normalization (unit length) - ‘minmax’: Min-Max scaling to [0, 1] - ‘area’: Area normalization (total area = 1) - ‘peak’: Normalize by peak intensity - ‘range’: Normalize by intensity range - ‘max’: Normalize by maximum value - ‘detrend’: Polynomial detrending - ‘snv_detrend’: SNV followed by detrending
Novel methods (1D-compatible): - ‘robust_snv’: Robust SNV using median/MAD - ‘curvature_weighted’: Curvature-weighted normalization - ‘peak_envelope’: Peak envelope normalization - ‘entropy_weighted’: Entropy-weighted normalization - ‘pqn’: Probabilistic quotient normalization - ‘total_variation’: Total variation normalization - ‘spectral_moments’: Spectral moment normalization - ‘adaptive_regional’: Adaptive regional normalization (requires wavenumbers) - ‘derivative_ratio’: Derivative ratio normalization - ‘signal_to_baseline’: Signal-to-baseline ratio normalization
**kwargs (method-specific parameters) –
- peak: peak_idx (index), peak_wavenumber (float, requires wavenumbers),
peak_value (float), use_absolute (bool)
minmax: feature_range (tuple, default (0, 1)) detrend: order (int, default 1) adaptive_regional: regions (list of tuples), method_per_region (dict)
- Returns:
Normalized intensity values.
- Return type:
np.ndarray
Notes
Methods that require multiple spectra (2D data) are NOT available here: - ‘mean_center’: Use mean_center() directly on 2D array - ‘auto_scale’: Use auto_scale() directly on 2D array - ‘pareto’: Use pareto_scale() directly on 2D array These methods compute column-wise statistics and should be applied via normalize_df() for batch processing.
PQN (Probabilistic Quotient Normalization): - For single spectrum WITH reference: Provide reference kwarg for true PQN
Example: normalize(y, method=’pqn’, reference=ref_spectrum)
For single spectrum WITHOUT reference: Falls back to median scaling (NOT true PQN!) A warning will be issued in this case.
For batch PQN: Use normalize_df() instead (auto-computes reference from dataset)
- xpectrass.utils.normalization.normalize_method_names()[source]
Return list of available 1D normalization method names.
Note: Methods requiring 2D data (mean_center, auto_scale, pareto) are not included. Use normalize_df() for batch processing with those methods.
- xpectrass.utils.normalization.normalize_df(data, method='snv', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, show_progress=True, **kwargs)[source]
Normalize multiple spectra (DataFrame or numpy array).
Works with both pandas and polars DataFrames, or numpy arrays. For DataFrames: each row is a sample, numerical columns are wavenumbers. For numpy arrays: shape (n_samples, n_wavenumbers).
- Parameters:
data (pd.DataFrame | pl.DataFrame | np.ndarray) – Wide-format DataFrame where rows = samples, columns = wavenumbers, OR numpy array of shape (n_samples, n_wavenumbers).
method (str, default "snv") –
Normalization method. Options:
Single-spectrum methods (row-wise, 1D): - ‘snv’: Standard Normal Variate (mean=0, std=1 within each spectrum) - ‘vector’: L2 vector normalization (unit length) - ‘minmax’: Min-Max scaling to [0, 1] - ‘area’: Area normalization (total area = 1) - ‘peak’: Normalize by peak intensity - ‘range’: Normalize by intensity range - ‘max’: Normalize by maximum value - ‘detrend’: Polynomial detrending - ‘snv_detrend’: SNV followed by detrending - Plus all novel methods (robust_snv, curvature_weighted, etc.)
Multi-spectrum methods (column-wise, 2D - for PCA/PLS prep): - ‘mean_center’: Column-wise mean centering (mean=0 per wavenumber)
Requires ≥2 samples
’auto_scale’: Column-wise auto-scaling (mean=0, std=1 per wavenumber) Requires ≥2 samples
’pareto’: Column-wise Pareto scaling (mean=0, scaled by sqrt(std)) Requires ≥2 samples
Dataset-level methods (require reference from entire dataset): - ‘pqn’: Probabilistic Quotient Normalization (computes reference spectrum
from dataset median/mean, then normalizes each spectrum relative to it) Requires ≥3 samples
label_column (str, default "label") – Name of the label/group column to exclude from normalization. Only used for DataFrame inputs.
exclude_columns (list[str], optional) – Additional column names to exclude from normalization (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns. Only used for DataFrame inputs.
wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default, or auto-expands if no columns found within default range.
wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default, or auto-expands if no columns found within default range.
show_progress (bool, default True) – If True, display a progress bar during processing. Only used for DataFrame inputs.
**kwargs (method-specific parameters) –
peak: peak_idx (index) or peak_wavenumber (requires wavenumbers array) minmax: feature_range (tuple, default (0, 1)) detrend: order (int, default 1) pqn: reference_type (str, default ‘median’) - ‘median’ or ‘mean’ for
computing reference spectrum from dataset
sample_id_column (str)
- Returns:
Normalized data (same type as input).
- Return type:
pd.DataFrame | pl.DataFrame | np.ndarray
- Raises:
ValueError – If using 2D methods (mean_center, auto_scale, pareto) with <2 samples. If using PQN with <3 samples.
Examples
>>> # Normalize DataFrame with SNV >>> df_norm = normalize_batch(df_wide, method="snv")
>>> # Normalize with vector normalization >>> df_norm = normalize_batch(df_wide, method="vector")
>>> # Normalize numpy array >>> spectra_norm = normalize_batch(spectra_array, method="snv")
>>> # Min-max normalization to [0, 1] >>> df_norm = normalize_batch(df_wide, method="minmax", feature_range=(0, 1))
>>> # PQN with median reference (proper batch PQN) >>> df_norm = normalize_batch(df_wide, method="pqn", reference_type="median")
>>> # PQN with mean reference >>> df_norm = normalize_batch(df_wide, method="pqn", reference_type="mean")
>>> # Disable progress bar >>> df_norm = normalize_batch(df_wide, method="snv", show_progress=False)
- xpectrass.utils.normalization.mean_center(spectra, axis=0, return_mean=False)[source]
Mean-center spectra (essential preprocessing for PCA/PLS).
- Parameters:
spectra (np.ndarray, shape (n_samples, n_wavenumbers)) – Matrix of spectra.
axis (int, default 0) – Axis along which to compute mean. - 0: Column-wise (feature/wavenumber centering) - standard for PCA - 1: Row-wise (sample centering)
return_mean (bool, default False) – If True, return the mean array for later reconstruction.
- Returns:
centered (np.ndarray) – Mean-centered spectra.
mean (np.ndarray, optional) – Mean values (returned if return_mean=True).
- Return type:
- xpectrass.utils.normalization.auto_scale(spectra, return_params=False)[source]
Auto-scaling (mean centering + unit variance scaling).
Each variable (wavenumber) is scaled to have mean=0 and std=1. Common preprocessing for PCA/PLS when variables have different scales.
- Parameters:
spectra (np.ndarray, shape (n_samples, n_wavenumbers)) – Matrix of spectra.
return_params (bool, default False) – If True, return mean and std for reconstruction.
- Returns:
scaled (np.ndarray) – Auto-scaled spectra.
mean (np.ndarray, optional)
std (np.ndarray, optional)
- Return type:
- xpectrass.utils.normalization.pareto_scale(spectra, return_params=False)[source]
Pareto scaling (mean centering + sqrt(std) scaling).
Less aggressive than auto-scaling; preserves more of the original data structure. Good for spectral data.
- Parameters:
spectra (np.ndarray, shape (n_samples, n_wavenumbers)) – Matrix of spectra.
return_params (bool, default False) – If True, return mean and std for reconstruction.
- Returns:
scaled – Pareto-scaled spectra.
- Return type:
np.ndarray
- xpectrass.utils.normalization.detrend(intensities, order=1, wavenumbers=None)[source]
Remove polynomial trend from spectrum.
Often used after SNV to remove residual slope.
- Parameters:
intensities (np.ndarray) – 1-D spectrum.
order (int, default 1) – Polynomial order (1 = linear detrending).
wavenumbers (np.ndarray, optional) – Wavenumber array for physical slope calculation. If provided, polynomial fit uses actual wavenumber values (cm⁻¹). If None, uses array indices (0, 1, 2, …).
- Returns:
Detrended spectrum.
- Return type:
np.ndarray
Notes
With wavenumbers: Fit uses physical x-axis (cm⁻¹), yielding physically meaningful slope coefficients. Important when comparing spectra on different grids or after resampling.
Without wavenumbers: Fit uses indices (0, 1, 2, …), which is grid-dependent. Slope coefficients are in units of intensity/index.
Examples
>>> # Index-based detrending (grid-dependent) >>> detrended = detrend(spectrum) >>> >>> # Physical detrending (grid-independent) >>> detrended = detrend(spectrum, wavenumbers=wn, order=1)
- xpectrass.utils.normalization.snv_detrend(intensities, detrend_order=1, wavenumbers=None)[source]
SNV followed by detrending.
Common combined preprocessing for scatter correction.
- Parameters:
intensities (np.ndarray) – 1-D spectrum.
detrend_order (int, default 1) – Polynomial order for detrending (1 = linear).
wavenumbers (np.ndarray, optional) – Wavenumber array for physical slope calculation in detrending. If provided, uses actual wavenumber values (cm⁻¹). If None, uses array indices.
- Returns:
SNV-normalized and detrended spectrum.
- Return type:
np.ndarray
- xpectrass.utils.normalization.normalize_curvature_weighted(y, sigma=3.0, min_weight=0.01)[source]
Normalize with weights proportional to local curvature (2nd derivative).
Physical motivation: In FT-IR, peaks have high curvature while baseline regions are flat. This method emphasizes peak regions during normalization, making it more representative of actual chemical information.
The normalization factor is the curvature-weighted L2 norm.
- Parameters:
- Returns:
Curvature-weighted normalized spectrum.
- Return type:
np.ndarray
Notes
Handles flat spectra (zero curvature) by falling back to uniform weighting.
- xpectrass.utils.normalization.normalize_peak_envelope(y, percentile=95, window_size=50)[source]
Normalize by the upper envelope of the spectrum.
Physical motivation: The upper envelope represents the maximum signal level across the spectrum, accounting for varying peak densities and intensities. This is more representative than single-peak normalization.
Works on absolute values to handle negative/baseline-corrected spectra.
- Parameters:
- Returns:
Envelope-normalized spectrum.
- Return type:
np.ndarray
Notes
Uses absolute values to avoid NaN issues with negative/baseline-corrected spectra.
- xpectrass.utils.normalization.normalize_entropy_weighted(y, n_bins=50, window_size=30, epsilon=1e-10)[source]
Normalize with weights based on local spectral entropy.
Motivation: Regions with high entropy (high variability/information) should contribute more to normalization. Flat baseline regions have low entropy and contribute less.
Local entropy is computed in sliding windows using histogram-based probability estimation.
- xpectrass.utils.normalization.normalize_total_variation(y, order=1)[source]
Normalize by total variation (sum of absolute differences).
Physical motivation: Total variation captures the “roughness” or total signal content independent of baseline offset. It’s related to the first derivative energy and is baseline-invariant.
TV = sum(|y[i+1] - y[i]|) for first order
- Parameters:
y (np.ndarray) – 1-D spectrum.
order (int, default 1) – Order of differences (1 = first derivative, 2 = second derivative).
- Returns:
TV-normalized spectrum.
- Return type:
np.ndarray
- xpectrass.utils.normalization.normalize_spectral_moments(y, moment_order=2, use_central=True)[source]
Normalize using spectral moments.
Physical motivation: Higher-order moments capture distribution characteristics beyond simple mean/variance. The nth moment emphasizes larger deviations, useful for peak-rich spectra.
- xpectrass.utils.normalization.normalize_adaptive_regional(y, wavenumbers, regions=None, method_per_region=None)[source]
Apply different normalization to different spectral regions.
Physical motivation: Different FT-IR regions have different characteristics: - 3600-2800 cm⁻¹: O-H, N-H, C-H stretching (often intense) - 1800-1500 cm⁻¹: C=O, C=C, amide bands - 1500-400 cm⁻¹: Fingerprint region (complex)
Each region may benefit from different normalization.
- Parameters:
y (np.ndarray) – 1-D spectrum.
wavenumbers (np.ndarray, optional) – Wavenumber axis. Required for this method.
regions (list of tuples, optional) – [(start1, end1), (start2, end2), …] defining regions. Default: standard FT-IR regions.
method_per_region (dict, optional) – {“region_idx”: “method_name”} mapping. Default: SNV for all regions.
- Returns:
Regionally-normalized spectrum.
- Return type:
np.ndarray
- Raises:
ValueError – If wavenumbers is None.
- xpectrass.utils.normalization.normalize_derivative_ratio(y, sigma=2.0)[source]
Normalize using the ratio of derivative energies.
Physical motivation: The ratio of second to first derivative energy characterizes peak sharpness independent of intensity. This provides baseline-independent normalization.
- Parameters:
y (np.ndarray) – 1-D spectrum.
sigma (float) – Smoothing parameter before derivative computation.
- Returns:
Derivative-ratio normalized spectrum.
- Return type:
np.ndarray
- xpectrass.utils.normalization.normalize_signal_to_baseline(y, baseline_percentile=10, signal_percentile=90)[source]
Normalize by the ratio of signal to baseline levels.
Physical motivation: This separates the “signal” (peaks) from “baseline” (background) and normalizes by their contrast. Useful when baseline levels vary between samples.
- xpectrass.utils.normalization.normalize_robust_snv(y, consistency_correction=True, epsilon=1e-10)[source]
Robust Standard Normal Variate using median and MAD.
Traditional SNV uses mean and std, which are sensitive to: - Baseline artifacts (shifts the mean) - Outlier peaks (inflates std) - Asymmetric intensity distributions
RSNV uses median (robust center) and MAD (robust scale).
Formula: (x - median(x)) / MAD(x)
- Parameters:
y (np.ndarray) – 1-D spectrum.
consistency_correction (bool, default True) – If True, scale MAD by 1.4826 to be consistent with std for normal data.
epsilon (float, default 1e-10) – Small value added to MAD to avoid division by zero and prevent zero vectors (which cause issues with cosine-based methods).
- Returns:
np.ndarray – Robustly normalized spectrum.
Reference
———
Novel method - combines robust statistics with scatter correction.
- Return type:
Notes
When MAD is very small or zero (flat spectrum), epsilon prevents returning a zero vector, which would cause errors in downstream methods that use cosine similarity (e.g., clustering, PCA with cosine kernel).
- xpectrass.utils.normalization.normalize_pqn(y, reference=None, reference_type='median')[source]
Probabilistic Quotient Normalization (PQN) for FT-IR spectra.
IMPORTANT: True PQN requires a reference spectrum from your dataset. For single-spectrum processing without a reference, this function performs “median scaling” (dividing by median intensity), which is NOT true PQN. Use normalize_df() for proper batch PQN.
Originally from metabolomics (Dieterle et al., 2006), PQN uses median fold-change relative to a reference spectrum, which is robust to varying numbers/intensities of peaks.
Physical motivation: Accounts for dilution effects and path length variations without being dominated by major peaks.
- Parameters:
y (np.ndarray) – 1-D spectrum.
reference (np.ndarray, optional) –
Reference spectrum from your dataset. - If provided: Performs true PQN using this reference - If None: Falls back to “median scaling” (divides by median
of positive values). This is NOT true PQN!
reference_type (str, default "median") – Currently unused. Reserved for future use in batch processing via normalize_df() where reference can be computed from dataset.
- Returns:
Normalized spectrum.
- Return type:
np.ndarray
Notes
- Single spectrum (reference=None):
Returns: y / median(y[y > 0]) This is “median scaling”, not true PQN.
- With reference spectrum:
Compute quotients: q[i] = y[i] / reference[i] (where both > 0)
Normalization factor: median(q)
Returns: y / median(q)
- For batch processing:
Use normalize_df() with method=’pqn’ to automatically compute a reference spectrum (median or mean) from your dataset.
References
Dieterle et al. (2006) Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Anal Chem 78(13):4281-90.
Examples
>>> # Single spectrum: median scaling (NOT true PQN) >>> y_scaled = normalize_pqn(spectrum) # reference=None >>> >>> # True PQN: provide reference from dataset >>> ref = np.median(all_spectra, axis=0) # median spectrum >>> y_pqn = normalize_pqn(spectrum, reference=ref) >>> >>> # Batch PQN: use normalize_df() instead >>> df_pqn = normalize_df(df, method='pqn')
- xpectrass.utils.normalization.normalize_novel(y, method='robust_snv', **kwargs)[source]
Apply novel normalization method.
- Parameters:
y (np.ndarray) – 1-D spectrum.
method (str) – One of: ‘robust_snv’, ‘curvature’, ‘envelope’, ‘entropy’, ‘pqn’, ‘total_variation’, ‘moments’, ‘derivative_ratio’, ‘signal_baseline’
**kwargs (method-specific parameters)
- Returns:
Normalized spectrum.
- Return type:
np.ndarray
- xpectrass.utils.normalization.novel_normalize_method_names()[source]
Return list of novel normalization method names.
- Return type:
normalize
normalize(intensities: np.ndarray, method: str = "snv", **kwargs) -> np.ndarray
Methods: snv, vector, minmax, area, peak, range, max
mean_center
mean_center(
spectra: np.ndarray,
axis: int = 0,
return_mean: bool = True
) -> Tuple[np.ndarray, np.ndarray]
auto_scale
auto_scale(spectra: np.ndarray, return_params: bool = True) -> Tuple[np.ndarray, np.ndarray, np.ndarray]
atmospheric
Atmospheric Correction Module for FTIR Spectral Preprocessing
Corrects for CO₂ and H₂O vapor interference in FTIR spectra that result from atmospheric absorption during measurement.
IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py
Default atmospheric regions: - CO₂: 2300-2400 cm⁻¹ (asymmetric stretch) and 650-690 cm⁻¹ (bending) - H₂O: 1350-1900 cm⁻¹ (bending) and 3550-3900 cm⁻¹ (stretching)
Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:
import logging logging.getLogger(‘utils.atmospheric’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.atmospheric’).setLevel(logging.ERROR) # Only errors
Auto-Detection: Use auto_detect=True in atmospheric_correction() to automatically check for atmospheric interference and receive warnings if detected.
- xpectrass.utils.atmospheric.atmospheric_correction_spectrum(intensities, wavenumbers, method='interpolate', co2_ranges=None, h2o_ranges=None, reference_spectrum=None, **kwargs)[source]
Public wrapper for correcting a single spectrum (numpy arrays).
- xpectrass.utils.atmospheric.identify_atmospheric_features(intensities, wavenumbers, threshold=0.1)[source]
Check for presence of atmospheric interference.
- Parameters:
intensities (ndarray) – Spectral intensity values
wavenumbers (ndarray) – Wavenumber grid (cm⁻¹)
threshold (float) – Sensitivity threshold as a fraction of total spectral variation. Higher values (e.g., 0.2) reduce false positives but may miss weak interference. Lower values (e.g., 0.05) are more sensitive but may flag noise. Default of 0.1 works well for typical FTIR spectra.
- Returns:
Dictionary with ‘co2_detected’, ‘h2o_detected’ (bool), and ‘recommendations’ (list)
- Return type:
- xpectrass.utils.atmospheric.exclude_and_interpolate_spectrum(intensities, wavenumbers, exclude_ranges=None, interpolate_ranges=None, method='interpolate', reference_spectrum=None, **kwargs)[source]
Public wrapper for excluding/interpolating regions on a single spectrum.
- Parameters:
intensities (ndarray) – Spectral intensity values
wavenumbers (ndarray) – Wavenumber grid (cm⁻¹)
exclude_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to physically remove from output
interpolate_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to process using specified method
method (str) – Method for interpolate_ranges (‘interpolate’, ‘spline’, ‘reference’, ‘zero’, ‘exclude’)
reference_spectrum (ndarray | None) – Reference spectrum for ‘reference’ method
**kwargs – Additional arguments for specific methods (e.g., reference_scale)
- Returns:
Tuple of (corrected_intensities, corrected_wavenumbers)
- Return type:
- xpectrass.utils.atmospheric.atmospheric_correction(data, method='interpolate', co2_ranges=None, h2o_ranges=None, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, auto_detect=False, **kwargs)[source]
Apply atmospheric correction to a DataFrame of FTIR spectra.
Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies correction to all samples.
- NaN Handling:
All correction methods robustly handle NaN values in spectral data: - ‘interpolate’/’spline’: Filters NaN from boundary regions; sets atmospheric
regions to NaN if insufficient finite boundary data exists
‘reference’: Filters NaN before computing scale factor
‘zero’: Uses nanmean for baselines; skips regions with all-NaN boundaries
‘exclude’: Marks regions as NaN
- Performance:
Optimized for large datasets using: - Vectorized numpy array access (no DataFrame.loc overhead) - Pre-allocated output arrays (no list appending) - Progress tracking via tqdm For maximum performance on very large datasets (100k+ spectra), consider using atmospheric_correction_spectrum() on extracted numpy arrays directly.
- Parameters:
data (DataFrame | DataFrame) – Input DataFrame (pandas or polars)
method (str) – Correction method (‘interpolate’, ‘spline’, ‘reference’, ‘zero’, ‘exclude’)
co2_ranges (List[Tuple[float, float]] | None) – Custom CO₂ regions to correct (default: standard FTIR regions)
h2o_ranges (List[Tuple[float, float]] | None) – Custom H₂O regions to correct (default: standard FTIR regions)
label_column (str) – Name of label/metadata column to preserve
exclude_columns (List[str] | None) – Additional columns to exclude from processing
wn_min (float | None) – Minimum wavenumber for column detection (default: auto-detect)
wn_max (float | None) – Maximum wavenumber for column detection (default: auto-detect)
auto_detect (bool) – If True, automatically check first spectrum for atmospheric interference and warn if detected but no custom ranges provided (default: False)
**kwargs – Additional arguments passed to correction methods
sample_id_column (str)
- Returns:
Corrected DataFrame in same format as input (columns sorted by ascending wavenumber)
- Return type:
DataFrame | DataFrame
Warning
Warns if spectral columns are reordered during processing
Warns if wavenumber bounds are auto-expanded
Warns if auto_detect=True and atmospheric interference detected without custom ranges
- xpectrass.utils.atmospheric.exclude_and_interpolate_regions(data, exclude_ranges=None, interpolate_ranges=None, method='interpolate', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, reference_spectrum=None, **kwargs)[source]
Exclude wavenumber ranges and process interpolate regions for DataFrame of spectra.
- Parameters:
data (DataFrame | DataFrame) – Input DataFrame (pandas or polars)
exclude_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to physically remove from output
interpolate_ranges (List[Tuple[float, float]] | None) – Wavenumber ranges to process using specified method
method (str) – Method for interpolate_ranges (‘interpolate’, ‘spline’, ‘reference’, ‘zero’, ‘exclude’)
label_column (str) – Name of label/metadata column to preserve
exclude_columns (List[str] | None) – Additional columns to exclude from processing
wn_min (float | None) – Minimum wavenumber for column detection (default: auto-detect)
wn_max (float | None) – Maximum wavenumber for column detection (default: auto-detect)
reference_spectrum (ndarray | None) – Reference spectrum for ‘reference’ method
**kwargs – Additional arguments for specific methods (e.g., reference_scale)
sample_id_column (str)
- Returns:
Processed DataFrame in same format as input
- Return type:
DataFrame | DataFrame
atmospheric_correction
atmospheric_correction(
intensities: np.ndarray,
wavenumbers: np.ndarray,
method: str = "interpolate",
co2_range: Tuple[float, float] = (2300, 2400),
h2o_ranges: List[Tuple[float, float]] = [(1350, 1900), (3550, 3900)],
**kwargs
) -> np.ndarray
Methods: interpolate, spline, reference, zero, exclude
derivatives
Spectral Derivatives Module for FTIR Preprocessing
Compute smoothed spectral derivatives for resolution enhancement and baseline removal.
- xpectrass.utils.derivatives.spectral_derivative(intensities, order=1, window_length=15, polyorder=3, delta=1.0)[source]
Compute smoothed spectral derivative using Savitzky-Golay.
- Parameters:
intensities (np.ndarray) – 1-D intensity array.
order (int, default 1) – Derivative order (1 = first derivative, 2 = second derivative).
window_length (int, default 15) – Savitzky-Golay window length (must be odd).
polyorder (int, default 3) – Polynomial order for Savitzky-Golay filter.
delta (float, default 1.0) – Spacing between samples (affects derivative scaling). Important for FT-IR: Set this to the actual wavenumber spacing (e.g., median of np.diff(wavenumbers)) to get physically meaningful derivatives in units of dI/d(cm⁻¹). Default of 1.0 assumes unit spacing.
- Returns:
Derivative spectrum.
- Return type:
np.ndarray
Notes
1st derivative: Resolves overlapping peaks, removes constant baseline
2nd derivative: Sharpens peaks, removes linear baseline
Higher derivatives increase noise; adjust window_length accordingly
Warning
Input must be 1-D array (will raise ValueError if multi-dimensional)
Large derivative orders may trigger automatic window expansion (logged as warning)
Examples
>>> import numpy as np >>> wn = np.linspace(400, 4000, 1000) >>> intensities = np.exp(-0.5 * ((wn - 1500) / 100) ** 2) >>> >>> # Correct: specify delta for physical units >>> delta = np.median(np.diff(wn)) >>> deriv = spectral_derivative(intensities, order=1, delta=delta) >>> >>> # Warning: default delta=1.0 gives index-based units >>> deriv = spectral_derivative(intensities, order=1) # Not recommended for FT-IR
- xpectrass.utils.derivatives.first_derivative(intensities, window_length=15, polyorder=3)[source]
Compute first derivative.
Benefits: - Removes constant baseline offset - Resolves overlapping bands - Enhances small spectral differences
- xpectrass.utils.derivatives.second_derivative(intensities, window_length=15, polyorder=4)[source]
Compute second derivative.
Benefits: - Removes linear baseline - Sharpens peaks (negative peaks in output) - Heavily used in FTIR for band identification
Note: Peaks appear as negative minima in 2nd derivative.
- xpectrass.utils.derivatives.gap_derivative(intensities, gap=5, segment=5, delta=1.0, pad_mode='edge')[source]
Norris-Williams gap derivative.
Averages points on either side of a gap, then takes difference. More noise-resistant than point-to-point derivatives.
- Parameters:
intensities (np.ndarray) – 1-D spectrum.
gap (int, default 5) – Gap size (number of points to skip). Will be cast to int if float provided.
segment (int, default 5) – Number of points to average on each side. Will be cast to int if float provided.
delta (float, default 1.0) – Spacing between consecutive points (e.g., wavenumber spacing in cm⁻¹). The derivative is divided by (gap + segment) * delta to get proper units. For FT-IR with uniform 1 cm⁻¹ spacing, delta=1.0 is correct.
pad_mode (str, default 'edge') – Padding mode for edges. Options: - ‘edge’: Replicate edge values (default, simple but may create plateaus) - ‘constant’: Pad with zeros (pure approach, no artifacts) - None: Return unpadded array (length reduced by gap + 2*segment - 1)
- Returns:
Gap derivative. If pad_mode is None, array is shorter than input by (gap + 2*segment - 1). Otherwise, same length as input.
- Return type:
np.ndarray
Notes
Padding with ‘edge’ mode replicates edge values, which can introduce artificial plateaus at spectrum ends. For pure results, use pad_mode=None or ‘constant’.
For FT-IR spectra, edges (<600 cm⁻¹, >4000 cm⁻¹) often have noise, so ‘edge’ padding is usually acceptable.
The derivative is scaled by (gap + segment) * delta to approximate dI/d(wavenumber). For uniform grids, this gives physically meaningful units.
Examples
>>> import numpy as np >>> wn = np.linspace(400, 4000, 1000) >>> y = np.exp(-0.5 * ((wn - 1500) / 100) ** 2) >>> >>> # With padding (same length as input) >>> deriv = gap_derivative(y, gap=5, segment=5, pad_mode='edge') >>> len(deriv) == len(y) # True >>> >>> # Without padding (pure derivative, shorter) >>> deriv = gap_derivative(y, gap=5, segment=5, pad_mode=None) >>> len(deriv) == len(y) - 14 # True (lost gap + 2*segment - 1 points) >>> >>> # With delta scaling for physical units >>> delta = np.median(np.diff(wn)) >>> deriv = gap_derivative(y, gap=5, segment=5, delta=delta)
- xpectrass.utils.derivatives.derivative_with_smoothing(intensities, order=1, smooth_window=11, deriv_window=15, smooth_polyorder=3, deriv_polyorder=None, delta=1.0, smooth_first=True)[source]
Apply derivative with separate smoothing control.
This function allows independent control over smoothing and derivative computation, useful for very noisy data or when you want to optimize smoothing separately from derivative calculation.
- Parameters:
intensities (np.ndarray) – 1-D spectrum.
order (int, default 1) – Derivative order.
smooth_window (int, default 11) – Window length for initial smoothing step (if smooth_first=True). Must be odd and > smooth_polyorder.
deriv_window (int, default 15) – Window length for derivative calculation. Must be odd and > deriv_polyorder.
smooth_polyorder (int, default 3) – Polynomial order for smoothing step (Savitzky-Golay). Must be less than smooth_window.
deriv_polyorder (int, optional) – Polynomial order for derivative step. If None, defaults to order + 1 (minimum required for the derivative order).
delta (float, default 1.0) – Spacing between samples (affects derivative scaling). Important for FT-IR: Set to actual wavenumber spacing for physically meaningful derivatives in dI/d(cm⁻¹).
smooth_first (bool, default True) – If True, smooth before taking derivative (recommended). If False, differentiate first then smooth (may distort peak shapes and amplify noise before smoothing—use with caution).
- Returns:
Derivative spectrum.
- Return type:
np.ndarray
Warning
Setting smooth_first=False differentiates noisy data before smoothing, which can amplify noise and distort spectral features. Generally not recommended unless you have specific scientific reasons.
High smooth_polyorder with small smooth_window may under-smooth data.
Notes
For most applications, use spectral_derivative() which combines smoothing and differentiation optimally in a single step.
Use this function only when you need separate control over smoothing and derivative parameters (e.g., aggressive pre-smoothing for very noisy data).
The two-step approach (smooth → derivative) may introduce edge artifacts at both ends of the spectrum.
Examples
>>> import numpy as np >>> wn = np.linspace(400, 4000, 1000) >>> y = np.exp(-0.5 * ((wn - 1500) / 100) ** 2) >>> y_noisy = y + np.random.normal(0, 0.01, len(y)) >>> >>> # Recommended: smooth first (default) >>> delta = np.median(np.diff(wn)) >>> deriv = derivative_with_smoothing( ... y_noisy, ... order=1, ... smooth_window=25, # Heavy smoothing ... deriv_window=15, ... delta=delta, ... smooth_first=True ... ) >>> >>> # Not recommended: differentiate noisy data first >>> deriv = derivative_with_smoothing( ... y_noisy, ... order=1, ... smooth_first=False # Amplifies noise before smoothing ... )
- xpectrass.utils.derivatives.derivative_batch(data, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, order=1, window_length=15, polyorder=3, delta=1.0, show_progress=True)[source]
Compute spectral derivatives for multiple spectra (DataFrame or numpy array).
Works with both pandas and polars DataFrames, or numpy arrays. For DataFrames: each row is a sample, numerical columns are wavenumbers. For numpy arrays: shape (n_samples, n_wavenumbers).
- Parameters:
data (pd.DataFrame | pl.DataFrame | np.ndarray) – Wide-format DataFrame where rows = samples, columns = wavenumbers, OR numpy array of shape (n_samples, n_wavenumbers).
label_column (str, default "label") – Name of the label/group column to exclude from derivative computation. Only used for DataFrame inputs.
exclude_columns (list[str], optional) – Additional column names to exclude from derivative computation (e.g., ‘sample’, ‘id’). If None, automatically excludes non-numeric columns. Only used for DataFrame inputs.
wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default, or auto-expands if no columns found within default range.
wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default, or auto-expands if no columns found within default range.
order (int, default 1) – Derivative order: - 1: First derivative (resolves overlapping peaks, removes constant baseline) - 2: Second derivative (sharpens peaks, removes linear baseline) - 3+: Higher derivatives (increases noise sensitivity)
window_length (int, default 15) – Savitzky-Golay filter window length (must be odd). Larger values = more smoothing but less detail.
polyorder (int, default 3) – Polynomial order for Savitzky-Golay filter. Must be less than window_length.
delta (float, default 1.0) – Spacing between samples (affects derivative scaling). For DataFrame inputs, this parameter is automatically computed from wavenumber spacing and the provided value is ignored. For numpy array inputs, uses the provided delta value.
show_progress (bool, default True) – If True, display a progress bar during processing. Only used for DataFrame inputs.
sample_id_column (str)
- Returns:
Derivative spectra (same type as input).
- Return type:
pd.DataFrame | pl.DataFrame | np.ndarray
Examples
>>> # First derivative of DataFrame >>> df_d1 = derivative_batch(df_wide, order=1)
>>> # Second derivative with larger smoothing window >>> df_d2 = derivative_batch(df_wide, order=2, window_length=21)
>>> # Third derivative (highly sensitive to noise) >>> df_d3 = derivative_batch(df_wide, order=3, window_length=25, polyorder=4)
>>> # Numpy array processing (legacy) >>> spectra_d1 = derivative_batch(spectra_array, order=1)
>>> # Disable progress bar >>> df_d1 = derivative_batch(df_wide, order=1, show_progress=False)
Notes
1st derivative: Removes constant baseline, enhances spectral differences
2nd derivative: Removes linear baseline, sharpens peaks (peaks appear as negative minima)
Higher derivatives amplify noise; increase window_length for smoother results
Savitzky-Golay filtering preserves peak shapes better than simple numerical derivatives
- xpectrass.utils.derivatives.plot_derivatives(data, label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, orders=[0, 1, 2], sample=None, wavenumbers=None, window_length=15, polyorder=3, figsize=(10, 8), invert_x=True)[source]
Plot spectrum and its derivatives for DataFrame or numpy array.
Works with both pandas/polars DataFrames and numpy arrays. For DataFrames: automatically extracts wavenumbers from column names. For numpy arrays: wavenumbers parameter is required.
- Parameters:
data (pd.DataFrame | pl.DataFrame | np.ndarray) – Wide-format DataFrame (rows=samples, columns=wavenumbers) OR 1-D numpy array of intensities.
label_column (str, default "label") – Name of the label/group column to exclude. Only used for DataFrame inputs.
exclude_columns (list[str], optional) – Additional column names to exclude (e.g., ‘sample’, ‘id’). Only used for DataFrame inputs.
wn_min (float, optional) – Minimum wavenumber bound (cm⁻¹). If None, uses 200.0 cm⁻¹ as default.
wn_max (float, optional) – Maximum wavenumber bound (cm⁻¹). If None, uses 8000.0 cm⁻¹ as default.
orders (list of int, default [0, 1, 2]) – Derivative orders to plot: - 0: Original spectrum - 1: First derivative - 2: Second derivative - 3+: Higher derivatives
sample (str | int, optional) – For DataFrames: sample name (index) to plot. If None, plots the first sample. For numpy arrays: ignored (plots the provided array).
wavenumbers (np.ndarray, optional) – Wavenumber axis. Required only for numpy array input. For DataFrames, automatically extracted from column names.
window_length (int, default 15) – Savitzky-Golay window length for derivative computation.
polyorder (int, default 3) – Polynomial order for Savitzky-Golay filter.
figsize (tuple, default (10, 8)) – Figure size (width, height).
invert_x (bool, default True) – If True, invert x-axis (higher wavenumbers on left).
sample_id_column (str)
- Return type:
None
Examples
>>> # Plot derivatives from DataFrame >>> plot_derivatives(df_wide, orders=[0, 1, 2], sample="PP225")
>>> # Plot original and 2nd derivative only >>> plot_derivatives(df_wide, orders=[0, 2], sample="HDPE1")
>>> # Plot from first sample (default) >>> plot_derivatives(df_wide, orders=[0, 1, 2, 3])
>>> # Plot from numpy array >>> plot_derivatives(spectrum, wavenumbers=wn_array, orders=[0, 1, 2])
>>> # Custom window for smoother derivatives >>> plot_derivatives(df_wide, sample="PET1", window_length=25, polyorder=4)
Notes
1st derivative: Removes constant baseline, enhances differences
2nd derivative: Removes linear baseline, sharpens peaks (negative minima)
Higher orders increase noise sensitivity; use larger window_length
spectral_derivative
spectral_derivative(
intensities: np.ndarray,
order: int = 1,
window_length: int = 15,
polyorder: int = 3,
delta: float = 1.0
) -> np.ndarray
first_derivative / second_derivative
first_derivative(intensities, window_length=15, polyorder=3) -> np.ndarray
second_derivative(intensities, window_length=15, polyorder=4) -> np.ndarray
scatter_correction
Scatter Correction Module for FTIR Spectral Preprocessing
Provides multiplicative scatter correction (MSC), extended MSC (EMSC), and related methods for correcting light scattering effects.
IMPORTANT: This module expects absorbance data (AU), not transmittance (%). Convert transmittance to absorbance first using convert_spectra() from trans_abs.py
Features: - Single spectrum correction via scatter_correction() - Batch DataFrame processing via apply_scatter_correction() - Automatic column detection and sorting by wavenumber - Performance optimized for large datasets (vectorized operations) - Pandas and Polars DataFrame support
Logging: This module uses Python’s logging module for warnings and informational messages. Configure the logger to control output:
import logging logging.getLogger(‘utils.scatter_correction’).setLevel(logging.INFO) # Show all messages logging.getLogger(‘utils.scatter_correction’).setLevel(logging.ERROR) # Only errors
Available Methods: Run scatter_method_names() to see all available correction methods. Common methods: msc, emsc, snv, snv_detrend
- xpectrass.utils.scatter_correction.scatter_correction(intensities, wavenumbers=None, method='msc', reference=None, **kwargs)[source]
Apply scatter correction to a single FTIR spectrum.
- Parameters:
intensities (array-like) – Raw intensity values (1-D). Absorbance data (AU), not transmittance (%).
wavenumbers (array-like, optional) – X-axis values (wavenumbers in cm⁻¹). Ensures API consistency with other preprocessing modules (baseline, denoise, atmospheric). Not used in calculations but validates data integrity.
method (str, default "msc") – Correction method: - ‘msc’: Multiplicative Scatter Correction - ‘emsc’: Extended MSC (includes polynomial baseline terms) - ‘snv’: Standard Normal Variate (per-spectrum normalization) - ‘snv_detrend’: SNV followed by polynomial detrending
reference (np.ndarray, optional) – Reference spectrum for MSC/EMSC. If None, cannot be applied (use apply_scatter_correction for batch processing with automatic reference). Must have same length as intensities.
**kwargs (method-specific parameters) – emsc: poly_order (default 2) snv_detrend: detrend_order (default 1)
- Returns:
Scatter-corrected intensity values. NaN values in input are preserved at their original positions; correction is applied only to finite values.
- Return type:
np.ndarray
- Raises:
ValueError – If method requires reference spectrum but none provided, or if reference length doesn’t match intensities.
Notes
- NaN Handling:
If input contains NaN values, they are preserved in output
Correction is computed only on finite values
If all values are NaN, returns array of NaN
- Methods requiring reference (msc, emsc):
For single spectrum, reference must be provided explicitly
For batch processing, use apply_scatter_correction() which computes mean reference automatically
- xpectrass.utils.scatter_correction.scatter_method_names()[source]
Return list of available scatter correction method names.
- xpectrass.utils.scatter_correction.apply_scatter_correction(data, method='msc', label_column='label', sample_id_column='sample_id', exclude_columns=None, wn_min=None, wn_max=None, reference=None, show_progress=True, **kwargs)[source]
Apply scatter correction to a DataFrame of FTIR spectra (batch processing).
Works with both pandas and polars DataFrames. Each row is a sample, numerical columns are wavenumbers. Applies scatter correction to all samples.
- Parameters:
data (pd.DataFrame | pl.DataFrame) – Wide-format DataFrame where rows = samples, columns = wavenumbers. Should contain numerical columns with spectral data and optional metadata columns (e.g., ‘sample’, ‘label’).
method (str, default "msc") – Scatter correction method. Options: - ‘msc’: Multiplicative Scatter Correction - ‘emsc’: Extended MSC (includes polynomial baseline terms) - ‘snv’: Standard Normal Variate (per-spectrum normalization) - ‘snv_detrend’: SNV followed by polynomial detrending
label_column (str, default "label") – Name of the label/group column to exclude from correction.
exclude_columns (list[str], optional) – Additional column names to exclude from correction (e.g., ‘sample’, ‘id’).
wn_min (float, optional) – Minimum wavenumber for column detection (default: 200.0 cm⁻¹). Columns with wavenumbers below this value will be excluded.
wn_max (float, optional) – Maximum wavenumber for column detection (default: 8000.0 cm⁻¹). Columns with wavenumbers above this value will be excluded.
reference (np.ndarray, optional) – Reference spectrum for MSC/EMSC. If None, uses mean of all spectra. Must match the length of spectral columns.
show_progress (bool, default True) – If True, display a progress bar during processing.
**kwargs (additional parameters) – Method-specific parameters: - emsc: poly_order (default 2) - snv_detrend: detrend_order (default 1)
sample_id_column (str)
- Returns:
pd.DataFrame | pl.DataFrame – Scatter-corrected DataFrame (same type as input) with spectral data corrected and metadata columns preserved. Output columns are sorted by ascending wavenumber for standardization.
NaN Handling
————
Robustly handles NaN (missing) values in spectral data
- NaN values are preserved in output at their original positions
- Correction is computed only on finite values
- If an entire spectrum is NaN, it remains as NaN
- For MSC/EMSC, reference spectrum is computed from finite values only
Performance
———–
Optimized for large datasets using
- Robust wavenumber column detection (parses column names, not dtype)
- Automatic column sorting to ensure monotonic wavenumber order
- Vectorized numpy array access (no DataFrame.loc overhead)
- Pre-allocated output arrays (no dynamic list appending)
- Progress tracking via tqdm
- Return type:
DataFrame | DataFrame
Examples
>>> # Apply MSC scatter correction to all samples >>> df_corrected = apply_scatter_correction(df_wide, method="msc")
>>> # Use EMSC with custom polynomial order >>> df_corrected = apply_scatter_correction( ... df_wide, ... method="emsc", ... poly_order=3 ... )
>>> # Use SNV (no reference needed) >>> df_corrected = apply_scatter_correction(df_wide, method="snv")
>>> # Works with both pandas and polars >>> df_pd_corrected = apply_scatter_correction(df_pandas) >>> df_pl_corrected = apply_scatter_correction(df_polars)
>>> # Disable progress bar for cleaner output >>> df_corrected = apply_scatter_correction(df_wide, show_progress=False)
- xpectrass.utils.scatter_correction.msc_single(spectrum, reference)[source]
Apply MSC to a single spectrum and return coefficients.
Deprecated: Use scatter_correction() with method=’msc’ instead.
This function is retained for backward compatibility only.
scatter_correction
scatter_correction(
spectra: np.ndarray, # (n_samples, n_wavenumbers)
method: str = "msc",
reference: np.ndarray = None,
**kwargs
) -> np.ndarray
Methods: msc, emsc, snv, snv_detrend
region_selection
Region Selection Module for FTIR Spectral Preprocessing
Provides utilities for selecting, excluding, and extracting spectral regions based on wavenumber ranges.
- xpectrass.utils.region_selection.get_region_names()[source]
Return list of predefined region names.
- xpectrass.utils.region_selection.get_region_range(name)[source]
Get wavenumber range for a named region.
- xpectrass.utils.region_selection.select_region(df, regions)[source]
Select spectral regions by wavenumber ranges.
- Parameters:
- Returns:
DataFrame with only selected wavenumber columns.
- Return type:
pl.DataFrame
Examples
>>> select_region(df, (400, 1500)) # Fingerprint region >>> select_region(df, 'ch_stretch') # Named region >>> select_region(df, [(400, 1500), (2800, 3100)]) # Multiple regions
- xpectrass.utils.region_selection.exclude_regions(df, regions)[source]
Exclude spectral regions (opposite of select_region).
- xpectrass.utils.region_selection.exclude_atmospheric(df)[source]
Convenience function to exclude atmospheric interference regions.
Excludes CO2 (2300-2400 cm⁻¹) and H2O (1350-1900, 3550-3900 cm⁻¹).
- Parameters:
df (DataFrame)
- Return type:
DataFrame
- xpectrass.utils.region_selection.select_region_np(intensities, wavenumbers, start, end)[source]
Select region from 1-D spectrum arrays.
- xpectrass.utils.region_selection.select_regions_np(intensities, wavenumbers, regions)[source]
Select multiple regions from 1-D spectrum arrays.
Regions are concatenated in the order provided.
- xpectrass.utils.region_selection.analyze_regions(df, regions=None)[source]
Analyze intensity statistics across different spectral regions.
- Parameters:
df (pl.DataFrame) – Wide-format spectral DataFrame.
regions (list of tuples, optional) – Regions to analyze. If None, analyzes predefined plastic regions.
- Returns:
Statistics for each region (mean, std, min, max, peak location).
- Return type:
pd.DataFrame
- xpectrass.utils.region_selection.get_wavenumbers(df)[source]
Extract wavenumber array from DataFrame columns.
- Parameters:
df (DataFrame)
- Return type:
- xpectrass.utils.region_selection.get_spectra_matrix(df)[source]
Extract spectra as numpy matrix (n_samples, n_wavenumbers).
- Parameters:
df (DataFrame)
- Return type:
select_region
select_region(
df: pl.DataFrame,
regions: Union[str, Tuple, List[Tuple]]
) -> pl.DataFrame
exclude_regions
exclude_regions(df: pl.DataFrame, regions: Union[str, Tuple, List[Tuple]]) -> pl.DataFrame
FTIR_REGIONS
FTIR_REGIONS = {
'fingerprint': (400, 1500),
'ch_stretch': (2800, 3100),
'carbonyl': (1650, 1800),
# ... and more
}
file_management
- xpectrass.utils.file_management.process_batch_files(files, skiprows=15, separator=',', engine='pl', concat_how='vertical', keep_index=True, index_col=None, show_progress=True)[source]
Import a batch of FT-IR CSVs and concatenate them into one Polars frame.
- Parameters:
files (iterable of str | Path) – Paths to the spectral CSV files.
skiprows (int, default 15) – Number of rows to skip at the start of the file (e.g. metadata).
separator (str, default ',') – Delimiter for the CSV file.
engine ({'pd', 'pl'}, default 'pl') –
‘pd’ → read via the pandas-based importer, then convert to Polars.
’pl’ → read directly via the Polars importer (faster).
concat_how ({'vertical', 'vertical_relaxed'}, default 'vertical') –
‘vertical’ → schemas must match exactly; raises if not.
- ’vertical_relaxed’ → union by column name; missing cols filled with
nulls (Polars ≥ 0.20).
keep_index (bool, default True) – If True and engine is ‘pd’, include the pandas Index as a column when converting to Polars (include_index=True). Recommended because your importer names the index “sample”.
index_col (str, optional) – Column name to use as the row identifier/index. If None, uses default behavior (integer index for pandas, “sample” column for polars). Common values: ‘sample’, ‘sample_name’, etc.
show_progress (bool, default True) – Toggle the tqdm progress bar.
- Returns:
All spectra stacked row-wise (each row = one sample).
- Return type:
pl.DataFrame
- xpectrass.utils.file_management.import_data(file_path, engine='pl', skiprows=15, separator=',', index_col=None)[source]
Load a single‐sample CSV of spectral data, set the wavenumber index, transpose so samples are rows, and attach a simple sample label.
- Parameters:
file_path (str or pathlib.Path) – Path to the CSV file to import.
skiprows (int, default 15) – Number of rows to skip at the start of the file (e.g. metadata).
separator (str, default ',') – Delimiter for the CSV file.
engine (str, default 'pl') – ‘pd’ for pandas or ‘pl’ for polars.
index_col (str, optional) – Column name to use as the DataFrame’s row index. If None, uses default integer index. Common values: ‘sample’, ‘sample_name’, etc.
- Returns:
Transposed DataFrame (samples × wavenumber) with: - Index name = “sample” (pandas) or “sample” column (polars) - Column name = wavenumber values, index name = “wavenumber” - A “label” column containing the alphabetic prefix of the sample name
- Return type:
pd.DataFrame | pl.DataFrame
- xpectrass.utils.file_management.import_data_pd(file_path, skiprows=15, sep=',', index_col=None)[source]
Load a single‐sample CSV of spectral data, set the wavenumber index, transpose so samples are rows, and attach a simple sample label.
- Parameters:
file_path (str or pathlib.Path) – Path to the CSV file to import.
skiprows (int, default 15) – Number of rows to skip at the start of the file (e.g. metadata).
sep (str, default ',') – Delimiter for the CSV file.
index_col (str, optional) – Column name to use as the DataFrame’s row index after transposing. If None, uses default integer index with name “sample”. Common values: ‘sample’, ‘sample_name’, etc.
- Returns:
Transposed DataFrame (samples × wavenumber) with: - Index name = index_col (if provided) or “sample” (default) - Column names = wavenumber values - A “label” column containing the alphabetic prefix of the sample name
- Return type:
pd.DataFrame
- xpectrass.utils.file_management.import_data_pl(file_path, skiprows=15, sep=',', index_col=None)[source]
Load a single-sample spectral CSV into a Polars DataFrame, reshape it so that each sample is a row with wavenumbers as columns, and attach a simple sample label.
- Parameters:
file_path (str or pathlib.Path) – Path to the CSV file.
skiprows (int, default 15) – Number of lines to skip before reading data (e.g., metadata).
sep (str, default ",") – Field delimiter for the CSV.
index_col (str, optional) – Column name to use as the row identifier. If provided, this column will be moved to the first position and can be used for setting as index when converting to pandas. If None, “sample” column remains as a regular column. Common values: ‘sample’, ‘sample_name’, etc.
- Returns:
- A wide DataFrame where:
Each row is a sample.
If index_col is specified, that column is in the first position.
Columns are wavenumbers (as floats or ints).
A “label” column holds the alphabetic prefix of the sample name.
- Return type:
pl.DataFrame
process_batch_files
process_batch_files(
files: Iterable[str],
skiprows: int = 15,
separator: str = ',',
engine: str = "pl",
show_progress: bool = True
) -> pl.DataFrame
import_data
import_data(file_path: str, engine: str = 'pl', skiprows: int = 15) -> pl.DataFrame