Data Validation
The data validation module ensures your FTIR spectral data is complete, consistent, and ready for preprocessing.
Overview
Data validation is the first step in any preprocessing workflow. It checks for:
Completeness: All expected samples and classes present
Missing values: NaN or infinite values in intensity data
Intensity ranges: Values within expected bounds
Wavenumber consistency: Matching spectral grids across samples
Duplicates: Duplicate sample names
Functions
validate_spectra
import polars as pl
from xpectrass.data import load_jung_2018
from xpectrass.utils import validate_spectra
# Validation utilities expect columns named 'sample' and 'label'
df_raw = load_jung_2018()
df = pl.from_pandas(
df_raw.rename(columns={"sample_id": "sample", "type": "label"})
)
report = validate_spectra(
df,
expected_samples_per_class=500,
expected_classes=['HDPE', 'LDPE', 'PET', 'PP', 'PS', 'PVC'],
wavenumber_range=(399.0, 4000.0),
intensity_range=(0.0, 150.0),
verbose=True
)
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
pl.DataFrame |
required |
Wide-format DataFrame with sample, label, and wavenumber columns |
|
int |
500 |
Expected number of samples per class |
|
list |
HDPE, LDPE, PET, PP, PS, PVC |
Expected class labels |
|
tuple |
(399.0, 4000.0) |
Expected wavenumber range |
|
tuple |
(0.0, 150.0) |
Valid intensity range for %T |
|
bool |
True |
Print validation report |
Returns
Dictionary with validation results:
{
'valid': True, # Overall pass/fail
'n_samples': 3000, # Total samples
'n_wavenumbers': 3751, # Spectral points
'class_counts': {...}, # Samples per label
'missing_values': 0, # NaN/Inf count
'out_of_range': {...}, # Intensity issues
'wavenumber_check': {...}, # Range info
'duplicates': [], # Duplicate names
'issues': [] # Issue descriptions
}
detect_outlier_spectra
Detect spectra that deviate significantly from the dataset:
from xpectrass.utils import detect_outlier_spectra
result = detect_outlier_spectra(
df,
method='zscore', # 'zscore', 'iqr', or 'mad'
threshold=3.0
)
print(f"Found {result['n_outliers']} outlier spectra")
print("Outlier samples:", result['outlier_samples'])
Methods
Method |
Description |
|---|---|
|
Flag spectra with mean intensity > threshold std from global mean |
|
Interquartile range method (1.5 × IQR default) |
|
Median Absolute Deviation (robust to outliers) |
check_wavenumber_consistency
Verify all files have consistent wavenumber grids:
from xpectrass.utils import check_wavenumber_consistency
result = check_wavenumber_consistency(
file_paths=['file1.csv', 'file2.csv', ...],
skiprows=15,
tolerance=0.1
)
if result['consistent']:
print("All files have matching wavenumber grids")
else:
print("Mismatched files:", result['mismatched_files'])
Example Output
============================================================
FTIR DATA VALIDATION REPORT
============================================================
Total samples: 3000
Wavenumber points: 3751
Classes: ['HDPE', 'LDPE', 'PET', 'PP', 'PS', 'PVC']
Samples per class: {'HDPE': 500, 'LDPE': 500, 'PET': 500, 'PP': 500, 'PS': 500, 'PVC': 500}
Missing values: 0
Out of range: 12
Duplicates: 0
------------------------------------------------------------
ISSUES FOUND:
⚠ 12 samples have intensities outside [0.0, 150.0]
------------------------------------------------------------
VALIDATION STATUS: PASSED ✓
============================================================
Best Practices
Always validate before preprocessing: Catch data issues early
Check for outliers: Remove or investigate anomalous spectra
Verify class balance: Ensure equal representation for classification
Document exclusions: Keep track of any removed samples