# Data Validation The data validation module ensures your FTIR spectral data is complete, consistent, and ready for preprocessing. ## Overview Data validation is the first step in any preprocessing workflow. It checks for: - **Completeness**: All expected samples and classes present - **Missing values**: NaN or infinite values in intensity data - **Intensity ranges**: Values within expected bounds - **Wavenumber consistency**: Matching spectral grids across samples - **Duplicates**: Duplicate sample names ## Functions ### validate_spectra ```python import polars as pl from xpectrass.data import load_jung_2018 from xpectrass.utils import validate_spectra # Validation utilities expect columns named 'sample' and 'label' df_raw = load_jung_2018() df = pl.from_pandas( df_raw.rename(columns={"sample_id": "sample", "type": "label"}) ) report = validate_spectra( df, expected_samples_per_class=500, expected_classes=['HDPE', 'LDPE', 'PET', 'PP', 'PS', 'PVC'], wavenumber_range=(399.0, 4000.0), intensity_range=(0.0, 150.0), verbose=True ) ``` #### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `df` | pl.DataFrame | required | Wide-format DataFrame with sample, label, and wavenumber columns | | `expected_samples_per_class` | int | 500 | Expected number of samples per class | | `expected_classes` | list | HDPE, LDPE, PET, PP, PS, PVC | Expected class labels | | `wavenumber_range` | tuple | (399.0, 4000.0) | Expected wavenumber range | | `intensity_range` | tuple | (0.0, 150.0) | Valid intensity range for %T | | `verbose` | bool | True | Print validation report | #### Returns Dictionary with validation results: ```python { 'valid': True, # Overall pass/fail 'n_samples': 3000, # Total samples 'n_wavenumbers': 3751, # Spectral points 'class_counts': {...}, # Samples per label 'missing_values': 0, # NaN/Inf count 'out_of_range': {...}, # Intensity issues 'wavenumber_check': {...}, # Range info 'duplicates': [], # Duplicate names 'issues': [] # Issue descriptions } ``` ### detect_outlier_spectra Detect spectra that deviate significantly from the dataset: ```python from xpectrass.utils import detect_outlier_spectra result = detect_outlier_spectra( df, method='zscore', # 'zscore', 'iqr', or 'mad' threshold=3.0 ) print(f"Found {result['n_outliers']} outlier spectra") print("Outlier samples:", result['outlier_samples']) ``` #### Methods | Method | Description | |--------|-------------| | `zscore` | Flag spectra with mean intensity > threshold std from global mean | | `iqr` | Interquartile range method (1.5 × IQR default) | | `mad` | Median Absolute Deviation (robust to outliers) | ### check_wavenumber_consistency Verify all files have consistent wavenumber grids: ```python from xpectrass.utils import check_wavenumber_consistency result = check_wavenumber_consistency( file_paths=['file1.csv', 'file2.csv', ...], skiprows=15, tolerance=0.1 ) if result['consistent']: print("All files have matching wavenumber grids") else: print("Mismatched files:", result['mismatched_files']) ``` ## Example Output ``` ============================================================ FTIR DATA VALIDATION REPORT ============================================================ Total samples: 3000 Wavenumber points: 3751 Classes: ['HDPE', 'LDPE', 'PET', 'PP', 'PS', 'PVC'] Samples per class: {'HDPE': 500, 'LDPE': 500, 'PET': 500, 'PP': 500, 'PS': 500, 'PVC': 500} Missing values: 0 Out of range: 12 Duplicates: 0 ------------------------------------------------------------ ISSUES FOUND: ⚠ 12 samples have intensities outside [0.0, 150.0] ------------------------------------------------------------ VALIDATION STATUS: PASSED ✓ ============================================================ ``` ## Best Practices 1. **Always validate before preprocessing**: Catch data issues early 2. **Check for outliers**: Remove or investigate anomalous spectra 3. **Verify class balance**: Ensure equal representation for classification 4. **Document exclusions**: Keep track of any removed samples