Advanced Dataset Settings
Overview
Additional preprocessing options are available at the dataset level to reduce data dimensionality and align reference spectra with dataset wavelengths. These settings affect how data is prepared before model training and can significantly impact performance and interpretability.
Principal Component Analysis (PCA)
Used to reduce the number of input spectral bands while retaining the most important variance information.
Perform PCA
- Enables or disables PCA during preprocessing.
- When enabled, PCA plots showing spectra projected in a 2‑dimensional space will be shown.
Number of latent variables
Sets how many principal components to retain in the computation of the loading matrix. This matrix is then used to transform the spectra into a lower dimensional space when applying PCA as preprocessing in Model Creation.
- Recommendation: retain a minimum of 30–40 latent variables for training-time dimensionality reduction.
- Note: PCA plots will always show only the first two latent variables, regardless of how many are retained.
Guidance: Use PCA when dealing with high-dimensional or noisy data. Avoid overly small component counts, which can remove key discriminative features.
Reference Spectra
Reference spectra are often used for unmixing or target detection tasks to match measured data against known spectral signatures.
Reference spectra normalization
Controls normalization applied to reference spectra before comparison.
- None: Keeps the spectra as provided.
- L1/L2 or area-based normalization (if available): Standardizes magnitude for fair comparison.
Reference spectra resampling type
Defines how reference spectra are adjusted to align with dataset wavelengths.
- Interpolation: Recommended in most cases for smooth resampling across wavelengths.
- Convolution: Recommended when the spectral resolution of the images is significantly different from the reference spectra.
Reference spectra resampling method
Specifies the interpolation function used during resampling.
- Linear: Default method; balances accuracy and smoothness.
- Other options (if available): spline or nearest, depending on sensor and application.
Rationale: Proper normalization and resampling ensure consistent spectral alignment between reference and measured data, improving the accuracy of spectral comparisons.
Sampling
Controls how pixels are selected and split between training, validation, and testing sets. Proper sampling ensures balanced data coverage and improves model generalization.
Classes to average
Optionally merge multiple pixels into one averaged sample per class for training.
- Use only if pixel data is noisy.
Sampling method
Defines the algorithm used to select representative samples from each class.
- Random: Selects pixels randomly; simplest approach but may cause imbalance.
- Interleaved (Default): Selects pixels in sequence (e.g., every nth pixel); useful for spatially distributed data.
- Stratified (beta): Maintains class proportion during splitting; recommended for classification tasks.
Class split type (Automatic assignment only)
Determines how data is automatically divided between the Training and Validation buckets. Applicable when the Data Assignment Method is set to Automatic.
- Pixel: Randomly splits individual pixels within each class between training and validation. All class labels are represented in both sets.
- Label: Averages spectra per label and splits labels between training and validation. Each label (or subclass) appears in only one of the sets.
Guidance:
- Use Pixel when you want both training and validation buckets to represent all labels.
- Use Label when you have sufficient labelled data and want to increase the signal-to-noise ratio.
Per-class limits
Set caps to control dataset size and balance.
- Max training pixels per class
- Max validation pixels per class
- Max testing pixels per class
Pixel thresholds
Define thresholds for pixel inclusion based on intensity or quality metrics.
- Min threshold per pixel / Max threshold per pixel
- Use to filter out noisy or invalid pixels.