Finalize Dataset
Overview
Once a dataset has been created and labeled, it must be finalized before it can be used for model training. Finalizing locks the dataset’s configuration and ensures that data, classes, and spectral settings are consistent across all training and validation stages.
This page outlines the recommended process for validating, organizing, and reviewing your dataset before finalization.
Configure Dataset Settings
Before finalizing, confirm that all configuration options are correct and complete.
- Sensor: Select the correct sensor configuration used during data acquisition.
- Wavelengths: Verify that the wavelength range matches the spectral region of interest.
- Classes: Ensure all relevant labeled categories are included.
- Label only target classes relevant to your task.
- Add an “Other” or “Background” class to capture non-target spectra that may appear during production.
- Reference Spectra (Optional): Attach reference spectra for each class if using unmixing or target detection models.
- Max Pixels per Class (Optional): Limit the number of pixels per class to control dataset size and maintain balance.
- Tip: Start with a pilot dataset of roughly 50,000 training pixels and 5–10,000 validation/test pixels to enable quick iteration and early feedback.
Validate Data
Open the Data tab and verify that all data files are compatible with the dataset’s configuration.
A file is compatible when:
- The sensor and wavelengths match (or can be resampled).
- The processing level (reflectance or radiance) matches the dataset.
- Labels are finalized and correctly assigned.
If any file is incompatible, hover over it to see the reason and make the necessary adjustments before proceeding.
Organize Training, Validation, and Test Sets
Organize your data into the three buckets used during training.
- Add Data: Drag and drop files into the Training, Validation, and Test boxes.
- Automatic Splitting:
- Set Data Assignment Method = Automatic and adjust the split ratio slider.
- The split can be made by pixel (randomly across each class) or by label (entire labels assigned to one bucket).
- Pixel: All labels appear in both sets.
- Label: Each label appears in only one set (Training or Validation).
Validation and Test Buckets
- Validation Set: Whenever possible, create a dedicated validation set rather than relying only on automatic splits. This allows early detection of overfitting and helps measure generalization.
- Test Set: Choose test data from conditions that mirror production, such as conveyor-belt or on-site scans rather than lab samples. Partial labeling is acceptable at first but should expand over time to include all classes.
Review Dataset Balance and Spectral Consistency
Use the Analyze Dataset feature on the Overview tab to review class balance, spectral separation, and overall consistency. These checks help confirm that your dataset is representative of production data and ready for stable model training.
Average Class Spectral Signatures
Displays the mean spectral curve for each class.
- Distinct curves → strong class separability and reliable labeling.
- Overlapping curves → similar materials, mixed pixels, or labeling inconsistencies.
Pixel Count Distribution
Shows the number of pixels per class.
- Balanced distributions reduce bias toward dominant classes.
- Large imbalances can lower accuracy for underrepresented materials.
- Add samples or adjust pixel limits if necessary.
PCA Plots
Principal Component Analysis (PCA) plots project spectral data into two dimensions to visualize diversity and consistency across Training, Validation, and Test buckets.
How to interpret:
- Each point = one pixel, colored by class.
- Distinct clusters → good spectral separation and high-quality data.
- Overlap → similar spectra or label confusion.
- Consistent distribution across buckets → balanced sampling; major shifts suggest sampling bias or acquisition differences.
- Outliers → potential noise or mis-labeled pixels.
Tip: Aim for similar cluster patterns and density across all PCA plots. Consistent shapes indicate your data buckets represent the same spectral distribution.
Finalize the Dataset
When all settings are correct and the dataset passes analysis checks:
Confirm that:
- Spectral signatures are distinct.
- Pixel counts are balanced across classes.
- PCA distributions are consistent across buckets.
Then click Finalize Dataset.
Once finalized, the dataset’s configuration becomes locked and it will be marked as Ready, allowing you to begin model training.