Dataset Splitting Strategy

SmartML applies a fixed, deterministic dataset splitting strategy to ensure fair and reproducible benchmarking across models.

Splitting behavior differs slightly between classification and regression tasks, but the core principles remain the same.

Overview

This guarantees that every model is evaluated under identical conditions.

SmartML uses the following fixed defaults:

These values are not exposed for modification to preserve benchmark consistency.

For classification tasks, SmartML attempts to use stratified splitting whenever possible.

If any class has fewer than 2 samples:

This prevents runtime errors while maintaining maximum balance when feasible.

When stratification is enabled, SmartML logs:

This allows users to verify that class balance has been preserved during the split.

For regression tasks:

Regression splitting is deterministic given the same random seed.

Before splitting:

Invalid inputs raise explicit errors to prevent silent failures.

After splitting, SmartML computes and logs:

This information is used for transparency and debugging.

The splitting process is fully deterministic:

Given the same dataset, SmartML will always produce the same split.

This splitting strategy is designed to:

Flexible splitting options are intentionally excluded.