Training and Execution Flow
This document describes how SmartML loads data, prepares inputs, trains models, and evaluates results.
The training flow is linear, deterministic, and identical for all models, with the exception of model-specific preprocessing where mathematically required.
Overview
SmartML training consists of the following high-level stages:
- Dataset loading
- Task validation
- Train/test split
- Feature encoding
- Target encoding
- Model selection
- Model training
- Evaluation and timing
- Result aggregation
No step is skipped.
Model-specific behavior is explicit and documented.
Dataset Loading
Datasets can be loaded from:
- Local CSV files
- OpenML datasets (via dataset ID)
After loading:
- The target column is validated
- Optional subsampling may be applied
- Features and target are separated
If the target column is missing, execution fails immediately.
Subsampling Behavior
If a subset size is provided and the dataset is larger than that value:
- A random subset is selected
- Sampling is performed without replacement
- A fixed random seed is used
This is intended for controlled benchmarking on large datasets.
Task Validation
SmartML supports two task types only:
- Classification
- Regression
The task must be specified explicitly.
Invalid task values result in an error.
Train / Test Split
Before training, SmartML performs an internal split:
- Train size: 80%
- Test size: 20%
- Random seed: 42
- Shuffling: Enabled
Classification Tasks
- Stratified splitting is attempted
- Stratification is used only if every class has at least 2 samples
- If stratification is not possible, it is disabled and logged
Regression Tasks
- No stratification is applied
- A standard shuffled split is used
The same split is reused for all models.
Feature Encoding
After splitting:
- Feature encoding is fitted only on the training data
- The fitted encoder is applied to both train and test sets
- Encoded features are converted to dense
float32arrays
Feature encoding is fully documented in the Encoding section.
Feature Scaling (Model-Specific)
SmartML applies feature scaling only for models where it is mathematically required or standard practice.
Scaling is not applied globally and not forced on all models.
Models That Use Feature Scaling
Scaling is applied internally for:
- Linear models (Linear, Ridge, Lasso, ElasticNet)
- Support Vector Machines (SVC, SVR)
- K-Nearest Neighbors (classification and regression)
These models use standard normalization to ensure:
- Distance-based comparisons are meaningful
- Optimization converges correctly
- Benchmarks reflect correct model behavior
Models That Do Not Use Scaling
Scaling is intentionally not applied to:
- Tree-based models (Random Forest, Extra Trees, LightGBM, XGBoost, CatBoost)
- Deep learning models with internal normalization
- SmartKNN (uses its own internal distance logic)
Applying scaling to these models would either:
- Have no effect
- Degrade performance
- Bias benchmarks unfairly
Key Principle
Scaling is applied only when required for correctness, not for optimization.
Target Encoding
Target encoding depends on task type.
Classification
- Labels are converted to strings
- Label encoding is applied
- Mapping is learned from training labels only
Regression
- Target values are passed through directly
- Converted to
float32
Memory Management
Once encoding is complete:
- Raw DataFrames and Series are explicitly deleted
- Only NumPy arrays are kept for training and evaluation
This reduces memory pressure during multi-model runs.
Model Selection
Models are selected using the internal model registry.
- If no model list is provided, all compatible models for the task are selected
- Models can be explicitly excluded
- If no models remain, execution fails
Model availability depends on installed dependencies.
Model Training
For each selected model:
- A fresh model instance is created
- Training is performed using encoded (and scaled, if applicable) training data
- Training time is measured using a high-resolution timer
No model receives hidden optimizations or undocumented preprocessing.
Warmup Phase
If a model implements a warmup method:
- Warmup is executed after training
- Warmup uses training data
- Warmup time is not included in training time
This stabilizes inference latency measurements.
Evaluation
After training, each model is evaluated on the test set.
Evaluation includes:
- Task-specific performance metrics
- Training time
- Inference latency metrics
Evaluation logic is handled by the internal evaluation module.
Result Aggregation
For each model, SmartML collects:
- Model name
- All computed metrics
- Timing information
Results are stored in a tabular structure and combined into a single results table.
Output Handling
If an output path is provided:
- The results directory is created if missing
- Results are written to CSV
- No formatting or post-processing is applied
The raw output is intended for further analysis.
Determinism and Reproducibility
Training behavior is deterministic due to:
- Fixed random seed
- Fixed split ratio
- Fixed encoding rules
- Fixed model defaults
- Explicit scaling rules
Given the same dataset and environment, SmartML will produce identical results.
Design Rationale
The training system is designed to:
- Treat all models fairly
- Apply preprocessing only where required
- Prevent benchmark manipulation
- Avoid hidden optimizations
- Favor transparency over flexibility
If custom training logic is required, SmartML is not intended to be extended implicitly.