Benchmarking and Evaluation

SmartML uses a fixed and transparent benchmarking system to evaluate model performance and latency.

All metrics are computed using deterministic procedures and identical inputs across models.

Overview

For every model run, SmartML measures:

Task-specific performance metrics
Training time
Batch inference time
Single-sample inference latency
P95 single-sample latency

All measurements are performed after training and use the same test split.

Prediction Handling

Before evaluation:

Model predictions are converted to NumPy arrays
Multi-dimensional predictions are flattened when required
For classification, probability outputs are converted to class labels using argmax

This ensures consistent metric computation across different model APIs.

Classification Metrics

For classification tasks, SmartML computes:

Accuracy
Macro F1 score

Macro F1 is calculated using equal weight for each class, making it suitable for imbalanced datasets.

Both metrics are computed on the test set only.

Regression Metrics

For regression tasks, SmartML computes:

R² (coefficient of determination)
Mean Squared Error (MSE)

Targets and predictions are reshaped to one-dimensional arrays before evaluation.

Training Time Measurement

Training time is measured as:

Wall-clock time
Captured around the model fit() call
Reported in seconds

Warmup operations, if present, are not included in training time.

Batch Inference Benchmarking

Batch inference time measures how long a model takes to predict on the entire test set.

Procedure:

A small warmup prediction is executed
Full test set prediction is timed
Time is reported in seconds

Derived metrics:

Number of samples processed
Batch throughput (samples per second)

Batch benchmarking reflects throughput-oriented workloads.

Single-Sample Inference Benchmarking

Single-sample latency measures per-request inference cost.

Procedure:

A warmup prediction is executed
A fixed number of inference runs are performed
Each run predicts a single randomly selected sample
Latency is measured in milliseconds

Default behavior:

200 inference runs
Fixed random seed for reproducibility

Reported metrics:

Mean single-sample latency (ms)
95th percentile latency (P95, ms)

This reflects real-world request-style inference.

Warmup Strategy

Warmup predictions are executed before timing:

To stabilize internal model state
To reduce first-call overhead
To improve measurement consistency

Warmup time is excluded from all reported metrics.

Throughput Calculation

Batch throughput is computed as:

Number of test samples divided by batch inference time

If batch inference time is zero, throughput is reported as infinity.

Determinism and Reproducibility

Benchmarking behavior is deterministic due to:

Fixed random seeds
Fixed number of runs
Fixed evaluation procedures
Identical test data for all models

Given the same environment and dataset, SmartML produces consistent benchmark results.

Design Rationale

The benchmarking system is designed to:

Measure both accuracy and speed
Capture realistic inference behavior
Avoid cherry-picked metrics
Remain model-agnostic

No model-specific optimizations or shortcuts are applied.