Skip to content

Benchmarking and Evaluation

SmartML uses a fixed and transparent benchmarking system to evaluate model performance and latency.

All metrics are computed using deterministic procedures and identical inputs across models.


Overview

For every model run, SmartML measures:

  • Task-specific performance metrics
  • Training time
  • Batch inference time
  • Single-sample inference latency
  • P95 single-sample latency

All measurements are performed after training and use the same test split.


Prediction Handling

Before evaluation:

  • Model predictions are converted to NumPy arrays
  • Multi-dimensional predictions are flattened when required
  • For classification, probability outputs are converted to class labels using argmax

This ensures consistent metric computation across different model APIs.


Classification Metrics

For classification tasks, SmartML computes:

  • Accuracy
  • Macro F1 score

Macro F1 is calculated using equal weight for each class, making it suitable for imbalanced datasets.

Both metrics are computed on the test set only.


Regression Metrics

For regression tasks, SmartML computes:

  • R² (coefficient of determination)
  • Mean Squared Error (MSE)

Targets and predictions are reshaped to one-dimensional arrays before evaluation.


Training Time Measurement

Training time is measured as:

  • Wall-clock time
  • Captured around the model fit() call
  • Reported in seconds

Warmup operations, if present, are not included in training time.


Batch Inference Benchmarking

Batch inference time measures how long a model takes to predict on the entire test set.

Procedure:

  • A small warmup prediction is executed
  • Full test set prediction is timed
  • Time is reported in seconds

Derived metrics:

  • Number of samples processed
  • Batch throughput (samples per second)

Batch benchmarking reflects throughput-oriented workloads.


Single-Sample Inference Benchmarking

Single-sample latency measures per-request inference cost.

Procedure:

  • A warmup prediction is executed
  • A fixed number of inference runs are performed
  • Each run predicts a single randomly selected sample
  • Latency is measured in milliseconds

Default behavior:

  • 200 inference runs
  • Fixed random seed for reproducibility

Reported metrics:

  • Mean single-sample latency (ms)
  • 95th percentile latency (P95, ms)

This reflects real-world request-style inference.


Warmup Strategy

Warmup predictions are executed before timing:

  • To stabilize internal model state
  • To reduce first-call overhead
  • To improve measurement consistency

Warmup time is excluded from all reported metrics.


Throughput Calculation

Batch throughput is computed as:

  • Number of test samples divided by batch inference time

If batch inference time is zero, throughput is reported as infinity.


Determinism and Reproducibility

Benchmarking behavior is deterministic due to:

  • Fixed random seeds
  • Fixed number of runs
  • Fixed evaluation procedures
  • Identical test data for all models

Given the same environment and dataset, SmartML produces consistent benchmark results.


Design Rationale

The benchmarking system is designed to:

  • Measure both accuracy and speed
  • Capture realistic inference behavior
  • Avoid cherry-picked metrics
  • Remain model-agnostic

No model-specific optimizations or shortcuts are applied.