Skip to content

Regression Benchmarks

This document describes the 10 regression datasets used in our evaluation, along with their characteristics and expected modeling behavior.

All benchmarks are evaluated using the SmartML system with default model configurations and no hyperparameter tuning.

Note: The intent of these benchmarks is system-level comparison (MSE, R², latency, throughput), not leaderboard optimization.


Benchmark Philosophy

  • All datasets are evaluated under identical preprocessing and evaluation protocols
  • No dataset-specific tuning
  • Single run per model
  • Focus on practical behavior under production-like defaults

Note: Differences in latency across models reflect inherent algorithmic scaling behavior rather than benchmark bias. Note: Some models (e.g., SVR, KNN) are excluded from certain datasets due to scaling limitations or prohibitively high training/inference time at large sample counts.

This setup reflects real-world constraints where tuning is limited by time, cost, or deployment requirements.


Dataset Summary

# Dataset Samples × Features
1 DIAMONDS 54k * 10
2 SARCOS 45K × 22
3 Wave Energy 72K × 49
4 Buzzinsocialmedia_Twitter 580K × 78
5 California-Environmental 128K × 19
6 Diabetes-130-Hospitals 102K × 25
7 CoverType 567K × 11
8 SGEMM_GPU_kernel_performance 242K × 15
9 DutchTwitterDataset 451K × 20
10 Autos 1M × 28

Dataset Overview

1. DIAMONDS (54K × 10)

  • Samples: 54,000
  • Features: 10 (including carat, cut, color, clarity, and physical dimensions of the diamond)
  • Target: Price of the diamond (continuous variable)
  • Dataset characteristics:
  • Mix of categorical (cut, color, clarity) and numerical features (carat, x, y, z).
  • Moderate feature correlation; strong non-linear relationships.
  • Popular benchmark for regression with tabular data.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
LightGBM 0.315 0.027 10,788 405,639 0.9817 291,703 0.978 1.205
Random Forest 10.848 0.155 10,788 69,383 0.9810 302,595 33.50 37.18
Extra Trees 8.436 0.189 10,788 56,983 0.9806 308,089 34.78 37.54
CatBoost 0.668 0.0055 10,788 1,955,742 0.9803 312,984 0.842 0.934
XGBoost 0.361 0.0172 10,788 628,996 0.9800 318,286 0.485 0.616
SmartKNN 7.835 0.163 10,788 66,065 0.9764 375,274 0.143 0.163
KNN 0.025 1.322 10,788 8,161 0.9575 675,779 2.729 3.105
Lasso 0.215 0.0031 10,788 3,506,000 0.9190 1,288,437 0.379 0.462
Ridge 0.036 0.0029 10,788 3,671,000 0.9189 1,288,693 0.327 0.382
Linear 0.069 0.0033 10,788 3,254,000 0.9189 1,288,705 0.335 0.437
ElasticNet 0.054 0.0033 10,788 3,243,000 0.8489 2,402,448 0.359 0.437
SVR 144.856 31.827 10,788 339 0.3541 10,267,090 3.595 3.733

2. SARCOS (45K × 22)

  • Samples: 45,000
  • Features: 22 (joint positions, velocities, and accelerations of a 7-DOF robot arm)
  • Target: Torques for 7 robot joints (continuous variables)
  • Dataset characteristics:
  • All numerical features; no categorical data.
  • Strong non-linear relationships between joint states and torques.
  • Popular benchmark for high-dimensional regression with real-world physical measurements.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
Extra Trees 10.920 0.162 8,897 54,859 0.9803 8.087 34.32 36.92
SmartKNN 7.839 0.127 8,897 69,784 0.9786 8.800 0.143 0.166
Random Forest 47.589 0.142 8,897 62,690 0.9740 10.675 35.09 38.41
XGBoost 0.648 0.015 8,897 612,661 0.9740 10.681 0.461 0.526
KNN 0.010 0.870 8,897 10,224 0.9716 11.685 1.851 2.522
LightGBM 0.455 0.028 8,897 314,392 0.9682 13.083 0.868 0.932
CatBoost 0.723 0.0035 8,897 2,545,808 0.9677 13.259 0.926 1.017
Ridge 0.014 0.0017 8,897 5,192,842 0.9239 31.277 0.317 0.342
Linear 0.019 0.0017 8,897 5,212,028 0.9239 31.278 0.310 0.349
Lasso 0.051 0.0019 8,897 4,793,474 0.8946 43.329 0.482 0.975
ElasticNet 0.016 0.0019 8,897 4,761,156 0.7480 103.577 0.343 0.367

3. Wave Energy (72K × 49)

  • Samples: 72,000
  • Features: 49 (wave measurements and derived features for energy prediction)
  • Target: Wave energy output (continuous)
  • Dataset characteristics:
  • All numerical features; some highly correlated.
  • Moderate-to-high dimensionality (49 features).
  • Designed to test robust regression under complex feature interactions.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
Ridge 0.041 0.005 14,400 2,871,421 1.0000 5,269 0.350 0.430
Linear 0.072 0.008 14,400 1,797,951 1.0000 5,276 0.304 0.325
Lasso 0.068 0.016 14,400 909,344 1.0000 5,259 0.594 3.399
CatBoost 1.534 0.004 14,400 4,022,440 0.9897 1.282e+08 0.991 1.099
XGBoost 1.941 0.024 14,400 598,174 0.9577 5.292e+08 0.486 0.588
LightGBM 1.348 0.042 14,400 344,427 0.9416 7.300e+08 0.912 0.965
ElasticNet 0.062 0.008 14,400 1,736,283 0.8828 1.466e+09 0.358 0.413
Extra Trees 42.536 0.286 14,400 50,349 0.8476 1.906e+09 35.632 38.030
Random Forest 188.638 0.266 14,400 54,194 0.8364 2.046e+09 33.863 37.039
KNN 0.036 2.836 14,400 5,078 0.7632 2.961e+09 4.118 4.379
SmartKNN 18.191 0.343 14,400 41,933 0.7568 3.042e+09 0.176 0.203

4. Buzzinsocialmedia_Twitter (580K × 78)

  • Samples: 580,000
  • Features: 78 (user activity, tweet metadata, engagement metrics, etc.)
  • Target: Engagement score or related continuous metric
  • Dataset characteristics:
  • Large-scale, high-dimensional dataset.
  • Mix of highly correlated features and sparsity in some user activity fields.
  • Designed to test scalable regression models under real-world social media data.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
Lasso 15.857 0.078 116,650 1,498,953 0.9371 23,593.96 0.360 0.440
Ridge 0.620 0.058 116,650 2,014,685 0.9348 24,453.78 0.321 0.348
Linear 1.456 0.057 116,650 2,058,530 0.9302 26,175.83 0.302 0.328
SmartKNN 253.027 7.679 116,650 15,190 0.9256 27,894.62 0.413 0.460
ElasticNet 30.218 0.085 116,650 1,365,232 0.9010 37,113.69 0.345 0.384
LightGBM 6.379 0.175 116,650 667,612 0.8835 43,684.85 0.940 0.976
CatBoost 17.685 0.091 116,650 1,283,054 0.8466 57,517.84 1.690 3.726
XGBoost 5.451 0.176 116,650 661,987 0.8122 70,429.23 0.461 0.509

5. California-Environmental-Conditions-Dataset (128K × 19)

  • Samples: 128,000
  • Features: 19 (temperature, humidity, air quality, pollution levels, etc.)
  • Target: Continuous environmental metric (e.g., air quality index or temperature prediction)
  • Dataset characteristics:
  • Medium-scale regression dataset.
  • Mostly numerical features with potential correlations.
  • Includes both slowly varying environmental signals and occasional spikes.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
XGBoost 0.643 0.040 25,625 641,179 0.7275 0.01117 0.470 0.551
LightGBM 0.562 0.033 25,625 777,359 0.6970 0.01242 0.892 1.033
SmartKNN 11.074 0.430 25,625 59,540 0.6941 0.01254 0.161 0.182
CatBoost 1.150 0.006 25,625 4,445,293 0.6864 0.01285 0.998 1.164
Linear 0.046 0.004 25,625 6,969,827 0.4149 0.02397 0.332 0.386
Ridge 0.038 0.003 25,625 7,881,582 0.4149 0.02397 0.321 0.339
Lasso 0.031 0.010 25,625 2,635,151 -0.00004 0.04098 0.627 3.439
ElasticNet 0.034 0.016 25,625 1,636,454 -0.00004 0.04098 0.607 3.432

6. Diabetes-130-Hospitals (Fairlearn)

  • Samples: 130,000 (subset used: 20,354 per batch)
  • Features: Multiple clinical and demographic features per patient
  • Target: Continuous outcome (e.g., glucose level, readmission risk, or treatment response)
  • Dataset characteristics:
  • Medium-scale regression dataset in healthcare.
  • Fairness-focused dataset (Fairlearn), potentially including sensitive attributes.
  • Features may be heterogeneous (numerical + categorical) with complex interactions.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
XGBoost 0.170 0.007 20,354 2,778,491 1.0000 1.22e-11 0.438 0.520
SmartKNN 5.763 0.310 20,354 65,627 1.0000 0.0000 0.171 0.204
Ridge 0.035 0.003 20,354 6,745,938 1.0000 2.93e-11 0.329 0.401
Linear 0.040 0.003 20,354 7,129,392 1.0000 3.92e-14 0.316 0.355
LightGBM 0.505 0.049 20,354 418,504 1.0000 7.03e-11 0.843 0.910
CatBoost 0.677 0.005 20,354 4,153,346 0.999993 6.55e-07 0.964 1.106
Lasso 0.035 0.006 20,354 3,346,010 -0.000007 0.09966 0.727 1.068
ElasticNet 0.026 0.003 20,354 1,636,454 -0.000007 0.09966 0.550 0.804

7. COVERTYPE

  • Samples: 567,000 (subset used: 113,321 per batch)
  • Features: 11 features including elevation, slope, soil type, and hydrological metrics
  • Target: Continuous label representing forest cover type
  • Dataset characteristics:
  • Large-scale regression dataset from UCI Covertype data.
  • Features are a mix of numerical and categorical (encoded as binaries).
  • Commonly used for predicting land cover and forestry classification regression problems.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
SmartKNN 20.413 2.802 113,321 40,447 0.8272 0.043203 0.264 0.321
XGBoost 1.851 0.175 113,321 648,308 0.5808 0.104801 0.472 0.514
CatBoost 3.447 0.013 113,321 8,974,239 0.5098 0.122548 0.950 1.052
LightGBM 1.571 0.227 113,321 498,750 0.4799 0.130029 0.862 0.919
Ridge 0.081 0.008 113,321 14,553,170 0.0769 0.230768 0.317 0.374
Linear 0.111 0.009 113,321 12,867,660 0.0767 0.230836 0.313 0.394
Lasso 0.076 0.009 113,321 12,821,830 -0.000000 0.250000 0.367 0.442
ElasticNet 0.072 0.008 113,321 13,337,130 -0.000000 0.250000 0.465 0.577

8. SGEMM_GPU_kernel_performance

  • Samples: 242,000 (subset used: 48,320 per batch)
  • Features: 15 GPU kernel performance metrics including matrix sizes, block dimensions, and computational throughput
  • Target: Continuous label representing kernel execution time (or performance metric)
  • Dataset characteristics:
  • High-dimensional regression dataset for GPU kernel performance prediction.
  • Features are primarily numerical and represent low-level hardware and kernel parameters.
  • Useful for evaluating models in hardware-aware performance prediction tasks.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
LightGBM 1.035 0.104 48,320 466,325 0.9904 1,285.37 1.273 3.218
XGBoost 1.029 0.076 48,320 638,878 0.9847 2,057.10 0.496 0.565
SmartKNN 13.823 0.840 48,320 57,494 0.9835 2,212.46 0.194 0.236
CatBoost 1.641 0.007 48,320 7,079,067 0.9765 3,159.37 1.011 1.205
Ridge 0.045 0.005 48,320 9,675,293 0.4025 80,236.66 0.317 0.350
Linear 0.068 0.017 48,320 2,770,231 0.4025 80,236.67 0.535 3.343
Lasso 0.069 0.005 48,320 10,513,860 0.4024 80,248.38 0.371 0.458
ElasticNet 0.054 0.005 48,320 9,482,767 0.3399 88,649.51 0.354 0.409

9. DutchTwitterDataset

  • Samples: 451,000 (batch subset: 90,240)
  • Features: 20 linguistic, user, and tweet metadata features
  • Target: Continuous sentiment or engagement score
  • Dataset characteristics:
  • Real-world social media dataset.
  • Mix of numerical and categorical features (encoded).
  • Evaluates models on text-derived numerical regression tasks.

Why this dataset matters:
It tests model scalability and latency on social media-sized datasets, where batch inference speed and single-sample latency are important.


Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
SmartKNN 81.724 2.593 90,240 34,805 0.9492 0.001984 0.267 0.309
XGBoost 1.527 0.138 90,240 652,627 0.8841 0.004526 0.498 0.614
LightGBM 1.458 0.155 90,240 580,956 0.8662 0.005223 0.892 0.958
CatBoost 3.292 0.019 90,240 4,634,122 0.8628 0.005356 0.865 0.955
Ridge 0.207 0.020 90,240 4,583,470 0.0758 0.036085 0.324 0.374
Linear 0.349 0.042 90,240 2,152,844 0.0758 0.036087 0.393 0.370
Lasso 0.238 0.021 90,240 4,321,524 ~0 0.039046 0.348 0.376
ElasticNet 0.220 0.021 90,240 4,347,271 ~0 0.039046 0.344 0.386

10. Autos

  • Samples: 1,000,000 (batch subset: 200,000)
  • Features: 28 features including vehicle specs, engine performance, and sales metadata
  • Target: Continuous variable (e.g., car price or fuel efficiency)
  • Dataset characteristics:
  • Large-scale tabular dataset.
  • Mix of numerical and categorical features (mostly encoded).
  • Evaluates both model scalability and single-sample latency for high-volume industrial datasets.

Benchmark Results

Model Train Time (s) Batch Inference (s) Batch Samples Batch Throughput (samples/s) MSE Single Mean (ms) Single P95 (ms)
XGBoost 8.967 0.322 200,000 621,688 0.6532 0.5543 0.503 0.622
CatBoost 9.067 0.076 200,000 2,626,389 0.6413 0.5733 0.929 1.085
LightGBM 7.726 0.544 200,000 367,857 0.6162 0.6135 1.399 3.674
SmartKNN 548.474 12.648 200,000 15,813 0.4902 0.8149 0.509 0.597
Ridge 5.372 0.089 200,000 2,244,214 0.4181 0.9301 0.320 0.366
Linear 2.270 0.099 200,000 2,027,420 0.4181 0.9301 0.306 0.343
ElasticNet 1.587 0.096 200,000 2,072,844 0.1126 1.4183 0.376 0.479
Lasso 1.399 0.092 200,000 2,183,667 ~0 1.5983 0.398 0.633

Evaluation Note

These benchmarks are intended to evaluate model behavior under identical, untuned conditions, reflecting practical deployment scenarios rather than optimized leaderboard results.