Regression Benchmarks
This document describes the 10 regression datasets used in our evaluation, along with their characteristics and expected modeling behavior.
All benchmarks are evaluated using the SmartML system with default model configurations and no hyperparameter tuning.
Note: The intent of these benchmarks is system-level comparison (MSE, R², latency, throughput), not leaderboard optimization.
Benchmark Philosophy
- All datasets are evaluated under identical preprocessing and evaluation protocols
- No dataset-specific tuning
- Single run per model
- Focus on practical behavior under production-like defaults
Note: Differences in latency across models reflect inherent algorithmic scaling behavior rather than benchmark bias.
Note: Some models (e.g., SVR, KNN) are excluded from certain datasets due to scaling limitations or prohibitively high training/inference time at large sample counts.
This setup reflects real-world constraints where tuning is limited by time, cost, or deployment requirements.
Dataset Summary
| # |
Dataset |
Samples × Features |
| 1 |
DIAMONDS |
54k * 10 |
| 2 |
SARCOS |
45K × 22 |
| 3 |
Wave Energy |
72K × 49 |
| 4 |
Buzzinsocialmedia_Twitter |
580K × 78 |
| 5 |
California-Environmental |
128K × 19 |
| 6 |
Diabetes-130-Hospitals |
102K × 25 |
| 7 |
CoverType |
567K × 11 |
| 8 |
SGEMM_GPU_kernel_performance |
242K × 15 |
| 9 |
DutchTwitterDataset |
451K × 20 |
| 10 |
Autos |
1M × 28 |
Dataset Overview
1. DIAMONDS (54K × 10)
- Samples: 54,000
- Features: 10 (including carat, cut, color, clarity, and physical dimensions of the diamond)
- Target: Price of the diamond (continuous variable)
- Dataset characteristics:
- Mix of categorical (cut, color, clarity) and numerical features (carat, x, y, z).
- Moderate feature correlation; strong non-linear relationships.
- Popular benchmark for regression with tabular data.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| LightGBM |
0.315 |
0.027 |
10,788 |
405,639 |
0.9817 |
291,703 |
0.978 |
1.205 |
| Random Forest |
10.848 |
0.155 |
10,788 |
69,383 |
0.9810 |
302,595 |
33.50 |
37.18 |
| Extra Trees |
8.436 |
0.189 |
10,788 |
56,983 |
0.9806 |
308,089 |
34.78 |
37.54 |
| CatBoost |
0.668 |
0.0055 |
10,788 |
1,955,742 |
0.9803 |
312,984 |
0.842 |
0.934 |
| XGBoost |
0.361 |
0.0172 |
10,788 |
628,996 |
0.9800 |
318,286 |
0.485 |
0.616 |
| SmartKNN |
7.835 |
0.163 |
10,788 |
66,065 |
0.9764 |
375,274 |
0.143 |
0.163 |
| KNN |
0.025 |
1.322 |
10,788 |
8,161 |
0.9575 |
675,779 |
2.729 |
3.105 |
| Lasso |
0.215 |
0.0031 |
10,788 |
3,506,000 |
0.9190 |
1,288,437 |
0.379 |
0.462 |
| Ridge |
0.036 |
0.0029 |
10,788 |
3,671,000 |
0.9189 |
1,288,693 |
0.327 |
0.382 |
| Linear |
0.069 |
0.0033 |
10,788 |
3,254,000 |
0.9189 |
1,288,705 |
0.335 |
0.437 |
| ElasticNet |
0.054 |
0.0033 |
10,788 |
3,243,000 |
0.8489 |
2,402,448 |
0.359 |
0.437 |
| SVR |
144.856 |
31.827 |
10,788 |
339 |
0.3541 |
10,267,090 |
3.595 |
3.733 |
2. SARCOS (45K × 22)
- Samples: 45,000
- Features: 22 (joint positions, velocities, and accelerations of a 7-DOF robot arm)
- Target: Torques for 7 robot joints (continuous variables)
- Dataset characteristics:
- All numerical features; no categorical data.
- Strong non-linear relationships between joint states and torques.
- Popular benchmark for high-dimensional regression with real-world physical measurements.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| Extra Trees |
10.920 |
0.162 |
8,897 |
54,859 |
0.9803 |
8.087 |
34.32 |
36.92 |
| SmartKNN |
7.839 |
0.127 |
8,897 |
69,784 |
0.9786 |
8.800 |
0.143 |
0.166 |
| Random Forest |
47.589 |
0.142 |
8,897 |
62,690 |
0.9740 |
10.675 |
35.09 |
38.41 |
| XGBoost |
0.648 |
0.015 |
8,897 |
612,661 |
0.9740 |
10.681 |
0.461 |
0.526 |
| KNN |
0.010 |
0.870 |
8,897 |
10,224 |
0.9716 |
11.685 |
1.851 |
2.522 |
| LightGBM |
0.455 |
0.028 |
8,897 |
314,392 |
0.9682 |
13.083 |
0.868 |
0.932 |
| CatBoost |
0.723 |
0.0035 |
8,897 |
2,545,808 |
0.9677 |
13.259 |
0.926 |
1.017 |
| Ridge |
0.014 |
0.0017 |
8,897 |
5,192,842 |
0.9239 |
31.277 |
0.317 |
0.342 |
| Linear |
0.019 |
0.0017 |
8,897 |
5,212,028 |
0.9239 |
31.278 |
0.310 |
0.349 |
| Lasso |
0.051 |
0.0019 |
8,897 |
4,793,474 |
0.8946 |
43.329 |
0.482 |
0.975 |
| ElasticNet |
0.016 |
0.0019 |
8,897 |
4,761,156 |
0.7480 |
103.577 |
0.343 |
0.367 |
3. Wave Energy (72K × 49)
- Samples: 72,000
- Features: 49 (wave measurements and derived features for energy prediction)
- Target: Wave energy output (continuous)
- Dataset characteristics:
- All numerical features; some highly correlated.
- Moderate-to-high dimensionality (49 features).
- Designed to test robust regression under complex feature interactions.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| Ridge |
0.041 |
0.005 |
14,400 |
2,871,421 |
1.0000 |
5,269 |
0.350 |
0.430 |
| Linear |
0.072 |
0.008 |
14,400 |
1,797,951 |
1.0000 |
5,276 |
0.304 |
0.325 |
| Lasso |
0.068 |
0.016 |
14,400 |
909,344 |
1.0000 |
5,259 |
0.594 |
3.399 |
| CatBoost |
1.534 |
0.004 |
14,400 |
4,022,440 |
0.9897 |
1.282e+08 |
0.991 |
1.099 |
| XGBoost |
1.941 |
0.024 |
14,400 |
598,174 |
0.9577 |
5.292e+08 |
0.486 |
0.588 |
| LightGBM |
1.348 |
0.042 |
14,400 |
344,427 |
0.9416 |
7.300e+08 |
0.912 |
0.965 |
| ElasticNet |
0.062 |
0.008 |
14,400 |
1,736,283 |
0.8828 |
1.466e+09 |
0.358 |
0.413 |
| Extra Trees |
42.536 |
0.286 |
14,400 |
50,349 |
0.8476 |
1.906e+09 |
35.632 |
38.030 |
| Random Forest |
188.638 |
0.266 |
14,400 |
54,194 |
0.8364 |
2.046e+09 |
33.863 |
37.039 |
| KNN |
0.036 |
2.836 |
14,400 |
5,078 |
0.7632 |
2.961e+09 |
4.118 |
4.379 |
| SmartKNN |
18.191 |
0.343 |
14,400 |
41,933 |
0.7568 |
3.042e+09 |
0.176 |
0.203 |
- Samples: 580,000
- Features: 78 (user activity, tweet metadata, engagement metrics, etc.)
- Target: Engagement score or related continuous metric
- Dataset characteristics:
- Large-scale, high-dimensional dataset.
- Mix of highly correlated features and sparsity in some user activity fields.
- Designed to test scalable regression models under real-world social media data.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| Lasso |
15.857 |
0.078 |
116,650 |
1,498,953 |
0.9371 |
23,593.96 |
0.360 |
0.440 |
| Ridge |
0.620 |
0.058 |
116,650 |
2,014,685 |
0.9348 |
24,453.78 |
0.321 |
0.348 |
| Linear |
1.456 |
0.057 |
116,650 |
2,058,530 |
0.9302 |
26,175.83 |
0.302 |
0.328 |
| SmartKNN |
253.027 |
7.679 |
116,650 |
15,190 |
0.9256 |
27,894.62 |
0.413 |
0.460 |
| ElasticNet |
30.218 |
0.085 |
116,650 |
1,365,232 |
0.9010 |
37,113.69 |
0.345 |
0.384 |
| LightGBM |
6.379 |
0.175 |
116,650 |
667,612 |
0.8835 |
43,684.85 |
0.940 |
0.976 |
| CatBoost |
17.685 |
0.091 |
116,650 |
1,283,054 |
0.8466 |
57,517.84 |
1.690 |
3.726 |
| XGBoost |
5.451 |
0.176 |
116,650 |
661,987 |
0.8122 |
70,429.23 |
0.461 |
0.509 |
5. California-Environmental-Conditions-Dataset (128K × 19)
- Samples: 128,000
- Features: 19 (temperature, humidity, air quality, pollution levels, etc.)
- Target: Continuous environmental metric (e.g., air quality index or temperature prediction)
- Dataset characteristics:
- Medium-scale regression dataset.
- Mostly numerical features with potential correlations.
- Includes both slowly varying environmental signals and occasional spikes.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| XGBoost |
0.643 |
0.040 |
25,625 |
641,179 |
0.7275 |
0.01117 |
0.470 |
0.551 |
| LightGBM |
0.562 |
0.033 |
25,625 |
777,359 |
0.6970 |
0.01242 |
0.892 |
1.033 |
| SmartKNN |
11.074 |
0.430 |
25,625 |
59,540 |
0.6941 |
0.01254 |
0.161 |
0.182 |
| CatBoost |
1.150 |
0.006 |
25,625 |
4,445,293 |
0.6864 |
0.01285 |
0.998 |
1.164 |
| Linear |
0.046 |
0.004 |
25,625 |
6,969,827 |
0.4149 |
0.02397 |
0.332 |
0.386 |
| Ridge |
0.038 |
0.003 |
25,625 |
7,881,582 |
0.4149 |
0.02397 |
0.321 |
0.339 |
| Lasso |
0.031 |
0.010 |
25,625 |
2,635,151 |
-0.00004 |
0.04098 |
0.627 |
3.439 |
| ElasticNet |
0.034 |
0.016 |
25,625 |
1,636,454 |
-0.00004 |
0.04098 |
0.607 |
3.432 |
6. Diabetes-130-Hospitals (Fairlearn)
- Samples: 130,000 (subset used: 20,354 per batch)
- Features: Multiple clinical and demographic features per patient
- Target: Continuous outcome (e.g., glucose level, readmission risk, or treatment response)
- Dataset characteristics:
- Medium-scale regression dataset in healthcare.
- Fairness-focused dataset (Fairlearn), potentially including sensitive attributes.
- Features may be heterogeneous (numerical + categorical) with complex interactions.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| XGBoost |
0.170 |
0.007 |
20,354 |
2,778,491 |
1.0000 |
1.22e-11 |
0.438 |
0.520 |
| SmartKNN |
5.763 |
0.310 |
20,354 |
65,627 |
1.0000 |
0.0000 |
0.171 |
0.204 |
| Ridge |
0.035 |
0.003 |
20,354 |
6,745,938 |
1.0000 |
2.93e-11 |
0.329 |
0.401 |
| Linear |
0.040 |
0.003 |
20,354 |
7,129,392 |
1.0000 |
3.92e-14 |
0.316 |
0.355 |
| LightGBM |
0.505 |
0.049 |
20,354 |
418,504 |
1.0000 |
7.03e-11 |
0.843 |
0.910 |
| CatBoost |
0.677 |
0.005 |
20,354 |
4,153,346 |
0.999993 |
6.55e-07 |
0.964 |
1.106 |
| Lasso |
0.035 |
0.006 |
20,354 |
3,346,010 |
-0.000007 |
0.09966 |
0.727 |
1.068 |
| ElasticNet |
0.026 |
0.003 |
20,354 |
1,636,454 |
-0.000007 |
0.09966 |
0.550 |
0.804 |
7. COVERTYPE
- Samples: 567,000 (subset used: 113,321 per batch)
- Features: 11 features including elevation, slope, soil type, and hydrological metrics
- Target: Continuous label representing forest cover type
- Dataset characteristics:
- Large-scale regression dataset from UCI Covertype data.
- Features are a mix of numerical and categorical (encoded as binaries).
- Commonly used for predicting land cover and forestry classification regression problems.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| SmartKNN |
20.413 |
2.802 |
113,321 |
40,447 |
0.8272 |
0.043203 |
0.264 |
0.321 |
| XGBoost |
1.851 |
0.175 |
113,321 |
648,308 |
0.5808 |
0.104801 |
0.472 |
0.514 |
| CatBoost |
3.447 |
0.013 |
113,321 |
8,974,239 |
0.5098 |
0.122548 |
0.950 |
1.052 |
| LightGBM |
1.571 |
0.227 |
113,321 |
498,750 |
0.4799 |
0.130029 |
0.862 |
0.919 |
| Ridge |
0.081 |
0.008 |
113,321 |
14,553,170 |
0.0769 |
0.230768 |
0.317 |
0.374 |
| Linear |
0.111 |
0.009 |
113,321 |
12,867,660 |
0.0767 |
0.230836 |
0.313 |
0.394 |
| Lasso |
0.076 |
0.009 |
113,321 |
12,821,830 |
-0.000000 |
0.250000 |
0.367 |
0.442 |
| ElasticNet |
0.072 |
0.008 |
113,321 |
13,337,130 |
-0.000000 |
0.250000 |
0.465 |
0.577 |
- Samples: 242,000 (subset used: 48,320 per batch)
- Features: 15 GPU kernel performance metrics including matrix sizes, block dimensions, and computational throughput
- Target: Continuous label representing kernel execution time (or performance metric)
- Dataset characteristics:
- High-dimensional regression dataset for GPU kernel performance prediction.
- Features are primarily numerical and represent low-level hardware and kernel parameters.
- Useful for evaluating models in hardware-aware performance prediction tasks.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| LightGBM |
1.035 |
0.104 |
48,320 |
466,325 |
0.9904 |
1,285.37 |
1.273 |
3.218 |
| XGBoost |
1.029 |
0.076 |
48,320 |
638,878 |
0.9847 |
2,057.10 |
0.496 |
0.565 |
| SmartKNN |
13.823 |
0.840 |
48,320 |
57,494 |
0.9835 |
2,212.46 |
0.194 |
0.236 |
| CatBoost |
1.641 |
0.007 |
48,320 |
7,079,067 |
0.9765 |
3,159.37 |
1.011 |
1.205 |
| Ridge |
0.045 |
0.005 |
48,320 |
9,675,293 |
0.4025 |
80,236.66 |
0.317 |
0.350 |
| Linear |
0.068 |
0.017 |
48,320 |
2,770,231 |
0.4025 |
80,236.67 |
0.535 |
3.343 |
| Lasso |
0.069 |
0.005 |
48,320 |
10,513,860 |
0.4024 |
80,248.38 |
0.371 |
0.458 |
| ElasticNet |
0.054 |
0.005 |
48,320 |
9,482,767 |
0.3399 |
88,649.51 |
0.354 |
0.409 |
- Samples: 451,000 (batch subset: 90,240)
- Features: 20 linguistic, user, and tweet metadata features
- Target: Continuous sentiment or engagement score
- Dataset characteristics:
- Real-world social media dataset.
- Mix of numerical and categorical features (encoded).
- Evaluates models on text-derived numerical regression tasks.
Why this dataset matters:
It tests model scalability and latency on social media-sized datasets, where batch inference speed and single-sample latency are important.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| SmartKNN |
81.724 |
2.593 |
90,240 |
34,805 |
0.9492 |
0.001984 |
0.267 |
0.309 |
| XGBoost |
1.527 |
0.138 |
90,240 |
652,627 |
0.8841 |
0.004526 |
0.498 |
0.614 |
| LightGBM |
1.458 |
0.155 |
90,240 |
580,956 |
0.8662 |
0.005223 |
0.892 |
0.958 |
| CatBoost |
3.292 |
0.019 |
90,240 |
4,634,122 |
0.8628 |
0.005356 |
0.865 |
0.955 |
| Ridge |
0.207 |
0.020 |
90,240 |
4,583,470 |
0.0758 |
0.036085 |
0.324 |
0.374 |
| Linear |
0.349 |
0.042 |
90,240 |
2,152,844 |
0.0758 |
0.036087 |
0.393 |
0.370 |
| Lasso |
0.238 |
0.021 |
90,240 |
4,321,524 |
~0 |
0.039046 |
0.348 |
0.376 |
| ElasticNet |
0.220 |
0.021 |
90,240 |
4,347,271 |
~0 |
0.039046 |
0.344 |
0.386 |
10. Autos
- Samples: 1,000,000 (batch subset: 200,000)
- Features: 28 features including vehicle specs, engine performance, and sales metadata
- Target: Continuous variable (e.g., car price or fuel efficiency)
- Dataset characteristics:
- Large-scale tabular dataset.
- Mix of numerical and categorical features (mostly encoded).
- Evaluates both model scalability and single-sample latency for high-volume industrial datasets.
Benchmark Results
| Model |
Train Time (s) |
Batch Inference (s) |
Batch Samples |
Batch Throughput (samples/s) |
R² |
MSE |
Single Mean (ms) |
Single P95 (ms) |
| XGBoost |
8.967 |
0.322 |
200,000 |
621,688 |
0.6532 |
0.5543 |
0.503 |
0.622 |
| CatBoost |
9.067 |
0.076 |
200,000 |
2,626,389 |
0.6413 |
0.5733 |
0.929 |
1.085 |
| LightGBM |
7.726 |
0.544 |
200,000 |
367,857 |
0.6162 |
0.6135 |
1.399 |
3.674 |
| SmartKNN |
548.474 |
12.648 |
200,000 |
15,813 |
0.4902 |
0.8149 |
0.509 |
0.597 |
| Ridge |
5.372 |
0.089 |
200,000 |
2,244,214 |
0.4181 |
0.9301 |
0.320 |
0.366 |
| Linear |
2.270 |
0.099 |
200,000 |
2,027,420 |
0.4181 |
0.9301 |
0.306 |
0.343 |
| ElasticNet |
1.587 |
0.096 |
200,000 |
2,072,844 |
0.1126 |
1.4183 |
0.376 |
0.479 |
| Lasso |
1.399 |
0.092 |
200,000 |
2,183,667 |
~0 |
1.5983 |
0.398 |
0.633 |
Evaluation Note
These benchmarks are intended to evaluate model behavior under identical, untuned conditions, reflecting practical deployment scenarios rather than optimized leaderboard results.