Classification Benchmarks

This document describes the 10 classification datasets used in our evaluation, along with their characteristics and expected modeling behavior.

All benchmarks are evaluated using the SmartML system with default model configurations and no hyperparameter tuning.

Note: The intent of these benchmarks is system-level comparison (accuracy, Macro-F1, latency, throughput), not leaderboard optimization.

Benchmark Philosophy

All datasets are evaluated under identical preprocessing and evaluation protocols
No dataset-specific tuning
Single run per model
Focus on practical behavior under production-like defaults

Note: Differences in latency across models reflect inherent algorithmic scaling behavior rather than benchmark bias. Note: Some models (e.g., SVC, KNN) are excluded from certain datasets due to scaling limitations or prohibitively high training/inference time at large sample counts.

This setup reflects real-world constraints where tuning is limited by time, cost, or deployment requirements.

Dataset Summary

#	Dataset	Samples × Features
1	Bank Marketing	45.2K × 17
2	Click Prediction (Small Subset)	40K × 10
3	Adult	49K × 50
4	Credit Card	285K × 31
5	APS Failure	80K × 171
6	KDD98 (Subset)	83K × 478
7	CoverType	550K × 55
8	Criteo Uplift (Balanced)	1.37M × 14
9	Poker Hand	1M × 11
10	Santander Customer Satisfaction	200K × 202

Dataset-wise Characteristics & Behavior

1. Bank Marketing (45.2K × 17)

Type: Binary classification with moderate class imbalance
Features: Mix of categorical and numerical
Model behavior: Tree-based models perform well due to non-linear interactions
Metric notes: Macro-F1 is lower than accuracy due to minority-class difficulty

Why this dataset matters: Classic business classification problem with realistic imbalance and feature heterogeneity.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
LightGBM	0.395	0.025	359,261	0.9103	0.7599	1.30	1.49
XGBoost	0.417	0.015	597,378	0.9079	0.7512	0.58	0.88
CatBoost	1.026	0.0088	1,024,817	0.9074	0.7479	0.44	0.50
Random Forest	2.306	0.078	115,325	0.9045	0.7198	32.85	38.15
Extra Trees	1.807	0.110	82,314	0.9028	0.7022	33.72	37.93
Logistic Reg.	1.925	0.0034	2,699,455	0.9007	0.6949	0.30	0.55
SmartKNN	4.502	0.159	57,022	0.8951	0.7059	0.17	0.20
KNN	0.026	1.033	8,752	0.8926	0.6876	2.88	3.26
Naive Bayes	0.015	0.0040	2,254,836	0.8466	0.6839	0.18	0.26
SVC	28.044	7.121	1,270	0.8983	0.6704	1.23	1.42

2. Click Prediction (Small Subset) (40K × 10)

Type: Binary classification with skewed click/no-click distribution
Features: Low-dimensional, sparse signal
Model behavior: Accuracy is generally high across models, but Macro-F1 exposes class imbalance

Why this dataset matters: Highlights the difference between accuracy and class-balanced metrics in ad-tech style problems.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
LightGBM	0.275	0.0227	351,792	0.8362	0.5170	1.44	1.86
CatBoost	0.494	0.0057	1,390,990	0.8342	0.5169	0.48	0.63
XGBoost	0.428	0.0138	578,407	0.8302	0.5347	0.65	1.00
Random Forest	5.567	0.0912	87,587	0.8280	0.5372	35.49	38.90
Logistic Reg.	1.839	0.0010	7,987,149	0.8324	0.4616	0.30	0.44
Extra Trees	1.746	0.1362	58,671	0.8200	0.5287	36.03	39.43
SmartKNN	3.527	0.0844	94,698	0.8015	0.5263	0.15	0.18
KNN	0.069	1.669	4,788	0.8105	0.5256	2.06	2.46
Naive Bayes	0.010	0.0019	4,151,311	0.8153	0.4851	0.22	0.32
SVC	61.116	5.805	1,376	0.8324	0.4594	1.25	1.65

3. Adult (49K × 50)

Type: Binary income prediction
Features: Structured tabular, moderate imbalance
Model behavior: Most models perform competitively; gains are incremental

Why this dataset matters: Canonical ML benchmark showing trade-offs between simplicity and performance under defaults.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
LightGBM	0.366	0.0308	317,301	0.8762	0.8189	1.36	1.62
CatBoost	0.966	0.0100	975,425	0.8746	0.8159	0.51	0.78
XGBoost	0.397	0.0166	586,969	0.8738	0.8166	0.59	0.85
Random Forest	2.646	0.1002	97,528	0.8608	0.7984	32.80	37.92
Logistic Reg.	1.876	0.0038	2,588,927	0.8515	0.7795	0.29	0.39
SVC	49.779	11.021	886	0.8527	0.7766	1.58	1.70
Extra Trees	2.313	0.1360	71,828	0.8431	0.7745	33.23	37.76
KNN	0.033	1.3317	7,336	0.8332	0.7585	3.15	3.79
SmartKNN	5.992	0.1746	55,953	0.8313	0.7576	0.18	0.22
Naive Bayes	0.019	0.0052	1,878,899	0.7972	0.6493	0.17	0.18

4. Credit Card (285K × 31)

Type: Highly imbalanced fraud detection task
Features: Mix of numerical and categorical features, high skew in target
Model behavior: Accuracy is extremely high for most models due to severe class imbalance; Macro-F1 reveals meaningful differences.

Why this dataset matters: Demonstrates why Macro-F1 and system-level metrics are critical when accuracy alone is misleading in fraud detection.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
Extra Trees	9.886	0.2633	216,351	0.99963	0.9425	33.87	37.96
Random Forest	143.888	0.1988	286,510	0.99961	0.9401	29.26	36.85
CatBoost	4.506	0.0341	1,669,088	0.99956	0.9330	0.48	0.52
SmartKNN	41.255	1.5284	37,270	0.99954	0.9285	0.24	0.28
XGBoost	2.627	0.0759	750,563	0.99946	0.9170	0.54	0.67
Logistic Reg.	2.575	0.0171	3,328,539	0.99916	0.8619	0.39	0.50
Naive Bayes	0.085	0.0185	3,074,959	0.97639	0.5489	0.17	0.19
LightGBM	2.543	0.1132	503,029	0.99252	0.5118	1.32	1.45

5. APS Failure (80K × 171)

Type: High-dimensional industrial failure dataset
Features: Sparse and noisy, many weak signals
Model behavior: Tree ensembles dominate in predictive performance; linear models struggle; latency differences are significant due to dimensionality.

Why this dataset matters: Represents industrial monitoring problems with many weak features and limited signal density, testing both predictive quality and system efficiency.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
Extra Trees	5.773	0.1363	111,539	0.99480	0.9192	35.72	38.98
Random Forest	45.506	0.0916	165,935	0.99474	0.9196	31.64	37.18
CatBoost	4.945	0.0201	755,952	0.99454	0.9180	0.72	0.94
LightGBM	3.444	0.0437	347,766	0.99441	0.9161	1.59	1.84
XGBoost	4.011	0.0272	559,446	0.99329	0.8995	0.54	0.66
SmartKNN	30.281	1.3649	11,136	0.99224	0.8756	0.33	0.44
Logistic Reg.	4.963	0.0293	518,141	0.99026	0.8519	0.43	0.64
Naive Bayes	0.094	0.0296	513,504	0.96757	0.7393	0.19	0.25

6. KDD98 (Subset) (83K × 478)

Type: Extremely high-dimensional tabular dataset with strong class imbalance
Features: Many redundant or weakly informative features
Model behavior: Most models achieve high accuracy due to class skew, but Macro-F1 highlights true minority-class performance; latency differences are significant.

Why this dataset matters: Tests scalability, robustness, and noise tolerance rather than pure predictive power.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
LightGBM	8.573	0.0686	240,175	0.8609	0.5040	2.31	2.76
CatBoost	8.325	0.0282	584,174	0.8570	0.5109	1.03	1.21
XGBoost	8.467	0.0406	405,320	0.8541	0.5064	0.57	0.83
Logistic Reg.	8.481	0.0781	210,883	0.8520	0.5083	0.51	0.84
SmartKNN	65.843	3.6427	4,520	0.8596	0.4918	0.61	0.71
Random Forest	37.802	0.2801	58,777	0.8790	0.4752	36.46	39.09
Extra Trees	20.899	0.3391	48,553	0.8816	0.4736	36.56	39.00
Naive Bayes	0.307	0.0906	181,711	0.5937	0.4653	0.19	0.22

7. CoverType (550K × 55)

Type: Large multi-class classification problem
Features: Relatively balanced classes, mixed numerical/categorical features
Model behavior: Tree-based methods dominate due to spatial and hierarchical structure; linear models perform poorly.

Why this dataset matters: Evaluates multi-class scalability and performance consistency at medium-to-large scale.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
Random Forest	57.827	2.168	53,589	0.9543	0.9256	36.92	39.50
Extra Trees	44.640	3.122	37,221	0.9526	0.9252	46.65	69.67
SmartKNN	221.993	5.930	19,596	0.9466	0.9049	0.38	0.45
XGBoost	31.279	1.230	94,476	0.8682	0.8518	0.78	0.92
LightGBM	16.242	2.112	55,032	0.8518	0.8260	1.57	1.86
Logistic Reg.	52.693	0.073	1,593,008	0.7235	0.5303	0.43	0.71
Naive Bayes	0.289	0.246	472,183	0.5658	0.4468	0.34	0.56
CatBoost	27.708	0.126	925,254	0.3646	0.0763	0.62	0.89

8. Criteo Uplift (Balanced) (1.37M × 14)

Type: Large-scale binary classification
Features: Low-dimensional but extremely high sample count
Model behavior: XGBoost achieves highest predictive quality; system-level metrics (throughput and latency) reveal trade-offs for large-scale scoring.

Why this dataset matters: Stress-tests training time, batch inference throughput, and memory efficiency at scale.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
XGBoost	24.510	1.720	158,930	0.7201	0.6510	0.71	0.97
LightGBM	20.227	3.283	83,248	0.7196	0.6508	1.42	1.68
Logistic Reg.	12.829	0.0767	3,565,107	0.6960	0.6177	0.38	0.50
Random Forest	158.947	8.040	33,992	0.6833	0.6184	36.81	40.27
Extra Trees	89.121	9.791	27,915	0.6803	0.6150	36.24	38.75
SmartKNN	121.743	13.685	19,971	0.6678	0.5994	0.49	0.58
Naive Bayes	0.338	0.1665	1,641,407	0.3067	0.2921	0.22	0.26
CatBoost	41.667	0.1993	1,371,246	0.3659	0.1339	0.40	0.48

9. poker Hand

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
XGBoost	3.927	0.322	636,386	0.8894	0.8890	0.60	0.77
Extra Trees	66.686	5.304	38,653	0.8627	0.8625	35.01	39.68
Random Forest	81.772	4.768	42,997	0.8515	0.8512	36.49	52.38
CatBoost	6.234	0.090	2,266,050	0.8454	0.8448	0.71	0.88
LightGBM	4.531	0.435	471,140	0.7060	0.7034	1.82	3.22
SmartKNN	56.156	7.558	27,126	0.6276	0.6276	0.42	0.45
Naive Bayes	0.150	0.032	6,360,374	0.5234	0.5108	0.20	0.36
Logistic Reg.	2.437	0.023	8,829,797	0.5175	0.4993	0.44	0.75

10. Santander Customer Satisfaction (200K × 202)

Type: Extremely imbalanced binary classification
Features: High-dimensional tabular data with sparse signals
Model behavior: Accuracy is nearly identical across all models due to extreme imbalance; differences are primarily system-level (latency, throughput, efficiency).

Why this dataset matters: Illustrates why accuracy alone is insufficient and why reporting latency, throughput, and efficiency is critical.

Benchmark Results

Model	Train Time (s)	Batch Inference (s)	Throughput (samples/s)	Accuracy	Macro-F1	Single Mean (ms)	Single P95 (ms)
CatBoost	10.556	0.0215	1,864,126	0.8995	0.4735	0.79	0.96
XGBoost	6.367	0.0245	1,635,031	0.8995	0.4735	0.55	0.73
LightGBM	7.641	0.0690	579,707	0.8995	0.4735	1.58	1.87
Logistic Reg.	4.160	0.0896	446,572	0.8995	0.4735	0.39	0.48
Naive Bayes	0.316	0.1179	339,184	0.8995	0.4735	0.17	0.21
Extra Trees	16.253	0.3030	132,018	0.8995	0.4735	26.45	35.72
Random Forest	148.571	0.2441	163,845	0.8995	0.4735	25.04	26.51
SmartKNN	69.707	5.2203	7,662	0.8995	0.4735	0.49	0.55

Key Takeaways Across Datasets

Accuracy can be misleading under class imbalance
Macro-F1 provides a more reliable signal of real performance
Many datasets converge under default settings
Differences often emerge in latency, throughput, and scalability, not just accuracy

Evaluation Note

These benchmarks are intended to evaluate model behavior under identical, untuned conditions, reflecting practical deployment scenarios rather than optimized leaderboard results.