Encoding Strategy
SmartML applies a deterministic, rule-based encoding strategy for tabular datasets to ensure fair and reproducible benchmarking across models.
The encoding process is fully automatic, but not opaque. Every decision is driven by fixed thresholds and explicit rules.
Overview
Encoding in SmartML consists of two independent parts:
- Feature encoding (numerical and categorical columns)
- Target encoding (classification or regression)
The same encoding logic is applied to all models to guarantee comparability.
Column Type Detection
SmartML inspects the input dataset and splits features into:
- Numerical columns
-
Detected via pandas numeric dtypes
-
Categorical columns
- Detected via
objectandcategorydtypes
No manual column specification is required.
Missing Value Handling
Missing values are handled before encoding using fixed strategies:
- Numerical features
-
Imputed using the median
-
Categorical features
- Imputed using the most frequent value
These strategies are chosen for robustness and low variance impact.
Categorical Cardinality Analysis
Categorical columns are further divided based on cardinality.
A column is considered:
- Low-cardinality if unique values ≤ 10
- High-cardinality if unique values > 10
This decision is made per column using the training data only.
Encoding Decision Rules
SmartML uses a hybrid encoding strategy with strict limits to prevent feature explosion.
One-Hot Encoding (OHE)
One-Hot Encoding is used only if all conditions below are met:
- Total categorical columns ≤ 10
- Column cardinality ≤ 10
- Total unique values across low-cardinality columns ≤ 100
If these conditions are satisfied:
- Low-cardinality columns → One-Hot Encoding
- High-cardinality columns → Target Encoding
If any condition fails, One-Hot Encoding is disabled entirely.
Target Encoding
Target Encoding is applied when:
- Cardinality is high
- OHE limits are exceeded
- Dataset has many categorical columns
Target Encoding uses:
- Smoothing = 1.0
- Minimum samples per leaf = 1
This avoids overfitting while remaining deterministic.
If categorical features exist, target values are required during fitting.
Numerical Feature Processing
Numerical columns are processed using:
- Median imputation
- No scaling
- No normalization
This ensures:
- Minimal distortion of feature distributions
- Model-agnostic preprocessing
- Faster execution
Feature Transformer Construction
All transformations are assembled into a single ColumnTransformer:
- Numerical pipeline
- Categorical pipeline(s)
- Dropped remainder columns
- Dense output enforced
Sparse outputs are explicitly disabled to maintain consistency across models.
Feature Output
After transformation:
- All features are converted to
float32 - Output is a dense NumPy array
- Feature order is deterministic
This format is compatible with all supported ML and DL models.
Target Encoding Logic
Target encoding depends on task type.
Classification Tasks
- Targets are converted to strings
- Label encoding is applied
- Mapping is learned on training labels only
This ensures consistent class indexing.
Regression Tasks
- Targets are passed through directly
- Converted to
float32 - No transformation is applied
Determinism and Reproducibility
The encoding process is:
- Deterministic
- Stateless after fitting
- Independent of model choice
Given the same dataset and random seed, SmartML will always produce identical encoded outputs.
Feature Introspection
SmartML exposes metadata about the encoding process, including:
- Number of numerical features
- Number of categorical features
- Whether OHE was used
- Count of low- and high-cardinality columns
- Column names per category group
This information is used internally for logging and debugging.
Design Rationale
This encoding strategy prioritizes:
- Fairness across models
- Controlled feature dimensionality
- Predictable behavior
- Benchmark stability
Aggressive feature engineering and custom transformations are intentionally excluded.