Skip to content

Encoding Strategy

SmartML applies a deterministic, rule-based encoding strategy for tabular datasets to ensure fair and reproducible benchmarking across models.

The encoding process is fully automatic, but not opaque. Every decision is driven by fixed thresholds and explicit rules.


Overview

Encoding in SmartML consists of two independent parts:

  • Feature encoding (numerical and categorical columns)
  • Target encoding (classification or regression)

The same encoding logic is applied to all models to guarantee comparability.


Column Type Detection

SmartML inspects the input dataset and splits features into:

  • Numerical columns
  • Detected via pandas numeric dtypes

  • Categorical columns

  • Detected via object and category dtypes

No manual column specification is required.


Missing Value Handling

Missing values are handled before encoding using fixed strategies:

  • Numerical features
  • Imputed using the median

  • Categorical features

  • Imputed using the most frequent value

These strategies are chosen for robustness and low variance impact.


Categorical Cardinality Analysis

Categorical columns are further divided based on cardinality.

A column is considered:

  • Low-cardinality if unique values ≤ 10
  • High-cardinality if unique values > 10

This decision is made per column using the training data only.


Encoding Decision Rules

SmartML uses a hybrid encoding strategy with strict limits to prevent feature explosion.

One-Hot Encoding (OHE)

One-Hot Encoding is used only if all conditions below are met:

  • Total categorical columns ≤ 10
  • Column cardinality ≤ 10
  • Total unique values across low-cardinality columns ≤ 100

If these conditions are satisfied:

  • Low-cardinality columns → One-Hot Encoding
  • High-cardinality columns → Target Encoding

If any condition fails, One-Hot Encoding is disabled entirely.


Target Encoding

Target Encoding is applied when:

  • Cardinality is high
  • OHE limits are exceeded
  • Dataset has many categorical columns

Target Encoding uses:

  • Smoothing = 1.0
  • Minimum samples per leaf = 1

This avoids overfitting while remaining deterministic.

If categorical features exist, target values are required during fitting.


Numerical Feature Processing

Numerical columns are processed using:

  • Median imputation
  • No scaling
  • No normalization

This ensures:

  • Minimal distortion of feature distributions
  • Model-agnostic preprocessing
  • Faster execution

Feature Transformer Construction

All transformations are assembled into a single ColumnTransformer:

  • Numerical pipeline
  • Categorical pipeline(s)
  • Dropped remainder columns
  • Dense output enforced

Sparse outputs are explicitly disabled to maintain consistency across models.


Feature Output

After transformation:

  • All features are converted to float32
  • Output is a dense NumPy array
  • Feature order is deterministic

This format is compatible with all supported ML and DL models.


Target Encoding Logic

Target encoding depends on task type.

Classification Tasks

  • Targets are converted to strings
  • Label encoding is applied
  • Mapping is learned on training labels only

This ensures consistent class indexing.


Regression Tasks

  • Targets are passed through directly
  • Converted to float32
  • No transformation is applied

Determinism and Reproducibility

The encoding process is:

  • Deterministic
  • Stateless after fitting
  • Independent of model choice

Given the same dataset and random seed, SmartML will always produce identical encoded outputs.


Feature Introspection

SmartML exposes metadata about the encoding process, including:

  • Number of numerical features
  • Number of categorical features
  • Whether OHE was used
  • Count of low- and high-cardinality columns
  • Column names per category group

This information is used internally for logging and debugging.


Design Rationale

This encoding strategy prioritizes:

  • Fairness across models
  • Controlled feature dimensionality
  • Predictable behavior
  • Benchmark stability

Aggressive feature engineering and custom transformations are intentionally excluded.