Item request has been placed!
×
Item request cannot be made.
×

Processing Request
Categorical variable encoding methods for tabular data: a benchmarking study
Item request has been placed!
×
Item request cannot be made.
×

Processing Request
- Author(s): Clerici, F; Nobani, N
- Document Type:
Electronic Resource
- Online Access:
https://hdl.handle.net/10281/589521
volume:22
issue:1
journal:INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS
- Additional Information
- Publisher Information:
Springer Nature country:CH 2026
- Abstract:
Machine learning models often require numerical inputs, making the encoding of categorical features a critical step in the data preprocessing pipeline. A wide range of encoding methods, such as the commonly used one-hot encoding, are available, but they may not always be optimal due to increased dimensionality and a lack of sensitivity to the inherent relationships between categories. This paper presents a comprehensive evaluation of 26 categorical encoding techniques, benchmarked across 13 real-world datasets and 7 different machine learning algorithms. Our study categorizes these methods based on predictive task type, model performance, and computational efficiency, offering a taxonomy for selecting encoders. In addition, we illustrate how Safe AI metrics can be applied to encoding pipelines, showing that they provide complementary insights into model robustness and fairness. Finally, we provide a Python tool called EncodeHero that enables researchers and practitioners to (1) extend the results by augmenting the benchmark with their own data and (2) choose the best encoding methodology based on their data and technical constraints.
- Subject Terms:
- Availability:
Open access content. Open access content
- Note:
STAMPA
English
- Other Numbers:
ITBAO oai:boa.unimib.it:10281/589521
10.1007/s41060-025-00886-w
1574047117
- Contributing Source:
BICOCCA OPEN ARCH
From OAIster®, provided by the OCLC Cooperative.
- Accession Number:
edsoai.on1574047117
HoldingsOnline
No Comments.