Predictive Modelling in Imbalanced Categorical Data – Synthetic Minority Oversampling Technique

Joe Tanner, University of Limerick

Co-authors: Helen Purtill, University of Limerick

Abstract: This paper evaluates the predictive performance of various sampling techniques and supervised learning models for classification tasks, with a focus on addressing class imbalance. Specifically, it investigates the development and application of the Synthetic Minority Oversampling Technique (SMOTE) for categorical variables in R. Four distinct SMOTE-based models are compared against classical sampling approaches across logistic regression, random forest and support vector machines. Performance is assessed using precision, recall, and F1 scores to give insights into the efficacy of strategies for handling imbalanced datasets.