Model-Based Sequential Outlier Identification for Linear Cluster-Weighted Models

Ultán Doherty, Trinity College Dublin

Co-authors: Paul D. McNicholas, McMaster University; Arthur White, Trinity College Dublin

Abstract: Cluster-weighted models combine regression and model-based clustering by jointly modelling explanatory variables and a response variable. This allows them to simultaneously perform model-based clustering and clusterwise multiple regression by constructing a mixture model. In this paper, we focus on (Gaussian) linear cluster-weighted models. That is, cluster-weighted models where the explanatory variables are assumed to follow a multivariate Gaussian distribution and the relationship between the response variable and those explanatory variables is linear with Gaussian errors.

The presence of outliers in a data set can hinder the ability of linear cluster-weighted models (CWMs) to accurately recover the underlying model of the true data points. We present outlierMBC-LCWM, a method for sequentially identifying outliers while fitting a linear CWM. Our method does not require the number of outliers to be pre-specified and does not require any assumptions about the distribution of the outliers. We apply our method to both simulated and real data sets and demonstrate its ability to correctly identify outliers and accurately model both the cluster structure and the response variable of the true observations.