Discovering patterns in categorical data

Adam Korczynski, SGH Warsaw School of Economics

Abstract: Clustering is well-defined for continuous data as it follows intuitive geometric characteristics. When it comes to categorical data, we have to search for alternatives, especially in view of various types of scale, namely binary, nominal, and ordinal. These scales are common in research aimed at assessing attitudes and opinions based on questionnaires constructed using the Likert scale.  This is typical for studies focused on determining the existence of certain features as well as the intensity of attitudes. In such framework the data typically get collected for number of subjects within objects of interest (e.g. countries) and for larger number of characteristics. The paper describes a technique that allows identifying similarities between objects characterized by larger amounts of subjects. It is distribution-free, applies for mixed scales problems, and has straightforward interpretation. It can be applied to identify groups of objects that are similar in view of a single variable or multiple characteristics. Technically it follows the idea of creating all possible pairs of objects and verifying similarities through a pre-specified distance measure.