An unsupervised record linkage approach using household information to enhance individual matching across different databases

Brendan Murphy, University College Dublin

Abstract: Record linkage involves matching information from multiple sources that are from the same entity. This process is challenged by typographical errors, inconsistencies, and changes over time in the variables used for matching. Applications of record linkage range from studying historical population mobility to enhancing medical records by integrating diverse databases.

Motivated by studies demonstrating improved matching performance when group information is considered, we propose a model that facilitates the joint estimation of individual and household match status using multinomial and Gaussian distributions to model variable differences. Our approach not only estimates match status for individuals and households but the parameter estimates provide valuable insights into the relevance of the data variables in the matching process.

We illustrate our methodology using data from the 1901 and 1911 Census of Ireland and the Italian Survey of Household Income and Wealth from 2014 and 2016. The results show a significant improvement in precision and recall when household information is included, compared to a direct matching approach that ignores group information.