George Mason University
CSI/Statistics Colloquium Series
Seminar Announcement


Record Linkage and Datamining

William E. Winkler


U.S. Bureau of the Census


ABSTRACT
Fellegi and Sunter (1969) provided a formal mathematical model for record linkage that has been applied to name and address lists. They proved the optimality of the decision rules and introduced methods for estimating crucial probabilities. The model has been rediscovered by computer scientists (Cooper and Maron, 1978) at Berkeley and applied to information retrieval. More recent applications have been to Bayesian Networks by Lewis (1994) at Bell Labs, Heckerman (1996) at Microsoft Research, and Nigam, McCallum, Thrun and Mitchell (1999) at Carnegie-Mellon. This talk provides an overview of record linkage. It shows how versions of the EM algorithm (Winkler 1988, 1989, 1993, Meng and Rubin 1993) can be used in estimating crucial matching probabilities (parameters). Because theoretical models are drastically affected by messy data, it shows how string comparator metrics (Winkler 1990) are used in adjusting the likelihoods applied in the decision rules. Error rates can be estimated when training data are available (Belin and Rubin 1995) and (sometimes) when unavailable (Winkler 1994, Larsen and Rubin 1999). Sets of assignments can be optimized by an assignment algorithm (Winkler 1994) that uses 1/500 the storage of the classic Burkard-Derigs algorithm and is particularly effective in re-identification experiments that evaluate the confidentiality of microdata (Winkler 1998, Fuller 1993). The talk concludes by showing how the recent text classification ideas of Nigam et al. (1999) can be extended to terms that are dependent. To make better use of prior knowledge, it also shows how parameters can be restricted to convex regions of the parameter space.


Friday, November 12, 1999
George W. Johnson Center, Assembly Room B
Seminar at 10:45 a.m.
Refreshments at 10:30 a.m.
For the 1999 Fall Seminar Schedule, go to
http:www.science.gmu.edu/statseminars