George Mason University
CSI/Statistics Colloquium Series
Seminar Announcement
Record Linkage and Datamining
William E. Winkler
U.S. Bureau of the Census
ABSTRACT
Fellegi and Sunter (1969) provided a formal mathematical model for
record linkage that has been applied to name and address lists. They
proved the optimality of the decision rules and introduced methods for
estimating crucial probabilities. The model has been rediscovered by
computer scientists (Cooper and Maron, 1978) at Berkeley and applied
to information retrieval. More recent applications have been to
Bayesian Networks by Lewis (1994) at Bell Labs, Heckerman (1996) at
Microsoft Research, and Nigam, McCallum, Thrun and Mitchell (1999) at
Carnegie-Mellon. This talk provides an overview of record linkage.
It shows how versions of the EM algorithm (Winkler 1988, 1989, 1993,
Meng and Rubin 1993) can be used in estimating crucial matching
probabilities (parameters). Because theoretical models are
drastically affected by messy data, it shows how string comparator
metrics (Winkler 1990) are used in adjusting the likelihoods applied
in the decision rules. Error rates can be estimated when training
data are available (Belin and Rubin 1995) and (sometimes) when
unavailable (Winkler 1994, Larsen and Rubin 1999). Sets of
assignments can be optimized by an assignment algorithm (Winkler 1994)
that uses 1/500 the storage of the classic Burkard-Derigs algorithm
and is particularly effective in re-identification experiments that
evaluate the confidentiality of microdata (Winkler 1998, Fuller
1993). The talk concludes by showing how the recent text
classification ideas of Nigam et al. (1999) can be extended to terms
that are dependent. To make better use of prior knowledge, it also
shows how parameters can be restricted to convex regions of the
parameter space.
Friday, November 12, 1999
George W. Johnson Center, Assembly Room B
Seminar at 10:45 a.m.
Refreshments at 10:30 a.m.
For the 1999 Fall Seminar Schedule, go to
http:www.science.gmu.edu/statseminars