George Mason University
AES/SCS Statistics Colloquium Series
Seminar Announcement



Machine Learning Methods for Using and Analyzing Text

William E. Winkler


U.S. Bureau of the Census


ABSTRACT

Textual information consisting of words can be used for areas such as classification of documents into categories, queries in web and library searches, and the record linkage of name and address lists. To use text effectively, the text might possibly be cleaned to remove typographical error and documents (records) be given a mathematical representation in a probabilistic model. This talk describes an application of Bayesian networks to classify a collection of Reuter's newpaper articles (Lewis 1992) into categories (Nigam, McCallum, Thrun, and Mitchell 2000, Winkler 2000). The generalization involves a method for finding parsimonious interactions between words within classes that are related to statistical mixture methods of Winkler (1993) and Larsen and Rubin (2001). The results are indirectly compared with the current best-performing methods such as Support Vector Machines (Vapnik 1995) and Boosting (Schapire and Singer 2000, Friedman, Hastie, and Tibshirani 2000). The theoretical method is also compared to Probabilistic Latent Semantic Indexing (Hofmann 1999), the Information Bottleneck method (Slonim and Tishby 2001), and Hierarchical Mixtures of Experts (Jordan and Jacobs 1994).


Friday, January 25, 2002
George W. Johnson Center, Assembly Room B
Seminar at 10:45 a.m.
Refreshments at 10:30 a.m.
For the 2002 Spring Seminar Schedule, go to
www.science.gmu.edu/statseminars