CSI 991

Seminar in Computational Statistics:

Robust Methods in Statistics

Spring, 2001

Fridays 3:00pm -- 5:00pm, Science & Technology, Room 101C

csutton@gmu.edu
jgentle@gmu.edu
Follow-up seminar is Wilcox's new book
Allen has made some comments on Chapters 3 and 4.
We plan to pull together information from many sources and do something of a comparative study of various methods. We hope that by collectively discussing the various procedures, we'll become really comfortable with using them and learn about the strengths and weaknesses of each one. We can also compare and contrasts robust methods with classical (least squares, likilihood-based methods, etc.), nonparametric, and computationally intensive (e.g., bootstrap) methods.

One approach will be to focus on a single problem at a time (e.g., simple regression, two-sample tests, estimation of the mean), and explore many different methods that can be used to address the problem. We plan to develop S-PLUS programs to perform Monte Carlo studies with which to compare the various methods.


General References


Topics

  1. Introduction
    1. Fitting Models to Data
      1. Maximum Likelihood
      2. Minimizing Residuals
      3. Fitting Quantiles
      4. Fitting Moments
      5. Model Selection
    2. Statistical Inference
      1. Parametric Estimates and Tests
      2. More General Inference about the Model
      3. The Central Role of the Normal Distribution
    3. Effects of Contamination and Heavy Tails
      1. More Realistic Data Models
      2. Examples
    4. Influence
    5. Leverage
  2. Robust Methods
    1. Measuring Robustness
      1. Breakdown
      2. Efficiency
    2. Minimizing Residuals
      1. ${\rm L}_p$ Estimators
      2. $M$ Estimators
      3. Trimming Residuals
    3. Fitting Quantiles
    4. Structure in Data
    5. Controlling Both Influence and Leverage
      1. Minimum Volume Ellipsoids
      2. Transformations
    6. Comparisons and Summary of Methods for Fitting
  3. Inference Using Robust Fits
    1. Methods Based on Approximations of Normality
    2. Monte Carlo Methods
    3. Bootstrap Methods
    4. Tests and Confidence Intervals in the One-Sample Case
    5. Inference in the Two-Sample Case
    6. Inference for Linear Regression Models
    7. Inference for Nonlinear Regression
    8. Comparisons and Summary
  4. Assessing Fitted Models
    1. Inspection of Residuals
      1. General Measures of Departures from Normality
      2. Graphical Methods
      3. Identification of Outliers
      4. Tail Weight
    2. Cross Validation
  5. Data Exploration and Adaptive Robust Methods
    1. Model Selection
      1. Variable Selection
      2. Sensitivity Analysis
      3. Cross Validation
    2. Graphical Methods
    3. Projections and Other Transformations
    4. Two-Stage Methods

We do not plan to cover the topics in the order above. Some of the topics at the beginning of the outline are listed to remind us of the basics. We will assume some knowledge of those topics, and we will skip most of them on the first pass, but we may return to them in subsequent discussions.

We plan to focus on a single problem at a time (e.g., simple regression, two-sample tests, estimation of the mean), and explore many different methods that can be used to address the problem. We will develop S-Plus programs to perform Monte Carlo studies with which to compare the various methods.


Schedule


Software and Datasets

A little test dataset from Birkes and Dodge is
  y   x1   x2 
 37    4   22
 40    6   24
 48    6   18
 44    9   20
 50   11   15
 51   12    9

Data in B&S Table 5.3 (courtesy of Arndt)

The main data-fitting routines are rrega (alternate version of the standard rreg in S-Plus) and rlmx (alternative version of rlm from the MASS library of Venables and Ripley.
These routines can be used to fit the model using a variety of methods. We are interested in tests of hypotheses using test statistics obtained by these fits.

We will concentrate first on the simple linear regression case.

We have two S-Plus functions, bsimptests and Hsimptests that will compute two Wald test statistics and fsh (the Shrader and Hettsmansberger F statistic, see B&D, eq (5.6)) following an M-regression fit.
They both require that x be normalized.

The calling sequences are similar:

   Hsimptests (bhat, resful, resred, sig, kbend=1.345)

   bsimptests (bhat, resful, resred, sig, tunec=4.685)
The returned value is the same in both functions:
   return (zstat, zstata, fsh)
Normally, sig is computed as 1.483*median(abs(resful)).

Yaru Li wrote a program implementing a preliminary design for a Monte Carlo study.

On March 30, we wrote a driver, driver.s for a Monte Carlo study that does 4 different fits (ls, Huber MAD, Huber proposal 2, and bisquare), and for the 3 latter computes 3 statistics. For 2 of the statistics, we use both t and normal distributions for the test. Thus, there are 16 different tests (see the comments in d1sim.s). Some more programs:
FMapp computes FM of B&D, eq (5.6), for any number of independent variables.
Here's a sample program to use FMapp.
FMfittest.s (fits model and tests)
A little program for a Monte Carlo experiment for the FM test and the Wald test.

Wilcox has downloadable software associated with his book.
Venables and Ripley have downloadable software associated with their book.