CSI 991
Seminar in Computational Statistics:
Robust Methods in Statistics
Spring, 2001
Fridays 3:00pm -- 5:00pm, Science & Technology, Room 101C
csutton@gmu.edu
jgentle@gmu.edu
Follow-up seminar is
Wilcox's new book
Allen has made some
comments
on Chapters 3 and 4.
We plan
to pull together information from many sources and do
something of a comparative study of various methods.
We hope that by collectively discussing the
various procedures, we'll become really comfortable with using them and
learn about the strengths and weaknesses of each one. We can also
compare
and contrasts robust methods with classical (least squares,
likilihood-based methods, etc.), nonparametric, and computationally
intensive (e.g., bootstrap) methods.
One approach will be to focus on a single problem at a time
(e.g., simple regression, two-sample tests, estimation of the mean), and
explore many different methods that can be used to address the problem.
We plan to develop S-PLUS
programs to perform Monte Carlo studies with which to compare the
various methods.
General References
- Birkes, David, and Yadolah Dodge (1993),
Alternative Methods of Regression,
John Wiley & Sons, New York.
- Gnanadesikan, R. (1997),
Methods for Statistical Data Analysis of
Multivariate Observations,
second edition, John Wiley & Sons, New York.
- Rousseeuw, Peter J., and Annick M. Leroy (1987),
Robust Regression and Outlier Detection,
John Wiley & Sons, New York.
- Wilcox, Rand R., (1997),
Introduction to Robust Estimation and Hypothesis Testing,
Academic Press, San Diego.
Topics
- Introduction
- Fitting Models to Data
- Maximum Likelihood
- Minimizing Residuals
- Fitting Quantiles
- Fitting Moments
- Model Selection
- Statistical Inference
- Parametric Estimates and Tests
- More General Inference about the Model
- The Central Role of the Normal Distribution
- Effects of Contamination and Heavy Tails
- More Realistic Data Models
- Examples
- Influence
- Leverage
- Robust Methods
- Measuring Robustness
- Breakdown
- Efficiency
- Minimizing Residuals
- ${\rm L}_p$ Estimators
- $M$ Estimators
- Trimming Residuals
- Fitting Quantiles
- Structure in Data
- Controlling Both Influence and Leverage
- Minimum Volume Ellipsoids
- Transformations
- Comparisons and Summary of Methods for Fitting
- Inference Using Robust Fits
- Methods Based on Approximations of Normality
- Monte Carlo Methods
- Bootstrap Methods
- Tests and Confidence Intervals in the One-Sample Case
- Inference in the Two-Sample Case
- Inference for Linear Regression Models
- Inference for Nonlinear Regression
- Comparisons and Summary
- Assessing Fitted Models
- Inspection of Residuals
- General Measures of Departures from Normality
- Graphical Methods
- Identification of Outliers
- Tail Weight
- Cross Validation
- Data Exploration and Adaptive Robust Methods
- Model Selection
- Variable Selection
- Sensitivity Analysis
- Cross Validation
- Graphical Methods
- Projections and Other Transformations
- Two-Stage Methods
We do not plan to cover the topics in the order above. Some of the
topics at the beginning of the outline are listed to remind us of the
basics. We will assume some knowledge of those topics, and we will
skip most of them on the first pass,
but we may return to them in subsequent discussions.
We plan to focus on a single problem at a time
(e.g., simple regression, two-sample tests, estimation of the mean), and
explore many different methods that can be used to address the problem.
We will develop S-Plus
programs to perform Monte Carlo studies with which to compare the
various methods.
Schedule
- January 26
Accurate tests about the slope parameter in simple linear regression
using M-estimators.
(This is Section 3.6 above.)
-
the rho and psi functions
-
robust regression using M-estimators
-
tests about the slope parameter
-
test for reduced models
(See Chapter 5 of the book by Birkes and Dodge, in particular the
approximate F statistic of equation (5.6).)
- February 2
Dallas pointed out that Birkes and Dodge state that the scale estimate
from the full model should be used in fitting the reduced model. To
accommodate this, we wrote an alternate version of rreg, called
rrega (see below)
that has a keyword argument fix.scale, which, if set to a positive
value, defines a scale to be used throughout the fit (rather than a
value that is proportional to the median of the residuals at each step).
To get the fit for the full model using the M estimator that Birkes
and Dodge recommends, use
rrful <- rrega(---[full model]---,
method=function(u) wt.huber(u,c=1.5))
Then to get the fit for the reduced model use
sig <- 1.483*median(abs(rrful$residuals))
rrared <- rrega(---[reduced model]---,
method=function(u) wt.huber(u,c=1.5),
fix.scale=sig)
A function
FMapp
for equation (5.6) of Birkes and Dodge is available.
The user must first do the full and the reduced fits and obtain
the respective residual vectors.
These fits can be obtained using the function rrega.
- February 9
The scale to use in the reduced model.
The null distribution of FM.
We wrote a program to use FMapp.
Yaru Li wrote two programs:
FMnew.s and Wtest.s (Wald test).
- February 16
Continue work on the distribution of FM.
Discuss the Wald test. Program it for the simple case (Wsimtest.s).
- February 23
Set up and run some small Monte Carlo experiments with the simple
linear case. (testssim.s)
Begin checking and using programs from Venables and Ripley.
- March 2
Continue Monte Carlo experiments for simple case.
Develop and program the Wald test for the general case.
- March 8 (Thursday)
Review distribution of FM.
Dallas generated 1000 realizations of a null model with 100
observations
and normal errors, and did a q-q plot:
Develop plans for Monte Carlo study.
- March 16
Wrote Hsimptests and bsimptests for Wald tests and FM test
following a Huber fit or a bisquare fit.
(See programs below.)
- March 23
Dallas produced a modified version of rlm, called rlmx (see link below).
- March 30
Modified Hsimptests and bsimptests (the links below are to the new
versions)
Fixed a driver program (driver.s) for a MC study.
The MC study currently uses rlmx.
- April 6
- April 13
- April 20
- April 27
- May 18
Identified and fixed a problem in our usage of rlmx. Unless the
default tuning parameters are used, they must be passed to rlmx
as keyword arguments. (This probably means we could have used rlm
all along, and did not need rlmx.)
The full program for the Monte Carlo study is
fullmonte.s
(Note the third argument in the calls to rlmx for the Huber psi
functions.)
Software and Datasets
A little test dataset from Birkes and Dodge is
y x1 x2
37 4 22
40 6 24
48 6 18
44 9 20
50 11 15
51 12 9
Data in B&S Table 5.3
(courtesy of Arndt)
The main data-fitting routines are
rrega
(alternate version of the standard rreg in S-Plus)
and
rlmx
(alternative version of rlm from
the MASS library of Venables and Ripley.
These routines can be used to fit the model using a variety of
methods. We are interested in tests of hypotheses using test
statistics obtained by these fits.
We will concentrate first on the simple linear regression case.
We have two S-Plus functions,
bsimptests
and
Hsimptests
that will compute two Wald test statistics
and fsh (the Shrader and Hettsmansberger F statistic, see B&D, eq (5.6))
following an M-regression fit.
They both require that x be normalized.
The calling sequences are similar:
Hsimptests (bhat, resful, resred, sig, kbend=1.345)
bsimptests (bhat, resful, resred, sig, tunec=4.685)
The returned value is the same in both functions:
return (zstat, zstata, fsh)
Normally, sig is computed as 1.483*median(abs(resful)).
Yaru Li wrote
a program
implementing a preliminary design for a Monte Carlo study.
On March 30, we wrote a driver,
driver.s
for a Monte Carlo study that does 4 different fits (ls, Huber MAD,
Huber proposal 2, and bisquare), and for the 3 latter computes 3
statistics. For 2 of the statistics, we use both t and normal
distributions for the test. Thus, there are 16 different tests (see
the comments in d1sim.s).
Some more programs:
FMapp
computes FM of B&D, eq (5.6), for any number of independent variables.
Here's
a sample program
to use FMapp.
FMfittest.s
(fits model and tests)
A little program
for a Monte Carlo experiment for the FM test and the Wald test.
Wilcox has
downloadable software
associated with his book.
Venables and Ripley have
downloadable software
associated with their book.