This document was written in LaTeX, and there has been no serious attempt to translate it to html. Maybe when MathML really arrives I'll translate it.
The computationally intensive methods of modern statistics rely heavily on the developments in statistical computing and numerical analysis generally. This book discusses methods of computational statistics; for statistical computing, the reader is referred to a book such as Lange (1999) or the older book by Kennedy and Gentle (1980).
Computational statistics shares two hallmarks with other ``computational'' sciences, such as computational physics, computational biology, and so on. One is a characteristic of the methodology: it is computationally intensive. The other is the nature of the tools of discovery. Tools of the scientific method have generally been logical deduction (theory) and observation (experimentation). The computer, used to explore large numbers of scenarios, constitutes a new type of tool. Use of the computer to simulate alternatives and present the research worker with information about these alternatives is a characteristic of the computational sciences. In some ways, this usage is akin to experimentation. The observations, however, are generated from an assumed model, and those simulated data are used to evaluate and study the model.
Advances in computing hardware and software have changed the nature of the daily work of statisticians. Data analysts and applied statisticians rely on computers for storage of data, analysis of the data, and production of reports describing the analysis. Mathematical statisticians (and even probabilists) use the computer for symbolic manipulations, evaluation of expressions, ad hoc simulations, and production of research reports and papers. Some of the effects on statisticians have been subtle, such as the change from the use of ``critical values'' of test statistics to the use of ``p-values'', whereas others have been more fundamental, such as the use of multivariate and/or nonlinear models instead of univariate linear models, which might formerly have been used as approximations because they were computationally tractable. More recently, computational inference using Monte Carlo methods has been replacing asymptotic approximations. Another major effect that developments in computing have had on the practice of statistics is that many Bayesian methods that were formerly impractical have entered the mainstream of statistical applications.
The ease of computations has given the statistician a new attitude about the nature of statistical research. Experimentation has been put in the toolbox of the mathematical statistician. Ideas can be explored via ``quick and dirty'' computations. Ideas that appear promising after an initial evaluation can be pursued more rigorously.
Larger scale computing systems have also given the statistician a new attitude about the nature of discovery. Science has always moved ahead by finding something that was not being sought. Exploratory methods can be applied to very large datasets. Data mining of massive datasets has enabled statisticians to increase the rate of finding things that are not being sought.
In computational statistics, computation is viewed as an instrument of discovery; the role of the computer is not just to store data, perform computations, and produce graphs and tables, but additionally to suggest to the scientist alternative models and theories. Many alternative graphical displays of a given dataset are usually integral features of computational statistics. Another characteristic of computational statistics is the computational intensity of the methods; even for datasets of medium size, high-performance computers may be required to perform the computations. Large-scale computations can replace asymptotic approximations in statistical inference.
This book describes techniques used in computational statistics, and considers some of the areas of application, such as density estimation and model building, in which computationally intensive methods are useful. The book grew out of a semester course in ``Computational Statistics'' and various courses called ``Topics in Computational Statistics'' that I have offered at George Mason University over the past several years. The book is part of a much larger tome that also covers many topics in numerical analysis; see http://www.science.gmu.edu/~jgentle/cmpstbk/
Many of the topics addressed in this book could easily be (and are) subjects for full-length books. My intent in this book is to describe these methods in a general manner and to emphasize commonalities among them. An example of a basic tool used in a variety of settings in computational statistics is the decomposition of a function so that it has a probability density as a factor. We encounter this technique in Monte Carlo methods, in function estimation, and in projection pursuit.
Most of the statistical methods and applications discussed in this book are computationally intensive, and that is why we consider them to be in the field called computational statistics. As mentioned earlier, however, the attitude with which we embark on a statistical analyses is a hallmark of computational statistics. The computations are often viewed as experiments and the computer is used as a tool of discovery.
I assume that the reader has a background in mathematical statistics at roughly the level of an advanced undergraduate- or beginning graduate-level course in the subject, and, of course, the mathematical prerequisites for such a course, which include advanced calculus, some linear algebra, and the basic notions of optimization. Except for that prerequisite, the text is essentially self-contained.
Part I addresses in a general manner the methods and techniques of computational statistics. The first chapter reviews some basic notions of statistical inference and some of the computational methods. The subject of a statistical analysis is viewed as a {\it data-generating process}. The immediate object of the analysis is a set of data that arose from the process. A wealth of standard statistical tools are available for analyzing the dataset and for making inferences about the process. Important tools in computational statistics involve simulations of the data-generating process. These simulations are used for {\it computational inference}. The standard principles of statistical inference are employed in computational inference. The difference is in the source of the data and how the data are treated.
The second chapter is about Monte Carlo simulation and some of its uses in computational inference, including Monte Carlo tests, in which artificial data are generated according to a hypothesis. Chapters 3 and 4 discuss computational inference using resampling and partitioning of a given dataset. In these methods, a given dataset is used, but the Monte Carlo sampling is employed repeatedly on the data. These methods include randomization tests, jackknife techniques, and bootstrap methods, in which data are generated from the empirical distribution of a given sample, that is, the sample is resampled.
Chapter 5 discusses methods of projecting higher-dimensional data into lower dimensions; Chapter 6 covers some of the general issues in function estimation; and Chapter 7 presents a brief overview of some graphical methods, especially those concerned with multidimensional data. The more complicated the structure of the data and the higher the dimension, the more ingenuity is required for visualization of the data; it is, however, in just those situations that graphics is most important. The orientation of the discussion on graphics is that of computational statistics; the emphasis is on discovery; and the important issues that should be considered in making presentation graphics are not addressed. The tools discussed in Chapter 5 will also be used for clustering and classification, and, in general, for exploring structure in data.
Identification of interesting features, or ``structure'', in data is an important activity in computational statistics. In Part II, I consider the problem of identification of structure and the general problem of estimation of probability densities. In simple cases, or as approximations in more realistic situations, structure may be described in terms of functional relationships among the variables in a dataset.
The most useful and complete description of a random data generating process is the associated probability density, if it exists. Estimation of this special type of function is the topic of Chapters 8 and 9, building on general methods discussed in earlier chapters, especially Chapter 6. If the data follow a parametric distribution, or rather, if we are willing to assume that the data follow a parametric distribution, identification of the probability density is accomplished by estimation of the parameters. Nonparametric density estimation is considered in Chapter 9.
Features of interest in data include clusters of observations and relationships among variables that allow a reduction in the dimension of the data. I discuss methods for identification of structure in Chapter 10, building on some of the basic measures introduced in Chapter 5.
Higher-dimensional data have some surprising and counterintuitive properties, and I discuss some of the interesting characteristics of higher dimensions.
In Chapter 11, I discuss asymetric relationships among variables. For such problems, the objective often is to estimate or predict a response for a given set of explanatory or predictive variables, or to identify the class to which an observation belongs. The approach is to use a given dataset to develop a model or a set of rules that can be applied to new data. Statistical modeling may be computationally intensive because of the number of possible forms considered or because of the recursive partitioning of the data used in selecting a model. In computational statistics, the emphasis is on {\it building} a model rather than just estimating the parameters in the model. Parametric estimation of course plays an important role in building models.
People in various disciplines have contributed to the development of the clustering and classification methods discussed in Chapters 10 and 11. Different terminology is used in different disciplines. Some people, especially in the field that was once called artificial intelligence, attempt to identify some methods---usually only the simpler ones---as ``statistical'', and other methods as something else, including ``machine learning''. I do not understand these distinctions. I take the view that any method of analyzing data is a statistical method. The major objective of statistics is to development knowledge (and maybe wisdom) from data.
As in Chapters 8 and 9, a simple model may be a probability distribution for some variable of interest. If, in addition, the relationship among variables is of interest, a model may contain a systematic component that expresses that relationship approximately and a random component that attempts to account for deviations from the relationship expressed by the systematic component.
I often take the view that a model describes a generation mechanism for data. A better understanding of a model can be assessed by taking this view: use the model to simulate artificial data, and examine the artificial data for conformity to our expectations or to some available real data. In the text and in the exercises of this chapter, I often use a model to generate data. The data are then analyzed using the model. This process, which is characteristic of computational statistics, helps to evaluate the {\it method} of the analysis. It helps us understand the role of the individual components of the model: its functional form, the parameters, and the nature of the stochastic component.
Monte Carlo methods are widely used in the research literature to evaluate properties of statistical methods. Appendix A addresses some of the considerations that apply to this kind of study. It is emphasized that the study uses an {\it experiment}, and the principles of scientific experimentation should be observed. Appendix B describes some of the software and programming issues that may be relevant for conducting a Monte Carlo study. Some parts of these appendices are revised and updated versions of material that originally appeared in Gentle~(1998a).
After this summary of what is in the book, I feel compelled to mention some things that {\it are not} in the book---but which are relevant to computational statistics. I realize that in many places throughout the book, I have skimped on details. When I teach the material, I find myself providing details, or else, preferably, having students work out details. Some important topics such as FFTs and wavelets are only mentioned in this book. Several other topics, perhaps most notably the bootstrap, classification methods, and model-building, are discussed only in an introductory manner. A full treatment of any of these topics would require by itself a longer book than this one. My goal has been to introduce a number of topics and devote an appropriate proportion of pages to each. I have given a number of references for more in-depth study of most of the topics. For most of these topics, I have more extensive class notes, but I felt that their inclusion would result in an unwieldy book.
The exercises contain an important part of the information that is to be conveyed. Many exercises require use of the computer, in some cases to perform routine calculations and in other cases to conduct experiments on simulated data. The exercises range from the trivial or merely mechanical to the very challenging. I have not attempted to indicate which is which. Some of the Monte Carlo studies suggested in the exercises could be the bases for research publications.
When I teach this material, I use more examples, and more extensive examples, than what I have included in the text. Some of my examples form the basis for some of the exercises; but it is important to include discussion of them in the class lectures. Additional examples, datasets, and programs are available through links from the web page for this book.
The text covers more material than can reasonably be included in a one-semester course. A reasonable approach, however, is just to begin at the beginning and proceed sequentially through the book. For students with more background in statistics, Chapter 1 can be skipped. The book can serve as text for two courses if more emphasis is placed on the student projects.
In most classes I teach in computational statistics, I give Exercise A.4 in Appendix A. It is to replicate and extend a Monte Carlo study reported in some recent journal article. Each student picks an article to use. The statistical methods studied in the article must be ones that the student understands, but that is the only requirement as to the area of statistics addressed in the article. I have varied the way in which the project is carried out, but it usually involves more than one student working together. A simple way is for each student to referee another student's first version (due midway through the term) and to provide a report for the student author to use in a revision. Each student is both an author and a referee. In another variation, I have students be coauthors. One student selects the article and designs and performs the Monte Carlo experiment, and another student writes the article, in which the main content is the description and analysis of the Monte Carlo experiment.