Notes on Models for Data
Neil W. Henry     Revised June 2000
Virginia Commonwealth University

One of the most confusing aspects of statistical methodology is the distinction between sample and population. The distinction is supposed to be a clear one, and usually appears that way in Chapter 1 of many, if not all, textbooks. According to the naive presentation, the population is a collection of objects and the sample is the subset of the population about which we have information. Statistical inference consists of scientific methods of generalizing from the sample to the population.

This perspective is, in fact, the classic approach applied in sample surveys of finite human populations. It breaks down, however, in many situations. For example, we may have complete information on all freshmen who enrolled at VCU last year: what then is the proper interpretation of a P-value for a t-statistic that contrasts the male and female freshmen's mean SAT scores? Alternatively, are methods of statistical inference appropriate to evaluate the effect of gender on faculty salaries at VCU? Some scientists say no, since we have data on every person in the population. A long running controversy over the applicability of inferential statistics occurred in sociology during the 1960s, largely over the proper way to deal with such situations (see Denton Morrison, The Significance Test Controversy).

A totally different perspective is needed when discussing experimental data. The 20 subjects in a prototypical comparative experiment are neither a population nor a sample in the "chapter 1" sense. Instead the measurements of the response variable become the sample and the population is an imagined collection of all the possible measurements that might have occurred. A very attractive approach to interpreting what is going on in experimental data analysis has been recently developed by Donald Rubin, drawing on ideas of Jerzy Neyman, one of the founders of modern mathematical statistics. Unfortunately for you, I am not prepared to discuss that approach directly, although it has influenced what follows.

Any attempt to resolve the controversy involves invocation of the idea of explanatory or predictive models, and the adoption of probabilistic thinking in our explanations. In finite population theory the values of the variables being measured are conceived of as fixed (but unknown until the research is carried out) characteristics of the members of the population. An average value in the sample is simply the aggregation of these fixed values into a single number. A model-based approach takes a more constructivist view, laying down rules and theories of how the values of the measured variables were created. In many ways this approach is related to the early 19th-century view of statistics as the science of measurement error.

Take, for example, the problem of determining the weight of a specific object, using a well-defined instrument (i.e., a particular scale). If the instrument is good enough, different observations, carried out independently of one another, will result in different values. A measurement theory might state that

• MEASUREMENT = TRUTH + ERROR .
The modern statistician's model would specify that
• Y = T + e
where T is a fixed, unknown number and e is a random variable with mean A and fixed, unknown standard deviation. A is called the systematic bias of the measurement process. Independence would be assumed for the different measurement occasions. The model might also specify that e have a normal distribution, but the simple equation actually describes the measurement process adequately. Under these circumstances the sample mean Ybar of several observations on the same object (that is, with a constant value for T, and independent values of e) could be used to estimate the true value of T. If A is zero we call the measurement process unbiased.

The statistician calls T a population parameter: she interprets it to be the mean of a hypothetical population consisting of an infinitude of unbiased measurements. That is, indeed, the way I often speak in order to preserve the naive ideas of sample and population. But we might prefer to refer to T simply as the "true weight of the object", and say that the value of an observation is "caused" by the conjunction of the true weight and the "random error". This is an example of model-building to explain a phenomenon, in this case the fact that repeated observations do not always agree. Note that within a constructivist philosophy of science "true weight" is a concept that cannot exist except in the context of a particular measurement system.

Now let's consider the SAT scores of VCU freshmen and the comparison of male and female mean scores. The conventional paradigm insists that

• DATA = MODEL + RESIDUAL
If a datum is a student's SAT score, the model is that part of the score that is attributable to the student's membership in a gender group. While the mathematical equation corresponding to this statement will be identical to the one shown previously, it will not be necessary to think of T as standing for a "true" score in the ordinary sense. But I will continue to think of the observed measurement as being constructed as follows. The student is granted a value of T by virtue of his or her gender, and then an amount is added or subtracted based on the student's other personal qualities, on characteristics of the test instrument, or on some more complicated interaction of these two sets of factors. The residual "e" is considered to be random because of the unpredictability of its value for an otherwise nameless freshman, not because the freshman actually dips into a table of random numbers to achieve his or her SAT score.

Our 1500-odd VCU freshmen's SAT scores constitute a sample, from this perspective, even though the freshmen are a finite bunch of people whose scores have already been recorded. The statistical model in this case might be written in mathematical notation in several ways. For instance,

• Ymale = Tmale + e
• Yfemale = Tfemale + e
are two equations that emphasize the two T parameters of the groups; while the alternative formulation:
• Y = Alpha + Beta *X + e
emphasizes the difference between the groups. Here X is an indicator variable (or "dummy" variable) that indicates femaleness (value X = 1 for females, X = 0 for males). Beta will equal the difference between the female and male values of T, i.e. between the means of an infinite collection of measurements.

Regression models are extensions that take into account more details of the individual case than this. For example, the mean height of children could be modeled as a function of their age. The latter is a numerical variable rather than an indicator of group membership, and the specified model is a straight line. In terminology used by Agresti and Finlay, the expected value of height is a linear function of age:

• E(Height) = Alpha + Beta (Age), or
• DATA = PREDICTION + RESIDUAL .
There are no hard and fast rules about how regression results should be interpreted. Sometimes we speak of using X to predict Y, sometimes say that X explains Y, and at other times try to present a more neutral description. We always need to be cautious, however, in attributing a correlation between two variables to an underlying causal process. The model may represent our interpretation of how data are created, and in this sense may reflect our belief in the existence of a causal relationship. Most of the statistics, the numbers we calculate, do not rely directly on such an interpretation, however. That is why the warning "correlation is not causation" is important to remember.

Posted by Neil W. Henry, March 21, 2001 for Statistics 608 Students
If you have comments or questions, email me at nhenry@vcu.edu