R-square and Standardization in Regression

R-square and Standardization in Regression
Neil W. Henry March, 2001

Adjusted R-square

As a summary of some topics that may have been overlooked in class, here are a few interesting facts about R-square related concepts.

R-squared, often called the coefficient of determination, is defined as the ratio of the sum of squares explained by a regression model and the "total" sum of squares around the mean

R²= 1 - SSE / SST

in the usual ANOVA notation. Most people refer to it as the proportion of variation explained by the model, but sometimes it is called the proportion of variance explained. This is misleading because SST is not the varianceof Y. In sample terminology, variances are "mean squares." Thus the estimated variance of Y is MST = SST/(n-1) and the estimated residual or error variance is MSE = SSE/(n-p-1) where p is the number of predictors in the regression equation. We "average" by dividing by degrees of freedom rather than by n in order to make the sample mean squares unbiased estimates of the population variances.

Regression analysis programs also calculate an "adjusted" R-square. The best way to define this quantity is:

R²_adj = 1 - MSE / MST

since this emphasizes its natural relationship to the coefficient of determination.

While R-squared will never increase when a predictor is dropped from a regression equation, the adjusted R-squared may be larger. Specifically, if the t-ratio for a predictor is less than one, dropping that predictor from the model will increase the adjusted R-squared. Sometimes you will come across an article in which the researcher keeps everything with a t bigger than 1 in the model. The motivation for doing that is to get as large an adjusted R-squared as possible. Note that the one-sided P-value for t = 1 is .16 in large samples, quite large compared to the conventional hypothesis testing standards of .05 or .01.

Here is the traditional formula for expressing the adjusted R-squared in terms of the ordinary R-squared. It shows explicitly the "adjustment" process, and also demonstrates that the adjusted R-squared is always smaller:

Interpreting R as Correlation

In contrast to the conventions described above for regression analysis of non-experimental data, it is not standard practice to report the percentage of variance explained in a designed experiment. R-squared can easily be calculated from any ANOVA table, of course:

R-squared = SS(Between Groups)/SS(Total)

The Greek symbol "Eta-squared" is sometimes used to denote this quantity. Of course the calculation of the coefficients is identical despite the different terminology, as is obvious when the definition is written in terms of the error or residual sum of squares:

R-squared = 1 - SS(Error)/SS(Total)

Note that Eta is reported if you use the Means procedure in SPSS, but not if you use the One-way ANOVA procedure. This (in my opinion) is because the ANOVA procedure was originally written for use by experimentalists while the Means procedure was added later for the convenience of survey researchers. There is a very good reason for not using this coefficient to describe results of a designed experiment. The size of Pearson's r or Eta or multiple correlation R depends on decisions made in planning the experiment, not simply on the phenomenon being studied.

Suppose, for instance, that an experimental intervention really increases response variable Y by 10 points on the average, with a standard deviation of 2 points. Imagine a simple experiment where n subjects get the intervention and a multiple kn do not, and let n be large so I can ignore sampling error. Then it works out that the value of Eta-squared is equal to:

Eta-squared =

When the treatment and control groups are of equal size (k=1) the Eta-squared is 25/29, and this is its maximal value. If the two groups differ greatly in size, say with k = 10, Eta-squared is smaller, only 25/37.1. The phenomenon is the same, the effect of treatment in average points gained is the same, but the correlation coefficient Eta is not the same.

This example is one in which the independent variable is dichotomous, the classic treatment-control experiment. Experiments can be done with a continuous independent variable, for instance where X is the dosage in a drug study. The experimenter may then assign cases to different X values as she sees fit. If we suppose that there is really a linear relationship between dosage X and outcome Y on the average, with random residuals that have a standard deviation ,it would be appropriate to do a regression analysis and R-squared would be automatically calculated by the computer program. Again, however, it can be shown that the researcher's decision on what X values to use will affect the value of "the proportion of variation explained by the model." If cases are placed at extreme values (say half the subjects have very low X, half very high) the R-squared will be larger than if they are close together.

This characteristic of the Pearson correlation was known to the ancients. Charles Spearman (the grandfather of 20^th century psychometrics), in his 1904 paper on intelligence, described it as the problem of attenuation of the correlation coefficient. Attenuation arises in experiments and in observational studies when the sample is selected from restricted ranges of the independent variable X rather than strictly at random. Adjusting for attenuation is a standard topic in psychological statistics texts. Sociologists are more likely to think of their samples as "representative" of the population on all variables, and therefore pay little attention to the issue. The sample r or Multiple R will not be a good estimate of the corresponding population parameter if the sample is (deliberately or accidentally) biased.

Standardization.

The 1981 reader by Peter Marsden (Linear Models in Social Research) contains some useful and readable papers, and his introductory sections deserve to be read (as an unusually perceptive book reviewer noted in the journal Social Forces in 1983). One paper in that collection that has become a standard reference is "Standardization in Causal Analysis" by Kim and Ferree. Standardization, in the social and behavioral sciences, refers to the practice of redefining regression equations in terms of standard deviation units. An ordinary ("raw") regression coefficient b is replaced by b times s(X)/s(Y) where s(Y) is the standard deviation of the dependent variable, Y, and s(X) is the standard deviation of the predictor, X . An equivalent result can be achieved by imagining that all variables in a regression have been rescaled to z-scores by subtracting their respective means and dividing by their standard deviations. This is often referred to as a change of scale or linear transformation of the data.

Changes of scale are trivial in one sense, for they do not affect the underlying reality or the degree of fit of a linear model to data. Choosing to measure distance in meters rather than feet is a matter of taste or convention, not a matter for the theoretical physicist or statistician to worry about. But since such changes affect the values of numbers, they may have an impact on a naive researcher whose goal is to evaluate "the relative importance of different explanatory variables" or "the relative importance of a given variable in two or more different populations" (Marsden, p. 15). While there are an infinite number of ways to change scales of measurement, the standardization technique is the one most often adopted by social and behavioral scientists. The standardized regression coefficients are often called "beta weights" or simply "betas" in some books and are routinely calculated and reported in SPSS.

Agresti and Finlay (p.416) illustrate standardization in a model in which the subject's "life events" and "socio-economic status" have been used to predict "mental impairment". The respective coefficients are .103 and -.097, indicating that "there is a .1-unit increase in the estimated mean of mental impairment for every 1-unit increase in the life events score, controlling for SES" (p. 392) compared to a decrease of .097 in estimated mean mental impairment when SES increases by one point and life events are held constant. These two "effects" are hard to compare since the two predictors have entirely different units of measurement. After standardizing, the regression coefficients are .43 and -.45, respectively, and A&F conclude that the two coefficients have similar magnitudes: a "standard deviation increase in X₂ , controlling for X₁ " has about the same effect on mental impairment as "a standard deviation increase in X₁ , controlling for X₂" , but in the opposite direction.

The attenuation problem also arises in this context, unless the data being used are a simple random sample from the population. If stratified sampling has been used, or if the data are from a designed experiment, the standard deviations of the predictors may not be unbiased estimates of their population analogs. While the unstandardized regression coefficients will usually be good estimates of the population model parameters, the standardized coefficients will not be generalizable and thus are difficult to interpret.

Kim & Ferree argued forcefully that routine use of standardized coefficients to solve the problem of comparing apples and oranges is not justifiable, and that it is possible to evaluate relative importance of predictors only when some legitimate common unit of measurement is available for all predictors. Agresti and Finlay (p. 419) warn against using standardized coefficients when comparing the results of the same regression analysis on different groups. Hubert Blalock, of course, had made the same points many years before (see Chapter 8 of his 1971 reader Causal Models in the Social Sciences, which reproduces his 1967 article).

Despite these warnings, social and behavioral science applications of regression analysis in the period 1960 - 1990 were very likely to use standardized variables. My opinion is that it is only in the last decade that the tide has turned toward analysis that emphasizes measured units and de-emphasizes the goal of comparative effect evaluation.

These issues apply to single-equation regression models, but become even more involved when a multiple equation causal model is being studied. Early converts to Sewall Wright's path analysis methodology saw as their goal the decomposition of X/Y correlations into direct effects, indirect effects, and effects due to common causes. The Pearson correlations among the variables served as the raw data for such analyses and the path coefficients used in the decomposition of effects were standardized regression coefficients. Standardization was taken for granted, not considered a problematic step in the research process. (See Agresti and Finlay, Section 16.2 for an example.)

To summarize, correlations (whether r or R) can be considered as characteristics of a population as well as descriptions of a sample. Non-random samples will not necessarily provide good estimates of these correlations. Under such circumstances standardized regression coefficients, R-squares, and "path coefficients" computed from the sample data in routine ways may not be good estimates of the population phenomena the researcher is seeking to understand. The aforementioned reviewer of Marsden's reader, noting that some of the articles in the book used data from designed experiments or non-simple random samples, pointed out that:

The structural equation literature has always tended to be associated with the analysis of "available" data. Perhaps it is time to stress that the models can be more efficiently tested and estimated if data gathering were designed specifically for those purposes.

Partial Correlation

In his article on standardized coefficients J. Bring (The American Statistician, August 1994, pp. 209-213) points out that the formula relating the square of the t value for predictor i is related to the increment in R-squared due to predictor i:

Here stands for the R with predictor i removed from the equation. This quantity is also the F statistic for testing whether the full model is a significant improvement on the simpler model. (See Agresti and Finlay, p.404.)

The increment in R-squared is also related to another widely used measure, the partial correlation coefficient between Y and the i^th predictor, controlling for the other variables in the model. The neatest expression I know for the square of this partial correlation is:

This can be interpreted as the proportion of the remaining unexplained variance that is accounted for by adding predictor i to an existing model.

Posted by Neil W. Henry, March 21, 2001 for Statistics 608 students
Corrected January 14, 2004.
If you have comments or questions, email me at nhenry@vcu.edu