Note: Large images and tables on this page may necessitate printing in landscape mode.

Copyright © The McGraw-Hill Companies.  All rights reserved.

Basic and Clinical Biostatistics > Chapter 10. Statistical Methods for Multiple Variables >

Key Concepts

The choice of statistical methods depends on the research question, the scales on which the variables are measured, and the number of variables to be analyzed.
Many of the advanced statistical procedures can be interpreted as an extension or modification of multiple regression analysis.
Many of the statistical methods used for questions with one independent variable have direct analogies with methods for multiple independent variables.
The term "multivariate" is used when more than one independent variable is analyzed.
Multiple regression is a simple and ideal method to control for confounding variables.
Multiple regression coefficients indicate whether the relationship between the independent and dependent variables is positive or negative.
Dummy, or indicator, coding is used when nominal variables are used in multiple regression.
Regression coefficients indicate the amount the change in the dependent variable for each one-unit change in the X variable, holding other independent variables constant.
Multiple regression measures a linear relationship only.
The Multiple R statistic is the best indicator of how well the model fits the data—how much variance is accounted for.
Several methods can be used to select variables in a multivariate regression.
Polynomial regression can be used when the relationship is curvilinear.
Cross-validation tell us how applicable the model will be if we used it in another sample of subjects.
A good rule of thumb is to have ten times as many subjects as variables.
Analysis of covariance controls for confounding variables; it can be used as part of analysis of variance or in multiple regression.
Logistic regression predicts a nominal outcome; it is the most widely used regression method in medicine.
The regression coefficients in logistic regression can be transformed to give odds ratios.
The Cox model is the multivariate analogue of the Kaplan–Meier curve; it predicts time-dependent outcomes when there are censored observations.
The Cox model is also called the proportional hazard model; it is one of the most important statistical methods in medicine.
Meta-analysis provides a way to combine the results from several studies in a quantitative way and is especially useful when studies have come to opposite conclusions or are based on small samples.
An effect size is a measure of the magnitude of differences between two groups; it is a useful concept in estimating sample sizes.
The Cochrane Collection is a set of very well designed meta-analyses and is available at libraries and online.
Several methods are available when the goal is to classify subjects into groups.
Multivariate analysis of variance, or MANOVA, is analogous to using ANOVA when there are several dependent variables.

Presenting Problems

Presenting Problem 1

In Chapter 8 we examined the study by Jackson and colleagues (2002) who evaluated the relationship between BMI and percent body fat. Please refer to that chapter for more details on the study. We found a significant relationship between these two measures and calculated a correlation coefficient of r = 0.73. These investigators knew, however, that variables other than BMI may also affect the relationship between BMI and percent body fat and developed separate models for men and women. We use their data in this chapter to illustrate two important procedures: multiple regression to control possible confounding variables, and polynomial regression to model the nonlinear relationship we noted in Chapter 8. Data are on the CD-ROM [available only with the book] in a file entitled "Jackson."

Presenting Problem 2

Soderstrom and coinvestigators (1997) wanted to develop a model to identify trauma patients who are likely to have a blood alcohol concentration (BAC) in excess of 50 mg/dL. They evaluated data from a clinical trauma registry and toxicology database at a level I trauma center. Such patients might be candidates for alcohol and drug abuse and dependence treatment and intervention programs.

Data, including BAC, were available on 11,062 patients of whom approximately 71% were male and 65% were white. The mean age was 35 years with a standard deviation of 17 years. Type of injury was classified as unintentional, typically accidental (78.2%), or intentional, including suicide attempts (21.8%). Of these patients, 3180 (28.7%) had alcohol detected in the blood, and 91.2% of those patients had a BAC in excess of 50 mg/dL. Among the patients with a BAC > 50, percentages of men and whites did not differ appreciably from the entire sample; however, the percentage of intentional injuries in this group was higher (28.9%). We use a random sample of data provided by the investigators to illustrate the calculation and interpretation of the logistic model, the statistical method they used to develop their predictive model. Data are in a file called "Soderstrom" on the CD-ROM [available only with the book].

Presenting Problem 3

In the previous chapter we used data from a study by Crook and colleagues (1997) to illustrate the Kaplan–Meier survival analysis method. These investigators studied the correlation between both the pretreatment prostate-specific antigen (PSA) and posttreatment nadir PSA levels in men with localized prostate cancer who were treated using external beam radiation therapy. The Gleason histologic scoring system was used to classify tumors on a scale of 2 to 10. Please refer to that Chapter 9 for more details. The investigators wanted to examine factors other than tumor stage that might be associated with treatment failure, and we use observations from their study to describe an application of the Cox proportional hazard model. Data on the patients are given in the file entitled "Crook" on the CD-ROM [available only with the book].

Presenting Problem 4

The use of central venous catheters to administer parenteral nutrition, fluids, or drugs is a common medical practice. Catheter-related bloodstream infections (CR-BSI) are a serious complication estimated to occur in about 200,000 patients each year. Many studies have suggested that impregnation of the catheter with the antiseptic chlorhexidine/silver sulfadiazine reduces bacterial colonization, but only one study has shown a significant reduction in the incidence of bloodstream infections.

It is difficult for physicians to interpret the literature when studies report conflicting results about the benefits of a clinical intervention or practice. As you now know, studies frequently fail to find significance because of low power associated with small sample sizes. Traditionally, conflicting results in medicine are dealt with by reviewing many studies published in the literature and summarizing their strengths and weaknesses in what are commonly called review articles. Veenstra and colleagues (1999) used a more structured method to combine the results of several studies in a statistical manner. They applied meta-analysis to 11 randomized, controlled clinical trials, comparing the incidence of bloodstream infection in impregnated catheters versus nonimpregnated catheters, so that overall conclusions regarding efficacy of the practice could be drawn. The section titled "Meta-Analysis" summarizes the results.

Purpose of the Chapter

The purpose of this chapter is to present a conceptual framework that applies to almost all the statistical procedures discussed so far in this text. We also describe some of the more advanced techniques used in medicine.

A Conceptual Framework

The previous chapters illustrated statistical techniques that are appropriate when the number of observations on each subject in a study is limited. For example, a t test is used when two groups of subjects are studied and the measure of interest is a single numerical variable—such as in Presenting Problem 1 in Chapter 6, which discussed differences in pulse oximetry in patients who did and did not have a pulmonary embolism(Kline et al, 2002). When the outcome of interest is nominal, the chi-square test can be used—such as the Lapidus et al (2002) study of screening for domestic violence in the emergency department (Chapter 6 Presenting Problem 3). Regression analysis is used to predict one numerical measure from another, such as in the study predicting insulin sensitivity in hyperthyroid women (Gonzalo et al, 1996; Chapter 7 Presenting Problem 2).

Alternatively, each of these examples can be viewed conceptually as involving a set of subjects with two observations on each subject: (1) for the t test, one numerical variable, pulse oximetry, and one nominal (or group membership) variable, development of pulmonary embolism; (2) for the chi-square test, two nominal variables, training in domestic violence and screening in the emergency department; (3) for regression, two numerical variables, insulin sensitivity and body mass index. It is advantageous to look at research questions from this perspective because ideas are analogous to situations in which many variables are included in a study.

To practice viewing research questions from a conceptual perspective, let us reconsider Presenting Problem 1 in Chapter 7 by Woeber (2002). The objective was to determine whether differences exist in serum free T4 concentrations in patients who had thyroiditis with normal serum TSH values and not taking L-T4 replacement, had normal TSH values and were taking L-T4 replacement therapy, or had normal thyroid and serum TSH levels. The research question in this study may be viewed as involving a set of subjects with two observations per subject: one numerical variable, serum free T4 concentrations, and one ordinal (or group membership) variable, thyroid status, with three categories. If only two categories were included for thyroid status, the t test would be used. With more than two groups, however, one-way analysis of variance (ANOVA) is appropriate.

Many problems in medicine have more than two observations per subject because of the complexity involved in studying disease in humans. In fact, many of the presenting problems used in this text have multiple observations, although we chose to simplify the problems by examining only selected variables. One method involving more than two observations per subject has already been discussed: two-way ANOVA. Recall that in Presenting Problem 2 in Chapter 7 insulin sensitivity was examined in overweight and normal weight women with and without hyperthyroid disease (Gonzalo et al, 1996). For this analysis, the investigators classified women according to two nominal variables (weight status and thyroid status, both measured as normal or higher than normal) and one numerical variable, insulin sensitivity. (Although both weight and thyroid level are actually numerical measures, the investigators transformed them into nominal variables by dividing the values into two categories.)

If the term independent variable is used to designate the group membership variables (eg, development of pulmonary embolism or not), or the X variable (eg, blood pressure measured by a finger device), and the term dependent is used to designate the variables whose means are compared (eg, pulse oximetry), or the Y variable (eg, blood pressure measured by the cuff device), the observations can be summarized as in Table 10–1. (For the sake of simplicity, this summary omits ordinal variables; variables measured on an ordinal scale are often treated as if they are nominal.) Data from several of the presenting problems are available on the CD-ROM [available only with the book], and we invite you to replicate the analyses as you go through this chapter.

Table 10–1. Summary of Conceptual Frameworka for Questions Involving Two Variables.

Table 10–1. Summary of Conceptual Frameworka for Questions Involving Two Variables.

Independent VariableDependent VariableMethod
Nominal (binary)Numericalt testa
Nominal (more than two values)NumericalOne-way
NominalNumerica (censored)Actuarial methods

aAssuming the necessary assumptions (eg, normality, independence, etc.) are met.

bCorrelation is appropriate when neither variable is designated as independent or dependent.

ANOVA = analysis of variance.

Introduction to Methods for Multiple Variables

Statistical techniques involving multiple variables are used increasingly in medical research, and several of them are illustrated in this chapter. The multiple-regression model, in which several independent variables are used to explain or predict the values of a single numerical response, is presented first, partly because it is a natural extension of the regression model for one independent variable illustrated in Chapter 8. More importantly, however, all the other advanced methods except meta-analysis can be viewed as modifications or extensions of the multiple-regression model. All except meta-analysis involve more than two observations per subject and are concerned with explanation or prediction.

The goal in this chapter is to present the logic of the different methods listed in Table 10–2 and to illustrate how they are used and interpreted in medical research. These methods are generally not mentioned in traditional introductory texts, and most people who take statistics courses do not learn about them until their third or fourth course. These methods are being used more frequently in medicine, however, partly because of the increased involvement of statisticians in medical research and partly because of the availability of complex statistical computer programs. In truth, few of these methods would be used very much in any field were it not for computers because of the time-consuming and complicated computations involved. To read the literature with confidence, especially studies designed to identify prognostic or risk factors, a reasonable acquaintance with the methods described in this chapter is required. Few of the available elementary books discuss multivariate methods. One that is directed toward statisticians is nevertheless quite readable (Chatfield, 1995); Katz (1999) is intended for readers of the medical literature and contains explanations of many of topics we discuss in this chapter (Dawson, 2000), as does Norman and Streiner (1996).

Table 10–2. Summary of Conceptual Frameworka for Questions Involving Two or More Independent (Explanatory) Variables.

Table 10–2. Summary of Conceptual Frameworka for Questions Involving Two or More Independent (Explanatory) Variables.

Independent VariablesDependent VariableMethod(s)
Nominal and numericalNominal (binary)Logistic regression
Nominal and numericalNominal (2 or more categories)Logistic regression
  Discriminant analysisa
  Cluster analysis
  Propensity scores
NumericalNumericalMultiple regressiona
Nominal and numericalNumerical (censored)Cox propotional hazard model
Confounding factorsNumericalANCOVAa
Confounding factorsNominalMantel–Haenszel
Numerical only Factor analysis

aCertain assumptions (eg, multivariate normality, independence, etc.) are needed to use these methods.

CART = classification and regression tree; ANOVA = analysis of variance; ANCOVA = analysis of covariance; MANOVA = multivariate analysis of variance; GEE = generalized estimating equations.

Before we examine the advanced methods, however, a comment on terminology is necessary. Some statisticians reserve the term "multivariate" to refer to situations that involve more than one dependent (or response) variable. By this strict definition, multiple regression and most of the other methods discussed in this chapter would not be classified as multivariate techniques. Other statisticians, ourselves included, use the term to refer to methods that examine the simultaneous effect of multiple independent variables. By this definition, all the techniques discussed in this chapter (with the possible exception of some meta-analyses) are classified as multivariate.

Multiple Regression

Review of Regression

Simple linear regression (Chapter 8) is the method of choice when the research question is to predict the value of a response (dependent) variable, denoted Y, from an explanatory (independent) variable X. The regression model is

For simplicity of notation in this chapter we use Y to denote the dependent variable, even though Y', the predicted value, is actually given by this equation. We also use a and b, the sample estimates, instead of the population parameters, 0 and 1, where a is the intercept and b the regression coefficient. Please refer to Chapter 8 if you'd like to review simple linear regression.

Multiple Regression

The extension of simple regression to two or more independent variables is straightforward. For example, if four independent variables are being studied, the multiple regression model is

where X1 is the first independent variable and b1 is the regression coefficient associated with it, X2 is the second independent variable and b2 is the regression coefficient associated with it, and so on. This arithmetic equation is called a linear combination; thus, the response variable Y can be expressed as a (linear) combination of the explanatory variables. Note that a linear combination is really just a weighted average that gives a single number (or index) after the X's are multiplied by their associated b's and the bX products are added. The formulas for a and b were given in Chapter 8, but we do not give the formulas in multiple regression because they become more complex as the number of independent variables increases; and no one calculates them by hand, in any case.

The dependent variable Y must be a numerical measure. The traditional multiple-regression model calls for the independent variables to be numerical measures as well; however, nominal independent variables may be used, as discussed in the next section. To summarize, the appropriate technique for numerical independent variables and a single numerical dependent variable is the multiple regression model, as indicated in Table 10–2.

Multiple regression can be difficult to interpret, and the results may not be replicable if the independent variables are highly correlated with each other. In the extreme situation, two variables that are perfectly correlated are said to be collinear. When multicollinearity occurs, the variances of the regression coefficients are large so the observed value may be far from the true value. Ridge regression is a technique for analyzing multiple regression data that suffer from multicollinearity by reducing the size of standard errors. It is hoped that the net effect will be to give more reliable estimates. Another regression technique, principal components regression, is also available, but ridge regression is the more popular of the two methods.

Interpreting the Multiple Regression Equation

Jackson and colleagues (2002) (Presenting Problem 1) wanted to study the way in which sex, age, and race affect the relationship between BMI and percent body fat. We provide some basic information on these variables in Table 10–3 and see the study included 121 black females, 238 white females, 81 black men, and 215 white men.

Table 10–3. Means and Standard Deviations Broken Down by Gender and Race.

Table 10–3. Means and Standard Deviations Broken Down by Gender and Race.

GenderRace 2 AgeBMIPCTFAT
  Standard deviation11.352296.140868.756
  Standard deviation13.799104.913539.8447
  Standard deviation13.032565.576089.9349
  Standard deviation11.978434.834547.3195
  Standard deviation15.065624.664559.0302
  Standard deviation14.302264.706708.5839
  Standard deviation11.600575.6718810.4632
  Standard deviation14.435504.8678110.0846
  Standard deviation13.647475.2091910.3710

Source: Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002;26:789–796. Table produced with SPSS Inc.; used with permission.

Table 10–4 shows the regression equation to predict percent body fat (see the bold values). Focusing initially on the Regression Equation Section, we see that all the variables are statistically significantly related to percent body fat.

Table 10–4. Multiple Regression Predicting Percent Body Weight.

Table 10–4. Multiple Regression Predicting Percent Body Weight.

Multiple Regression Report
Run Summary Section
Dependent variablePCTFATRows processed655
Number independent variables4Rows filtered out0
Weight variableNoneRows with X's missing0
R20.8042Rows with weight missing0
Adj R20.8030Rows with Y missing0
Coefficient of variation0.1649Rows used in estimation655
Mean square error21.18832Sum of weights655.000
Square root of MSE4.603077Completion statusNormal completion
Ave Abs Pct Error19.089  

Regression Equation Section
Independent VariableRegression Coefficient b(i)Standard Error Sb(i)T-Value to test H0:B(i)=0 Prob LevelReject H0 at 5%?
Power of Test at 5%
Intercept–8.3748 1.0338–8.1010.0000Yes1.0000
Age0.1603 0.014011.4420.0000Yes1.0000
BMI1.3710 0.037236.8090.0000Yes1.0000
Race–0.9161 0.4005–2.2870.0225Yes0.6283
Sex–10.2746 0.3638–28.2420.0000Yes1.0000

Regression Coefficient Section
Independent VariableRegression CoefficientStandard ErrorLower 95% C.L.Upper 95% C.L.Standardized Coefficient
Note: The T-Value used to calculate these confidence limits was 1.960.

Source: Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002;26:789–796. Analysis produced using NCSS; used with permission.

The first variable is a numerical variable, age, with regression coefficient, b, of 0.1603, indicating that greater age is associated with higher percent body fat. The second variable, BMI, is also numerical; the regression coefficient of 1.3710 indicates that patients with higher BMI also have higher percent body fat, which certainly makes sense.

The third variable, sex, is a binary variable having two values. For regression models it is convenient to code binary variables as 0 and 1; in the Jackson example, females have a 0 code for sex, and males have a 1. This procedure, called dummy or indicator coding, allows investigators to include nominal variables in a regression equation in a straightforward manner. The dummy variables are interpreted as follows: A subject who is male has the code for males, 1, multiplied by the regression coefficient for sex, 1.3710, resulting in an additional 1.3710 points being added to his percent body fat. The decision of which value is assigned 1 and which is assigned 0 is an arbitrary decision made by the researcher but can be chosen to facilitate interpretations of interest to the researcher.

The final variable is race, also dummy coded, with 0 for black and 1 for white. The regression coefficient is negative and indicates that white patients have 0.9161 subtracted from their percent body fat. The intercept itself is –8.3748, meaning that the predicted percent body fat is reduced by this amount after including all variables in the equation. The regression coefficients can be used to predict percent body fat by multiplying a given patient's value for each independent variable X by the corresponding regression coefficient b and then summing to obtain the predicted percent body fat.

Regression coefficients are interpreted differently in multiple regression than in simple regression. In simple regression, the regression coefficient b indicates the amount the predicted value of Y changes each time X increases by 1 unit. In multiple regression, a given regression coefficient indicates how much the predicted value of Y changes each time X increases by 1 unit, holding the values of all other variables in the regression equation constant—as though all subjects had the same value on the other variables. For example, predicted percent body fat is increased by 0.1603 for increase of 1 year in patient, assuming all other variables are held constant. This feature of multiple regression makes it an ideal method to control for baseline differences and confounding variables, as we discuss in the section titled "Controlling for Confounding."

It bears repeating that multiple regression measures only the linear relationship between the independent variables and the dependent variable, just as in simple regression. In the Jackson study, the authors examined the scatterplot between BMI and percent body fat, which we have reproduced in Figure 10–1. The figure indicates a curvilinear relationship, and investigators decided to transform BMI by taking its natural logarithm. They developed four models for females and males separately to examine the cumulative effect of including variables in the regression equation; results are reproduced in Table 10–5. Model I includes only ln BMI and the intercept; model II adds in the age, model III the race, and model IV interactions between ln BMI with race and age. The rationale for including interactions is the same as discussed in Chapter 7, namely that they wanted to know whether the relationship between ln BMI and percent body weight was the same for all levels of race or age.

Table 10–5. Results from the Regression Analyses Predicting Percent Body Weight.

Table 10–5. Results from the Regression Analyses Predicting Percent Body Weight.

 Female ModelsMale Models
In BMI43.05a
Age 0.14a
Race x In BMI   7.48a
Age x ln BMI       –0.41a
s.e.e. (% fat)

aP< 0.001.

bKey: race—black = 0 and white = 1.

Source: Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002;26:789–796.

Statistical Tests for the Regression Coefficient

Table 10–6 shows the output from NCSS for model III for female subjects; it contains a number of features to discuss. In the upper half of the table, note the columns headed by t value and probability level. Both the t test and the F test can be used to determine whether a regression coefficient is different from zero, or the t distribution can be used to form confidence intervals for each regression coefficient. Remember that even though the P values are sometimes reported as 0.000, there is always some probability, even if it is very small. Many statisticians believe, and we agree, that it is more accurate to report P < 0.001.

Table 10–6. Regression Analysis of Females, Model III.

Table 10–6. Regression Analysis of Females, Model III.

Regression Equation Section
Independent VariableRegression Coefficient b(iStandard Error Sb(iT-Value to test H0:B(i)=0 
Prob LevelReject H0 at 5%?
Power of Test at 5%

Regression Coefficient Section
Independent VariableRegression CoefficientStandard ErrorLower 95% C.L.Upper 95% C.L.Standardized Coefficient
Note: The T-value used to calculate these confidence limits was 1.960. 

Source: Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002;26:789–796. Table produced with NCSS; used with permission.

Standardized Regression Coefficients

Most authors present regression coefficients that can be used with individual subjects to obtain predicted Y values. But the size of the regression coefficients cannot be used to decide which independent variables are the most important, because their size is also related to the scale on which the variables are measured, just as in simple regression. For example, in Jackson and colleagues' study, the variable race was coded 1 if white and 0 if black, and the variable age was coded as the number of years of age at the time of the first data collection. Then, if race and age are equally important in predicting subsequent depression, the regression coefficient for race would be much larger than the regression coefficient for age so that the same amount would be added to the prediction of percent body weight. These regression coefficients are sometimes called unstandardized; they cannot be used to draw conclusions about the importance of the variable, but only whether the relationship or with the dependent variable Y is positive or negative.a One way to eliminate the effect of scale is to standardize the regression coefficients. Standardization can be done by subtracting the mean value of X and dividing by the standard deviation before analysis, so that all variables have a mean of 0 and a standard deviation of 1. Then it is possible to compare the magnitudes of the regression coefficients and draw conclusions about which explanatory variables play an important role. It is also possible to calculate the standardized regression coefficients after the regression model has been developed.b The larger the standardized coefficient, the larger the value of the t statistic. Standardized regression coefficients are often referred to as beta () coefficients. The major disadvantage of standardized regression coefficients is that they cannot readily be used to predict outcome values. The lower half of Table 10–6 contains the standardized regression coefficients in the far right column for the variables used to percent body fat in Jackson and colleagues' study. Using the standardized coefficients in Table 10–6, can you determine which variable, age or race, has more influence in predicting subsequent depression? If you chose age, you are correct, because the absolute value of its standardized coefficient is larger, 0.1981, compared with –0.0777 for race.

aTechnically it is possible for the regression coefficient and the correlation to have different signs. If so, the variable is called a moderator variable; it affects the relationship between the dependent variable and another independent variable.

bThe standardized coefficient = the unstandardized coefficient multiplied by the standard deviation of the X variable and divided by the standard deviation of the Y variable: j = bj (SDX/SDY).

Multiple R

Multiple R is the multiple-regression analogue of the Pearson product moment correlation coefficient r. It is also called the coefficient of multiple determination, but most authors use the shorter term. As an example, suppose percent body fat is calculated for each person in the study by Jackson and colleagues; then, the correlation between predicted percent body fat and the actual percent body fat is calculated. This correlation is the multiple R. If the multiple R is squared (R2), it measures how much of the variation in the actual depression score is accounted for by knowing the information included in the regression equation. The term R2 is interpreted in exactly the same way as r2 in simple correlation and regression, with 0 indicating no variance accounted for and 1.00 indicating 100% of the variance accounted for. Recall that in simple regression, the correlation between the actual value Y of the dependent variable and the predicted value, denoted Y', is the same as the correlation between the dependent variable and the independent variable; that is, rY'Y = rXY. Thus, R and R2 in multiple regression play the same role as r and r2 in simple regression. The statistical test for R and R2, however, uses the F distribution instead of the t distribution.

The computations are time-consuming, and fortunately, computers do them for us. Jackson and colleagues included R2 in Table 10–5 (although they used lowercase r2); it was 0.81 for model III (and is also shown in the NCSS output in Table 10–4). After ln BMI, age, and race are entered into the regression equation, R2 = 0.81 indicates that more than 80% of the variability in percent body fat is accounted for by knowing patients' BMI, age, and race. Because R2 is less than 1, we know that factors other than those included in the study also play a role in determining a person's percent body fat.

Selecting Variables for Regression Models

The primary purpose of Jackson and colleagues in their study of BMI and percent body fat was explanation; they used multiple regression analysis to learn how specific characteristics confounded the relationship between BMI and percent body fat. They also wanted to know how the characteristics interacted with one another, such as gender and race. Some research questions, however, focus on the prediction of the outcome, such as using the regression equation to predict of percent body fat in future subjects.

Deciding on the variables that provide the best prediction is a process sometimes referred to as model building and is exemplified in Table 10–5. Selecting the variables for regression models can be accomplished in several ways. In one approach, all variables are introduced into the regression equation, called the "enter" method in SPSS and used in the multiple regression procedure in NCSS. Then, especially if the purpose is prediction, the variables that do not have significant regression coefficients are eliminated from the equation. The regression equation may be recalculated using only the variables retained because the regression coefficients have different values when some variables are removed from the analysis.

Computer programs also contain routines to select an optimal set of explanatory variables. One such procedure is called forward selection. Forward selection begins with one variable in the regression equation; then, additional variables are added one at a time until all statistically significant variables are included in the equation. The first variable in the regression equation is the X variable that has the highest correlation with the response variable Y. The next X variable considered for the regression equation is the one that increases R2 by the largest amount. If the increment in R2 is statistically significant by the F test, it is included in the regression equation. This step-by-step procedure continues until no X variables remain that produce a significant increase in R2. The values for the regression coefficients are calculated, and the regression equation resulting from this forward selection procedure can be used to predict outcomes for future subjects. The increment in R2 was calculated by Jackson and colleagues; it is shown as r2 in Table 10–5.

A similar backward elimination procedure can also be used; in it, all variables are initially included in the regression equation. The X variable that would reduce R2 by the smallest increment is removed from the equation. If the resulting decrease is not statistically significant, that variable is permanently removed from the equation. Next, the remaining X variables are examined to see which produces the next smallest decrease in R2. This procedure continues until the removal of an X variable from the regression equation causes a significant reduction in R2. That X variable is retained in the equation, and the regression coefficients are calculated.

When features of both the forward selection and the backward elimination procedures are used together, the method is called stepwise regression (stepwise selection). Stepwise selection is commonly used in the medical literature; it begins in the same manner as forward selection. After each addition of a new X variable to the equation, however, all previously entered X variables are checked to see whether they maintain their level of significance. Previously entered X variables are retained in the regression equation only if their removal would cause a significant reduction in R2. The forward versus backward versus stepwise procedures have subtle advantages related to the correlations among the independent variables that cannot be covered in this text. They do not generally produce identical regression equations, but conceptually, all approaches determine a "parsimonious" equation using a subset of explanatory variables.

Some statistical programs examine all possible combinations of predictor values and determine the one that produces the overall highest R2, such as All Possible Regression in NCSS. We do not recommend this procedure, however, and suggest that a more appealing approach is to build a model in a logical way. Variables are sometimes grouped according to their function, such as all demographic characteristics, and added to the regression equation as a group or block; this process is often called hierarchical regression; see exercise 7 for an example. The advantage of a logical approach to building a regression model is that, in general, the results tend to be more stable and reliable and are more likely to be replicated in similar studies.

Polynomial Regression

Polynomial regression is a special case of multiple regression in which each term in the equation is a power of X. Polynomial regression provides a way to fit a regression model to curvilinear relationships and is an alternative to transforming the data to a linear scale. For example, the following equation can be used to predict a quadratic relationship:

If a linear and cubic term do not provide an adequate fit, a cubic term, a fourth-power term, and so on, can also be included until an adequate fit is obtained.

Jackson and colleagues (2002) used polynomial regression to fit separate curves for men and women, illustrated in Figure 10–1. Two approaches to polynomial regression can be used. The first method calculates squared terms, cubic terms, and so on; these terms are then entered one at a time using multiple regression. Another approach is to use a program that permits curve fitting, such as the regression curve estimation procedure in SPSS. We used the SPSS procedure to fit a quadratic curve of BMI to percent body fat for women. The regression equation was:

A plot is produced by SPSS is given in Figure 10–2.

Missing Observations

When studies involve several variables, some observations on some subjects may be missing. Controlling the problem of missing data is easier in studies in which information is collected prospectively; it is much more difficult when information is obtained from already existing records, such as patient charts. Two important factors are the percentage of observations that is missing and whether missing observations are randomly missing or missing because of some causal factor.

For example, suppose a researcher designs a case–control study to examine the effect of leg length inequality on the incidence of loosening of the femoral component after total hip replacement. Cases are patients who developed loosening of the femoral component, and controls are patients who did not. In reviewing the records of routine follow-up, the researcher found that leg length inequality was measured in some patients by using weight-bearing anterior–posterior (AP) hip and lower extremity films, whereas other patients had measurements taken using non-weight-bearing films. The type of film ordered during follow-up may well be related to whether the patient complained of hip pain; patients with symptoms were more likely to have received the weight-bearing films, and patients without symptoms were more likely to have had the routine non-weight-bearing films. A researcher investigating this question must not base the leg length inequality measures on weight-bearing films only, because controls are less likely than cases to have weight-bearing film measures in their records. In this situation, the missing leg length information occurred because of symptoms and not randomly.

The potential for missing observations increases in studies involving multiple variables. Depending on the cause of the missing observations, solutions include dropping subjects who have missing observations from the study, deleting variables that have missing values from the study, or substituting some value for the missing data, such as the mean or a predicted value, called imputing. SPSS has an option to estimate missing data with the mean for that variable calculated with the subjects who had the data. The Data Screening procedure (in Descriptive Statistics) in NCSS provides the option of substituting either the mean or a predicted score. Investigators in this situation should seek advice from a statistician on the best way to handle the problem.

Cross Validation

The statistical procedures for all regression models are based on correlations among the variables, which, in turn, are related to the amount of variation in the variables included in the study. Some of the observed variation in any variable, however, occurs simply by chance; and the same degree of variation does not occur if another sample is selected and the study is replicated. The mathematical procedures for determining the regression equation cannot distinguish between real and chance variation. If the equation is to be used to predict outcomes for future subjects, it should therefore be validated on a second sample, a process called cross validation. The regression equation is used to predict the outcome in the second sample, and the predicted outcomes are compared with the actual outcomes; the correlation between the predicted and actual values indicates how well the model fits. Cross-validating the regression equation gives a realistic evaluation of the usefulness of the prediction it provides.

In medical research we rarely have the luxury of cross-validating the findings on another sample of the same size. Several alternative methods exist. First, researchers can hold out a proportion of the subjects for cross validation, perhaps 20% or 25%. The holdout sample should be randomly selected from the entire sample prior to the original analysis. The predicted outcomes in the holdout sample are compared with the actual outcomes, often using R2 to judge how well the findings cross-validate.

Another method is the jackknife in which one observation is left out of the sample, call it x1; regression is performed using the n – 1 observations, and the results are applied to x1. Then this observation is returned to the sample, and another, x2, is held out. This process continues until there is a predicted outcome for each observation in the sample; the predicted and actual outcomes are then compared.

The bootstrap method works in a similar manner although the goal is different. The bootstrap can be used with small samples to estimate the standard error and confidence intervals. A small hold-out sample is randomly selected and the statistic of interest calculated. Then the hold-out sample is returned to the original sample, and another hold-out sample is selected. After a fairly large number of samples is analyzed, generally a minimum of 200, standard errors and confidence intervals can be estimated. In essence, the bootstrap method uses the data itself to determine the sampling distribution rather than the central limit theorem discussed in Chapter 4.

Both the jackknife and bootstrap are called resampling methods; they are very computer-intensive and require special software. Kline and colleagues (2002) used a bootstrap method to develop confidence intervals for odds ratios in their study of the use of the D-dimer test in the emergency department.

It is possible to estimate the magnitude of R or R2 in another sample without actually performing the cross validation. This R2 is smaller than the R2 for the original sample, because the mathematical formula used to obtain the estimate removes the chance variation. For this reason, the formula is called a formula for shrinkage. Many computer programs, including NCSS, SPSS, and SAS, provide both R2 for the sample used in the analysis as well as R2 adjusted for shrinkage, often referred to as the adjusted R2. Refer to Table 10–4 where NCSS gives the "Adj R2" in the fifth row of the first column of the computer analysis.

Sample Size Requirements

The only easy way to determine how large a sample is needed in multiple regression or any multivariate technique is to use a computer program. Some rules of thumb, however, may be used for guidance. A common recommendation by statisticians calls for ten times as many subjects as the number of independent variables. For example, this rule of thumb prescribes a minimum of 60 subjects for a study predicting the outcome from six independent variables. Having a large ratio of subjects to variables decreases problems that may arise because assumptions are not met.

Assumptions about normality in multiple regression are complicated, depending on whether the independent variables are viewed as fixed or random (as in fixed-effects model or random-effects model in ANOVA), and they are beyond the scope of this text. To ensure that estimates of regression coefficients and multiple R and R2 are accurate representatives of actual population values, we suggest that investigators never perform regression without at least five times as many subjects as variables.

A more accurate estimate is found by using a computer power program. We used the PASS power program to find the power of a study using five predictor variables, as in the Jackson study (Table 10–5). We posed the question: How many subjects are needed to test whether a given variable increases R2 by 0.05, given that four variables are already in the regression equation and they collectively provide an R2 of 0.50? The output from the program is shown in Box 10–1. The power table indicates that a sample of 80 gives power of 0.84, assuming an or P value of 0.05. The accompanying graph shows the power curve for different sample sizes and different values of . As you can see, the sample of 359 females and 296 males in the study by Jackson and colleagues was more than adequate for the regression model.

Box 10–1. Linear and Quadratic Curves for the Relationship between BMI and Percent Body Fat in Females.

Box 10–1. Linear and Quadratic Curves for the Relationship between BMI and Percent Body Fat in Females.

Multiple-Regression Power Analysis
    Ind. Variables TestedInd. Variables Controlled
PowerN AlphaBetaCntR2
Summary Statements 
A sample size of 80 achieves 84% power to detect an R-Squared of 0.05000 attributed to 1 independent variable(s) using an F-Test with a significance level (alpha) of 0.5000. The variables tested are adjusted for an additional 4 independent variable(s) with an R-Squared of 0.50000. 

Source: Data, used with permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC: The effect of sex, age and race on estimating percentage body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002;26:789–796. Output produced with PASS; used with permission.

Controlling for Confounding

Analysis of Covariance

Analysis of covariance (ANCOVA) is the statistical technique used to control for the influence of a confounding variable. Confounding variables occur most often when subjects cannot be assigned at random to different groups, that is, when the groups of interest already exist. Gonzalo and colleagues (1996) (Chapters 7 and 8) predicted insulin sensitivity from body mass index (BMI); they wanted to control for age of the women and did so by adding age to the regression equation. When BMI alone is used to predict insulin sensitivity (IS) in hyperthyroid women, the regression equation is

where IS is the insulin sensitivity level. Using this equation, a hyperthyroid woman's insulin sensitivity level is predicted to decrease by 0.077 for each increase of 1 in BMI. For instance, a woman with a BMI of 25 has a predicted insulin sensitivity of 0.411. What would happen, however, if age were also related to insulin sensitivity? A way to control for the possible confounding effect of age is to include that variable in the regression equation. The equation with age included is

Using this equation, a hyperthyroid woman's insulin sensitivity level is predicted to decrease by 0.068 for each increase of 1 in BMI, holding age constant or independent of age. A 30-year-old woman with a BMI of 25 has a predicted insulin sensitivity of 0.456, whereas a 60-year-old woman with the same BMI of 25 has a predicted insulin sensitivity of 0.321.

A more traditional use of ANCOVA is illustrated by a study of the negative influence of smoking on the cardiovascular system. Investigators wanted to know whether smokers have more ventricular wall motion abnormalities than nonsmokers (Hartz et al, 1984). They might use a t test to determine whether the mean number of wall motion abnormalities differ in these two groups. The investigators know, however, that wall motion abnormalities are also related to the degree of coronary stenosis, and smokers generally have a greater degree of coronary stenosis. Thus, any difference observed in the mean number of wall abnormalities between smokers and nonsmokers may really be a difference in the amount of coronary stenosis between these two groups of patients.

This situation is illustrated in the graph of hypothetical data in Figure 10–3; in the figure, the relationship between occlusion scores and wall motion abnormalities appears to be the same for smokers and nonsmokers. Nonsmokers, however, have both lower occlusion scores and lower numbers of wall motion abnormalities; smokers have higher occlusion scores and higher numbers of wall motion abnormalities. The question is whether the difference in wall motion abnormalities is due to smoking, to occlusion, or to both.

In this study, the investigators must control for the degree of coronary stenosis so that it does not confound (or confuse) the relationship between smoking and wall motion abnormalities. Useful methods to control for confounding variables are analysis of covariance (ANCOVA) and the Mantel–Haenszel chi-square procedure. Table 10–2 specifies ANCOVA when the dependent variable is numerical (eg, wall motion) and the independent measures are grouping variables on a nominal scale (eg, smoking versus nonsmoking), and confounding variables occur (eg, degree of coronary occlusion). If the dependent measure is also nominal, such as whether a patient has survived to a given time, the Mantel–Haenszel chi-square discussed in Chapter 9 can be used to control for the effect of a confounding (nuisance) variable. ANCOVA can be performed by using the methods of ANOVA; however, most medical studies use one of the regression methods discussed in this chapter.

If ANCOVA is used in this example, the occlusion score is called the covariate, and the mean number of wall motion abnormalities in smokers and nonsmokers is said to be adjusted for the occlusion score (or degree of coronary stenosis). Put another way, ANCOVA simulates the Y outcome observed if the value of X is held constant, that is, if all the patients had the same degree of coronary stenosis. This adjustment is achieved by calculating a regression equation to predict mean number of wall motion abnormalities from the covariate, degree of coronary stenosis, and from a dummy variable coded 1 if the subject is a member of the group (ie, a smoker) and 0 otherwise. For example, the regression equation determined for the hypothetical observations in Figure 10–3 is

The equation illustrates that smokers have a larger number of predicted wall motion abnormalities, because 1.28 is added to the equation if the subject is a smoker. The equation can be used to obtain the mean number of wall motion abnormalities in each group, adjusted for degree of coronary stenosis.

If the relationship between coronary stenosis and ventricular motion is ignored, the mean number of wall motion abnormalities, calculated from the observations in Figure 10–2, is 3.33 for smokers and 1.00 for nonsmokers. If, however, ANCOVA is used to control for degree of coronary stenosis, the adjusted mean wall motion is 2.81 for smokers and 1.53 for nonsmokers, a difference of 1.28, represented by the regression coefficient for the dummy variable for smoking. In ANCOVA, the adjusted Y mean for a given group is obtained by (1) finding the difference between the group's mean on the covariate variable X, denoted , and the grand mean ; (2) multiplying the difference by the regression coefficient; and (3) subtracting this product from the unadjusted mean. Thus, for group j, the adjusted mean is

(See Exercise 1.)

This result is consistent with our knowledge that coronary stenosis alone has some effect on abnormality of wall motion; the unadjusted means contain this effect as well as any effect from smoking. Controlling for the effect of coronary stenosis therefore results in a smaller difference in number of wall motion abnormalities, a difference related only to smoking.

Using hypothetical data, Figure 10–4 illustrates schematically the way ANCOVA adjusts the mean of the dependent variable if the covariate is important. Using unadjusted means is analogous to using a separate regression line for each group. For example, the mean value of Y for group 1 is found by using the regression line drawn through the group 1 observations to project the mean value 1 onto the Y-axis, denoted 1 in Figure 10–4. Similarly, the mean of group 2 is found at 2 by using the regression line to project the mean 2 in that group. The Y means in each group adjusted for the covariate (stenosis) are analogous to the projections based on the overall mean value of the covariate; that is, as though the two groups had the same mean value for the covariate. The adjusted means for groups 1 and 2, Adj. 1 and Adj. 2, are illustrated by the dotted line projections of from each separate regression line in Figure 10–4.

ANCOVA assumes that the relationship between the covariate (X variable) and the dependent variable (Y) is the same in both groups, that is, that any relationship between coronary stenosis and wall motion abnormality is the same for smokers and nonsmokers. This assumption is equivalent to requiring that the regression slopes be the same in both groups; geometrically, ANCOVA asks whether a difference exists between the intercepts, assuming the slopes are equal.

ANCOVA is an appropriate statistical method in many situations that occur in medical research. For example, age is a variable that affects almost everything studied in medicine; if preexisting groups in a study have different age distributions, investigators must adjust for age before comparing the groups on other variables, just as Gonzalo and colleagues recognized. The methods illustrated in Chapter 3 to adjust mortality rates for characteristics such as age and birth weight are used when information is available on groups of individuals; when information is available on individuals themselves, ANCOVA is used.

Before leaving this section, we point out some important aspects of ANCOVA. First, although only two groups were included in the example, ANCOVA can be used to adjust for the effect of a confounding variable in more than two groups. In addition, it is possible to adjust for more than one confounding variable in the same study, and the confounding variables may be either nominal or numerical. Thus, it is easy to see why the multiple regression model for analysis of covariance provides an ideal method to incorporate confounding variables.

Finally, ANCOVA can be considered as a special case of the more general question of comparing two regression lines (discussed in Chapter 8). In ANCOVA, we assume that the slopes are equal, and attention is focused on the intercept. We can also perform the more global test of both slope and intercept, however, by using multiple regression. In Presenting Problem 4 in Chapter 8 on insulin sensitivity (Gonzalo et al, 1996), interest focused on comparing the regression lines predicting insulin activity from body mass index (BMI) in women who had normal versus elevated thyroid levels. ANCOVA can be used for this comparison using dummy coding. If we let X be BMI, Y be insulin sensitivity level, and Z be a dummy variable, where Z = 1 if the woman is hyperthyroid and Z = 0 for controls, then the multiple-regression model for testing whether the two regression lines are the same (coincident) is

The regression lines have equal slopes and are parallel when b3 is 0, that is, no interaction between the independent variable X and the group membership variable Z. The regression lines have equal intercepts and equal slopes (are coincident) if both b2 and b3 are 0; thus, the model becomes the simple regression equation Y = a + bX. The statistical test for b2 and b3 is the t test discussed in the section titled, "Statistical Tests for the Regression Coefficient."

Generalized Estimating Equations (GEE)

Many research designs, including both observational studies and clinical trials, concern observations that are clustered or hierarchical. A group of methods has been developed for these special situations. To illustrate, a study to examine the effect of different factors on complication rates following total knee arthroplasty was undertaken in a province of Canada (Kreder et al, 2003). Outcomes included length of hospital stay, inpatient complications, and mortality. Can the researchers examine the outcomes for patients and conclude that any differences are due the risk factors? The statistical methods we have examined thus far assume that one observation is independent from another. The problem with this study design, however, is that the outcome for patients operated on by the same surgeon may be related to factors other than the surgical method, such as the skill level of the surgeon. In this situation, patients are said to be nested within physicians.

Many other examples come to mind. Comparing the efficacy of medical education curricula is difficult because students are nested within medical schools. Comparing health outcomes for children within a community is complicated by the fact that children are nested within families. Many clinical trials create nested situations, such as when trials are carried out in several medical centers. The issue arises of how to define the unit of analysis—should it be the students or the school? the children or the families? the patients or the medical center?

The group of methods that accommodates these types of research questions include generalized estimating equations (GEE), multilevel modeling, and the analysis of hierarchically structured data. Most of these methods have been developed within the last decade and statistical software is just now becoming widely available. In addition to some specialized statistical packages, SAS, Stata, and SPSS contain procedures to accommodate hierarchical data. Using these models is more complex than some of the other methods we have discussed, and it is relatively easy to develop a model that is meaningless or misleading. Investigators who have research designs that involve nested subjects should consult a biostatistician for assistance.

Predicting Nominal or Categorical Outcomes

In the regression model discussed in the previous section, the outcome or dependent Y variable is measured on a numerical scale. When the outcome is measured on a nominal scale, other approaches must be used. Table 10–2 indicates that several methods can be used to analyze problems with several independent variables when the dependent variable is nominal. First we discuss logistic regression, a method that is frequently used in the health field. One reason for the popularity of logistic regression is that many outcomes in health are nominal, actually binary, variables—they either occur or do not occur. The second reason is that the regression coefficients obtained in logistic regression can be transformed into odds ratios. So, in essence, logistic regression provides a way to obtain an odds ratio for a given risk factor that controls for, or is adjusted for, confounding variables; in other words, we can do analysis of covariance with logistic regression as well as with multiple linear regression.

Other methods are log-linear analysis and several methods that attempt to classify subjects into groups. These methods appear occasionally in the medical literature, and we provide a brief illustration, primarily so that readers can have an intuitive understanding of their purpose. The classification methods are discussed in the section titled "Methods for Classification."

Logistic Regression

Logistic regression is commonly used when the independent variables include both numerical and nominal measures and the outcome variable is binary (dichotomous). Logistic regression can also be used when the outcome has more than two values (Hosmer and Lemeshow, 2000), but its most frequent use is as in Presenting Problem 2, which illustrates the use of logistic regression to identify trauma patients who are alcohol-positive, a yes-or-no outcome. Soderstrom and his coinvestigators (1997) wanted to develop a model to help emergency department staff identify the patients most likely to have blood alcohol concentrations (BAC) in excess of 50 mg/dL at the time of admission. The logistic model gives the probability that the outcome, such as high BAC, occurs as an exponential function of the independent variables. For example, with three independent variables, the model is

where b0 is the intercept, b1, b2, and b3 are the regression coefficients, and exp indicates that the base of the natural logarithm (2.718) is taken to the power shown in parentheses (ie, the antilog). The equation can be derived by specifying the variables to be included in the equation or by using a variable selection method similar to the ones for multiple regression. A chi-square test (instead of the t or F test) is used to determine whether a variable adds significantly to the prediction.

In the study described in Presenting Problem 2, the variables used by the investigators to predict blood alcohol concentrations included the variables listed in Table 10–7. The investigators coded the values of the independent variables as 0 and 1, a method useful both for dummy variables in multiple regression and for variables in logistic regression. This practice makes it easy to interpret the odds ratio. In addition, if a goal is to develop a score, as is the case in the study by Soderstrom and coinvestigators, the coefficient associated with a given variable needs to be included in the score only if the patient has a 1 on that variable. For instance, if patients are more likely to have BAC 50 mg/dL on weekends, the score associated with day of week is not included if the injury occurs on a weekday.

Table 10–7. Variables, Codes, and Frequencies for Variables.a

Table 10–7. Variables, Codes, and Frequencies for Variables.a

  39 or younger03514
  40 or older11534
Time of Day   
  6 PM–6 AM 02601
  6 AM–6 PM 12447
Day of week   
Injury Type   
Blood Alcohol Concentration   
  <50 mg/dL04067
  50 mg/dL11465

aNot all totals are the same because of missing data on some variables.

Source: Data, used with permission, from Soderstrom CA, Kufera JA, Dischinger PC, Kerns TJ, Murphy JG, Lowenfels A: Predictive model to identify trauma patients with blood alcohol concentrations 50 mg/dL. J Trauma 1997;42:67–73.

The investigators calculated logistic regression equations for each of four groups: males with intentional injury, males with unintentional injury, females with intentional injury, and females with unintentional injury. The results of the analysis on males who were injured unintentionally are given in Table 10–8.

Table 10–8. Logistic Regression Report for Men with Unintentional Injury.a

Table 10–8. Logistic Regression Report for Men with Unintentional Injury.a

Filter  sex=1; injtype=0
Response  BAC50
Parameter Estimation Section
VariableRegression CoefficientStandard Error2
Probability LevelLast R2
Age 40–0.11983710.10822091.230.2681480.000466
Odds Ratio Estimation Section
VariableRegression CoefficientStandard ErrorOdds RatioLower 95% Confidence LimitUpper 95% Confidence Limit
Age 40–0.1198370.1082210.8870650.7175261.096663

Model Summary Section
Model R2
Model df Model 2
Model Probability

Classification Table
Actual 01Total
 Row percent89.6610.34100.00
 Column percent80.8842.9874.12
 Row percent60.7039.30100.00
 Column percent19.1257.0225.88
 Row percent82.1617.84 
 Column percent100.00100.00 
Percent correctly classified = 76.62

aResults from logistic regression for men with unintentional injury.

Source: Data, used with permission, from Soderstrom CA, Kufera JA, Dischinger PC, Kerns TJ, Murphy JG, Lowenfels A: Predictive model to identify trauma patients with blood alcohol concentrations 50 mg/dL. J Trauma 1997;42:67–73. Output produced using NCSS; used with permission.

We need to know which value is coded 1 and which 0 in order to interpret the results. For example, time of day has a negative regression coefficient. The hours of 6 AM–6 PM are coded as 1, so a male coming to the emergency department with unintentional injuries in the daytime is less likely to have BAC 50 mg/dL than a male with unintentional injuries at night. The age variable is not significant (P > 0.268). Interpreting the equation for the other variables indicates that males with unintentional injuries who come to the emergency department at night and on weekends and are Caucasian are more likely to have elevated blood alcohol levels.

The logistic equation can be used to find the probability for any given individual. For instance, let us find the probability that a 27-year-old Caucasian man who comes to the emergency department at 2 PM on Thursday has BAC 50 mg/dL. The regression coefficients from Table 10–8 are

and we evaluate it as follows:

Substituting –2.36 in the equation for the probability:

Therefore, the chance that this man has a high BAC is less than 1 in 10. See Exercise 3 to determine the likelihood of a high BAC if the same man came to the emergency department on a Saturday night.

One advantage of logistic regression is that it requires no assumptions about the distribution of the independent variables. Another is that the regression coefficient can be interpreted in terms of relative risks in cohort studies or odds ratios in case–control studies. In other words, the relative risk of an elevated BAC in males with unintentional trauma during the day is exp (–1.845) = 0.158. The relative risk for night is the reciprocal, 1/0.158 = 6.33; therefore, males with unintentional injuries who come to the ER at night are more than six times more likely to have BAC 50 mg/dL than males coming during the day.

How can readers easily tell which odds ratios are statistically significant? Recall from Chapter 8 that if the 95% confidence interval does not include 1, we can be 95% sure that the factor associated with the odds ratio either is a significant risk or provides a significant level of protection. Do any of the independent variables in Table 10–8 have a 95% confidence interval for the odds ratio that contains 1? Did you already know without looking that it would be age because the age variable is not statistically significant?

The overall results from a logistic regression may be tested with Hosmer and Lemeshow's goodness of fit test. The test is based on the chi-square distribution. A P value 0.05 means that the model's estimates fit the data at an acceptable level.

There is no straightforward statistic to judge the overall logistic model as R2 is used in multiple regression. Some statistical programs give R2, but it cannot be interpreted as in multiple regression because the predicted and observed outcomes are nominal. Several other statistics are available as well, including Cox and Snell's R2 and a modification called Nagelkerke's R2, which is generally larger than Cox and Snell's R2.

Before leaving the topic of logistic regression, it is worthwhile to inspect the classification table in Table 10–8. This table gives the actual and the predicted number of males with unintentional injuries who had normal versus elevated BAC. The logistic equation tends to underpredict those with elevated concentrations: 470 males are predicted versus the 682 who actually had BAC 50 mg/dL. Overall, the prediction using the logistic equation correctly classified 76.62% of these males. Although this sounds rather impressive, it is important to compare this percentage with the baseline: 74.12% of the time we would be correct if we simply predicted a male to have normal BAC. Can you recall an appropriate way to compensate for or take the baseline into consideration? Although computer programs typically do not provide the kappa statistic, discussed in Chapter 5, it provides a way to evaluate the percentage correctly classified (see Exercise 4). Other measures of association are used so rarely in medicine that we did not discuss them in Chapter 8. SPSS provides two nonparametric correlations, the lambda correlation and the tau correlation, that can be interpreted as measures of strength of the relationship between observed and predicted outcomes.

Log-Linear Analysis

Psoriasis, a chronic, inflammatory skin disorder characterized by scaling erythematous patches and plaques of skin, has a strong genetic influence—about one third of patients have a positive family history. Stuart and colleagues (2002) conducted a study to determine differences in clinical manifestation between patients with positive and negative family histories of psoriasis and with early-onset versus late-onset disease. This study was used in Exercise 7 in Chapter 7.

The hypothesis was that the variables age at onset (in 10-year categories), onset (early or late), and familial status (sporadic or familial) had no effect on the occurrence of joint complaints. Results from the analysis of age, familial status, and frequency of joint complaints are given in Table 10–9.

Table 10–9. Frequency of Joint Complaints by Familial Status, Stratified by Age at Examination.

Table 10–9. Frequency of Joint Complaints by Familial Status, Stratified by Age at Examination.

 Joint Complaints (%)   
Age at Examination (years)SporadicFamilialRelative riskPearson 2

Source: Reproduced, with permission, from Stuart P, Malick F, Nair RP, Henseler T, Lim HW, Jenisch S, et al: Analysis of phenotypic variation in psoriasis as a function of age at onset and family history. Arch Dermatol Res 2002;294:207–213.

Each independent variable in this research problem is measured on a categorical or nominal scale (age, onset, and familial status), as is the outcome variable (occurrence of joint complaints). If only two variables are being analyzed, the chi-square method introduced in Chapter 6 can be used to determine whether a relationship exists between them; with three or more nominal or categorical variables, a statistical method called log-linear analysis is appropriate. Log-linear analysis is analogous to a regression model in which all the variables, both independent and dependent, are measured on a nominal scale. The technique is called log-linear because it involves using the logarithm of the observed frequencies in the contingency table.

Stuart and colleagues (2002) concluded that joint complaints and familial psoriasis were conditionally independent given age at examination, but that age at examination was not independent of either joint complaints or a family history.

Log-linear analysis may also be used to analyze multidimensional contingency tables in situations in which no distinction exists between independent and dependent variables, that is, when investigators simply want to examine the relationship among a set of nominal measures. The fact that log-linear analysis does not require a distinction between independent and dependent variables points to a major difference between it and other regression models—namely, that the regression coefficients are not interpreted in log-linear analysis.

Predicting a Censored Outcome: Cox Proportional Hazard Model

In Chapter 9, we found that special methods must be used when an outcome has not yet been observed for all subjects in the study sample. Studies of time-limited outcomes in which there are censored observations, such as survival, naturally fall into this category; investigators usually cannot wait until all patients in the study experience the event before presenting information.

Many times in clinical trials or cohort studies, investigators wish to look at the simultaneous effect of several variables on length of survival. For example, in the study described in Presenting Problem 3, Crook and her colleagues (1997) wanted to evaluate the relationship of pretreatment prostate-specific antigen (PSA) and posttreatment nadir PSA on the failure pattern of radiotherapy for treating localized prostate carcinoma. They categorized failures as biochemical, local, and distant. They analyzed data from a cohort study of 207 patients, but only 68 had a failure due to any cause in the 70 months during which the study was underway. These 68 observations on failure were therefore censored. The independent variables they examined included the Gleason score, the T classification, whether the patient had received hormonal treatment, the PSA before treatment, and the lowest PSA following treatment.

Table 10–2 indicates that the regression technique developed by Cox (1972) is appropriate when time-dependent censored observations are included. This technique is called the Cox regression, or proportional hazard, model. In essence this model allows the covariates (independent variables) in the regression equation to vary with time. The dependent variable is the survival time of the jth patient, denoted Yj. Both numerical and nominal independent variables may be used in the model.

The Cox regression coefficients can be used to determine the relative risk or odds ratio (introduced in Chapter 3) associated with each independent variable and the outcome variable, adjusted for the effect of all other variables in the equation. Thus, instead of giving adjusted means, as ANCOVA does in regression, the Cox model gives adjusted relative risks. We can also use a variety of methods to select the independent variables that add significantly to the prediction of the outcome, as in multiple regression; however, a chi-square test (instead of the F test) is used to test for significance.

The Cox proportional hazard model involves a complicated exponential equation (Cox, 1972). Although we will not go into detail about the mathematics involved in this model, its use is so common in medicine that an understanding of the process is needed by readers of the literature. Our primary focus is on the application and interpretation of the Cox model.

Understanding the Cox Model

Recall from Chapter 9 that the survival function gives the probability that a person will survive the next interval of time, given that he or she has survived up until that time. The hazard function, also defined in Chapter 9, is in some ways the opposite: it is the probability that a person will die (or that there will be a failure) in the next interval of time, given that he or she has survived until the beginning of the interval. The hazard function plays a key role in the Cox model.

The Cox model examines two pieces of information: the amount of time since the event first happened to a person and the person's observations on the independent variables. Using the Crook example, the amount of time might be 3 years, and the observations would be the patient's Gleason score, T classification, whether he had been treated with hormones, and the two PSA scores (pretreatment and lowest posttreatment). In the Cox model, the length of time is evaluated using the hazard function, and the linear combination of the independent values (like the linear combination we obtain when we use multiple regression) is the exponent of the natural logarithm, e. For example, for the Crook study, the model is written as

In words, the model is saying that the probability of dying in the next time interval, given that the patient has lived until this time and has the given values for Gleason score, T classification, and so on, can be found by multiplying the baseline hazard (h0) by the natural log raised to the power of the linear combination of the independent variables. In other words, a given person's probability of dying is influenced by how commonly patients die and the given person's individual characteristics. If we take the antilog of the linear combination, we multiply rather than add the values of the covariates. In this model, the covariates have a multiplicative, or proportional, effect on the probability of dying—thus, the term "proportional hazard" model.

An Example of the Cox Model

In the study described in Presenting Problem 3, Crook and her colleagues (1997) used the Cox proportional hazard model to examine the relationship between pretreatment PSA and posttreatment PSA nadir and treatment failure in men with prostate carcinoma following treatment with radiotherapy. Failure was categorized as chemical, local, or distant. The investigators wanted to control for possible confounding variables, including the Gleason score, the T classification, both measures of severity, and whether the patient received hormones prior to the radiotherapy. The outcome is a censored variable, the amount of time before the treatment fails, so the Cox proportional hazard model is the appropriate statistical method. We use the results of analysis using SPSS, given in Table 10–10, to point out some salient features of the method.

Table 10–10. Results from Cox Proportional Hazard Model Using Both Pretreatment and Posttreatment Variables.

Table 10–10. Results from Cox Proportional Hazard Model Using Both Pretreatment and Posttreatment Variables.

Indicator Parameter Coding
GSCORERecoded Gleason score    
TUMSTAGETumor stage    
207 Total cases read

Dependent Variable: TIMEANYF 
    68139 (67.1%)
Beginning block number 0.Initial log likelihood function
–2 Log likelihood649.655
Beginning block number 1.Method: Enter
Variable(s) Entered at step number 1. 
  GSCORERecoded Gleason score
  TUMSTAGETumor stage

Coefficients converged after seven iterations.
–2 Log likelihood576.950

df Significance
Overall (score)274.73770.0000
Change (–2LL) from    
  Previous block 72.706 7 0.0000 
  Previous step72.70670.0000

Variables in the equation
VariableBSE Walddf SignificanceR 
TUMSTAGE   14.3608 3 0.0025 0.1134 
  TUMSTAGE (1)–0.02240.70770.001010.97470.0000
  TUMSTAGE (3) 1.5075 0.5548 7.3828 1 0.0066 0.0910 

95% CI for Exp (B)
VariableExp (B)LowerUpper
TUMSTAGE (1)0.97780.24433.9140
TUMSTAGE (2)2.23530.75976.5770
TUMSTAGE (3) 4.5156 1.5221 13.3962 

df = degrees of freedom; SE = standard error; CI = confidence interval; Wald = statistic used by SPSS to test the significance of variances.

Source: Data, used with permission, from Crook JM, Bahadur YA, Bociek RG, Perry GA, Robertson SJ, Esche BA: Radiotherapy for localized prostate carcinoma. Cancer 1997;79:328–336. Output produced using SPSS 10.0, a registered trademark of SPSS, Inc; used with permission.

Both numerical and nominal variables can be used as independent variables in the Cox model. If the variables are nominal, it is necessary to tell the computer program so they can be properly analyzed. SPSS prints this information. PRERTHOR, pretreatment hormone therapy, is recoded so that 0 = no and 1 = yes. Prior to doing the analysis, we recoded the Gleason score into a variable called GSCORE with two values: 0 for Gleason scores 2–6 and 1 for Gleason scores 7–10. The T classification variable, TUMSTAGE, was recoded by the computer program using dummy variable coding. Note that for four values of TUMSTAGE, only three variables are needed, with the three more advanced stages compared with the lowest stage, T1b–2.

Among the 207 men in the study, 68 had experienced a failure by the time the data were analyzed. The authors reported a median follow-up of 36 months with a range of 12 to 70 months. The log likelihood statistic (LL) is used to evaluate the significance of the overall model; smaller values indicate that the data fit the model better. The change in the log likelihood associated with the initial (full) model in which no independent variables are included in the equation and the log likelihood after the variables are entered is calculated. In this example, the change is 72.706 (bold in Table 10–10), and it is the basis of the chi-square statistic used to determine the significance of the model. The significance is reported, as often occurs with computer programs, as 0.0000.

In addition to testing the overall model, it is possible to test each independent variable to see if it adds significantly to the prediction of failure. Were any of the potentially confounding variables significant? The significance of TUMSTAGE requires some explanation. The variable itself is significant, with P = 0.0025 (bold in the table). The TUMSTAGE(3) variable (which indicates the patient has T3–4 stage tumor), however, is the one that really matters because it is the only significant stage (P = 0.0066). Note that Gleason score and hormone therapy were not significant. Was either of the PSA values important in predicting failure? It appears that the pretreatment PSA is not significant, but the lowest PSA (NADIRPSA) reached following treatment has a very low P value.

As in logistic regression, the regression coefficients in the Cox model can be interpreted in terms of relative risks or odds ratios (by finding the antilog) if they are based on independent binary variables, such as hormone therapy. For this reason, many researchers divide independent variables into two categories, as we did with Gleason score, even though this practice can be risky if the correct cutpoint is not selected. The T classification variable was recoded as three dummy variables to facilitate interpretation in terms of odds ratios for each stage. The odds ratios are listed under the column titled "Exp (B)" in Table 10–10. Using the T3–4 stage (TUMSTAGE(3)) as an illustration, the antilog of the regression coefficient, 1.5075, is exp (1.5075) = 4.5156. Note that the 95% confidence interval goes from approximately 1.52 to 13.40; because this interval does not contain 1, the odds ratio is statistically significant (consistent with the P value).

Crook and colleagues (1997) also computed the Cox model using only the variables known prior to treatment (see Exercise 8).

Importance of the Cox Model

The Cox model is very useful in medicine, and it is easy to see why it is being used with increasing frequency. It provides the only valid method of predicting a time-dependent outcome, and many health-related outcomes are related to time. If the independent variables are divided into two categories (dichotomized), the exponential of the regression coefficient, exp (b), is the odds ratio, a useful way to interpret the risk associated with any specific factor. In addition, the Cox model provides a method for producing survival curves that are adjusted for confounding variables. The Cox model can be extended to the case of multiple events for a subject, but that topic is beyond our scope. Investigators who have repeated measures in a time-to-survival study are encouraged to consult a statistician.


Meta-analysis is a way to combine results of several independent studies on a specific topic. Meta-analysis is different from the methods discussed in the preceding sections because its purpose is not to identify risk factors or to predict outcomes for individual patients; rather, this technique is applicable to any research question. We briefly introduced meta-analysis in Chapter 2. Because we could not talk about it in detail until the basics of statistical tests (confidence limits, P values, etc) were explained, we include it in this chapter. It is an important technique increasingly used for studies in health and it can be looked on as an extension of multivariate analysis.

The idea of summarizing a set of studies in the medical literature is not new; review articles have long had an important role in helping practicing physicians keep up to date and make sense of the many studies on any given topic. Meta-analysis takes the review article a step further by using statistical procedures to combine the results from different studies. Glass (1977) developed the technique because many research projects are designed to answer similar questions, but they do not always come to similar conclusions. The problem for the practitioner is to determine which study to believe, a problem unfortunately too familiar to readers of medical research reports.

Sacks and colleagues (1987) reviewed meta-analyses of clinical trials and concluded that meta-analysis has four purposes: (1) to increase statistical power by increasing the sample size, (2) to resolve uncertainty when reports do not agree, (3) to improve estimates of effect size, and (4) to answer questions not posed at the beginning of the study. Purpose 3 requires some expansion because the concept of effect size is central to meta-analysis. Cohen (1988) developed this concept and defined effect size as the degree to which the phenomenon is present in the population. An effect size may be thought of as an index of how much difference exists between two groups—generally, a treatment group and a control group. The effect size is based on means if the outcome is numerical, on proportions or odds ratios if the outcome is nominal, or on correlations if the outcome is an association. The effect sizes themselves are statistically combined in meta-analysis.

Veenstra and colleagues (1999) used meta-analysis to evaluate the efficacy of impregnating central venous catheters with an antiseptic. They examined the literature, using manual and computerized searches, for publications containing the words chlorhexidine, antiseptic, and catheter and found 215 studies. Of these, 24 were comparative studies in humans. Nine studies were eliminated because they were not randomized, and another two were excluded based on the criteria for defining catheter colonization and catheter-related bloodstream infection. Ten studies examined both outcomes, two examined only catheter colonization, and one reported only catheter-related bloodstream infection.

Two authors independently read and evaluated each article. They reviewed the sample size, patient population, type of catheter, catheterization site, other interventions, duration of catheterization, reports of adverse events, and several other variables describing the incidence of colonization and catheter-related bloodstream infection. The authors also evaluated the appropriateness of randomization, the extent of blinding, and the description of eligible subjects. Discrepancies between the reviewers were resolved by a third author. Some basic information about the studies evaluated in this meta-analysis is given in Table 10–11.

Table 10–11. Characteristics of Studies Comparing Antiseptic-Impregnated with Control Catheters.

Table 10–11. Characteristics of Studies Comparing Antiseptic-Impregnated with Control Catheters.

    Number of Catheters (Number of Patients)Catheter Duration Mean, d Outcome Definitions
Study, ya
Number of Catheter LumensPatient PopulationCatheter Exchangeb
Treatment GroupControl GroupTreatment GroupControl GroupCatheter Colonizationc
Catheter-Related Bloodstream Infectiond
Tennenberg et al, 19972,3HospitalNo137 (137)145 (145)5.15.3SQ (IV, SC, >15 CFU)SO (IV, SC, site), CS, NS
Maki et al, 19973ICUYes208 (72)195 (86)6.06.0SQ (IV, >15 CFU) SO (>15 CFU, IV, hub, inf)e
van Heerden et al, 1996f
3ICUNo28 (28)26 (26)6.66.8SQ (IV, >15 CFU)NR
Hannan et al, 19963ICUNR68 (NR)60 (NR)78SQ (IV, >103 CFU)g
SQ (IV, >103 CFU), NS
Bach et al, 1994f
3ICUNo14 (14)12 (12)7.07.0QN (IV, >103 CFU)
Bach et al, 1996f
2, 3SurgicalNo116 (116)117 (117)7.77.7QN (IV, >103 CFU)
Heard et al, 1998f
3SICUYes151 (107)157 (104)8.59SQ (IV, SC, >14 CFU)SO (IV, SC, >4 CFU)
Collin, 19991, 2, 3ED/ICUYes98 (58)139 (61)9.07.3SQ (IV, SC, >15 CFU)SO (IV, SC)
Ciresi et al, 1996f
3TPNYes124 (92)127 (99)9.69.1SQ (IV, SC, >15 CFU)SO (IV, SC)
Pemberton et al, 19963TPNNo32 (32)40 (40)1011NRSO (IV), res, NS
Ramsay et al, 1994e
3HospitalNo199 (199)189 (189)10.910.9SQ (IV, SC, >15 CFU)SO (IV, SC)
Trazzera et al, 1995e
3ICU/BMTYes123 (99)99 (82)11.26.7SQ (IV, >15 CFU)SO (IV, >15 CFU)
George et al, 19973TransplantNo44 (NR)35 (NR)NRNRSQ (IV, >5 CFU)SO (IV)

aReaders should refer to the original article for these citations.

bCatheter exchange was performed using a guide wire.

cCatheter segments cultured and criteria for positive culture are given in parentheses.

dCatheter segment or site cultured and criteria for positive culture are given in parentheses.

eOrganism identity was confirmed by restriction-fragment subtyping.

fAdditional information was provided by author (personal communications, Jan 1998–Mar 1998).

gCulture method is reported as semiquantitative; criteria for culture growth suggest quantitative method.

NR = not reported; ICU = intensive care unit; SICU = surgical intensive care unit; TPN = total parenteral nutrition; BMT = bone marrow transplant; ED = emergency department; hospital, hospitalwide or a variety of settings; SQ = semiquantitative culture; QN = quantitative culture; CFU = colony-forming units; IV = intravascular catheter segment; SC = subcutaneous catheter segment; site = catheter insertion site; hub = catheter hub; inf = catheter infusate; SO = same organism isolated from blood and catheter; CS = clinical symptoms of systemic infection; res = resolution of symptoms on catheter removal; and NS = no other sources of infection.

Source: Table 1 from Veenstra DL, Saint S, Saha S, Lumley T, Sullivan SD: Efficacy of antiseptic-impregnated central venous catheters in preventing catheter-related bloodstream infection. JAMA 1999;281:261–267. Copyright © 1999, American Medical Association; used with permission.

The authors of the meta-analysis article calculated the odds ratios and 95% confidence intervals for each study and used a statistical method to determine summary odds ratios over all the studies. These odds ratios and intervals for the outcome of catheter-related bloodstream infection are illustrated in Figure 10–5. This figure illustrates the typical way findings from meta-analysis studies are presented. Generally the results from each study are shown, and the summary or combined results are given at the bottom of the figure. When the summary statistic is the odds ratio, a line representing the value of 1 is drawn to make it easy to see which of the studies have a significant outcome.

From the data in Table 10–11 and Figure 10–5, it appears that only one study (of the 11) reported a statistically significant outcome because only one has a confidence interval that does not contain 1. The entire confidence interval in Maki and associates' study (1997) is less than 1, indicating that these investigators found a protective effect when using the treated catheters. Of interest is the summary odds ratio, which illustrates that by pooling the results from 11 studies, treating the catheters appears to be beneficial. Several of the studies had relatively small sample sizes, however, and the failure to find a significant difference may be due to low power. Using meta-analysis to combine the results from these studies can provide insight on this issue.

A meta-analysis does not simply add the means or proportions across studies to determine an "average" mean or proportion. Although several different methods can be used to combine results, they all use the same principle of determining an effect size in each study and then combining the effect sizes in some manner. The methods for combining the effect sizes include the z approximation for comparing two proportions (Chapter 6); the t test for comparing two means (Chapter 6); the P values for the comparisons, and the odds ratio as shown in Veenstra and colleagues' study (1999). The values corresponding to the effect size in each study are the numbers combined in the meta-analysis to provide a pooled (overall) P value or confidence interval for the combined studies. The most commonly used method for reporting meta-analyses in the medical literature is the odds ratio with confidence intervals.

In addition to being potentially useful when published studies reach conflicting conclusions, meta-analysis can help raise issues to be addressed in future clinical trials. The procedure is not, however, without its critics, and readers should be aware of some of the potential problems in its use. To evaluate meta-analysis, LeLorier and associates (1997) compared the results of a series of large randomized, controlled trials with relevant previously published meta-analyses. Their results were mixed: They found that meta-analysis accurately predicted the outcome in only 65% of the studies; however, the difference between the trial results and the meta-analysis results was statistically significant in only 12% of the comparisons. Ioannidis and colleagues (1998) determined that the discrepancies in the conclusions were attributable to different disease risks, different study protocols, varying quality of the studies, and possible publication bias (discussed in a following section). These reports serve as a useful reminder that well-designed clinical trials remain a critical source of information.

Studies designed in dissimilar manners should not be combined. In performing a meta-analysis, investigators should use clear and well-accepted criteria for deciding whether studies should be included in the analysis, and these criteria should be stated in the published meta-analysis.

Most meta-analyses are based on the published literature, and some people believe it is easier to publish studies with results than studies that show no difference. This potential problem is called publication bias. Researchers can take at least three important steps to reduce publication bias. First, they can search for unpublished data, typically done by contacting the authors of published articles. Veenstra and his colleagues (1999) did this and contacted the manufacturer of the treated catheters as well but were unable to identify any unpublished data. Second, researchers can perform an analysis to see how sensitive the conclusions are to certain characteristics of the studies. For instance, Veenstra and colleagues assessed sources of heterogeneity or variation among the studies and reported that excluding these studies had no substantive effect on the conclusions. Third, investigators can estimate how many studies showing no difference would have to be done but not published to raise the pooled P value above the 0.05 level or produce a confidence interval that includes 1 so that the combined results would no longer be significant. The reader can have more confidence in the conclusions from a meta-analysis that finds a significant effect if a large number of unpublished negative studies would be required to repudiate the overall significance. The increasing use of computerized patient databases may lessen the effect of publication bias in future meta-analyses. Montori and colleagues (2000) provide a review of publication bias for clinicians.

The Cochrane Collection is a large and growing database of meta-analyses that were done according to specific guidelines. Each meta-analysis contains a description and an assessment of the methods used in the articles that constitute the meta-analysis. Graphs such as Figure 10–5 are produced, and, if appropriate, graphs for subanalyses are presented. For instance, if both cohort studies and clinical trials have been done on a given topic, the Cochrane Collection presents a separate figure for each. The Cochrane Collection is available on CD-ROM or via the Internet for an annual fee. The Cochrane Web site states that: "Cochrane reviews (the principal output of the Collaboration) are published electronically in successive issues of The Cochrane Database of Systematic Reviews. Preparation and maintenance of Cochrane reviews is the responsibility of international collaborative review groups."

No one has argued that meta-analyses should replace clinical trials. Veenstra and his colleagues (1999) conclude that a large trial may be warranted to confirm their findings. Despite their shortcomings, meta-analyses can provide guidance to clinicians when the literature contains several studies with conflicting results, especially when the studies have relatively small sample sizes. Furthermore, based on the increasingly large number of published meta-analyses, it appears that this method is here to stay. As with all types of studies, however, the methods used in a meta-analysis need to be carefully assessed before the results are accepted.

Methods for Classification

Several multivariate methods can be used when the research question is related to classification. When the goal is to classify subjects into groups, discriminant analysis, cluster analysis, and propensity score analysis are appropriate. These methods all involve multiple measurements on each subject, but they have different purposes and are used to answer different research questions.

Discriminant Analysis

Logistic regression is used extensively in the biologic sciences. A related technique, discriminant analysis, although used with less frequency in medicine, is a common technique in the social sciences. It is similar to logistic regression in that it is used to predict a nominal or categorical outcome. It differs from logistic regression, however, in that it assumes that the independent variables follow a multivariate normal distribution, so it must be used with caution if some X variables are nominal.

The procedure involves determining several discriminant functions, which are simply linear combinations of the independent variables that separate or discriminate among the groups defined by the outcome measure as much as possible. The number of discriminant functions needed is determined by a multivariate test statistic called Wilks' lambda. The discriminant functions' coefficients can be standardized and then interpreted in the same manner as in multiple regression to draw conclusions about which variables are important in discriminating among the groups.

Leone and coworkers (2002) wanted to identify characteristics that differentiate among expert adolescent female athletes in four different sports. Body mass, height, girth of the biceps and calf, skinfold measures, measures of aerobic power, and flexibility were among the measures they examined. Sports included were tennis with 15 girls, skating with 46, swimming with 23, and volleyball with 16. Discriminant analysis is useful when investigators want to evaluate several explanatory variables and the goal is to classify subjects into two or more categories or groups, such as that defined by the four sports.

Their analysis revealed three significant discriminant functions. The first function discriminated between skaters and the other three groups; the second reflected differences between volleyball players and swimmers, and the third between swimmers and tennis players. They concluded that adolescent female athletes show physical and biomotor differences that distinguish among them according to their sport.

Although discriminant analysis is most often employed to explain or describe factors that distinguish among groups of interest, the procedure can also be used to classify future subjects. Classification involves determining a separate prediction equation corresponding to each group that gives the probability of belonging to that group, based on the explanatory variables. For classification of a future subject, a prediction is calculated for each group, and the individual is classified as belonging to the group he or she most closely resembles.

Factor Analysis

Andrewes and colleagues (2003) wanted to know how scores on the Emotional and Social Dysfunction Questionnaire (ESDQ) can be used to help decide the level of support needed following brain surgery. Similarly, the Medical Outcomes Study Short Form 36 (MOS-SF36) is a questionnaire commonly used to measure patient outcomes (Stewart et al, 1988). In examples such as these, tests with a large number of items are developed, patients or other subjects take the test, and scores on various items are combined to produce scores on the relevant factors.

The MOS-SF36 is probably used more frequently than any other questionnaire to measure functional outcomes and quality of life; it has been used all over the world and in patients with a variety of medical conditions. The questionnaire contains 36 items that are combined to produce a patient profile on eight concepts: physical functioning, role-physical, bodily pain, general health, vitality, social functioning, role-emotional, and mental health. The first four concepts are combined to give a measure of physical health, and the last four concepts are combined to give a measure of mental health. The developers used factor analysis to decide how to combine the questions to develop these concepts.

In a research problem in which factor analysis is appropriate, all variables are considered to be independent; in other words, there is no desire to predict one on the basis of others. Conceptually, factor analysis works as follows: First, a large number of people are measured on a set of items; a rule of thumb calls for at least ten times as many subjects as items. The second step involves calculating correlations. To illustrate, suppose 500 patients answered the 36 questions on the MOS-SF36. Factor analysis answers the question of whether some of the items group together in a logical way, such as items that measure the same underlying component of physical activity. If two items measure the same component, they can be expected to have higher correlations with each other than with other items.

In the third step, factor analysis manipulates the correlations among the items to produce linear combinations, similar to a regression equation without the dependent variable. The difference is that each linear combination, called a factor, is determined so that the first one accounts for the most variation among the items, the second factor accounts for the most residual variation after the first factor is taken into consideration, and so forth. Typically, a small number of factors account for enough of the variation among subjects that it is possible to draw inferences about a patient's score on a given factor. For example, it is much more convenient to refer to scores for physical functioning, role-physical, bodily pain, and so on, than to refer to scores on the original 36 items. Thus, the fourth step involves determining how many factors are needed and how they should be interpreted.

Andrewes and colleagues analyzed the ESDQ, a questionnaire designed for brain-damaged populations. They performed a factor analysis of the ratings by the partner or caretaker of 211 patients. They found that the relationships among the questions could be summarized by eight factors, including anger, helplessness, emotional dyscontrol, indifference, inappropriateness, fatigue, maladaptive behavior, and insight. The researchers subsequently used the scores on the factors for a discriminant analysis to differentiate between the brain-damaged patients and a control group with no cerebral dysfunction and found significant discrimination.

Investigators who use factor analysis usually have an idea of what the important factors are, and they design the items accordingly. Many other issues are of concern in factor analysis, such as how to derive the linear combinations, how many factors to retain for interpretation, and how to interpret the factors. Using factor analysis, as well as the other multivariate techniques, requires considerable statistical skill.

Cluster Analysis

A statistical technique similar conceptually to factor analysis is cluster analysis. The difference is that cluster analysis attempts to find similarities among the subjects that were measured instead of among the measures that were made. The object in cluster analysis is to determine a classification or taxonomic scheme that accounts for variance among the subjects. Cluster analysis can also be thought of as similar to discriminant analysis, except that the investigator does not know to which group the subjects belong. As in factor analysis, all variables are considered to be independent variables.

Cluster analysis is frequently used in archeology and paleontology to determine if the existence of similarities in objects implies that they belong to the same taxon. Biologists use this technique to help determine classification keys, such as using leaves or flowers to determine appropriate species. A study by Penzel and colleagues (2003) used cluster analysis to examine the relationships among chromosomal imbalances in thymic epithelial tumors. Journalists and marketing analysts also use cluster analysis, referred to in these fields as Q-type factor analysis, as a way to classify readers and consumers into groups with common characteristics.

Propensity Scores

The propensity score method is an alternative to multiple regression and analysis of covariance. It provides a creative method to control for an entire group of confounding variables. Conceptually, a propensity score is found by using the confounding variables as predictors of the group to which a subject belongs; this step is generally accomplished by using logistic regression. For example, many cohort studies are handicapped by the problem of many confounding variables, such as age, gender, race, comorbidities, and so forth. Once the outcome is known for the subjects in the cohort, the confounding variables are used to develop a logistic regression equation to predict whether a patient has the outcome or not. This prediction, based on a combination of the confounding variables, is calculated for all subjects and then used as the confounding variable in subsequent analyses. Developers of the technique maintain it does a better job of controlling for confounding variables (Rubin, 1997). See Katzan and colleagues (2003) for an example of the application of propensity score analysis in a clinical study to determine the effect of pneumonia on mortality in patients with acute stroke.

Classification and Regression Tree (CART) Analysis

Classification and regression tree (CART) analysis is an approach to analyzing large databases to find significant patterns and relationships among variables. The patterns are then used to develop predictive models for classifying future subjects. As an example, CART was used in a study of 105 patients with stage IV colon or rectal cancer (Dixon et al, 2003). CART identified optimal cut points for carcinoembryonic antigen (CEA) and albumen (ALB) to form four groups of patients: low CEA with high ALB, low CEA with low ALB, high CEA with high ALB, and high CEA with low ALB. A survival analysis (Kaplan–Meier) was then used to compare survival times in these four groups. In another application of CART analysis, researchers were successful in determining the values of semen measurements that discriminate between fertile and infertile men (Guzick et al, 2001). The method requires special software and extensive computing power.

Multiple Dependent Variables

Multivariate analysis of variance and canonical correlation are similar to each other in that they both involve multiple dependent variables as well as multiple independent variables.

Multivariate Analysis of Variance

Multivariate analysis of variance (MANOVA) conceptually (although not computationally) is a simple extension of the ANOVA designs discussed in Chapter 7 to situations in which two or more dependent variables are included. As with ANOVA, MANOVA is appropriate when the independent variables are nominal or categorical and the outcomes are numerical. If the results from the MANOVA are statistically significant, using the multivariate statistic called Wilks' lambda, follow-up ANOVAs may be done to investigate the individual outcomes.

Weiner and Rudy (2002) wanted to identify nursing home resident and staff attitudes that are barriers to effective pain management. They collected information from nurses, nursing assistants, and residents in seven long-term care facilities. They designed questionnaires to collect beliefs about 12 components of chronic pain management and administered them to these three groups. They wanted to know if there were attitudinal differences among the three groups on the 12 components. If analysis of variance is used in this study, they would need to do 12 different ANOVAs, and the probability of any one component being significant by chance is increased. With these multiple dependent variables, they correctly chose to use MANOVA. Results indicated that residents believed that chronic pain does not change, and they were fearful of addiction. The nursing staff believed that many complaints were unheard by busy staff. Note that this study used a nested design (patients and staff within nursing homes) and would be a candidate for GEE or multilevel model analysis.

The motivation for doing MANOVA prior to univariate ANOVA is similar to the reason for performing univariate ANOVA prior to t tests: to eliminate doing many significance tests and increasing the likelihood that a chance difference is declared significant. In addition, MANOVA permits the statistician to look at complex relationships among the dependent variables. The results from MANOVA are often difficult to interpret, however, and it is used sparingly in the medical literature.

Canonical Correlation Analysis

Canonical correlation analysis also involves both multiple independent and multiple dependent variables. This method is appropriate when both the independent variables and the outcomes are numerical, and the research question focuses on the relationship between the set of independent variables and the set of dependent variables. For example, suppose researchers wish to examine the overall relationship between indicators of health outcome (physical functioning, mental health, health perceptions, age, gender, etc) measured at the beginning of a study and the set of outcomes (physical functioning, mental health, social contacts, serious symptoms, etc) measured at the end of the study. Canonical correlation analysis forms a linear combination of the independent variables to predict not just a single outcome measure, but a linear combination of outcome measures. The two linear combinations of independent variables and dependent variables, each resulting in a single number (or index), are determined so the correlation between them is as large as possible. The correlation between the pair of linear combinations (or numbers or indices) is called the canonical correlation. Then, as in factor analysis, a second pair of linear combinations is derived from the residual variation after the effect of the first pair is removed, and the third pair from those remaining, and so on. The canonical coefficients in the linear combinations are interpreted in the same manner as regression coefficients in a multiple regression equation, and the canonical correlations as multiple R. Generally, the first two or three pairs of linear combinations account for sufficient variation, and they can be interpreted to gain insights about related factors or dimensions.

The relationship between personality and symptoms of depression was studied in a community-based sample of 804 individuals. Grucza and colleagues (2003) used the Temperament and Character Inventory (TCI) to assess personality and the Center for Epidemiologic Studies Depression scale (CES-D) to measure symptoms of depression. Both of these questionnaires contain multiple scales or factors, and the authors used canonical correlation analysis to learn how the factors on the TCI are related to the factors on the CES-D. They discovered several relationships and concluded that depression symptom severity and patterns are partially explained by personality traits.

Summary of Advanced Methods

The advanced methods presented in this chapter are used in approximately 10–15% of the articles in medical and surgical journals. Unfortunately for readers of the medical literature, these methods are complex and not easy to understand, and they are not always described adequately. As with other complex statistical techniques, investigators should consult with a statistician if an advanced statistical method is planned. Table 10–2 gives a guide to the selection of the appropriate method(s), depending on the number independent variables and the scale on which they are measured.


[Note to AccessLange users: data and software are not available on the website.]

1. Using the following formula, verify the adjusted mean number of ventricular wall motion abnormalities in smokers and nonsmokers from the hypothetical data in the section titled, "Controlling for Confounding." That is,

2. Blood flow through an artery measured as peak systolic velocity (PSV) increases with narrowing of the artery. The well-known relationship between area of the arterial vessels and velocity of blood flow is important in the use of carotid Doppler measurements for grading stenosis of the artery. Alexandrov and collaborators (1997) examined 80 bifurcations in 40 patients and compared the findings from the Doppler technique with two angiographic methods of measuring carotid stenosis (the North American or NASCET [N] method and the common carotid [C or CSI] method). They investigated the fit provided by a linear equation, a quadratic equation, and a cubic equation.

  a. Using data in the file "Alexandrov" on the CD-ROM, produce a scatterplot with PSV on the y-axis and CSI on the x-axis. How do you interpret the scatterplot?
  b. Calculate the correlation between both the N and C methods and PSV. Which is most highly related to PSV?
  c. Perform a multiple regression to predict PSV from CSI using linear and quadratic terms.
  d. Using the regression equation, what is the predicted PSV if the measurement of angiographic stenosis using the CIS method is 60%?

3. Refer to the study by Soderstrom and coinvestigators (1997). Find the probability that a 27-year-old Caucasian man who comes to the emergency department on Saturday night has a BAC 50 mg/dL.

4. Refer to the study by Soderstrom and coinvestigators (1997). From Table 10–8, find the value of the kappa statistic for the agreement between the predicted and actual number of males with unintentional injuries who have a BAC 50 mg/dL when they come to the emergency department.

5. Bale and associates (1986) performed a study to consider the physique and anthropometric variables of athletes in relation to their type and amount of training and to examine these variables as potential predictors of distance running performance. Sixty runners were divided into three groups: (1) elite runners with 10-km runs in less than 30 min; (2) good runners with 10-km times between 30 and 35 min, and (3) average runners with 10-km times between 35 and 45 min. Anthropometric data included body density, percentage fat, percentage absolute fat, lean body mass, ponderal index, biceps and calf circumferences, humerus and femur widths, and various skinfold measures. The authors wanted to determine whether the anthropometric variables were able to differentiate between the groups of runners. What is the best method to use for this research question?

6. Ware and collaborators (1987) reported a study of the effects on health for patients in health maintenance organizations (HMO) and for patients in fee-for-service (FFS) plans. Within the FFS group, some patients were randomly assigned to receive free medical care and others shared in the costs. The health status of the adults was evaluated at the beginning and again at the end of the study. In addition, the number of days spent in bed because of poor health was determined periodically throughout the study. These measures, recorded at the beginning of the study—along with information on the participant's age, gender, income, and the system of health care to which he or she was assigned (HMO, free FFS, or pay FFS)—were the independent variables used in the study. The dependent variables were the values of these same 13 measures at the end of the study. The results from a multiple-regression analysis to predict number of bed days are given in Table 10–12.

Table 10–12. Regression Coefficients and t Test Values for Predicting Bed-Days in RAND Study.

Table 10–12. Regression Coefficients and t Test Values for Predicting Bed-Days in RAND Study.

 Dependent-Variable Equation
Explanatory Variables and Other Measures (XCoefficient (bt Test 
FFS freeplan–0.017–2.17
FFS payplan–0.014–2.18
Personal functioning–0.0002–1.35
Mental health–0.000060.25
Health perceptions–0.002–5.17
Three-year term0.0020.44
Took physical–0.003–0.56
Sample size1568 
Residual standard error0.01 

FFS = fee for service.

Source: Reproduced, with permission, from Ware JE, Brook RH, Rogers WH, Keeler EB, Davie AR, Sherbourne CD, et al: Health Outcomes for Adults in Prepaid and Fee-for-Service Systems of Care. (R–3459–HHS.) Santa Monica, CA: The RAND Corporation, 1987, p. 59.

Use the regression equation to predict the number of bed-days during a 30-day period for a 70-year-old woman in the FFS pay plan who has the values on the independent variables shown in Table 10–13 (asterisks [*] designate dummy variables given a value of 1 if yes and 0 if no).

Table 10–13. Values for Prediction Equation.

Table 10–13. Values for Prediction Equation.

Personal functioning80
Mental heatlh80
Health perceptions75
Income10 (from a formula used in the RAND study)
Three-year term*Yes
Took physical*Yes

*Indicates a dummy variable with 1 = yes and 0 = no.

7. Symptoms of depression in the elderly may be more subtle than in younger patients, but recognizing depression in the elderly is important because it can be treated. Henderson and colleagues in Australia (1997) studied a group of more than 1000 elderly, all age 70 years or older. They examined the outcome of depressive states 3–4 years after initial diagnosis to identify factors associated with persistence of depressive symptoms and to test the hypothesis that depressive symptoms in the elderly are a risk factor for dementia or cognitive decline. They used the Canberra Interview for the Elderly (CIE), which measures depressive symptoms and cognitive performance, and referred to the initial measurement as "wave 1" and the follow-up as "wave 2." The regression equation predicting depression at wave 2 for 595 people who completed the interview on both occasions is given in Table 10–14, and data are in the file on the CD-ROM entitled, "Henderson." The variables have been entered into the regression equation in blocks, an example of hierarchical regression.

Table 10–14. Regression Results for Predicting Depression at Wave 2.

Table 10–14. Regression Results for Predicting Depression at Wave 2.

Predictor Variablea
b Betab
P R2
R2 Change
Depression Score      
  Wave 10.2670.2310.0000.1820.182
Psychologic Health      
  Neuroticism, wave 10.0670.0770.0560.02370.050
  Past history of depression0.3200.1360.000  
Physical Health      
  ADL, wave 1–0.154–0.1030.0330.4110.174
  ADL, wave 20.2750.2830.012  
  ADL2, wave 2
  Number of current symptoms, wave 20.1150.1170.009  
  Number of medical conditions, wave 20.3090.2260.000  
  BP, systolic, wave 2–0.010–0.0920.010  
  Global health rating change0.2840.0790.028  
  Sensory impairment change–0.045–0.0640.073  
Social support/inactivity      
  Social support—friends, wave 2–1.650–0.0950.0150.4420.031
  Social support—visits, wave 2–1.229–0.0870.032  
  Activity level, wave 20.0610.0950.025  
Services (community residents only), wave 2 0.207c

aOnly those variables are shown that were included in the final model.

bStandardized beta value, controlling for all other variables in the regression, except service use. Based on community and institutional residents.

cRegression limited to community sample only; coefficients for other variables vary only very slightly from those obtained with regression on the full sample.

ADL = adult daily living; BP = blood pressure.

Source: Table 3 from the article was modified with the addition of unstandardized regression coefficients; used, with permission, from Henderson AS, Korten AE, Jacomb PA, MacKinnon AJ, Jorm AF, Christensen H, et al: The course of depression in the elderly: A longitudinal community-based study in Australia. Psychol Med 1997;27:119–129.)

  a. Based on the regression equation in Table 10–14, what is the relationship between depression score initially and at follow-up?
  b. The regression coefficient for age is –0.014. Is it significant? How would you interpret it?
  c. Once a person's depression score at wave 1 is known, which group of variables accounts for more of the variation in depression at wave 2?
  d. Use the data on the CD-ROM to replicate the analysis.

8. Table 10–15 contains the results from an analysis of the data from Crook and colleagues (1997) using only information known before treatment was given.

Table 10–15. Cox Proportional Hazard Model Using Only Pretreatment Variables.

Table 10–15. Cox Proportional Hazard Model Using Only Pretreatment Variables.

–2 Log Likelihood610.312  
df Significance
Overall (score)51.48360.0000
Change (–2LL) from   
  Previous block39.34460.0000
  Previous step39.34460.0000

Variables in the Equation
VariableB SE Walddf SignificanceR 
TUMSTAGE  12.103230.00700.0969
  TUMSTAGE (1)–0.02630.70750.001410.97030.0000
  TUMSTAGE (2)1.01410.54193.501410.06130.0481
  TUMSTAGE (3)1.45880.55356.945810.00840.0873

  95% CI for Exp (B)
VariableExp (B)LowerUpper
TUMSTAGE (1)0.97400.24343.8979
TUMSTAGE (2)2.75680.95307.9746
TUMSTAGE (3)4.30081.453412.7264

df = degree of freedom; SE = standard error; CI = confidence interval; Wald = statistic used by SPSS to test the significance of variables.

Source: Data, used with permission of the authors and publisher, from Crook JM, Bahadur YA, Bociek RG, Perry GA, Robertson SJ, Esche BA: Radiotherapy for localized prostate carcinoma. Cancer 1997;79:328–336. Output produced using SPSS 10.0, a registered trademark of SPSS, Inc. Used with permission.

  a. Is the overall Cox model significant when based on pretreatment variables only? What level of significance is reported?
  b. Were any of the potentially confounding variables significant?
  c. Confirm the value of the odds ratios associated with the TUMSTAGE(3) variable of T classifications, and interpret the confidence in terval.
  d. What are the major differences in this analysis compared with the one that included posttreatment variables as well?

9. Hindmarsh and Brook (1996) examined the final height of 16 short children who were treated with growth hormone. They studied several variables they thought might predict height in these children, such as the mother's height, the father's height, the child's chronologic and bone age, dose of the growth hormone during the first year, age at the start of therapy, and the peak response to an insulin-induced hypoglycemia test. All anthropometric indices were expressed as standard deviation scores; these scores express height in terms of standard deviations from the mean in a norm group. For example, a height score of –2.00 indicates the child is 2 standard deviations below the mean height for his or her age group.

Data are given in Table 10–16 and in a file entitled "Hindmarsh" on the CD-ROM.

Table 10–16. Case Summaries.a

Table 10–16. Case Summaries.a

 Final Height SDSAgeDoseFather's HeightMother's HeightHeight SDS Chronologic AgeHeight SDS Bone Age
Total N 16161616161616

aLimited to first 100 cases.

SDS = standard deviation score.

Source: Data, used with permission, from Hindmarsh PC, Brook CGD: Final height of short normal children treated with growth hormone. Lancet 1996;348:13–16. Table produced using SPSS 10.0; used with permission.

  a. Use the data to perform a stepwise regression and interpret the results. We reproduced a portion of the output in Table 10–17.
  b. What variable entered the equation on the first iteration (model 1)? Why do you think it entered first?
  c. What variables are in the equation at the final model? Which of these variables makes the greatest contribution to the prediction of final height?
  d. Why do you think the variable that entered the equation first is not in the final model?
  e. Using the regression equation, what is the predicted height of the first child? How close is this to the child's actual final height (in SDS scores)?
Table 10–17. Results from Stepwise Multiple Regression to Predict Final Height in Standard Deviation Scores.a

Table 10–17. Results from Stepwise Multiple Regression to Predict Final Height in Standard Deviation Scores.a

  Unstandardized CoefficientsStandardized Coefficients  
Model B SE  t Significance
1(Constant)–1.0550.248 –4.2610.001
 Father's height0.3020.1420.4942.1260.052
2(Constant)–1.3350.284 –4.7050.000
 Father's height0.3370.1350.5522.5030.026
 Mother's height–0.3430.200–0.378–1.7150.110
3(Constant)0.2050.734 0.2800.785
 Father's height0.2110.1310.3451.6120.133
 Mother's height–0.4780.185–0.527–2.5810.024
 Height SDS chronologic age0.8200.3680.5052.2300.046
4(Constant)–1.1100.927 –1.1980.256
 Father's height0.1280.1240.2101.0350.323
 Mother's height–0.5590.170–0.617–3.2840.007
 Height SDS chronologic age1.1320.3630.6973.1160.010
5(Constant)–1.1380.929 –1.2250.244
 Mother's height–0.5750.170–0.634–3.3810.005
 Height SDS chronologic age1.3250.3130.8164.2290.001

aDependent variable: final height SDS.

SE = standard error; SDS = standard deviation scores.

Note: Because the sample size is small, we set probability for variables to enter the regression equation at 0.15 and for variables to be removed at 0.20.

Source: Data, used with permission, from Hindmarsh PC, Brook CGD: Final height of short normal children treated with growth hormone. Lancet 1996;348:13–16. Stepwise regression results produced using SPSS; used with permission.

Copyright © The McGraw-Hill Companies. All rights reserved.
Privacy Notice. Any use is subject to the Terms of Use and Notice.