Basic and Clinical Biostatistics > Chapter 10. Statistical Methods for Multiple Variables >

Key Concepts

The choice of statistical methods depends on the research
question, the scales on which the variables are measured, and the
number of variables to be analyzed.

Many of the advanced statistical procedures can be interpreted
as an extension or modification of multiple regression analysis.

Many of the statistical methods used for questions with one
independent variable have direct analogies with methods for multiple
independent variables.

The term "multivariate" is used when more
than one independent variable is analyzed.

Multiple regression is a simple and ideal method to control
for confounding variables.

Multiple regression coefficients indicate whether the relationship
between the independent and dependent variables is positive or negative.

Dummy, or indicator, coding is used when nominal variables
are used in multiple regression.

Regression coefficients indicate the amount the change in
the dependent variable for each one-unit change in the X variable,
holding other independent variables constant.

Multiple regression measures a linear relationship only.

The Multiple R statistic is the best indicator of how well
the model fits the data—how much variance is accounted
for.

Several methods can be used to select variables in a multivariate
regression.

Polynomial regression can be used when the relationship is
curvilinear.

Cross-validation tell us how applicable the model will be
if we used it in another sample of subjects.

A good rule of thumb is to have ten times as many subjects
as variables.

Analysis of covariance controls for confounding variables;
it can be used as part of analysis of variance or in multiple regression.

Logistic regression predicts a nominal outcome; it is the
most widely used regression method in medicine.

The regression coefficients in logistic regression can be
transformed to give odds ratios.

The Cox model is the multivariate analogue of the Kaplan–Meier
curve; it predicts time-dependent outcomes when there are censored
observations.

The Cox model is also called the proportional hazard model;
it is one of the most important statistical methods in medicine.

Meta-analysis provides a way to combine the results from several
studies in a quantitative way and is especially useful when studies
have come to opposite conclusions or are based on small samples.

An effect size is a measure of the magnitude of differences
between two groups; it is a useful concept in estimating sample
sizes.

The Cochrane Collection is a set of very well designed meta-analyses
and is available at libraries and online.

Several methods are available when the goal is to classify
subjects into groups.

Multivariate analysis of variance, or MANOVA, is analogous
to using ANOVA when there are several dependent variables.

Presenting Problems

Presenting Problem
1

In Chapter 8 we examined the study by Jackson and colleagues
(2002) who evaluated the relationship between BMI and percent body
fat. Please refer to that chapter for more details on the study. We
found a significant relationship between these two measures and
calculated a correlation coefficient of r = 0.73. These
investigators knew, however, that variables other than BMI may also affect
the relationship between BMI and percent body fat and developed
separate models for men and women. We use their data in this chapter
to illustrate two important procedures: multiple regression to control
possible confounding variables, and polynomial regression to model
the nonlinear relationship we noted in Chapter 8. Data are on the
CD-ROM [available only with the book] in a file entitled "Jackson."

Presenting Problem
2

Soderstrom and coinvestigators (1997) wanted to develop a model
to identify trauma patients who are likely to have a blood alcohol
concentration (BAC) in excess of 50 mg/dL. They evaluated data
from a clinical trauma registry and toxicology database at a level
I trauma center. Such patients might be candidates for alcohol and
drug abuse and dependence treatment and intervention programs.

Data, including BAC, were available on 11,062 patients of whom
approximately 71% were male and 65% were white.
The mean age was 35 years with a standard deviation of 17 years.
Type of injury was classified as unintentional, typically accidental
(78.2%), or intentional, including suicide attempts (21.8%).
Of these patients, 3180 (28.7%) had alcohol detected in
the blood, and 91.2% of those patients had a BAC in excess
of 50 mg/dL. Among the patients with a BAC > 50, percentages
of men and whites did not differ appreciably from the entire sample;
however, the percentage of intentional injuries in this group was
higher (28.9%). We use a random sample of data provided
by the investigators to illustrate the calculation and interpretation
of the logistic model, the statistical method they used to develop
their predictive model. Data are in a file called "Soderstrom" on
the CD-ROM [available only with the book].

Presenting Problem
3

In the previous chapter we used data from a study by Crook and
colleagues (1997) to illustrate the Kaplan–Meier survival
analysis method. These investigators studied the correlation between
both the pretreatment prostate-specific antigen (PSA) and posttreatment
nadir PSA levels in men with localized prostate cancer who were
treated using external beam radiation therapy. The Gleason histologic
scoring system was used to classify tumors on a scale of 2 to 10.
Please refer to that Chapter 9 for more details. The investigators
wanted to examine factors other than tumor stage that might be associated
with treatment failure, and we use observations from their study
to describe an application of the Cox proportional hazard model.
Data on the patients are given in the file entitled "Crook" on
the CD-ROM [available only with the book].

Presenting Problem
4

The use of central venous catheters to administer parenteral
nutrition, fluids, or drugs is a common medical practice. Catheter-related
bloodstream infections (CR-BSI) are a serious complication estimated
to occur in about 200,000 patients each year. Many studies have
suggested that impregnation of the catheter with the antiseptic
chlorhexidine/silver sulfadiazine reduces bacterial colonization,
but only one study has shown a significant reduction in the incidence
of bloodstream infections.

It is difficult for physicians to interpret the literature when
studies report conflicting results about the benefits of a clinical
intervention or practice. As you now know, studies frequently fail
to find significance because of low power associated with small
sample sizes. Traditionally, conflicting results in medicine are
dealt with by reviewing many studies published in the literature
and summarizing their strengths and weaknesses in what are commonly
called review articles. Veenstra and colleagues (1999) used a more
structured method to combine the results of several studies in a
statistical manner. They applied meta-analysis to 11 randomized,
controlled clinical trials, comparing the incidence of bloodstream
infection in impregnated catheters versus nonimpregnated catheters,
so that overall conclusions regarding efficacy of the practice could
be drawn. The section titled "Meta-Analysis" summarizes
the results.

Purpose of the
Chapter

The purpose of this chapter is to present a conceptual framework
that applies to almost all the statistical procedures discussed
so far in this text. We also describe some of the more advanced
techniques used in medicine.

A Conceptual
Framework

The previous chapters illustrated statistical techniques that
are appropriate when the number of observations on each subject
in a study is limited. For example, a t test
is used when two groups of subjects are studied and the measure
of interest is a single numerical variable—such as in Presenting
Problem 1 in Chapter 6, which discussed differences in pulse oximetry
in patients who did and did not have a pulmonary embolism(Kline
et al, 2002). When the outcome of interest is nominal, the chi-square
test can be used—such as the Lapidus et al (2002) study
of screening for domestic violence in the emergency department (Chapter 6
Presenting Problem 3). Regression analysis is used to predict
one numerical measure from another, such as in the study predicting
insulin sensitivity in hyperthyroid women (Gonzalo et al, 1996;
Chapter 7 Presenting Problem 2).

Alternatively, each of these examples can be viewed conceptually
as involving a set of subjects with two observations on each subject:
(1) for the t test, one numerical variable,
pulse oximetry, and one nominal (or group membership) variable, development
of pulmonary embolism; (2) for the chi-square test, two nominal
variables, training in domestic violence and screening in the emergency
department; (3) for regression, two numerical variables, insulin
sensitivity and body mass index. It is advantageous to look at research
questions from this perspective because ideas are analogous to situations
in which many variables are included in a study.

To practice viewing research questions from a conceptual perspective,
let us reconsider Presenting Problem 1 in Chapter 7 by Woeber (2002).
The objective was to determine whether differences exist in serum
free T4 concentrations in patients who had thyroiditis
with normal serum TSH values and not taking L-T_{4} replacement,
had normal TSH values and were taking L-T_{4} replacement
therapy, or had normal thyroid and serum TSH levels. The research
question in this study may be viewed as involving a set of subjects
with two observations per subject: one numerical variable, serum
free T_{4} concentrations, and one ordinal (or group membership)
variable, thyroid status, with three categories. If only two categories
were included for thyroid status, the t test
would be used. With more than two groups, however, one-way analysis
of variance (ANOVA) is appropriate.

Many problems in medicine have more than two observations per
subject because of the complexity involved in studying disease in
humans. In fact, many of the presenting problems used in this text
have multiple observations, although we chose to simplify the problems
by examining only selected variables. One method involving more
than two observations per subject has already been discussed: two-way
ANOVA. Recall that in Presenting Problem 2 in Chapter 7 insulin
sensitivity was examined in overweight and normal weight women with
and without hyperthyroid disease (Gonzalo et al, 1996). For this
analysis, the investigators classified women according to two nominal
variables (weight status and thyroid status, both measured as normal
or higher than normal) and one numerical variable, insulin sensitivity.
(Although both weight and thyroid level are actually numerical measures,
the investigators transformed them into nominal variables by dividing
the values into two categories.)

If the term independent variable is
used to designate the group membership variables (eg, development
of pulmonary embolism or not), or the X variable
(eg, blood pressure measured by a finger device), and the term dependent is used to designate the
variables whose means are compared (eg, pulse oximetry), or the Y variable (eg, blood pressure measured
by the cuff device), the observations can be summarized as in Table
10–1. (For the sake of simplicity, this summary omits ordinal
variables; variables measured on an ordinal scale are often treated
as if they are nominal.) Data from several of the presenting problems
are available on the CD-ROM [available only with the book], and we invite you to replicate the
analyses as you go through this chapter.

Table 10–1. Summary
of Conceptual Frameworka for Questions Involving Two Variables.

Table 10–1. Summary
of Conceptual Framework^{a} for Questions Involving Two Variables.

Independent Variable

Dependent Variable

Method

Nominal

Nominal

Chi-square

Nominal (binary)

Numerical

t test^{a}

Nominal (more than two values)

Numerical

One-way

ANOVA^{a}

Nominal

Numerica (censored)

Actuarial methods

Numerical

Numerical

Regression^{b}

^{a}Assuming the necessary assumptions (eg, normality,
independence, etc.) are met.

^{b}Correlation is appropriate when neither variable is
designated as independent or dependent.

ANOVA = analysis of variance.

Introduction
to Methods for Multiple Variables

Statistical techniques involving multiple variables are used
increasingly in medical research, and several of them are illustrated
in this chapter. The multiple-regression model, in which several independent
variables are used to explain or predict the values of a single
numerical response, is presented first, partly because it is a natural
extension of the regression model for one independent variable illustrated
in Chapter 8. More importantly, however, all the other advanced
methods except meta-analysis can be viewed as modifications or extensions
of the multiple-regression model. All except meta-analysis involve
more than two observations per subject and are concerned with explanation
or prediction.

The goal in this chapter is to present the logic of the different
methods listed in Table 10–2 and to illustrate how they
are used and interpreted in medical research. These methods are
generally not mentioned in traditional introductory texts, and most
people who take statistics courses do not learn about them until
their third or fourth course. These methods are being used more
frequently in medicine, however, partly because of the increased
involvement of statisticians in medical research and partly because
of the availability of complex statistical computer programs. In
truth, few of these methods would be used very much in any field
were it not for computers because of the time-consuming and complicated
computations involved. To read the literature with confidence, especially
studies designed to identify prognostic or risk factors, a reasonable
acquaintance with the methods described in this chapter is required.
Few of the available elementary books discuss multivariate methods.
One that is directed toward statisticians is nevertheless quite
readable (Chatfield, 1995); Katz (1999) is intended for readers
of the medical literature and contains explanations of many of topics
we discuss in this chapter (Dawson, 2000), as does Norman and Streiner (1996).

Table 10–2. Summary
of Conceptual Frameworka for Questions Involving Two or
More Independent (Explanatory) Variables.

Table 10–2. Summary
of Conceptual Framework^{a} for Questions Involving Two or
More Independent (Explanatory) Variables.

Independent Variables

Dependent Variable

Method(s)

Nominal

Nominal

Log-linear

Nominal and numerical

Nominal (binary)

Logistic regression

Nominal and numerical

Nominal (2 or more categories)

Logistic regression

Discriminant analysis^{a}

Cluster analysis

Propensity scores

CART

Nominal

Numerical

ANOVA^{a}

MANOVA^{a}

Numerical

Numerical

Multiple regression^{a}

Nominal and numerical

Numerical (censored)

Cox propotional hazard model

Confounding factors

Numerical

ANCOVA^{a}

MANOVA^{a}

GEE^{a}

Confounding factors

Nominal

Mantel–Haenszel

Numerical only

Factor analysis

^{a}Certain assumptions (eg, multivariate normality,
independence, etc.) are needed to use these methods.

CART = classification and regression tree; ANOVA = analysis
of variance; ANCOVA = analysis of covariance; MANOVA = multivariate
analysis of variance; GEE = generalized estimating equations.

Before we examine the advanced methods, however, a comment on
terminology is necessary. Some statisticians reserve the term "multivariate" to
refer to situations that involve more than one dependent (or response)
variable. By this strict definition, multiple regression and most
of the other methods discussed in this chapter would not be classified
as multivariate techniques. Other statisticians, ourselves included,
use the term to refer to methods that examine the simultaneous effect
of multiple independent variables. By this definition, all the techniques
discussed in this chapter (with the possible exception of some meta-analyses)
are classified as multivariate.

Multiple Regression

Review of Regression

Simple linear regression (Chapter 8) is the method of choice
when the research question is to predict the value of a response
(dependent) variable, denoted Y, from
an explanatory (independent) variable X. The
regression model is

For simplicity of notation in this chapter we use Y to denote the dependent variable,
even though Y', the predicted value, is actually
given by this equation. We also use a and b, the sample estimates, instead of
the population parameters, _{0} and _{1,} where a is the intercept and b the regression
coefficient. Please refer to Chapter 8 if you'd like
to review simple linear regression.

Multiple Regression

The extension of simple regression to two or more independent
variables is straightforward. For example, if four independent variables
are being studied, the multiple regression model
is

where X_{1} is the first independent
variable and b_{1} is the regression
coefficient associated with it, X_{2} is
the second independent variable and b_{2} is
the regression coefficient associated with it, and so on. This arithmetic
equation is called a linear combination; thus,
the response variable Y can be expressed
as a (linear) combination of the explanatory variables. Note that
a linear combination is really just a weighted average that gives
a single number (or index) after the X's
are multiplied by their associated b's
and the bX products are added. The
formulas for a and b were
given in Chapter 8, but we do not give the formulas in multiple
regression because they become more complex as the number of independent
variables increases; and no one calculates them by hand, in any
case.

The dependent variable Y must be
a numerical measure. The traditional multiple-regression model calls
for the independent variables to be numerical measures as well;
however, nominal independent variables may be used, as discussed
in the next section. To summarize, the appropriate technique for
numerical independent variables and a single numerical dependent
variable is the multiple regression model, as indicated in Table
10–2.

Multiple regression can be difficult to interpret, and the results
may not be replicable if the independent variables are highly correlated
with each other. In the extreme situation, two variables that are
perfectly correlated are said to be collinear. When multicollinearity
occurs, the variances of the regression coefficients are large so
the observed value may be far from the true value. Ridge regression
is a technique for analyzing multiple regression data that suffer
from multicollinearity by reducing the size of standard errors.
It is hoped that the net effect will be to give more reliable estimates.
Another regression technique, principal components regression, is
also available, but ridge regression is the more popular of the
two methods.

Interpreting
the Multiple Regression Equation

Jackson and colleagues (2002) (Presenting Problem 1) wanted to
study the way in which sex, age, and race affect the relationship
between BMI and percent body fat. We provide some basic information
on these variables in Table 10–3 and see the study included
121 black females, 238 white females, 81 black men, and 215 white
men.

Table 10–3. Means
and Standard Deviations Broken Down by Gender and Race.

Table 10–3. Means
and Standard Deviations Broken Down by Gender and Race.

Report

Gender

Race 2

Age

BMI

PCTFAT

Female

Black

Mean

32.7770

28.1380

35.997

N

121

121

121

Standard deviation

11.35229

6.14086

8.756

White

Mean

34.4032

24.8182

29.971

N

238

238

238

Standard deviation

13.79910

4.91353

9.8447

Total

Mean

33.8551

25.9371

32.002

N

359

359

359

Standard deviation

13.03256

5.57608

9.9349

Male

Black

Mean

34.2526

26.9269

22.944

N

81

81

81

Standard deviation

11.97843

4.83454

7.3195

White

Mean

36.4834

26.5334

22.963

N

215

215

215

Standard deviation

15.06562

4.66455

9.0302

Total

Mean

35.8730

26.6411

22.958

N

296

296

296

Standard deviation

14.30226

4.70670

8.5839

Total

Black

Mean

33.3687

27.6524

30.763

N

202

202

202

Standard deviation

11.60057

5.67188

10.4632

White

Mean

35.3905

25.6322

26.645

N

453

453

453

Standard deviation

14.43550

4.86781

10.0846

Total

Mean

34.7670

26.2552

27.915

N

655

655

655

Standard deviation

13.64747

5.20919

10.3710

Source: Data, used with
permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T,
Leon AS, Rao DC, et al: The effect of sex, age and race on estimating
percentage body fat from body mass index: The Heritage Family Study.
Int J Obes Relat Metab Disord 2002;26:789–796. Table produced
with SPSS Inc.; used with permission.

Table 10–4 shows the regression equation to predict
percent body fat (see the bold values). Focusing initially on
the Regression Equation Section, we
see that all the variables are statistically significantly related
to percent body fat.

Table 10–4. Multiple
Regression Predicting Percent Body Weight.

Table 10–4. Multiple
Regression Predicting Percent Body Weight.

Multiple Regression Report

Run Summary Section

Parameter

Value

Parameter

Value

Dependent variable

PCTFAT

Rows processed

655

Number independent variables

4

Rows filtered out

0

Weight variable

None

Rows with X's missing

0

R2

0.8042

Rows with weight missing

0

Adj R2

0.8030

Rows with Y missing

0

Coefficient of variation

0.1649

Rows used in estimation

655

Mean square error

21.18832

Sum of weights

655.000

Square root of MSE

4.603077

Completion status

Normal completion

Ave Abs Pct Error

19.089

Regression Equation Section

Independent Variable

Regression Coefficient b(i)

Standard Error Sb(i)

T-Value to test H0:B(i)=0

Prob Level

Reject H_{0} at 5%?

Power of Test at 5%

Intercept

–8.3748

1.0338

–8.101

0.0000

Yes

1.0000

Age

0.1603

0.0140

11.442

0.0000

Yes

1.0000

BMI

1.3710

0.0372

36.809

0.0000

Yes

1.0000

Race

–0.9161

0.4005

–2.287

0.0225

Yes

0.6283

Sex

–10.2746

0.3638

–28.242

0.0000

Yes

1.0000

Regression Coefficient Section

Independent Variable

Regression Coefficient

Standard Error

Lower 95% C.L.

Upper 95% C.L.

Standardized Coefficient

Intercept

–8.3748

1.0338

–10.4011

–6.3486

0.0000

Age

0.1603

0.0140

0.1328

0.1877

0.2109

BMI

1.3710

0.0372

1.2980

1.4440

0.6886

Race

–0.91616

0.4005

–1.7011

–0.1311

–0.0408

Sex

–10.2746

0.3638

–10.9876

–9.5616

–0.4934

Note: The T-Value used to calculate these confidence limits
was 1.960.

Source: Data, used with permission, from Jackson AS,
Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al: The
effect of sex, age and race on estimating percentage body fat from
body mass index: The Heritage Family Study. Int J Obes Relat Metab
Disord 2002;26:789–796. Analysis produced using NCSS; used
with permission.

The first variable is a numerical variable, age, with regression
coefficient, b, of 0.1603, indicating
that greater age is associated with higher percent body fat. The
second variable, BMI, is also numerical; the regression coefficient
of 1.3710 indicates that patients with higher BMI also have higher
percent body fat, which certainly makes sense.

The third variable, sex, is a binary variable having two values.
For regression models it is convenient to code binary variables
as 0 and 1; in the Jackson example, females have a 0 code for sex, and
males have a 1. This procedure, called dummy or indicator coding, allows investigators
to include nominal variables in a regression equation in a straightforward manner.
The dummy variables are interpreted as follows: A subject who is male has the code for males, 1, multiplied
by the regression coefficient for sex, 1.3710, resulting in an additional
1.3710 points being added to his percent body fat. The decision
of which value is assigned 1 and which is assigned 0 is an arbitrary
decision made by the researcher but can be chosen to facilitate
interpretations of interest to the researcher.

The final variable is race, also dummy coded, with 0 for black
and 1 for white. The regression coefficient is negative and indicates
that white patients have 0.9161 subtracted from their percent body
fat. The intercept itself is –8.3748, meaning that the
predicted percent body fat is reduced by this amount after including
all variables in the equation. The regression coefficients can be
used to predict percent body fat by multiplying a given patient's
value for each independent variable X by
the corresponding regression coefficient b and
then summing to obtain the predicted percent body fat.

Regression coefficients are interpreted differently in multiple
regression than in simple regression. In simple regression, the
regression coefficient b indicates
the amount the predicted value of Y changes
each time X increases by 1 unit. In
multiple regression, a given regression coefficient indicates how
much the predicted value of Y changes
each time X increases by 1 unit, holding the values of all other variables
in the regression equation constant—as though all
subjects had the same value on the other variables. For example,
predicted percent body fat is increased by 0.1603 for increase of
1 year in patient, assuming all other variables are held constant.
This feature of multiple regression makes it an ideal method to
control for baseline differences and confounding variables, as we
discuss in the section titled "Controlling for Confounding."

It bears repeating that multiple regression measures only the
linear relationship between the independent variables and the dependent
variable, just as in simple regression. In the Jackson study, the
authors examined the scatterplot between BMI and percent body fat,
which we have reproduced in Figure 10–1. The figure indicates
a curvilinear relationship, and investigators decided to transform
BMI by taking its natural logarithm. They developed four models
for females and males separately to examine the cumulative effect
of including variables in the regression equation; results are reproduced
in Table 10–5. Model I includes only ln BMI and the intercept;
model II adds in the age, model III the race, and model IV interactions
between ln BMI with race and age. The rationale for including interactions
is the same as discussed in Chapter 7, namely that they wanted to
know whether the relationship between ln BMI and percent body weight
was the same for all levels of race or age.

Table 10–5. Results
from the Regression Analyses Predicting Percent Body Weight.

Table 10–5. Results
from the Regression Analyses Predicting Percent Body Weight.

Female Models

Male Models

Variable

I

II

III

IV

I

II

III

IV

Intercept

107.22^{a}

102.01^{a}

97.11^{a}

82.83^{a}

111.13^{a}

103.94^{a}

104.21^{a}

149.24^{a}

In BMI

43.05^{a}

39.96^{a}

38.67^{a}

34.43^{a}

41.04^{a}

37.31^{a}

37.35^{a}

51.31^{a}

Age

0.14^{a}

0.15^{a}

0.14^{a}

0.14^{a}

0.14^{a}

1.47^{a}

Race^{b}

–1.63^{a}

–26.02^{a}

–0.23

Race x In BMI

7.48^{a}

Age x ln BMI

–0.41^{a}

r^{2}

0.78^{a}

0.80

0.81^{a}

0.82^{a}

0.67^{a}

0.72^{a}

0.72^{a}

0.73^{a}

r^{2}

0.01^{a}

0.01^{a}

0.01^{a}

0.05^{a}

0.00

0.01^{a}

s.e.e. (% fat)

4.7

4.4

4.3

4.3

4.9

4.6

4.6

4.5

^{a}P< 0.001.

^{b}Key: race—black = 0 and white = 1.

Source: Data, used with permission,
from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao
DC, et al: The effect of sex, age and race on estimating percentage
body fat from body mass index: The Heritage Family Study. Int J
Obes Relat Metab Disord 2002;26:789–796.

Figure 10–1.

Plot illustrating the nonlinear relationship between
BMI and percent body fat. (Data, used with permission, from Jackson
AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao DC, et al:
The effect of sex, age, and race on estimating percentage body fat
from body mass index: The Heritage Family Study. Int
J Obes Relat Metab Disord 2002;26:789–796.
Analysis produced using NCSS; used with permission.)

Statistical
Tests for the Regression Coefficient

Table 10–6 shows the output from NCSS for model III
for female subjects; it contains a number of features to discuss.
In the upper half of the table, note the columns headed by t value and probability level. Both
the t test and the F test
can be used to determine whether a regression coefficient is different
from zero, or the t distribution can
be used to form confidence intervals for each regression coefficient.
Remember that even though the P values
are sometimes reported as 0.000, there is always some probability,
even if it is very small. Many statisticians believe, and we agree,
that it is more accurate to report P <
0.001.

Table 10–6. Regression
Analysis of Females, Model III.

Table 10–6. Regression
Analysis of Females, Model III.

Regression Equation Section

Independent Variable

Regression Coefficient b(i)

Standard Error Sb(i)

T-Value to test H_{0}:B(i)=0

Prob Level

Reject H_{0} at 5%?

Power of Test at 5%

Intercept

–97.1096

4.0314

–24.088

0.0000

Yes

1.0000

Log_BMI

38.6724

1.2684

30.490

0.0000

Yes

1.0000

Age

0.1510

0.0190

7.938

0.0000

Yes

1.0000

Race

–1.6308

0.5125

–3.182

0.0016

Yes

0.8875

Regression Coefficient Section

Independent Variable

Regression Coefficient

Standard Error

Lower 95% C.L.

Upper 95% C.L.

Standardized Coefficient

Intercept

–97.1096

4.0314

–105.0110

–89.2082

0.0000

Log_BMI

38.6724

1.2684

36.1864

41.1584

0.7910

Age

0.1510

0.0190

0.1137

0.1883

0.1981

Race

–1.6308

0.5125

–2.6354

–0.6262

–0.0777

Note: The T-value used to calculate these confidence
limits was 1.960.

Source: Data, used with permission,
from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao
DC, et al: The effect of sex, age and race on estimating percentage
body fat from body mass index: The Heritage Family Study. Int J
Obes Relat Metab Disord 2002;26:789–796. Table produced
with NCSS; used with permission.

Standardized
Regression Coefficients

Most authors present regression coefficients that can be used
with individual subjects to obtain predicted Y values.
But the size of the regression coefficients cannot be used to decide
which independent variables are the most important, because their
size is also related to the scale on which the variables are measured,
just as in simple regression. For example, in Jackson and colleagues' study, the
variable race was coded 1 if white and 0 if black, and the variable
age was coded as the number of years of age at the time of the first
data collection. Then, if race and age are equally important in
predicting subsequent depression, the regression coefficient for
race would be much larger than the regression coefficient for age
so that the same amount would be added to the prediction of percent
body weight. These regression coefficients are sometimes called unstandardized; they cannot be used
to draw conclusions about the importance of the variable, but only
whether the relationship or with the dependent variable Y is positive or negative.^{a} One
way to eliminate the effect of scale is to standardize the
regression coefficients. Standardization can be done by subtracting
the mean value of X and dividing by
the standard deviation before analysis, so that all variables have
a mean of 0 and a standard deviation of 1. Then it is possible to
compare the magnitudes of the regression coefficients and draw conclusions
about which explanatory variables play an important role. It is
also possible to calculate the standardized regression coefficients
after the regression model has been developed.^{b} The
larger the standardized coefficient, the larger the value of the t statistic. Standardized
regression coefficients are often referred to as beta ()
coefficients. The major disadvantage of standardized regression
coefficients is that they cannot readily be used to predict outcome
values. The lower half of Table 10–6 contains the standardized regression
coefficients in the far right column for the variables used to percent
body fat in Jackson and colleagues' study. Using the standardized
coefficients in Table 10–6, can you determine which variable,
age or race, has more influence in predicting subsequent depression?
If you chose age, you are correct, because the absolute value of
its standardized coefficient is larger, 0.1981, compared with –0.0777
for race.

^{a}Technically it is possible for the regression coefficient
and the correlation to have different signs. If so, the variable
is called a moderator variable; it affects the relationship between
the dependent variable and another independent variable.

^{b}The standardized coefficient = the unstandardized
coefficient multiplied by the standard deviation of the X variable and divided by the standard
deviation of the Y variable: _{j} = b_{j} (SD_{X}/SD_{Y}).

Multiple R

Multiple R is the multiple-regression
analogue of the Pearson product moment correlation coefficient r. It is also called the coefficient
of multiple determination, but most authors use the shorter term. As
an example, suppose percent body fat is calculated for each person
in the study by Jackson and colleagues; then, the correlation between
predicted percent body fat and the actual percent body fat is calculated.
This correlation is the multiple R. If
the multiple R is squared (R^{2}), it measures how much
of the variation in the actual depression score is accounted for
by knowing the information included in the regression equation.
The term R^{2} is interpreted
in exactly the same way as r^{2} in
simple correlation and regression, with 0 indicating no variance
accounted for and 1.00 indicating 100% of the variance
accounted for. Recall that in simple regression, the correlation between
the actual value Y of the dependent
variable and the predicted value, denoted Y',
is the same as the correlation between the dependent variable and
the independent variable; that is, r_{Y'Y} = r_{XY}. Thus, R and R^{2} in multiple regression
play the same role as r and r^{2} in simple regression. The
statistical test for R and R^{2}, however, uses the F distribution instead of the t distribution.

The computations are time-consuming, and fortunately, computers
do them for us. Jackson and colleagues included R^{2} in
Table 10–5 (although they used lowercase r^{2}); it was 0.81 for model
III (and is also shown in the NCSS output in Table 10–4).
After ln BMI, age, and race are entered into the regression equation, R^{2 }= 0.81 indicates
that more than 80% of the variability in percent body fat
is accounted for by knowing patients' BMI, age, and race.
Because R^{2} is less than 1,
we know that factors other than those included in the study also
play a role in determining a person's percent body fat.

Selecting Variables
for Regression Models

The primary purpose of Jackson and colleagues in their study
of BMI and percent body fat was explanation; they used multiple
regression analysis to learn how specific characteristics confounded
the relationship between BMI and percent body fat. They also wanted
to know how the characteristics interacted with one another, such
as gender and race. Some research questions, however, focus on the
prediction of the outcome, such as using the regression equation
to predict of percent body fat in future subjects.

Deciding on the variables that provide the best prediction is
a process sometimes referred to as model
building and is exemplified in Table 10–5. Selecting
the variables for regression models can be accomplished in several
ways. In one approach, all variables are introduced into the regression
equation, called the "enter" method in SPSS and
used in the multiple regression procedure in NCSS. Then, especially
if the purpose is prediction, the variables that do not have significant
regression coefficients are eliminated from the equation. The regression
equation may be recalculated using only the variables retained because
the regression coefficients have different values when some variables
are removed from the analysis.

Computer programs also contain routines to select an optimal
set of explanatory variables. One such procedure is called forward selection. Forward selection
begins with one variable in the regression equation; then, additional
variables are added one at a time until all statistically significant
variables are included in the equation. The first variable in the
regression equation is the X variable
that has the highest correlation with the response variable Y. The next X variable
considered for the regression equation is the one that increases R^{2} by the largest amount.
If the increment in R^{2} is
statistically significant by the F test,
it is included in the regression equation. This step-by-step procedure
continues until no X variables remain
that produce a significant increase in R^{2}.
The values for the regression coefficients are calculated, and the
regression equation resulting from this forward selection procedure
can be used to predict outcomes for future subjects. The increment
in R^{2 }was calculated by Jackson
and colleagues; it is shown as r^{2} in
Table 10–5.

A similar backward elimination procedure
can also be used; in it, all variables are initially included in
the regression equation. The X variable
that would reduce R^{2} by the
smallest increment is removed from the equation. If the resulting
decrease is not statistically significant, that variable is permanently
removed from the equation. Next, the remaining X variables
are examined to see which produces the next smallest decrease in R^{2}. This procedure continues
until the removal of an X variable
from the regression equation causes a significant reduction in R^{2}. That X variable
is retained in the equation, and the regression coefficients are
calculated.

When features of both the forward selection and the backward
elimination procedures are used together, the method is called stepwise regression (stepwise selection).
Stepwise selection is commonly used in the medical literature; it
begins in the same manner as forward selection. After each addition
of a new X variable to the equation,
however, all previously entered X variables
are checked to see whether they maintain their level of significance.
Previously entered X variables are
retained in the regression equation only if their removal would
cause a significant reduction in R^{2}.
The forward versus backward versus stepwise procedures have subtle
advantages related to the correlations among the independent variables
that cannot be covered in this text. They do not generally produce
identical regression equations, but conceptually, all approaches
determine a "parsimonious" equation using a subset
of explanatory variables.

Some statistical programs examine all possible combinations of
predictor values and determine the one that produces the overall
highest R^{2}, such as All Possible
Regression in NCSS. We do not recommend this procedure, however,
and suggest that a more appealing approach is to build a model in
a logical way. Variables are sometimes grouped according to their
function, such as all demographic characteristics, and added to the
regression equation as a group or block; this process is often called hierarchical regression; see exercise
7 for an example. The advantage of a logical approach to building
a regression model is that, in general, the results tend to be more
stable and reliable and are more likely to be replicated in similar
studies.

Polynomial Regression

Polynomial regression is a special case of multiple regression
in which each term in the equation is a power of X. Polynomial
regression provides a way to fit a regression model to curvilinear
relationships and is an alternative to transforming the data to
a linear scale. For example, the following equation can be used
to predict a quadratic relationship:

If a linear and cubic term do not provide an adequate fit, a
cubic term, a fourth-power term, and so on, can also be included
until an adequate fit is obtained.

Jackson and colleagues (2002) used polynomial regression to fit
separate curves for men and women, illustrated in Figure 10–1.
Two approaches to polynomial regression can be used. The first method
calculates squared terms, cubic terms, and so on; these terms are
then entered one at a time using multiple regression. Another approach
is to use a program that permits curve fitting, such as the regression
curve estimation procedure in SPSS. We used the SPSS procedure to
fit a quadratic curve of BMI to percent body fat for women. The
regression equation was:

A plot is produced by SPSS is given in Figure 10–2.

Figure 10–2.

Linear and quadratic curves for the relationship between
BMI and percent body fat in females. (Data, used with permission,
from Jackson AS, Stanforth PR, Gagnon J, Rankinen T, Leon AS, Rao
DC, et al: The effect of sex, age, and race on estimating percentage
body fat from body mass index: The Heritage Family Study. Int J Obes Relat Metab Disord 2002;26:789–796. Table produced
with SPSS Inc.; used with permission.)

Missing Observations

When studies involve several variables, some observations on
some subjects may be missing. Controlling the problem of missing
data is easier in studies in which information is collected prospectively;
it is much more difficult when information is obtained from already
existing records, such as patient charts. Two important factors
are the percentage of observations that is missing and whether missing
observations are randomly missing or missing because of some causal
factor.

For example, suppose a researcher designs a case–control
study to examine the effect of leg length inequality on the incidence
of loosening of the femoral component after total hip replacement.
Cases are patients who developed loosening of the femoral component,
and controls are patients who did not. In reviewing the records
of routine follow-up, the researcher found that leg length inequality
was measured in some patients by using weight-bearing anterior–posterior
(AP) hip and lower extremity films, whereas other patients had measurements
taken using non-weight-bearing films. The type of film ordered during
follow-up may well be related to whether the patient complained
of hip pain; patients with symptoms were more likely to have received
the weight-bearing films, and patients without symptoms were more
likely to have had the routine non-weight-bearing films. A researcher
investigating this question must not base the leg length inequality
measures on weight-bearing films only, because controls are less
likely than cases to have weight-bearing film measures in their
records. In this situation, the missing leg length information occurred
because of symptoms and not randomly.

The potential for missing observations increases in studies involving
multiple variables. Depending on the cause of the missing observations,
solutions include dropping subjects who have missing observations
from the study, deleting variables that have missing values from
the study, or substituting some value for the missing data, such
as the mean or a predicted value, called imputing. SPSS has an option
to estimate missing data with the mean for that variable calculated
with the subjects who had the data. The Data Screening procedure
(in Descriptive Statistics) in NCSS provides the option of substituting
either the mean or a predicted score. Investigators in this situation
should seek advice from a statistician on the best way to handle
the problem.

Cross Validation

The statistical procedures for all regression models are based
on correlations among the variables, which, in turn, are related
to the amount of variation in the variables included in the study.
Some of the observed variation in any variable, however, occurs
simply by chance; and the same degree of variation does not occur
if another sample is selected and the study is replicated. The mathematical
procedures for determining the regression equation cannot distinguish
between real and chance variation. If the equation is to be used
to predict outcomes for future subjects, it should therefore be
validated on a second sample, a process called cross
validation. The regression equation is used to predict the
outcome in the second sample, and the predicted outcomes are compared
with the actual outcomes; the correlation between the predicted
and actual values indicates how well the model fits. Cross-validating
the regression equation gives a realistic evaluation of the usefulness
of the prediction it provides.

In medical research we rarely have the luxury of cross-validating
the findings on another sample of the same size. Several alternative
methods exist. First, researchers can hold out a proportion of the
subjects for cross validation, perhaps 20% or 25%.
The holdout sample should be randomly selected from the entire sample
prior to the original analysis. The predicted outcomes in the holdout
sample are compared with the actual outcomes, often using R^{2} to judge how well the findings
cross-validate.

Another method is the jackknife in
which one observation is left out of the sample, call it x_{1}; regression is performed
using the n – 1 observations,
and the results are applied to x_{1}.
Then this observation is returned to the sample, and another, x_{2}, is held out. This process
continues until there is a predicted outcome for each observation
in the sample; the predicted and actual outcomes are then compared.

The bootstrap method works in a
similar manner although the goal is different. The bootstrap can
be used with small samples to estimate the standard error and confidence
intervals. A small hold-out sample is randomly selected and the
statistic of interest calculated. Then the hold-out sample is returned
to the original sample, and another hold-out sample is selected.
After a fairly large number of samples is analyzed, generally a
minimum of 200, standard errors and confidence intervals can be estimated.
In essence, the bootstrap method uses the data itself to determine
the sampling distribution rather than the central limit theorem
discussed in Chapter 4.

Both the jackknife and bootstrap are called resampling methods;
they are very computer-intensive and require special software. Kline
and colleagues (2002) used a bootstrap method to develop confidence
intervals for odds ratios in their study of the use of the D-dimer test in the emergency department.

It is possible to estimate the magnitude of R or R^{2} in another sample without
actually performing the cross validation. This R^{2} is
smaller than the R^{2} for the
original sample, because the mathematical formula used to obtain
the estimate removes the chance variation. For this reason, the
formula is called a formula for shrinkage. Many
computer programs, including NCSS, SPSS, and SAS, provide both R^{2} for the sample used in
the analysis as well as R^{2} adjusted
for shrinkage, often referred to as the adjusted R^{2}.
Refer to Table 10–4 where NCSS gives the "Adj
R2" in the fifth row of the first column of the computer
analysis.

Sample Size
Requirements

The only easy way to determine how large a sample is needed in
multiple regression or any multivariate technique is to use a computer
program. Some rules of thumb, however, may be used for guidance.
A common recommendation by statisticians calls for ten times as
many subjects as the number of independent variables. For example,
this rule of thumb prescribes a minimum of 60 subjects for a study
predicting the outcome from six independent variables. Having a
large ratio of subjects to variables decreases problems that may
arise because assumptions are not met.

Assumptions about normality in multiple regression are complicated,
depending on whether the independent variables are viewed as fixed
or random (as in fixed-effects model or random-effects model in
ANOVA), and they are beyond the scope of this text. To ensure that
estimates of regression coefficients and multiple R and R^{2} are
accurate representatives of actual population values, we suggest
that investigators never perform regression without at least five
times as many subjects as variables.

A more accurate estimate is found by using a computer power program.
We used the PASS power program to find the power of a study using
five predictor variables, as in the Jackson study (Table 10–5).
We posed the question: How many subjects are needed to test whether
a given variable increases R^{2} by
0.05, given that four variables are already in the regression equation
and they collectively provide an R^{2} of
0.50? The output from the program is shown in Box 10–1.
The power table indicates that a sample of 80 gives power of 0.84,
assuming an or P value
of 0.05. The accompanying graph shows the power curve for different
sample sizes and different values of . As you
can see, the sample of 359 females and 296 males in the study by
Jackson and colleagues was more than adequate for the regression
model.

Box 10–1. Linear
and Quadratic Curves for the Relationship between BMI and Percent
Body Fat in Females.

Box 10–1. Linear
and Quadratic Curves for the Relationship between BMI and Percent
Body Fat in Females.

Multiple-Regression Power Analysis

Ind. Variables Tested

Ind. Variables Controlled

Power

N

Alpha

Beta

Cnt

R^{2}

Cnt

R^{2}

0.53528

40

0.05000

0.46472

1

0.05000

4

0.50000

0.63511

50

0.05000

0.36489

1

0.05000

4

0.50000

0.71765

60

0.05000

0.28235

1

0.05000

4

0.50000

0.78431

70

0.05000

0.21569

1

0.05000

4

0.50000

0.83709

80

0.05000

0.16291

1

0.05000

4

0.50000

0.87818

90

0.05000

0.12182

1

0.05000

4

0.50000

0.90973

100

0.05000

0.09027

1

0.05000

4

0.50000

0.93366

110

0.05000

0.06634

1

0.05000

4

0.50000

0.95160

120

0.05000

0.04840

1

0.05000

4

0.50000

Summary Statements

A sample size of
80 achieves 84% power to detect an R-Squared of 0.05000
attributed to 1 independent variable(s) using an F-Test with a significance
level (alpha) of 0.5000. The variables tested are adjusted for an
additional 4 independent variable(s) with an R-Squared of 0.50000.

Source: Data, used with
permission, from Jackson AS, Stanforth PR, Gagnon J, Rankinen T,
Leon AS, Rao DC: The effect of sex, age and race on estimating percentage
body fat from body mass index: The Heritage Family Study. Int J
Obes Relat Metab Disord 2002;26:789–796. Output produced with
PASS; used with permission.

Controlling
for Confounding

Analysis of
Covariance

Analysis of covariance (ANCOVA) is
the statistical technique used to control for the influence of a
confounding variable. Confounding variables occur
most often when subjects cannot be assigned at random to different
groups, that is, when the groups of interest already exist. Gonzalo
and colleagues (1996) (Chapters 7 and 8) predicted insulin sensitivity
from body mass index (BMI); they wanted to control for age of the
women and did so by adding age to the regression equation. When
BMI alone is used to predict insulin sensitivity (IS) in hyperthyroid
women, the regression equation is

where IS is the insulin sensitivity level. Using this equation,
a hyperthyroid woman's insulin sensitivity level is predicted
to decrease by 0.077 for each increase of 1 in BMI. For instance,
a woman with a BMI of 25 has a predicted insulin sensitivity of
0.411. What would happen, however, if age were also related to insulin
sensitivity? A way to control for the possible confounding effect
of age is to include that variable in the regression equation. The
equation with age included is

Using this equation, a hyperthyroid woman's insulin
sensitivity level is predicted to decrease by 0.068 for each increase
of 1 in BMI, holding age constant or independent of age. A 30-year-old woman
with a BMI of 25 has a predicted insulin sensitivity of 0.456, whereas
a 60-year-old woman with the same BMI of 25 has a predicted insulin
sensitivity of 0.321.

A more traditional use of ANCOVA is illustrated by a study of
the negative influence of smoking on the cardiovascular system.
Investigators wanted to know whether smokers have more ventricular
wall motion abnormalities than nonsmokers (Hartz et al, 1984). They
might use a t test to determine whether
the mean number of wall motion abnormalities differ in these two groups.
The investigators know, however, that wall motion abnormalities
are also related to the degree of coronary stenosis, and smokers
generally have a greater degree of coronary stenosis. Thus, any
difference observed in the mean number of wall abnormalities between
smokers and nonsmokers may really be a difference in the amount
of coronary stenosis between these two groups of patients.

This situation is illustrated in the graph of hypothetical data
in Figure 10–3; in the figure, the relationship between
occlusion scores and wall motion abnormalities appears to be the
same for smokers and nonsmokers. Nonsmokers, however, have both
lower occlusion scores and lower numbers of wall motion abnormalities;
smokers have higher occlusion scores and higher numbers of wall
motion abnormalities. The question is whether the difference in
wall motion abnormalities is due to smoking, to occlusion, or to
both.

Figure 10–3.

Relationship between degree of coronary stenosis and
ventricular wall motion abnormalities in smokers and nonsmokers
(hypothetical data).

In this study, the investigators must control for the degree
of coronary stenosis so that it does not confound (or confuse) the
relationship between smoking and wall motion abnormalities. Useful methods
to control for confounding variables are analysis of covariance
(ANCOVA) and the Mantel–Haenszel chi-square procedure.
Table 10–2 specifies ANCOVA when the dependent variable is
numerical (eg, wall motion) and the independent measures are grouping
variables on a nominal scale (eg, smoking versus nonsmoking), and
confounding variables occur (eg, degree of coronary occlusion).
If the dependent measure is also nominal, such as whether a patient
has survived to a given time, the Mantel–Haenszel chi-square
discussed in Chapter 9 can be used to control for the effect of
a confounding (nuisance) variable. ANCOVA can be performed by using
the methods of ANOVA; however, most medical studies use one of the
regression methods discussed in this chapter.

If ANCOVA is used in this example, the occlusion score is called
the covariate, and the mean number
of wall motion abnormalities in smokers and nonsmokers is said to
be adjusted for the occlusion score
(or degree of coronary stenosis). Put another way, ANCOVA simulates
the Y outcome observed if the value
of X is held constant, that is, if
all the patients had the same degree of coronary stenosis. This
adjustment is achieved by calculating a regression equation to predict
mean number of wall motion abnormalities from the covariate, degree
of coronary stenosis, and from a dummy variable coded 1 if the subject
is a member of the group (ie, a smoker) and 0 otherwise. For example,
the regression equation determined for the hypothetical observations
in Figure 10–3 is

The equation illustrates that smokers have a larger number of
predicted wall motion abnormalities, because 1.28 is added to the
equation if the subject is a smoker. The equation can be used to obtain
the mean number of wall motion abnormalities in each group, adjusted
for degree of coronary stenosis.

If the relationship between coronary stenosis and ventricular
motion is ignored, the mean number of wall motion abnormalities,
calculated from the observations in Figure 10–2, is 3.33
for smokers and 1.00 for nonsmokers. If, however, ANCOVA is used
to control for degree of coronary stenosis, the adjusted mean wall
motion is 2.81 for smokers and 1.53 for nonsmokers, a difference of
1.28, represented by the regression coefficient for the dummy variable
for smoking. In ANCOVA, the adjusted Y mean
for a given group is obtained by (1) finding the difference between
the group's mean on the covariate variable X, denoted , and the grand mean ; (2)
multiplying the difference by the regression coefficient; and (3)
subtracting this product from the unadjusted mean. Thus, for group j, the adjusted mean is

(See Exercise 1.)

This result is consistent with our knowledge that coronary stenosis
alone has some effect on abnormality of wall motion; the unadjusted
means contain this effect as well as any effect from smoking. Controlling
for the effect of coronary stenosis therefore results in a smaller
difference in number of wall motion abnormalities, a difference
related only to smoking.

Using hypothetical data, Figure 10–4 illustrates schematically
the way ANCOVA adjusts the mean of the dependent variable if the
covariate is important. Using unadjusted means is analogous to using
a separate regression line for each group. For example, the mean
value of Y for group 1 is found by
using the regression line drawn through the group 1 observations
to project the mean value _{1} onto
the Y-axis, denoted _{1} in
Figure 10–4. Similarly, the mean of group 2 is found at _{2} by using the
regression line to project the mean _{2} in
that group. The Y means in each group adjusted for the covariate (stenosis)
are analogous to the projections based on the overall mean value
of the covariate; that is, as though the two groups had the same
mean value for the covariate. The adjusted means for groups 1 and
2, Adj. _{1} and Adj. _{2}, are illustrated
by the dotted line projections of from
each separate regression line in Figure 10–4.

Figure 10–4.

Illustration of means adjusted using analysis of covariance.

ANCOVA assumes that the relationship between the covariate (X variable) and the dependent variable
(Y) is the same in both groups, that
is, that any relationship between coronary stenosis and wall motion
abnormality is the same for smokers and nonsmokers. This assumption
is equivalent to requiring that the regression slopes be the same
in both groups; geometrically, ANCOVA asks whether a difference
exists between the intercepts, assuming the slopes are equal.

ANCOVA is an appropriate statistical method in many situations
that occur in medical research. For example, age is a variable that
affects almost everything studied in medicine; if preexisting groups
in a study have different age distributions, investigators must
adjust for age before comparing the groups on other variables, just
as Gonzalo and colleagues recognized. The methods illustrated in
Chapter 3 to adjust mortality rates for characteristics such as
age and birth weight are used when information is available on groups
of individuals; when information is available on individuals themselves,
ANCOVA is used.

Before leaving this section, we point out some important aspects
of ANCOVA. First, although only two groups were included in the
example, ANCOVA can be used to adjust for the effect of a confounding
variable in more than two groups. In addition, it is possible to
adjust for more than one confounding variable in the same study,
and the confounding variables may be either nominal or numerical.
Thus, it is easy to see why the multiple regression model for analysis
of covariance provides an ideal method to incorporate confounding
variables.

Finally, ANCOVA can be considered as a special case of the more
general question of comparing two regression lines (discussed in
Chapter 8). In ANCOVA, we assume that the slopes are equal, and
attention is focused on the intercept. We can also perform the more
global test of both slope and intercept, however, by using multiple
regression. In Presenting Problem 4 in Chapter 8 on insulin sensitivity
(Gonzalo et al, 1996), interest focused on comparing the regression
lines predicting insulin activity from body mass index (BMI) in
women who had normal versus elevated thyroid levels. ANCOVA can
be used for this comparison using dummy coding. If we let X be BMI, Y be
insulin sensitivity level, and Z be
a dummy variable, where Z = 1
if the woman is hyperthyroid and Z = 0
for controls, then the multiple-regression model for testing whether
the two regression lines are the same (coincident) is

The regression lines have equal slopes and are parallel when b_{3} is 0, that is, no interaction
between the independent variable X and
the group membership variable Z. The
regression lines have equal intercepts and equal slopes (are coincident)
if both b_{2} and b_{3} are 0; thus, the model
becomes the simple regression equation Y = a + bX.
The statistical test for b_{2} and b_{3} is the t test
discussed in the section titled, "Statistical Tests for
the Regression Coefficient."

Generalized
Estimating Equations (GEE)

Many research designs, including both observational studies and
clinical trials, concern observations that are clustered or hierarchical.
A group of methods has been developed for these special situations.
To illustrate, a study to examine the effect of different factors
on complication rates following total knee arthroplasty was undertaken
in a province of Canada (Kreder et al, 2003). Outcomes included
length of hospital stay, inpatient complications, and mortality.
Can the researchers examine the outcomes for patients and conclude
that any differences are due the risk factors? The statistical methods
we have examined thus far assume that one observation is independent
from another. The problem with this study design, however, is that
the outcome for patients operated on by the same surgeon may be
related to factors other than the surgical method, such as the skill
level of the surgeon. In this situation, patients are said to be
nested within physicians.

Many other examples come to mind. Comparing the efficacy of medical
education curricula is difficult because students are nested within
medical schools. Comparing health outcomes for children within a
community is complicated by the fact that children are nested within
families. Many clinical trials create nested situations, such as
when trials are carried out in several medical centers. The issue
arises of how to define the unit of analysis—should it
be the students or the school? the children or the families? the
patients or the medical center?

The group of methods that accommodates these types of research
questions include generalized estimating equations (GEE), multilevel
modeling, and the analysis of hierarchically structured data. Most
of these methods have been developed within the last decade and
statistical software is just now becoming widely available. In addition
to some specialized statistical packages, SAS, Stata, and SPSS contain
procedures to accommodate hierarchical data. Using these models
is more complex than some of the other methods we have discussed,
and it is relatively easy to develop a model that is meaningless
or misleading. Investigators who have research designs that involve
nested subjects should consult a biostatistician for assistance.

Predicting Nominal
or Categorical Outcomes

In the regression model discussed in the previous section, the
outcome or dependent Y variable is
measured on a numerical scale. When the outcome is measured on a
nominal scale, other approaches must be used. Table 10–2
indicates that several methods can be used to analyze problems with
several independent variables when the dependent variable is nominal.
First we discuss logistic regression, a method that is frequently
used in the health field. One reason for the popularity of logistic
regression is that many outcomes in health are nominal, actually
binary, variables—they either occur or do not occur. The
second reason is that the regression coefficients obtained in logistic
regression can be transformed into odds ratios. So, in essence,
logistic regression provides a way to obtain an odds ratio for a
given risk factor that controls for, or is adjusted for, confounding
variables; in other words, we can do analysis of covariance with
logistic regression as well as with multiple linear regression.

Other methods are log-linear analysis and several methods that
attempt to classify subjects into groups. These methods appear occasionally
in the medical literature, and we provide a brief illustration,
primarily so that readers can have an intuitive understanding of
their purpose. The classification methods are discussed in the section
titled "Methods for Classification."

Logistic Regression

Logistic regression is commonly
used when the independent variables include both numerical and nominal
measures and the outcome variable is binary (dichotomous). Logistic
regression can also be used when the outcome has more than two values
(Hosmer and Lemeshow, 2000), but its most frequent use is as in
Presenting Problem 2, which illustrates the use of logistic regression
to identify trauma patients who are alcohol-positive, a yes-or-no
outcome. Soderstrom and his coinvestigators (1997) wanted to develop
a model to help emergency department staff identify the patients
most likely to have blood alcohol concentrations (BAC) in excess
of 50 mg/dL at the time of admission. The logistic model
gives the probability that the outcome, such as high BAC, occurs
as an exponential function of the independent variables. For example,
with three independent variables, the model is

where b_{0} is the intercept, b_{1}, b_{2},
and b_{3} are the regression
coefficients, and exp indicates that the base of the natural logarithm
(2.718) is taken to the power shown in parentheses (ie, the antilog).
The equation can be derived by specifying the variables to be included
in the equation or by using a variable selection method similar
to the ones for multiple regression. A chi-square test (instead
of the t or F test)
is used to determine whether a variable adds significantly to the
prediction.

In the study described in Presenting Problem 2, the variables
used by the investigators to predict blood alcohol concentrations
included the variables listed in Table 10–7. The investigators
coded the values of the independent variables as 0 and 1, a method
useful both for dummy variables in multiple regression and for variables
in logistic regression. This practice makes it easy to interpret the
odds ratio. In addition, if a goal is to develop a score, as is
the case in the study by Soderstrom and coinvestigators, the coefficient
associated with a given variable needs to be included in the score
only if the patient has a 1 on that variable. For instance, if patients
are more likely to have BAC 50 mg/dL
on weekends, the score associated with day of week is not included
if the injury occurs on a weekday.

Table 10–7. Variables,
Codes, and Frequencies for Variables.a

Table 10–7. Variables,
Codes, and Frequencies for Variables.^{a}

Value

Frequency

Age

39 or younger

0

3514

40 or older

1

1534

Time of Day

6 PM–6 AM

0

2601

6 AM–6 PM

1

2447

Day of week

Monday–Thursday

0

2642

Friday–Sunday

1

2406

Sex

Female

0

1457

Male

1

3591

Race

Non-Caucasian

0

1758

Caucasian

1

3290

Injury Type

Unintentional

0

3966

Intentional

1

1082

Blood Alcohol Concentration

<50 mg/dL

0

4067

50 mg/dL

1

1465

^{a}Not all totals are the same because of missing
data on some variables.

Source: Data, used with permission,
from Soderstrom CA, Kufera JA, Dischinger PC, Kerns TJ, Murphy JG,
Lowenfels A: Predictive model to identify trauma patients with blood
alcohol concentrations 50 mg/dL. J Trauma 1997;42:67–73.

The investigators calculated logistic regression equations for
each of four groups: males with intentional injury, males with unintentional
injury, females with intentional injury, and females with unintentional
injury. The results of the analysis on males who were injured unintentionally are
given in Table 10–8.

Table 10–8. Logistic
Regression Report for Men with Unintentional Injury.a

Table 10–8. Logistic
Regression Report for Men with Unintentional Injury.^{a}

Filter sex=1; injtype=0

Response BAC50

Parameter Estimation Section

Variable

Regression Coefficient

Standard Error

^{2}

Probability Level

Last R^{2}

Intercept

–0.7960357

0.1188189

44.88

0.000000

0.016780

Daytime

–1.8445640

0.1062133

301.60

0.000000

0.102879

Weekday

0.6622602

0.0975930

46.05

0.000000

0.017208

Race

0.2780667

0.1125357

6.11

0.013477

0.002316

Age 40

–0.1198371

0.1082209

1.23

0.268148

0.000466

Odds Ratio Estimation Section

Variable

Regression Coefficient

Standard Error

Odds Ratio

Lower 95% Confidence Limit

Upper 95% Confidence Limit

Intercept

–0.796036

0.118819

Daytime

–1.844564

0.106213

0.158094

0.128383

0.194682

Weekday

0.662260

0.097593

1.939170

1.601565

2.347942

Race

0.278067

0.112536

1.320574

1.059186

1.646468

Age 40

–0.119837

0.108221

0.887065

0.717526

1.096663

Model Summary Section

Model R^{2}

Model df

Model ^{2}

Model Probability

0.141881

.4

434.84

0.000000

Classification Table

Predicted

Actual

0

1

Total

0

Count

1751.00

202.00

1953.00

Row percent

89.66

10.34

100.00

Column percent

80.88

42.98

74.12

1

Count

414.00

268.00

682.00

Row percent

60.70

39.30

100.00

Column percent

19.12

57.02

25.88

Total

Count

2165.00

470.00

2635.00

Row percent

82.16

17.84

Column percent

100.00

100.00

Percent correctly classified = 76.62

^{a}Results from logistic regression for men with
unintentional injury.

Source: Data, used with permission,
from Soderstrom CA, Kufera JA, Dischinger PC, Kerns TJ, Murphy JG,
Lowenfels A: Predictive model to identify trauma patients with blood
alcohol concentrations 50 mg/dL. J Trauma 1997;42:67–73.
Output produced using NCSS; used with permission.

We need to know which value is coded 1 and which 0 in order to
interpret the results. For example, time of day has a negative regression
coefficient. The hours of 6 AM–6 PM are coded as 1, so a male coming
to the emergency department with unintentional injuries in the daytime
is less likely to have BAC 50
mg/dL than a male with unintentional injuries at night.
The age variable is not significant (P >
0.268). Interpreting the equation for the other variables indicates
that males with unintentional injuries who come to the emergency
department at night and on weekends and are Caucasian are more likely
to have elevated blood alcohol levels.

The logistic equation can be used to find the probability for
any given individual. For instance, let us find the probability
that a 27-year-old Caucasian man who comes to the emergency department at
2 PM on Thursday has BAC 50
mg/dL. The regression coefficients from Table 10–8
are

and we evaluate it as follows:

Substituting –2.36 in the equation for the probability:

Therefore, the chance that this man has a high BAC is less than
1 in 10. See Exercise 3 to determine the likelihood of a high BAC
if the same man came to the emergency department on a Saturday night.

One advantage of logistic regression is that it requires no assumptions
about the distribution of the independent variables. Another is
that the regression coefficient can be interpreted in terms of relative
risks in cohort studies or odds ratios in case–control
studies. In other words, the relative risk of an elevated BAC in
males with unintentional trauma during the day is exp (–1.845) = 0.158. The
relative risk for night is the reciprocal, 1/0.158 = 6.33;
therefore, males with unintentional injuries who come to the ER
at night are more than six times more likely to have BAC 50
mg/dL than males coming during the day.

How can readers easily tell which odds ratios are statistically
significant? Recall from Chapter 8 that if the 95% confidence
interval does not include 1, we can be 95% sure that the
factor associated with the odds ratio either is a significant risk
or provides a significant level of protection. Do any of the independent
variables in Table 10–8 have a 95% confidence
interval for the odds ratio that contains 1? Did you already know
without looking that it would be age because the age variable is
not statistically significant?

The overall results from a logistic regression may be tested
with Hosmer and Lemeshow's goodness
of fit test. The test is based on the chi-square distribution.
A P value 0.05 means
that the model's estimates fit the data at an acceptable
level.

There is no straightforward statistic to judge the overall logistic
model as R^{2} is used in multiple
regression. Some statistical programs give R^{2},
but it cannot be interpreted as in multiple regression because the
predicted and observed outcomes are nominal. Several other statistics
are available as well, including Cox and Snell's R^{2} and a modification called
Nagelkerke's R^{2},
which is generally larger than Cox and Snell's R^{2}.

Before leaving the topic of logistic regression, it is worthwhile
to inspect the classification table in Table 10–8. This
table gives the actual and the predicted number of males with unintentional
injuries who had normal versus elevated BAC. The logistic equation
tends to underpredict those with elevated concentrations: 470 males
are predicted versus the 682 who actually had BAC
50
mg/dL. Overall, the prediction using the logistic equation
correctly classified 76.62% of these males. Although this
sounds rather impressive, it is important to compare this percentage
with the baseline: 74.12% of the time we would be correct
if we simply predicted a male to have normal BAC. Can you recall
an appropriate way to compensate for or take the baseline into
consideration?
Although computer programs typically do not provide the kappa statistic,
discussed in Chapter 5, it provides a way to evaluate the percentage
correctly classified (see Exercise 4). Other measures of association
are used so rarely in medicine that we did not discuss them in Chapter
8. SPSS provides two nonparametric correlations, the lambda correlation
and the tau correlation, that can be interpreted as measures of
strength of the relationship between observed and predicted outcomes.

Log-Linear Analysis

Psoriasis, a chronic, inflammatory skin disorder characterized
by scaling erythematous patches and plaques of skin, has a strong
genetic influence—about one third of patients have a positive family
history. Stuart and colleagues (2002) conducted a study to determine
differences in clinical manifestation between patients with positive
and negative family histories of psoriasis and with early-onset
versus late-onset disease. This study was used in Exercise 7 in
Chapter 7.

The hypothesis was that the variables age at onset (in 10-year
categories), onset (early or late), and familial status (sporadic
or familial) had no effect on the occurrence of joint complaints. Results
from the analysis of age, familial status, and frequency of joint
complaints are given in Table 10–9.

Table 10–9. Frequency
of Joint Complaints by Familial Status, Stratified by Age at Examination.

Table 10–9. Frequency
of Joint Complaints by Familial Status, Stratified by Age at Examination.

Joint Complaints (%)

Age at Examination (years)

Sporadic

Familial

Relative risk

Pearson ^{2}

P-value

0–20

0.0

0.0

—

—

—

21–30

14.6

19.4

1.41

0.32

0.57

31–40

26.6

40.9

1.91

3.33

0.068

41–50

20.5

26.8

1.42

0.43

0.51

51–60

47.1

47.3

1.01

0.00024

0.99

>60

28.6

45.3

2.07

0.72

0.40

Total

19.7

34.0

2.09

13.32

0.00026

Source: Reproduced, with
permission, from Stuart P, Malick F, Nair RP, Henseler T, Lim HW,
Jenisch S, et al: Analysis of phenotypic variation in psoriasis
as a function of age at onset and family history. Arch Dermatol
Res 2002;294:207–213.

Each independent variable in this research problem is measured
on a categorical or nominal scale (age, onset, and familial status),
as is the outcome variable (occurrence of joint complaints). If only
two variables are being analyzed, the chi-square method introduced
in Chapter 6 can be used to determine whether a relationship exists
between them; with three or more nominal or categorical variables,
a statistical method called log-linear analysis is appropriate. Log-linear analysis is analogous to
a regression model in which all the variables, both independent
and dependent, are measured on a nominal scale. The technique is
called log-linear because it involves using the logarithm of the
observed frequencies in the contingency table.

Stuart and colleagues (2002) concluded that joint complaints
and familial psoriasis were conditionally independent given age
at examination, but that age at examination was not independent
of either joint complaints or a family history.

Log-linear analysis may also be used to analyze multidimensional
contingency tables in situations in which no distinction exists
between independent and dependent variables, that is, when investigators
simply want to examine the relationship among a set of nominal measures.
The fact that log-linear analysis does not require a distinction
between independent and dependent variables points to a major difference
between it and other regression models—namely, that the
regression coefficients are not interpreted in log-linear analysis.

Predicting a
Censored Outcome: Cox Proportional Hazard Model

In Chapter 9, we found that special methods must be used when
an outcome has not yet been observed for all subjects in the study
sample. Studies of time-limited outcomes in which there are censored
observations, such as survival, naturally fall into this category;
investigators usually cannot wait until all patients in the study
experience the event before presenting information.

Many times in clinical trials or cohort studies, investigators
wish to look at the simultaneous effect of several variables on
length of survival. For example, in the study described in Presenting Problem
3, Crook and her colleagues (1997) wanted to evaluate the relationship
of pretreatment prostate-specific antigen (PSA) and posttreatment
nadir PSA on the failure pattern of radiotherapy for treating localized
prostate carcinoma. They categorized failures as biochemical, local,
and distant. They analyzed data from a cohort study of 207 patients,
but only 68 had a failure due to any cause in the 70 months during
which the study was underway. These 68 observations on failure were
therefore censored. The independent variables they examined included
the Gleason score, the T classification, whether the patient had
received hormonal treatment, the PSA before treatment, and the lowest
PSA following treatment.

Table 10–2 indicates that the regression technique developed
by Cox (1972) is appropriate when time-dependent censored observations
are included. This technique is called the Cox
regression, or proportional hazard,
model. In essence this model allows the covariates (independent
variables) in the regression equation to vary with time. The dependent
variable is the survival time of the jth
patient, denoted Y_{j}. Both
numerical and nominal independent variables may be used in the model.

The Cox regression coefficients can be used to determine the
relative risk or odds ratio (introduced in Chapter 3) associated
with each independent variable and the outcome variable, adjusted for
the effect of all other variables in the equation. Thus, instead
of giving adjusted means, as ANCOVA does in regression, the Cox
model gives adjusted relative risks. We can also use a variety of
methods to select the independent variables that add significantly
to the prediction of the outcome, as in multiple regression; however,
a chi-square test (instead of the F test)
is used to test for significance.

The Cox proportional hazard model involves a complicated exponential
equation (Cox, 1972). Although we will not go into detail about
the mathematics involved in this model, its use is so common in
medicine that an understanding of the process is needed by readers
of the literature. Our primary focus is on the application and interpretation
of the Cox model.

Understanding
the Cox Model

Recall from Chapter 9 that the survival function gives the probability
that a person will survive the next interval of time, given that
he or she has survived up until that time. The hazard function, also
defined in Chapter 9, is in some ways the opposite: it is the probability
that a person will die (or that there will be a failure) in the
next interval of time, given that he or she has survived until the
beginning of the interval. The hazard function plays a key role
in the Cox model.

The Cox model examines two pieces of information: the amount
of time since the event first happened to a person and the person's
observations on the independent variables. Using the Crook example,
the amount of time might be 3 years, and the observations would
be the patient's Gleason score, T classification, whether
he had been treated with hormones, and the two PSA scores (pretreatment
and lowest posttreatment). In the Cox model, the length of time
is evaluated using the hazard function, and the linear combination
of the independent values (like the linear combination we obtain
when we use multiple regression) is the exponent of the natural
logarithm, e. For example, for the
Crook study, the model is written as

In words, the model is saying that the probability of dying in
the next time interval, given that the patient has lived until this
time and has the given values for Gleason score, T classification,
and so on, can be found by multiplying the baseline hazard (h_{0}) by the natural log raised
to the power of the linear combination of the independent variables.
In other words, a given person's probability of dying is
influenced by how commonly patients die and the given person's
individual characteristics. If we take the antilog of the linear
combination, we multiply rather than add the values of the covariates.
In this model, the covariates have a multiplicative, or proportional,
effect on the probability of dying—thus, the term "proportional
hazard" model.

An Example of
the Cox Model

In the study described in Presenting Problem 3, Crook and her
colleagues (1997) used the Cox proportional hazard model to examine
the relationship between pretreatment PSA and posttreatment PSA
nadir and treatment failure in men with prostate carcinoma following
treatment with radiotherapy. Failure was categorized as chemical,
local, or distant. The investigators wanted to control for possible
confounding variables, including the Gleason score, the T classification,
both measures of severity, and whether the patient received hormones
prior to the radiotherapy. The outcome is a censored variable, the
amount of time before the treatment fails, so the Cox proportional
hazard model is the appropriate statistical method. We use the results
of analysis using SPSS, given in Table 10–10, to point
out some salient features of the method.

Table 10–10. Results
from Cox Proportional Hazard Model Using Both Pretreatment and Posttreatment
Variables.

Table 10–10. Results
from Cox Proportional Hazard Model Using Both Pretreatment and Posttreatment
Variables.

Indicator Parameter Coding

Value

Frequency

PRERTHOR

1.00

44

0.000

2.00

163

1.000

GSCORE

Recoded Gleason score

2–6

168

0.000

7–10

39

1.000

TUMSTAGE

Tumor stage

T1b–2

34

0.000

0.000

0.000

T2a

34

1.000

0.000

0.000

T2b–c

79

0.000

1.000

0.000

T3–4

60

0.000

0.000

1.000

207 Total cases read

Dependent Variable: TIMEANYF

Events

Censored

68

139 (67.1%)

Beginning block number 0.

Initial log likelihood function

–2 Log likelihood

649.655

Beginning block number 1.

Method: Enter

Variable(s) Entered at step number 1.

GSCORE

Recoded Gleason score

TUMSTAGE

Tumor stage

PRERTHOR

PRERXPSA

NADIRPSA

Coefficients converged after seven iterations.

–2 Log likelihood

576.950

^{2}

df

Significance

Overall (score)

274.737

7

0.0000

Change (–2LL) from

Previous block

72.706

7

0.0000

Previous step

72.706

7

0.0000

Variables in the equation

Variable

B

SE

Wald

df

Significance

R

GSCORE

0.4420

0.2843

2.4172

1

0.1200

0.0253

TUMSTAGE

14.3608

3

0.0025

0.1134

TUMSTAGE (1)

–0.0224

0.7077

0.0010

1

0.9747

0.0000

TUMSTAGE(2)

0.8044

0.5506

2.1342

1

0.1440

0.0144

TUMSTAGE (3)

1.5075

0.5548

7.3828

1

0.0066

0.0910

PRERTHOR

–0.1348

0.3168

0.1811

1

0.6704

0.0000

PRERXPSA

0.0040

0.0029

1.8907

1

0.1691

0.0000

NADIRPSA

0.0769

0.0115

44.7491

1

0.0000

0.2565

95% CI for Exp (B)

Variable

Exp (B)

Lower

Upper

GSCORE

1.5558

0.8912

2.7161

TUMSTAGE (1)

0.9778

0.2443

3.9140

TUMSTAGE (2)

2.2353

0.7597

6.5770

TUMSTAGE (3)

4.5156

1.5221

13.3962

PRERTHOR

0.8739

0.4697

1.6260

PRERXPSA

1.0040

0.9983

1.0098

NADIRPSA

1.0799

1.0559

1.1045

df = degrees of freedom; SE = standard
error; CI = confidence interval; Wald = statistic
used by SPSS to test the significance of variances.

Source: Data, used with permission,
from Crook JM, Bahadur YA, Bociek RG, Perry GA, Robertson SJ, Esche
BA: Radiotherapy for localized prostate carcinoma. Cancer 1997;79:328–336.
Output produced using SPSS 10.0, a registered trademark of SPSS,
Inc; used with permission.

Both numerical and nominal variables can be used as independent
variables in the Cox model. If the variables are nominal, it is
necessary to tell the computer program so they can be properly analyzed.
SPSS prints this information. PRERTHOR, pretreatment hormone therapy,
is recoded so that 0 = no and 1 = yes. Prior to
doing the analysis, we recoded the Gleason score into a variable called
GSCORE with two values: 0 for Gleason scores 2–6 and 1
for Gleason scores 7–10. The T classification variable,
TUMSTAGE, was recoded by the computer program using dummy variable
coding. Note that for four values of TUMSTAGE, only three variables
are needed, with the three more advanced stages compared with the
lowest stage, T1b–2.

Among the 207 men in the study, 68 had experienced a failure
by the time the data were analyzed. The authors reported a median
follow-up of 36 months with a range of 12 to 70 months. The log likelihood
statistic (LL) is used to evaluate the significance of the overall
model; smaller values indicate that the data fit the model better.
The change in the log likelihood associated with the initial (full)
model in which no independent variables are included in the equation
and the log likelihood after the variables are entered is calculated.
In this example, the change is 72.706 (bold in Table 10–10),
and it is the basis of the chi-square statistic used to determine
the significance of the model. The significance is reported, as
often occurs with computer programs, as 0.0000.

In addition to testing the overall model, it is possible to test
each independent variable to see if it adds significantly to the
prediction of failure. Were any of the potentially confounding variables significant?
The significance of TUMSTAGE requires some explanation. The variable
itself is significant, with P = 0.0025
(bold in the table). The TUMSTAGE(3) variable (which indicates
the patient has T3–4 stage tumor), however, is the one
that really matters because it is the only significant stage (P = 0.0066). Note that Gleason
score and hormone therapy were not significant. Was either of the PSA
values important in predicting failure? It appears that the pretreatment
PSA is not significant, but the lowest PSA (NADIRPSA) reached following
treatment has a very low P value.

As in logistic regression, the regression coefficients in the
Cox model can be interpreted in terms of relative risks or odds
ratios (by finding the antilog) if they are based on independent
binary variables, such as hormone therapy. For this reason, many
researchers divide independent variables into two categories, as
we did with Gleason score, even though this practice can be risky
if the correct cutpoint is not selected. The T classification variable
was recoded as three dummy variables to facilitate interpretation
in terms of odds ratios for each stage. The odds ratios are listed
under the column titled "Exp (B)" in Table 10–10.
Using the T3–4 stage (TUMSTAGE(3)) as an illustration,
the antilog of the regression coefficient, 1.5075, is exp (1.5075) = 4.5156.
Note that the 95% confidence interval goes from approximately
1.52 to 13.40; because this interval does not contain 1, the odds
ratio is statistically significant (consistent with the P value).

Crook and colleagues (1997) also computed the Cox model using
only the variables known prior to treatment (see Exercise 8).

Importance of
the Cox Model

The Cox model is very useful in medicine, and it is easy to see
why it is being used with increasing frequency. It provides the
only valid method of predicting a time-dependent outcome, and many
health-related outcomes are related to time. If the independent
variables are divided into two categories (dichotomized), the exponential
of the regression coefficient, exp (b),
is the odds ratio, a useful way to interpret the risk associated
with any specific factor. In addition, the Cox model provides a
method for producing survival curves that are adjusted for confounding
variables. The Cox model can be extended to the case of multiple
events for a subject, but that topic is beyond our scope. Investigators
who have repeated measures in a time-to-survival study are encouraged
to consult a statistician.

Meta-Analysis

Meta-analysis
is a way to combine
results of several independent studies on a specific topic.
Meta-analysis
is different from the methods discussed in the preceding sections
because its purpose is not to identify risk factors or to predict
outcomes for individual patients; rather, this technique is applicable
to
any research question. We briefly introduced meta-analysis in Chapter 2.
Because we could not talk about it in detail until the basics
of statistical tests (confidence limits, P values,
etc) were explained, we include it in this chapter. It is an important
technique increasingly used for studies in health and it can be
looked on as an extension of multivariate analysis.

The idea of summarizing a set of studies in the medical literature
is not new; review articles have long had an important role in helping
practicing physicians keep up to date and make sense of the many
studies on any given topic. Meta-analysis takes the review article
a step further by using statistical procedures to combine the results
from different studies. Glass (1977) developed the technique because
many research projects are designed to answer similar questions,
but they do not always come to similar conclusions. The problem
for the practitioner is to determine which study to believe, a problem
unfortunately too familiar to readers of medical research reports.

Sacks and colleagues (1987) reviewed meta-analyses of clinical
trials and concluded that meta-analysis has four purposes: (1) to
increase statistical power by increasing the sample size, (2) to resolve
uncertainty when reports do not agree, (3) to improve estimates
of effect size, and (4) to answer questions not posed at the beginning
of the study. Purpose 3 requires some expansion because the concept
of effect size is central to meta-analysis. Cohen (1988) developed
this concept and defined effect size as
the degree to which the phenomenon
is present in the population. An effect size may be thought of as
an index of how much difference exists between two groups—generally,
a treatment group and a control group. The effect size is based
on means if the outcome is numerical, on proportions or odds ratios
if the outcome is nominal, or on correlations if the outcome is
an association. The effect sizes themselves are statistically combined
in meta-analysis.

Veenstra and colleagues (1999) used meta-analysis to evaluate
the efficacy of impregnating central venous catheters with an antiseptic.
They examined the literature, using manual and computerized searches,
for publications containing the words chlorhexidine,
antiseptic, and catheter and found
215 studies. Of these, 24 were comparative studies in humans. Nine
studies were eliminated because they were not randomized, and another
two were excluded based on the criteria for defining catheter colonization
and catheter-related bloodstream infection. Ten studies examined both
outcomes, two examined only catheter colonization, and one reported
only catheter-related bloodstream infection.

Two authors independently read and evaluated each article. They
reviewed the sample size, patient population, type of catheter,
catheterization site, other interventions, duration of catheterization,
reports of adverse events, and several other variables describing
the incidence of colonization and catheter-related bloodstream infection.
The authors also evaluated the appropriateness of randomization,
the extent of blinding, and the description of eligible subjects.
Discrepancies between the reviewers were resolved by a third author.
Some basic information about the studies evaluated in this meta-analysis
is given in Table 10–11.

Table 10–11. Characteristics
of Studies Comparing Antiseptic-Impregnated with Control Catheters.

Table 10–11. Characteristics
of Studies Comparing Antiseptic-Impregnated with Control Catheters.

Number of Catheters (Number of Patients)

Catheter Duration Mean, d

Outcome Definitions

Study, y^{a}

Number of Catheter Lumens

Patient Population

Catheter Exchange^{b}

Treatment Group

Control Group

Treatment Group

Control Group

Catheter Colonization^{c}

Catheter-Related Bloodstream Infection^{d}

Tennenberg et al, 1997

2,3

Hospital

No

137 (137)

145 (145)

5.1

5.3

SQ (IV, SC, >15 CFU)

SO (IV, SC, site), CS, NS

Maki et al, 1997

3

ICU

Yes

208 (72)

195 (86)

6.0

6.0

SQ (IV, >15 CFU)

SO (>15 CFU, IV, hub, inf)^{e}

van Heerden et al, 1996^{f}

3

ICU

No

28 (28)

26 (26)

6.6

6.8

SQ (IV, >15 CFU)

NR

Hannan et al, 1996

3

ICU

NR

68 (NR)

60 (NR)

7

8

SQ (IV, >10^{3} CFU)^{g}

SQ (IV, >10^{3} CFU), NS

Bach et al, 1994^{f}

3

ICU

No

14 (14)

12 (12)

7.0

7.0

QN (IV, >10^{3} CFU)

NR

Bach et al, 1996^{f}

2, 3

Surgical

No

116 (116)

117 (117)

7.7

7.7

QN (IV, >10^{3} CFU)

SO (IV)

Heard et al, 1998^{f}

3

SICU

Yes

151 (107)

157 (104)

8.5

9

SQ (IV, SC, >14 CFU)

SO (IV, SC, >4 CFU)

Collin, 1999

1, 2, 3

ED/ICU

Yes

98 (58)

139 (61)

9.0

7.3

SQ (IV, SC, >15 CFU)

SO (IV, SC)

Ciresi et al, 1996^{f}

3

TPN

Yes

124 (92)

127 (99)

9.6

9.1

SQ (IV, SC, >15 CFU)

SO (IV, SC)

Pemberton et al, 1996

3

TPN

No

32 (32)

40 (40)

10

11

NR

SO (IV), res, NS

Ramsay et al, 1994^{e}

3

Hospital

No

199 (199)

189 (189)

10.9

10.9

SQ (IV, SC, >15 CFU)

SO (IV, SC)

Trazzera et al, 1995^{e}

3

ICU/BMT

Yes

123 (99)

99 (82)

11.2

6.7

SQ (IV, >15 CFU)

SO (IV, >15 CFU)

George et al, 1997

3

Transplant

No

44 (NR)

35 (NR)

NR

NR

SQ (IV, >5 CFU)

SO (IV)

^{a}Readers should refer to the original article
for these citations.

^{b}Catheter exchange was performed using a guide wire.

^{c}Catheter segments cultured and criteria for positive
culture are given in parentheses.

^{d}Catheter segment or site cultured and criteria for
positive culture are given in parentheses.

^{e}Organism identity was confirmed by restriction-fragment
subtyping.

^{f}Additional information was provided by author (personal
communications, Jan 1998–Mar 1998).

^{g}Culture method is reported as semiquantitative; criteria
for culture growth suggest quantitative method.

NR = not reported; ICU = intensive care unit;
SICU = surgical intensive care unit; TPN = total parenteral
nutrition; BMT = bone marrow transplant; ED = emergency
department; hospital, hospitalwide or a variety of settings; SQ = semiquantitative
culture; QN = quantitative culture; CFU = colony-forming
units; IV = intravascular catheter segment; SC = subcutaneous
catheter segment; site = catheter insertion site; hub = catheter
hub; inf = catheter infusate; SO = same organism
isolated from blood and catheter; CS = clinical symptoms
of systemic infection; res = resolution of symptoms on
catheter removal; and NS = no other sources of infection.

The authors of the meta-analysis article calculated the odds
ratios and 95% confidence intervals for each study and
used a statistical method to determine summary odds ratios over
all the studies. These odds ratios and intervals for the outcome
of catheter-related bloodstream infection are illustrated in Figure 10–5. This figure illustrates the typical way findings
from meta-analysis studies are presented. Generally the results
from each study are shown, and the summary or combined results are
given at the bottom of the figure. When the summary statistic is
the odds ratio, a line representing the value of 1 is drawn to make
it easy to see which of the studies have a significant outcome.

Figure 10–5.

Analysis of catheter-related bloodstream infection in
trials comparing chlorhexidine/silver sulfadiazine-impregnated
central venous catheters with nonimpregnated catheters. The diamond
indicates odds ratio (OR) and 95% confidence interval (CI).
Studies are ordered by increasing mean duration of catheterization
in the treatment group. The size of the squares is inversely proportional to
the variance of the studies. (Reproduced, with permission, from
Veenstra DL, Saint S, Saha S, Lumley T, Sullivan SD: Efficacy of
antiseptic-impregnated central venous catheters in preventing catheter-related
bloodstream infection. JAMA 1999;281:261–267. Copyright 1999,
American Medical Association.)

From the data in Table 10–11 and Figure 10–5,
it appears that only one study (of the 11) reported a statistically
significant outcome because only one has a confidence interval that
does not contain 1. The entire confidence interval in Maki and associates' study
(1997) is less than 1, indicating that these investigators found
a protective effect when using the treated catheters. Of interest
is the summary odds ratio, which illustrates that by pooling the
results from 11 studies, treating the catheters appears to be beneficial.
Several of the studies had relatively small sample sizes, however,
and the failure to find a significant difference may be due to low
power. Using meta-analysis to combine the results from these studies
can provide insight on this issue.

A meta-analysis does not simply add the means or proportions
across studies to determine an "average" mean
or proportion. Although several different methods can be used to
combine results, they all use the same principle of determining
an effect size in each study and then combining the effect sizes
in some manner. The methods for combining the effect sizes include
the z approximation for comparing two
proportions (Chapter 6); the t test
for comparing two means (Chapter 6); the P values
for the comparisons, and the odds ratio as shown in Veenstra and
colleagues' study (1999). The values corresponding to the
effect size in each study are the numbers combined in the meta-analysis
to provide a pooled (overall) P value
or confidence interval for the combined studies. The most commonly
used method for reporting meta-analyses in the medical literature
is the odds ratio with confidence intervals.

In addition to being potentially useful when published studies
reach conflicting conclusions, meta-analysis can help raise issues
to be addressed in future clinical trials. The procedure is not, however,
without its critics, and readers should be aware of some of the
potential problems in its use. To evaluate meta-analysis, LeLorier
and associates (1997) compared the results of a series of large
randomized, controlled trials with relevant previously published
meta-analyses. Their results were mixed: They found that meta-analysis
accurately predicted the outcome in only 65% of the studies;
however, the difference between the trial results and the meta-analysis
results was statistically significant in only 12% of the
comparisons. Ioannidis and colleagues (1998) determined that the
discrepancies in the conclusions were attributable to different
disease risks, different study protocols, varying quality of the
studies, and possible publication bias (discussed in a following
section). These reports serve as a useful reminder that well-designed
clinical trials remain a critical source of information.

Studies designed in dissimilar manners should not be combined.
In performing a meta-analysis, investigators should use clear and
well-accepted criteria for deciding whether studies should be included
in the analysis, and these criteria should be stated in the published
meta-analysis.

Most meta-analyses are based on the published literature, and
some people believe it is easier to publish studies with results
than studies that show no difference. This potential problem is
called publication bias. Researchers can take at least three important
steps to reduce publication bias. First, they can search for unpublished
data, typically done by contacting the authors of published articles.
Veenstra and his colleagues (1999) did this and contacted the manufacturer
of the treated catheters as well but were unable to identify any
unpublished data. Second, researchers can perform an analysis to
see how sensitive the conclusions are to certain characteristics
of the studies. For instance, Veenstra and colleagues assessed sources
of heterogeneity or variation among the studies and reported that
excluding these studies had no substantive effect on the conclusions. Third,
investigators can estimate how many studies showing no difference
would have to be done but not published to raise the pooled P value above the 0.05 level or produce
a confidence interval that includes 1 so that the combined results
would no longer be significant. The reader can have more confidence
in the conclusions from a meta-analysis that finds a significant
effect if a large number of unpublished negative studies would be
required to repudiate the overall significance. The increasing use
of computerized patient databases may lessen the effect of publication
bias in future meta-analyses. Montori and colleagues (2000) provide
a review of publication bias for clinicians.

The
Cochrane Collection is a large and growing database of meta-analyses
that were done according to specific guidelines. Each meta-analysis
contains a description and an assessment of the methods used in
the articles that constitute the meta-analysis. Graphs such as Figure
10–5 are produced, and, if appropriate, graphs for subanalyses
are presented. For instance, if both cohort studies and clinical
trials have been done on a given topic, the Cochrane Collection
presents a separate figure for each. The Cochrane Collection is
available on CD-ROM or via the Internet for an annual fee. The Cochrane
Web site states that: "Cochrane reviews (the principal
output of the Collaboration) are published electronically in successive
issues of The Cochrane Database of Systematic Reviews. Preparation
and maintenance of Cochrane reviews is the responsibility of
international
collaborative review groups."

No one has argued that meta-analyses should replace clinical
trials. Veenstra and his colleagues (1999) conclude that a large
trial may be warranted to confirm their findings. Despite their
shortcomings, meta-analyses can provide guidance to clinicians when
the literature contains several studies with conflicting results,
especially when the studies have relatively small sample sizes. Furthermore,
based on the increasingly large number of published meta-analyses,
it appears that this method is here to stay. As with all types of
studies, however, the methods used in a meta-analysis need to be
carefully assessed before the results are accepted.

Methods for
Classification

Several multivariate methods can be used when the research question
is related to classification. When the goal is to classify subjects
into groups, discriminant analysis, cluster analysis, and propensity
score analysis are appropriate. These methods all involve multiple
measurements on each subject, but they have different purposes and
are used to answer different research questions.

Discriminant
Analysis

Logistic regression is used extensively in the biologic sciences.
A related technique, discriminant analysis, although
used with less frequency in medicine, is a common technique in the
social sciences. It is similar to logistic regression in that it
is used to predict a nominal or categorical outcome. It differs
from logistic regression, however, in that it assumes that the independent
variables follow a multivariate normal distribution, so it must
be used with caution if some X variables
are nominal.

The procedure involves determining several discriminant functions,
which are simply linear combinations of the independent variables
that separate or discriminate among the groups defined by the outcome
measure as much as possible. The number of discriminant functions
needed is determined by a multivariate test statistic called Wilks' lambda.
The discriminant functions' coefficients can be standardized
and then interpreted in the same manner as in multiple regression
to draw conclusions about which variables are important in discriminating
among the groups.

Leone and coworkers (2002) wanted to identify characteristics
that differentiate among expert adolescent female athletes in four
different sports. Body mass, height, girth of the biceps and calf, skinfold
measures, measures of aerobic power, and flexibility were among
the measures they examined. Sports included were tennis with 15
girls, skating with 46, swimming with 23, and volleyball with 16.
Discriminant analysis is useful when investigators want to evaluate
several explanatory variables and the goal is to classify subjects
into two or more categories or groups, such as that defined by the
four sports.

Their analysis revealed three significant discriminant functions.
The first function discriminated between skaters and the other three
groups; the second reflected differences between volleyball players
and swimmers, and the third between swimmers and tennis players.
They concluded that adolescent female athletes show physical and
biomotor differences that distinguish among them according to their
sport.

Although discriminant analysis is most often employed to explain
or describe factors that distinguish among groups of interest, the
procedure can also be used to classify future subjects. Classification
involves determining a separate prediction equation corresponding
to each group that gives the probability of belonging to that group,
based on the explanatory variables. For classification of a future
subject, a prediction is calculated for each group, and the individual
is classified as belonging to the group he or she most closely resembles.

Factor Analysis

Andrewes and colleagues (2003) wanted to know how scores on the
Emotional and Social Dysfunction Questionnaire (ESDQ) can be used
to help decide the level of support needed following brain surgery.
Similarly, the Medical Outcomes Study Short Form 36 (MOS-SF36) is
a questionnaire commonly used to measure patient outcomes (Stewart
et al, 1988). In examples such as these, tests with a large number
of items are developed, patients or other subjects take the test,
and scores on various items are combined to produce scores on the
relevant factors.

The MOS-SF36 is probably used more frequently than any other
questionnaire to measure functional outcomes and quality of life;
it has been used all over the world and in patients with a variety
of medical conditions. The questionnaire contains 36 items that
are combined to produce a patient profile on eight concepts: physical
functioning, role-physical, bodily pain, general health, vitality,
social functioning, role-emotional, and mental health. The first
four concepts are combined to give a measure of physical health,
and the last four concepts are combined to give a measure of mental
health. The developers used factor analysis to
decide how to combine the questions to develop these concepts.

In a research problem in which factor analysis is appropriate,
all variables are considered to be independent; in other words,
there is no desire to predict one on the basis of others. Conceptually, factor
analysis works as follows: First, a large number of people are measured
on a set of items; a rule of thumb calls for at least ten times
as many subjects as items. The second step involves calculating
correlations. To illustrate, suppose 500 patients answered the 36
questions on the MOS-SF36. Factor analysis answers the question
of whether some of the items group together in a logical way, such
as items that measure the same underlying component of physical
activity. If two items measure the same component, they can be expected
to have higher correlations with each other than with other items.

In the third step, factor analysis manipulates the correlations
among the items to produce linear combinations, similar to a regression
equation without the dependent variable. The difference is that
each linear combination, called a factor, is
determined so that the first one accounts for the most variation
among the items, the second factor accounts for the most residual
variation after the first factor is taken into consideration, and so
forth. Typically, a small number of factors account for enough of
the variation among subjects that it is possible to draw inferences
about a patient's score on a given factor. For example,
it is much more convenient to refer to scores for physical functioning,
role-physical, bodily pain, and so on, than to refer to scores on
the original 36 items. Thus, the fourth step involves determining how
many factors are needed and how they should be interpreted.

Andrewes and colleagues analyzed the ESDQ, a questionnaire designed
for brain-damaged populations. They performed a factor analysis
of the ratings by the partner or caretaker of 211 patients. They
found that the relationships among the questions could be summarized
by eight factors, including anger, helplessness, emotional dyscontrol,
indifference, inappropriateness, fatigue, maladaptive behavior,
and insight. The researchers subsequently used the scores on the
factors for a discriminant analysis to differentiate between the
brain-damaged patients and a control group with no cerebral dysfunction
and found significant discrimination.

Investigators who use factor analysis usually have an idea of
what the important factors are, and they design the items accordingly.
Many other issues are of concern in factor analysis, such as how
to derive the linear combinations, how many factors to retain for
interpretation, and how to interpret the factors. Using factor analysis,
as well as the other multivariate techniques, requires considerable
statistical skill.

Cluster Analysis

A statistical technique similar conceptually to factor analysis
is cluster analysis. The difference
is that cluster analysis attempts to find similarities among the
subjects that were measured instead of among the measures that were
made. The object in cluster analysis is to determine a classification
or taxonomic scheme that accounts for variance among the subjects. Cluster
analysis can also be thought of as similar to discriminant analysis,
except that the investigator does not know to which group the subjects
belong. As in factor analysis, all variables are considered to be
independent variables.

Cluster analysis is frequently used in archeology and paleontology
to determine if the existence of similarities in objects implies
that they belong to the same taxon. Biologists use this technique
to help determine classification keys, such as using leaves or flowers
to determine appropriate species. A study by Penzel and colleagues
(2003) used cluster analysis to examine the relationships among
chromosomal imbalances in thymic epithelial tumors. Journalists
and marketing analysts also use cluster analysis, referred to in
these fields as Q-type factor analysis, as a way to classify readers
and consumers into groups with common characteristics.

Propensity Scores

The propensity score method is an alternative to multiple regression
and analysis of covariance. It provides a creative method to control
for an entire group of confounding variables. Conceptually, a propensity
score is found by using the confounding variables as predictors
of the group to which a subject belongs; this step is generally
accomplished by using logistic regression. For example, many cohort
studies are handicapped by the problem of many confounding variables,
such as age, gender, race, comorbidities, and so forth. Once the
outcome is known for the subjects in the cohort, the confounding
variables are used to develop a logistic regression equation to
predict whether a patient has the outcome or not. This prediction,
based on a combination of the confounding variables, is calculated
for all subjects and then used as the confounding variable in subsequent
analyses. Developers of the technique maintain it does a better
job of controlling for confounding variables (Rubin, 1997). See
Katzan and colleagues (2003) for an example of the application of
propensity score analysis in a clinical study to determine the effect
of pneumonia on mortality in patients with acute stroke.

Classification
and Regression Tree (CART) Analysis

Classification and regression tree (CART)
analysis is an approach to analyzing large databases to find
significant patterns and relationships among variables. The patterns
are then used to develop predictive models for classifying future
subjects. As an example, CART was used in a study of 105 patients
with stage IV colon or rectal cancer (Dixon et al, 2003). CART identified
optimal cut points for carcinoembryonic antigen (CEA) and albumen
(ALB) to form four groups of patients: low CEA with high ALB, low
CEA with low ALB, high CEA with high ALB, and high CEA with low
ALB. A survival analysis (Kaplan–Meier) was then used to
compare survival times in these four groups. In another application
of CART analysis, researchers were successful in determining the
values of semen measurements that discriminate between fertile and
infertile men (Guzick et al, 2001). The method requires special
software and extensive computing power.

Multiple Dependent
Variables

Multivariate analysis of variance and canonical correlation are
similar to each other in that they both involve multiple
dependent variables as well as multiple independent variables.

Multivariate
Analysis of Variance

Multivariate analysis of variance (MANOVA)
conceptually (although not computationally) is a simple extension
of the ANOVA designs discussed in Chapter 7 to situations in which
two or more dependent variables are included. As with ANOVA, MANOVA
is appropriate when the independent variables are nominal or categorical and
the outcomes are numerical. If the results from the MANOVA are statistically
significant, using the multivariate statistic called Wilks' lambda,
follow-up ANOVAs may be done to investigate the individual outcomes.

Weiner and Rudy (2002) wanted to identify nursing home resident
and staff attitudes that are barriers to effective pain management.
They collected information from nurses, nursing assistants, and
residents in seven long-term care facilities. They designed questionnaires
to collect beliefs about 12 components of chronic pain management
and administered them to these three groups. They wanted to know
if there were attitudinal differences among the three groups on
the 12 components. If analysis of variance is used in this study,
they would need to do 12 different ANOVAs, and the probability of
any one component being significant by chance is increased. With
these multiple dependent variables, they correctly chose to use
MANOVA. Results indicated that residents believed that chronic pain
does not change, and they were fearful of addiction. The nursing staff
believed that many complaints were unheard by busy staff. Note that
this study used a nested design (patients and staff within nursing
homes) and would be a candidate for GEE or multilevel model analysis.

The motivation for doing MANOVA prior to univariate ANOVA is
similar to the reason for performing univariate ANOVA prior to t tests: to eliminate doing many significance
tests and increasing the likelihood that a chance difference is
declared significant. In addition, MANOVA permits the statistician
to look at complex relationships among the dependent variables.
The results from MANOVA are often difficult to interpret, however,
and it is used sparingly in the medical literature.

Canonical Correlation
Analysis

Canonical correlation analysis also
involves both multiple independent and multiple dependent variables.
This method is appropriate when both the
independent variables and the outcomes are numerical, and the research
question focuses on the relationship between the set of independent
variables and the set of dependent variables. For example, suppose
researchers wish to examine the overall relationship between indicators
of health outcome (physical functioning, mental health, health perceptions,
age, gender, etc) measured at the beginning of a study and the set
of outcomes (physical functioning, mental health, social contacts,
serious symptoms, etc) measured at the end of the study. Canonical
correlation analysis forms a linear combination of the independent
variables to predict not just a single outcome measure, but a linear
combination of outcome measures. The two linear combinations of independent
variables and dependent variables, each resulting in a single number
(or index), are determined so the correlation between them is as
large as possible. The correlation between the pair of linear combinations
(or numbers or indices) is called the canonical
correlation. Then, as in factor analysis, a second pair of
linear combinations is derived from the residual variation after
the effect of the first pair is removed, and the third pair from
those remaining, and so on. The canonical coefficients in the linear
combinations are interpreted in the same manner as regression coefficients
in a multiple regression equation, and the canonical correlations
as multiple R. Generally, the first
two or three pairs of linear combinations account for sufficient
variation, and they can be interpreted to gain insights about related
factors or dimensions.

The relationship between personality and symptoms of depression
was studied in a community-based sample of 804 individuals. Grucza
and colleagues (2003) used the Temperament and Character Inventory
(TCI) to assess personality and the Center for Epidemiologic Studies
Depression scale (CES-D) to measure symptoms of depression. Both
of these questionnaires contain multiple scales or factors, and
the authors used canonical correlation analysis to learn how the
factors on the TCI are related to the factors on the CES-D. They
discovered several relationships and concluded that depression symptom
severity and patterns are partially explained by personality traits.

Summary of Advanced
Methods

The advanced methods presented in this chapter are used in approximately
10–15% of the articles in medical and surgical
journals. Unfortunately for readers of the medical literature, these
methods are complex and not easy to understand, and they are not
always described adequately. As with other complex statistical techniques,
investigators should consult with a statistician if an advanced
statistical method is planned. Table 10–2 gives a guide
to the selection of the appropriate method(s), depending on the
number independent variables and the scale on which they are measured.

Exercises

[Note to AccessLange users: data and software are not available on the website.]

1. Using the following formula, verify the adjusted mean number
of ventricular wall motion abnormalities in smokers and nonsmokers
from the hypothetical data in the section titled, "Controlling
for Confounding." That is,

2. Blood flow through an artery measured as peak systolic velocity
(PSV) increases with narrowing of the artery. The well-known relationship
between area of the arterial vessels and velocity of blood flow
is important in the use of carotid Doppler measurements for grading
stenosis of the artery. Alexandrov and collaborators (1997) examined
80 bifurcations in 40 patients and compared the findings from the
Doppler technique with two angiographic methods of measuring carotid
stenosis (the North American or NASCET [N] method
and the common carotid [C or CSI] method). They
investigated the fit provided by a linear equation, a quadratic
equation, and a cubic equation.

a. Using data in the file "Alexandrov" on
the CD-ROM, produce a scatterplot with PSV on the y-axis and CSI
on the x-axis. How do you interpret the scatterplot?

b. Calculate the correlation between both the N and C methods
and PSV. Which is most highly related to PSV?

c. Perform a multiple regression to predict PSV from CSI using
linear and quadratic terms.

d. Using the regression equation, what is the predicted PSV
if the measurement of angiographic stenosis using the CIS method
is 60%?

3. Refer to the study by Soderstrom and coinvestigators (1997).
Find the probability that a 27-year-old Caucasian man who comes
to the emergency department on Saturday night has a BAC 50
mg/dL.

4. Refer to the study by Soderstrom and coinvestigators (1997).
From Table 10–8, find the value of the kappa statistic
for the agreement between the predicted and actual number of males
with unintentional injuries who have a BAC 50
mg/dL when they come to the emergency department.

5. Bale and associates (1986) performed a study to consider the
physique and anthropometric variables of athletes in relation to
their type and amount of training and to examine these variables as
potential predictors of distance running performance. Sixty runners
were divided into three groups: (1) elite runners with 10-km runs
in less than 30 min; (2) good runners with 10-km times between 30
and 35 min, and (3) average runners with 10-km times between 35
and 45 min. Anthropometric data included body density, percentage
fat, percentage absolute fat, lean body mass, ponderal index, biceps
and calf circumferences, humerus and femur widths, and various skinfold
measures. The authors wanted to determine whether the anthropometric
variables were able to differentiate between the groups of runners.
What is the best method to use for this research question?

6. Ware and collaborators (1987) reported a study of the effects
on health for patients in health maintenance organizations (HMO)
and for patients in fee-for-service (FFS) plans. Within the FFS group,
some patients were randomly assigned to receive free medical care
and others shared in the costs. The health status of the adults
was evaluated at the beginning and again at the end of the study.
In addition, the number of days spent in bed because of poor health
was determined periodically throughout the study. These measures,
recorded at the beginning of the study—along with information
on the participant's age, gender, income, and the system
of health care to which he or she was assigned (HMO, free FFS, or
pay FFS)—were the independent variables used in the study.
The dependent variables were the values of these same 13 measures
at the end of the study. The results from a multiple-regression
analysis to predict number of bed days are given in Table 10–12.

Table 10–12. Regression
Coefficients and t Test Values for Predicting Bed-Days in RAND Study.

Table 10–12. Regression
Coefficients and t Test Values for Predicting Bed-Days in RAND Study.

Dependent-Variable Equation

Explanatory Variables and Other Measures (X)

Coefficient (b)

t Test

Intercept

0.613

22.36

FFS freeplan

–0.017

–2.17

FFS payplan

–0.014

–2.18

Personal functioning

–0.0002

–1.35

Mental health

–0.00006

0.25

Health perceptions

–0.002

–5.17

Age

–0.0001

–0.54

Male

–0.026

–4.58

Income

–0.021

–1.65

Three-year term

0.002

0.44

Took physical

–0.003

–0.56

Bed-day00

0.105

6.15

Sample size

1568

R^{2}

0.12

Residual standard error

0.01

FFS = fee for service.

Source: Reproduced, with permission,
from Ware JE, Brook RH, Rogers WH, Keeler EB, Davie AR, Sherbourne
CD, et al: Health Outcomes for Adults in Prepaid and Fee-for-Service
Systems of Care. (R–3459–HHS.) Santa Monica, CA:
The RAND Corporation, 1987, p. 59.

Use the regression equation to predict the number of bed-days during
a 30-day period for a 70-year-old woman in the FFS pay plan who
has the values on the independent variables shown in Table 10–13
(asterisks [*] designate dummy variables
given a value of 1 if yes and 0 if no).

Table 10–13. Values
for Prediction Equation.

Table 10–13. Values
for Prediction Equation.

Variable

Value

Personal functioning

80

Mental heatlh

80

Health perceptions

75

Age

70

Income

10 (from a formula used in the RAND study)

Three-year term*

Yes

Took physical*

Yes

Bed-day00

14

*Indicates a dummy variable with 1 = yes
and 0 = no.

7. Symptoms of depression in the elderly may be more subtle than
in younger patients, but recognizing depression in the elderly is
important because it can be treated. Henderson and colleagues in
Australia (1997) studied a group of more than 1000 elderly, all
age 70 years or older. They examined the outcome of depressive states
3–4 years after initial diagnosis to identify factors associated
with persistence of depressive symptoms and to test the hypothesis
that depressive symptoms in the elderly are a risk factor for dementia
or cognitive decline. They used the Canberra Interview for the Elderly
(CIE), which measures depressive symptoms and cognitive performance,
and referred to the initial measurement as "wave 1" and
the follow-up as "wave 2." The regression equation
predicting depression at wave 2 for 595 people who completed the
interview on both occasions is given in Table 10–14, and
data are in the file on the CD-ROM entitled, "Henderson." The
variables have been entered into the regression equation in blocks,
an example of hierarchical regression.

Table 10–14. Regression
Results for Predicting Depression at Wave 2.

Table 10–14. Regression
Results for Predicting Depression at Wave 2.

Predictor Variable^{a}

b

Beta^{b}

P

R^{2}

R^{2} Change

Depression Score

Wave 1

0.267

0.231

0.000

0.182

0.182

Sociodemographic

Age

–0.014

–0.024

0.538

0.187

0.005

Sex

0.165

0.034

0.370

Psychologic Health

Neuroticism, wave 1

0.067

0.077

0.056

0.0237

0.050

Past history of depression

0.320

0.136

0.000

Physical Health

ADL, wave 1

–0.154

–0.103

0.033

0.411

0.174

ADL, wave 2

0.275

0.283

0.012

ADL^{2}, wave 2

–0.013

–0.150

0.076

Number of current symptoms, wave 2

0.115

0.117

0.009

Number of medical conditions, wave 2

0.309

0.226

0.000

BP, systolic, wave 2

–0.010

–0.092

0.010

Global health rating change

0.284

0.079

0.028

Sensory impairment change

–0.045

–0.064

0.073

Social support/inactivity

Social support—friends, wave 2

–1.650

–0.095

0.015

0.442

0.031

Social support—visits, wave 2

–1.229

–0.087

0.032

Activity level, wave 2

0.061

0.095

0.025

Services (community residents only), wave 2

0.207^{c}

0.135^{c}

0.001^{c}

0.438^{c}

0.015^{c}

^{a}Only those variables are shown that were included
in the final model.

^{b}Standardized beta value, controlling for all other
variables in the regression, except service use. Based on community
and institutional residents.

^{c}Regression limited to community sample only; coefficients
for other variables vary only very slightly from those obtained
with regression on the full sample.

ADL = adult daily living; BP = blood pressure.

Source: Table 3 from the article
was modified with the addition of unstandardized regression coefficients; used,
with permission, from Henderson AS, Korten AE, Jacomb PA, MacKinnon
AJ, Jorm AF, Christensen H, et al: The course of depression in the
elderly: A longitudinal community-based study in Australia. Psychol
Med 1997;27:119–129.)

a. Based on the regression equation in Table 10–14,
what is the relationship between depression score initially and
at follow-up?

b. The regression coefficient for age is –0.014.
Is it significant? How would you interpret it?

c. Once a person's depression score at wave 1 is
known, which group of variables accounts for more of the variation
in depression at wave 2?

d. Use the data on the CD-ROM to replicate the analysis.

8. Table 10–15 contains the results from an analysis
of the data from Crook and colleagues (1997) using only information
known before treatment was given.

Table 10–15. Cox
Proportional Hazard Model Using Only Pretreatment Variables.

Table 10–15. Cox
Proportional Hazard Model Using Only Pretreatment Variables.

–2 Log Likelihood

610.312

^{2}

df

Significance

Overall (score)

51.483

6

0.0000

Change (–2LL) from

Previous block

39.344

6

0.0000

Previous step

39.344

6

0.0000

Variables in the Equation

Variable

B

SE

Wald

df

Significance

R

GSCORE

0.2999

0.2818

1.1321

1

0.2873

0.0000

TUMSTAGE

12.1032

3

0.0070

0.0969

TUMSTAGE (1)

–0.0263

0.7075

0.0014

1

0.9703

0.0000

TUMSTAGE (2)

1.0141

0.5419

3.5014

1

0.0613

0.0481

TUMSTAGE (3)

1.4588

0.5535

6.9458

1

0.0084

0.0873

PRERTHOR

0.1332

0.3262

0.1668

1

0.6830

0.0000

PRERXPSA

0.0080

0.0027

9.0391

1

0.0026

0.1041

95% CI for Exp (B)

Variable

Exp (B)

Lower

Upper

GSCORE

1.3497

0.7768

2.3450

TUMSTAGE (1)

0.9740

0.2434

3.8979

TUMSTAGE (2)

2.7568

0.9530

7.9746

TUMSTAGE (3)

4.3008

1.4534

12.7264

PRERTHOR

1.1425

0.6028

2.1654

PRERXPSA

1.0080

1.0028

1.0133

df = degree of freedom; SE = standard
error; CI = confidence interval; Wald = statistic
used by SPSS to test the significance of variables.

Source: Data, used with permission
of the authors and publisher, from Crook JM, Bahadur YA, Bociek RG,
Perry GA, Robertson SJ, Esche BA: Radiotherapy for localized prostate
carcinoma. Cancer 1997;79:328–336. Output produced using
SPSS 10.0, a registered trademark of SPSS, Inc. Used with permission.

a. Is the overall Cox model significant when based on
pretreatment variables only? What level of significance is reported?

b. Were any of the potentially confounding variables significant?

c. Confirm the value of the odds ratios associated with the
TUMSTAGE(3) variable of T classifications, and interpret the confidence
in terval.

d. What are the major differences in this analysis compared with
the one that included posttreatment variables as well?

9. Hindmarsh and Brook (1996) examined the final height of 16
short children who were treated with growth hormone. They studied
several variables they thought might predict height in these children,
such as the mother's height, the father's height,
the child's chronologic and bone age, dose of the growth
hormone during the first year, age at the start of therapy, and
the peak response to an insulin-induced hypoglycemia test. All anthropometric
indices were expressed as standard deviation scores; these scores
express height in terms of standard deviations from the mean in
a norm group. For example, a height score of –2.00 indicates
the child is 2 standard deviations below the mean height for his
or her age group.

Data are given in Table 10–16 and in a file entitled "Hindmarsh" on
the CD-ROM.

Table 10–16. Case
Summaries.a

Table 10–16. Case
Summaries.^{a}

Final Height SDS

Age

Dose

Father's Height

Mother's Height

Height SDS Chronologic Age

Height SDS Bone Age

1

–2.18

6.652

20.00

–2.14

–2.20

–3.07

–1.75

2

–2.15

10.383

16.00

–2.78

–0.15

–2.20

–2.43

3

–1.65

10.565

16.00

–1.91

–1.87

–2.62

–0.90

4

–1.18

10.104

15.00

0.84

–1.63

–2.43

–2.06

5

–1.31

11.145

14.00

–1.80

–0.70

–1.92

–1.49

6

–1.35

9.682

14.00

0.09

0.38

–1.38

–0.54

7

–1.18

9.863

16.00

–0.26

–1.70

–2.08

–1.32

8

–2.51

9.463

15.00

–3.62

–0.75

–2.36

–0.45

9

–1.61

7.704

19.00

–2.11

–1.25

–1.98

1.23

10

–2.15

5.858

25.00

1.31

0.23

–2.43

–1.95

11

0.80

5.153

22.00

–0.14

–1.83

–2.36

–0.39

12

–0.20

6.986

21.00

–0.14

–1.83

–2.09

–0.98

13

0.20

8.967

17.00

–0.14

–1.83

–1.38

–0.19

14

–0.71

6.970

21.00

0.50

1.63

–1.09

0.49

15

–1.71

6.515

13.00

0.84

–0.10

–1.98

–0.18

16

–2.32

7.548

21.00

–2.89

–0.20

–3.30

–2.25

Total N

16

16

16

16

16

16

16

^{a}Limited to first 100 cases.

SDS = standard deviation score.

Source: Data, used with permission,
from Hindmarsh PC, Brook CGD: Final height of short normal children
treated with growth hormone. Lancet 1996;348:13–16. Table
produced using SPSS 10.0; used with permission.

a. Use the data to perform a stepwise regression and interpret
the results. We reproduced a portion of the output in Table 10–17.

b. What variable entered the equation on the first iteration
(model 1)? Why do you think it entered first?

c. What variables are in the equation at the final model?
Which of these variables makes the greatest contribution to the
prediction of final height?

d. Why do you think the variable that entered the equation
first is not in the final model?

e. Using the regression equation, what is the predicted height
of the first child? How close is this to the child's actual
final height (in SDS scores)?

Table 10–17. Results
from Stepwise Multiple Regression to Predict Final Height in Standard
Deviation Scores.a

Table 10–17. Results
from Stepwise Multiple Regression to Predict Final Height in Standard
Deviation Scores.^{a}

Unstandardized Coefficients

Standardized Coefficients

Model

B

SE

t

Significance

1

(Constant)

–1.055

0.248

–4.261

0.001

Father's height

0.302

0.142

0.494

2.126

0.052

2

(Constant)

–1.335

0.284

–4.705

0.000

Father's height

0.337

0.135

0.552

2.503

0.026

Mother's height

–0.343

0.200

–0.378

–1.715

0.110

3

(Constant)

0.205

0.734

0.280

0.785

Father's height

0.211

0.131

0.345

1.612

0.133

Mother's height

–0.478

0.185

–0.527

–2.581

0.024

Height SDS chronologic age

0.820

0.368

0.505

2.230

0.046

4

(Constant)

–1.110

0.927

–1.198

0.256

Father's height

0.128

0.124

0.210

1.035

0.323

Mother's height

–0.559

0.170

–0.617

–3.284

0.007

Height SDS chronologic age

1.132

0.363

0.697

3.116

0.010

Dose

0.104

0.052

0.385

2.009

0.070

5

(Constant)

–1.138

0.929

–1.225

0.244

Mother's height

–0.575

0.170

–0.634

–3.381

0.005

Height SDS chronologic age

1.325

0.313

0.816

4.229

0.001

Dose

0.121

0.049

0.451

2.487

0.029

^{a}Dependent variable: final height SDS.

SE = standard error; SDS = standard deviation
scores.

Note: Because the sample size is small, we set probability for
variables to enter the regression equation at 0.15 and for variables
to be removed at 0.20.

Source: Data, used with permission,
from Hindmarsh PC, Brook CGD: Final height of short normal children
treated with growth hormone. Lancet 1996;348:13–16. Stepwise
regression results produced using SPSS; used with permission.