Basic and Clinical Biostatistics > Chapter 8. Research Questions About Relationships among Variables >

Key Concepts

Correlation and regression are statistical methods to
examine the linear relationship between two numerical variables
measured on the same subjects. Correlation describes a relationship,
and regression describes both a relationship and predicts an outcome.

Correlation coefficients range from –1 to +1,
both indicating a perfect relationship between two variables. A
correlation equal to 0 indicates no relationship.

Scatterplots provide a visual display of the relationship
between two numerical variables and are recommended to check for
a linear relationship and extreme values.

The coefficient of determination, or r^{2}, is simply
the squared correlation; it is the preferred statistic to describe
the strength between two numerical variables.

The t test can be used to test the hypothesis that the population
correlation is zero.

The Fisher z transformation is used to form confidence intervals
for the correlation or to test any hypotheses about the value of
the correlation.

The Fisher z transformation can also be used to form confidence
intervals for the difference between correlations in two independent
groups.

It is possible to test whether the correlation between one
variable and a second is the same as the correlation between a third
variable and a second variable.

When one or both of the variables in correlation is skewed,
the Spearman rho nonparametric correlation is advised.

Linear regression is called linear because it measures only
straight-line relationships.

The least squares method is the one used in almost all regression
examples in medicine. With one independent and one dependent variable,
the regression equation can be given as a straight line.

The standard error of the estimate is a statistic that can
be used to test hypotheses or form confidence intervals about both
the intercept and the regression coefficient (slope).

One important use of regression is to be able to predict outcomes
in a future group of subjects.

When predicting outcomes, the confidence limits are called
confidence bands about the regression line. The most accurate predictions
are for outcomes close to the mean of the independent variable X,
and they become less precise as the outcome departs from the mean.

It is possible to test whether the regression line is the
same (ie, has the same slope and intercept) in two different groups.

A residual is the difference between the actual and the predicted
outcome; looking at the distribution of residuals helps statisticians
decide if the linear regression model is the best approach to analyzing
the data.

Regression toward the mean can result in a treatment or procedure
appearing to be of value when it has had no actual effect; having
a control group helps to guard against this problem.

Correlation and regression should not be used unless observations
are independent; it is not appropriate to include multiple measurements
of the same subjects.

Mixing two populations can also cause the correlation and
regression coefficient to be larger than they should.

The use of correlation versus regression should be dictated
by the purpose of the research—whether it is to establish
a relationship or to predict an outcome.

The regression model can be extended to accommodate two or
more independent variables; this model is called multiple regression.

Determining the needed sample size for correlation and regression
is not difficult using one of the power analysis statistical programs.

Presenting Problems

Presenting Problem
1

In the United States, according to World Health Organization
(WHO) standards, 42% of men and 28% of women are
overweight, and an additional 21% of men and 28% of
women are obese. Body mass index (BMI) has become the measure to
define standards of overweight and obesity. The WHO defines overweight
as a BMI between 25 and 29.9 kg/m^{2} and obesity
as a BMI greater than or equal to 30 kg/m^{2}. Jackson
and colleagues (2002) point out that the use of BMI as a single
standard for obesity for all adults has been recommended because
it is assumed to be independent of variables such as age, sex, ethnicity,
and physical activity. Their goal was to examine this assumption
by evaluating the effects of sex, age, and race on the relation
between BMI and measured percent fat. They studied 665 black and
white men and women who ranged in age between 17 years and 65 years.
Each participant was carefully measured for height and weight to
calculate BMI and body density. Relative body fat (%fat)
was estimated from body density using previously published equations.
The independent variables examined were BMI, sex, age, and race.
We examine this data to learn whether a relationship exists, and
if so, whether it is linear or not. Data are on the CD [available only with the book] in a folder entitled "Jackson."

Presenting Problem
2

Hypertension, defined as systolic pressure greater than 140 mm
Hg or diastolic pressure greater than 90 mm Hg, is present in 20–30% of
the U.S. population. Recognition and treatment of hypertension has
significantly reduced the morbidity and mortality associated with
the complications of hypertension. A number of finger blood pressure
devices are marketed for home use by patients as an easy and convenient
way for them to monitor their own blood pressure.

How correct are these finger blood pressure devices? Nesselroad
and colleagues (1996) studied these devices to determine their accuracy.
They measured blood pressure in 100 consecutive patients presenting
to a family practice office who consented to participate. After
being seated for 5 min, blood pressure was measured in each patient
using a standard blood pressure cuff of appropriate size and with
each of three automated finger blood pressure devices. The data
were analyzed by calculating the correlation coefficient between
the value obtained with the blood pressure cuff and the three finger
devices and by calculating the percentage of measurements with each
automated device that fell within the ± 4 mm Hg margin of error
of the blood pressure cuff.

We use the data to illustrate correlation and scatterplots. We
also illustrate a test of hypothesis about two dependent or related
correlation coefficients. Data are given in the section titled, "Spearman's
Rho," and on the CD-ROM [available only with the book] in a folder called "Nesselroad."

Presenting Problem
3

Symptoms of forgetfulness and loss of concentration can be a
result of natural aging and are often aggravated by fatigue, illness,
depression, visual or hearing loss, or certain medications. Hodgson and
Cutler (1997) wished to examine the consequences of anticipatory
dementia, a phenomenon characterized as the fear that normal and
age-associated memory change may be the harbinger of Alzheimer's
disease.

They studied 25 men and women having a living parent with a probable
diagnosis of Alzheimer's disease, a condition in which
genetic factors are known to be important. A control group of 25 men
and women who did not have a parent with dementia was selected for
comparison. A directed interview and questionnaire were used to
measure concern about developing Alzheimer's disease and
to assess subjective memory functioning. Four measures of each individual's
sense of well-being were used in the areas of depression, psychiatric
symptomatology, life satisfaction, and subjective health status.
We use this study to illustrate biserial correlation and show its
concordance with the t test. Observations from the study are given
in files on the CD-ROM [available only with the book] entitled "Hodgson."

Presenting Problem
4

The study of hyperthyroid women by Gonzalo and coinvestigators
(1996) was a presenting problem in Chapter 7. Recall that the study
reported the effect of excess body weight in hyperthyroid patients
on glucose tolerance, insulin secretion, and insulin sensitivity.
The study included 14 hyperthyroid women, 6 of whom were overweight,
and 19 volunteers with normal thyroid levels of similar ages and
weight. The investigators in this study also examined the relationship
between insulin sensitivity and body mass index for hyperthyroid
and control women. (See Figure 3 in the Gonzalo article) We revisit
this study to calculate and compare two regression lines. Original observations
are given in Chapter 7, Table 7–8.

An Overview
of Correlation & Regression

In Chapter 3 we introduced methods to describe the association
or relationship between two variables. In this chapter we review
these concepts and extend the idea to predicting the value of one characteristic
from the other. We also present the statistical procedures used
to test whether a relationship between two characteristics is significant.
Two probability distributions introduced previously, the t distribution and the chi-square distribution, can be used for statistical tests in correlation and regression.
As a result, you will be pleased to learn that much of the material
in this chapter will be familiar to you.

When the goal is merely to establish a relationship (or association)
between two measures, as in these studies, the correlation coefficient
(introduced in Chapter 3) is the statistic most often used. Recall
that correlation is a measure of the linear relationship
between two variables measured on a numerical scale.

In addition to establishing a relationship, investigators sometimes
want to predict an outcome, dependent, or response, variable from an independent, or explanatory,
variable. Generally, the explanatory characteristic is the
one that occurs first or is easier or less costly to measure. The
statistical method of linear regression is
used; this technique involves determining an equation for predicting
the value of the outcome from values of the explanatory variable.
One of the major differences between correlation and regression
is the purpose of the analysis—whether it is merely to
describe a relationship or to predict a value. Several important
similarities also occur as well, including the direct relationship between
the correlation coefficient and the regression coefficient. Many
of the same assumptions are required for correlation and regression,
and both measure the extent of a linear relationship between the
two characteristics.

Correlation

Figure 8–1 illustrates several hypothetical scatterplots of data to demonstrate
the relationship between the size of the correlation coefficient r and the shape of the scatterplot.
When the correlation is near zero, as in Figure 8–1E, the
pattern of plotted points is somewhat circular. When the degree
of relationship is small, the pattern is more like an oval, as in
Figures 8–1D and 8–1B. As the value of the correlation
gets closer to either +1 or –1, as in Figure 8–1C,
the plot has a long, narrow shape; at +1 and –1,
the observations fall directly on a line, as for r = +1.0
in Figure 8–1A.

The scatterplot in Figure 8–1F illustrates a situation
in which a strong but nonlinear relationship exists. For example,
with temperatures less than 10–15°C, a cold nerve fiber
discharges few impulses; as the temperature increases, so do numbers
of impulses per second until the temperature reaches about 25°C.
As the temperature increases beyond 25°C, the numbers of impulses
per second decrease once again, until they cease at 40–45°C.
The correlation coefficient, however, measures only a linear relationship,
and it has a value close to zero in this situation.

One of the reasons for producing scatterplots of data as part
of the initial analysis is to identify nonlinear relationships when
they occur. Otherwise, if researchers calculate the correlation
coefficient without examining the data, they can miss a strong,
but nonlinear, relationship, such as the one between temperature
and number of cold nerve fiber impulses.

Calculating
the Correlation Coefficient

We use the study by Jackson and colleagues (2002) to extend our
understanding of correlation. We assume that anyone interested in
actually calculating the correlation coefficient will
use a computer program, as we do in this chapter. If you are interested
in a detailed illustration of the calculations, refer to Chapter 3, in
the section titled, "Describing the Relationship between
Two Characterisitics," and the study by Hébert
and colleagues (1997).

Recall that the formula for the Pearson product moment correlation
coefficient, symbolized by r, is

where X stands for the independent
variable and Y for the outcome variable.

A highly recommended first step in looking at the relationship
between two numerical characteristics is to examine the relationship
graphically. Figure 8–2 is a scatterplot of the data, with
body mass index (BMI) on the X-axis
and percent body fat on the Y-axis.
We see from Figure 8–2 that a positive relationship exists
between these two characteristics: Small values for BMI are associated
with small values for percent body fat. The question of interest
is whether the observed relationship is statistically significant.
(A large number of duplicate or overlapping data points occur in
this plot because the sample size is so large.)

Figure 8–2.

Scatterplot of body mass index and percent body fat.
(Data, used with permission, from Jack son A, Stanforth PR, Gagnon
J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and
race on estimating percentage body fat from body mass index: The
Heritage Family Study. Int J Obes Relat
Metab Disord 2002;26:789–796.
Plot produced with NCSS; used with permission.)

The extent of the relationship can be found by calculating the
correlation coefficient. Using a statistical program, the correlation
between BMI and percent body fat is 0.73, indicating a strong relationship
between these two measures. Use the CD-ROM [available only with the book] to confirm our calculations.
Also, see Chapter 3, in the section titled, "Describing
the Relationship between Two Characterisitics," for a review
of the properties of the correlation coefficient.

Interpreting
the Size of r

The size of the correlation required for statistical significance
is, of course, related to the sample size. With a very large sample
of subjects, such as 2000, even small correlations, such as 0.06,
are significant. A better way to interpret the size of the correlation
is to consider what it tells us about the strength of the relationship.

The Coefficient
of Determination

The correlation coefficient can be squared to form the statistic
called the coefficient of determination. For
the subjects in the study by Jackson, the coefficient of determination
is (0.73)^{2}, or 0.53. This means that 53% of the
variation in the values for one of the measures, such as percent
body fat, may be accounted for by knowing the BMI. This concept
is demonstrated by the Venn diagrams in Figure 8–3. For
the left diagram, r^{2} = 0.25;
so 25% of the variation in A is
accounted for by knowing B (or vice
versa). The middle diagram illustrates r^{2} = 0.50,
similar to the value we observed, and the diagram on the right
represents r^{2} = 0.80.

Figure 8–3.

Illustration of r^{2}, proportion of explained
variance.

The coefficient of determination tells us how strong the relationship
really is. In the health literature, confidence limits or results
of a statistical test for significance of the correlation coefficient are
also commonly presented.

The t Test for
Correlation

The symbol for the correlation coefficient in the population
(the population parameter) is (lower case Greek
letter rho). In a random sample, is estimated
by r. If several random samples of
the same size are selected from a given population and the correlation
coefficient r is calculated for each,
we expect r to vary from one sample
to another but to follow some sort of distribution about the population value
. Unfortunately, the sampling distribution of the correlation
does not behave as nicely as the sampling distribution of the mean,
which is normally distributed for large samples.

Part of the problem is a ceiling effect when the correlation
approaches either –1 or +1. If the value of the
population parameter is, say, 0.8, the sample values can exceed
0.8 only up to 1.0, but they can be less than 0.8 all the way to –1.0.
The maximum value of 1.0 acts like a ceiling, keeping the sample
values from varying as much above 0.8 as below it, and the result
is a skewed distribution. When the
population parameter is hypothesized to be zero, however, the ceiling
effects are equal, and the sample values are approximately distributed
according to the t distribution, which
can be used to test the hypothesis that the true value of the population
parameter ?r? is equal to zero. The following mathematical expression
involving the correlation coefficient, often called the t ratio, has been found to have a t distribution with n – 2 degrees of freedom:

Let us use this t ratio to test
whether the observed value of r = 0.73
is sufficient evidence with 655 observations to conclude that the
true population value of the correlation is different
from zero.

Step 1: H_{0}: No relationship
exists between BMI and percent body fat; or, the true correlation
is zero: = 0.

H_{1}: A relationship does exist between BMI and percent
body fat; or, the true correlation is not zero: ≠ 0.

Step 2: Because the null hypothesis
is a test of whether is zero, the t ratio may be used when the assumptions
for correlation (see the section titled, "Assumptions in Correlation")
are met.

Step 3: Let us use = 0.01
for this example.

Step 4: The degrees of freedom
are n – 2 = 655 – 2 = 653.
The value of a t distribution with
653 degrees of freedom that divides the area into the central 99% and
the upper and lower 1% is approximately 2.617 (using the
value for 120 df in Table A–3).
We therefore reject the null hypothesis of zero correlation if (the
absolute value of) the observed value of t is
greater than 2.617.

Step 5: The calculation is

Step 6: The observed value of the t ratio with 653 degrees of freedom
is 27.29, far greater than 2.617. The null hypothesis of zero correlation
is therefore rejected, and we conclude that the relationship between
BMI and percent body fat is large enough to conclude that these
two variables are associated.

Fisher's
z Transformation to Test the Correlation

Investigators generally want to know whether = 0,
and this test can easily be done with computer programs. Occasionally,
however, interest lies in whether the correlation is equal to a
specific value other than zero. For example, consider a diagnostic
test that gives accurate numerical values but is invasive and somewhat
risky for the patient. If someone develops an alternative testing
procedure, it is important to show that the new procedure is as
accurate as the test in current use. The approach is to select a
sample of patients and perform both the current test and the new
procedure on each patient and then calculate the correlation coefficient
between the two testing procedures.

Either a test of hypothesis can be performed to show that the
correlation is greater than a given value, or a confidence interval
about the observed correlation can be calculated. In either case,
we use a procedure called Fisher's z transformation to test any
null hypothesis about the correlation as well as to form confidence
intervals.

To use Fisher's exact test, we first transform the correlation
and then use the standard normal (z)
distribution. We need to transform the correlation because, as we
mentioned earlier, the distribution of sample values of the correlation
is skewed when ≠ 0. Although
this method is a bit complicated, it is actually more flexible than
the t test, because it permits us to
test any null hypothesis, not simply that the correlation is zero. Fisher's z transformation was proposed by the
same statistician (Ronald Fisher) who developed Fisher's exact
test for 2 x 2 contingency tables (discussed
in Chapter 6).

Fisher's z transformation
is

where ln represents the natural logarithm. Table A–6
gives the z transformation for different
values of r, so we do not actually
need to use the formula. With moderate-sized samples, this transformation follows
a normal distribution, and the following expression for the z test can be used:

To illustrate Fisher's z transformation
for testing the significance of , we evaluate
the relationship between BMI and percent body fat (Jackson et al,
2002). The observed correlation between these two measures was 0.73.
Jackson and his colleagues may have expected a sizable correlation
between these two measures; let us suppose they want to know whether
the correlation is significantly greater than 0.65. A one-tailed
test of the null hypothesis that 0.65,
which they hope to reject, may be carried out as follows.

Step 1: H_{0}: The relationship
between BMI and percent body fat is 0.65;
or, the true correlation 0.65.

H_{1}: The relationship between BMI and percent body fat
is >0.65; or, the true correlation > 0.65.

Step 2: Fisher's z transformation may be used with the
correlation coefficient to test any hypothesis.

Step 3: Let us again use = 0.01
for this example.

Step 4: The alternative hypothesis
specifies a one-tailed test. The value of the z distribution
that divides the area into the lower 99% and the upper
1% is approximately 2.326 (from Table A–2). We
therefore reject the null hypothesis that the correlation is 0.65
if the observed value of z is > 2.326.

Step 5: The first step is to find
the transformed values for r = 0.73
and = 0.65 from Table A–6;
these values are 0.929 and 0.775, respectively. Then, the calculation
for the z test is

Step 6: The observed value of the z statistic, 3.93, exceeds 2.358. The
null hypothesis that the correlation is 0.65 or less is rejected, and
the investigators can be assured that the relationship between BMI
and body fat is greater than 0.65.

Confidence Interval
for the Correlation

A major advantage of Fisher's z transformation
is that confidence intervals can be
formed. The transformed value of the correlation is used to calculate
confidence limits in the usual manner, and then they are transformed
back to values corresponding to the correlation coefficient.

To illustrate, we calculate a 95% confidence interval
for the correlation coefficient 0.73 in Jackson and colleagues (2002).
We use Fisher's z transformation
of 0.73 = 0.929 and the z distribution
in Table A–2 to find the critical value for 95%.
The confidence interval is

Transforming the limits 0.852 and 1.006 back to correlations
using Table A–6 in reverse gives approximately r = 0.69 and r = 0.77 (using conservative
values). Therefore, we are 95% confident that the true
value of the correlation in the population is contained within this
interval. Note that 0.65 is not in this interval, which is consistent
with our conclusion that the observed correlation of 0.73 is different
from 0.65.

Surprisingly, computer programs do not always contain routines
for finding confidence limits for a correlation. We have included
a Microsoft Excel program in the Calculations folder on the CD-ROM [available only with the book] that calculates
the 95% CI for a correlation.

Assumptions
in Correlation

The assumptions needed to draw valid conclusions about the correlation
coefficient are that the sample was randomly selected and the two
variables, X and Y, vary
together in a joint distribution that is normally distributed, called
the bivariate normal distribution. Just because each variable is
normally distributed when examined separately, however, does not
guarantee that, jointly, they have a bivariate normal distribution.
Some guidance is available: If either of the two variables is not normally distributed, Pearson's
product moment correlation coefficient is not the
most appropriate method. Instead, either one or both of the variables
may be transformed so that they more closely follow a normal distribution,
as discussed in Chapter 5, or the Spearman rank correlation may
be calculated. This topic is discussed in the section titled, "Other
Measures of Correlation."

Comparing Two
Correlation Coefficients

On occasion, investigators want to know if a difference exists
between two correlation coefficients. Here are two specific instances:
(1) comparing the correlations between the same two variables that
have been measured in two independent groups of subjects and (2)
comparing two correlations that involve a variable in common in
the same group of individuals. These situations are not extremely
common and not always contained in statistical programs. We designed Microsoft
Excel programs; see the folder "Calculations" on
the CD-ROM [available only with the book].

Comparing Correlations
in Two Independent Groups

Fisher's z transformation
can be used to test hypotheses or form confidence intervals about
the difference between the correlations between the same two variables
in two independent groups. The results of such tests are also called independent correlations. For example,
Gonzalo and colleagues (1996) in Presenting Problem 4 wanted to
compare the correlation between BMI and insulin sensitivity in the
14 hyperthyroid women (r = –0.775)
with the correlation between BMI and insulin sensitivity in the
19 control women (r = –0.456).
See Figure 8–4.

Figure 8–4.

Scatterplot of BMI and insulin sensitivity. (Data, used
with permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ,
Suarez AI, Herrera-Pombo JL, et al: Glucose tolerance, insulin secretion, insulin
sensitivity and glucose effectiveness in normal and overweight hyperthyroid
women. Clin Endocrinol (Oxf) 1996;45:689–697. Output produced
using NCSS; used with permission.)

In this situation, the value for the second group replaces z() in the numerator
for the z test shown in the previous
section, and l/(n – 3)
is found for each group and added before taking the square root
in the denominator. The test statistic is

To illustrate, the values of z from
Fisher's z transformation
tables (A–6) for –0.775 and –0.456 are
approximately 1.033 0.492 . (with interpolation), respectively.
Note that Fisher's z transformation
is the same, regardless of whether the correlation is positive or
negative. Using these values, we obtain

Assuming we choose the traditional significance level of 0.05,
the value of the test statistic, 1.38, is less than the critical
value, 1.96, so we do not reject the null hypothesis of equal correlations. We
decide that the evidence is insufficient to conclude that the relationship
between BMI and insulin sensitivity is different for hyperthyroid
women from that for controls. What is a possible explanation for
the lack of statistical significance? It is possible that there
is no difference in the relationships between these two variables
in the population. When sample sizes are small, however, as they
are in this study, it is always advisable to keep in mind that the
study may have low power.

Comparing Correlations
with Variables in Common in the Same Group

The second situation occurs when the research question involves
correlations that contain the same variable (also called dependent
correlations). For example, a very natural question for Nesselroad
and colleagues (1996) was whether one of the finger devices was
more highly correlated with the blood pressure cuff—considered
to be the gold standard—than the other two. If so, this would
be a product they might wish to recommend for patients to use at
home. To illustrate, we compare the diastolic reading with device
1 and the cuff (r_{XY} = 0.32)
to the diastolic reading with device 2 and the cuff (r_{XZ} = 0.45).

There are several formulas for testing the difference between
two dependent correlations. We present the simplest one, developed
by Hotelling (1940) and described by Glass and Stanley (1970) on
pages 310–311 of their book. We will show the calculations
for this example but, as always, suggest that you use a computer
program. The formula follows the t distribution
with n – 3 degrees of freedom;
it looks rather forbidding and requires the calculation of several
correlations:

We designate the cuff reading as X, device
1 as Y, and device 2 as Z. We therefore want to compare r_{XY} with r_{XZ}.
Both correlations involve the X, or
cuff, reading, so these correlations are dependent. To use the formula,
we also need to calculate the correlation between device 1 and device
2, which is r_{YZ} = 0.54.
Table 8–1 shows the correlations needed for this formula.

Table 8–1. Correlation
Matrix of Diastolic Blood Pressures in All 100 Subjects.

Table 8–1. Correlation
Matrix of Diastolic Blood Pressures in All 100 Subjects.

Pearson Correlations Section

Cuff Diastolic

Device 1 Diastolic

Device 2 Diastolic

Device 3 Diastolic

Cuff Diastolic

1.0000

0.3209^{a}

0.4450

0.3592

0.0000

0.0011

0.0000

0.0002

100.0000

100.0000

100.0000

100.0000

Device 1 Diastolic

0.3210

1.0000

0.5364

0.5392

0.0011

0.0000

0.0000

0.0000

100.0000

100.0000

100.0000

100.0000

Device 2 Diastolic

0.4450

0.5364

1.0000

0.5629

0.0000

0.0000

0.0000

0.0000

100.0000

100.0000

100.0000

100.0000

Device 3 Diastolic

0.3592

0.5392

0.5629

1.0000

0.0002

0.0000

0.0000

0.0000

100.0000

100.0000

100.0000

100.0000

^{a}Bolded values are needed for comparing two
dependent correlations.

Source: Data, used with permission,
from Nesselroad JM, Flacco VA, Phillips DM, Kruse J: Accuracy of automated
finger blood pressure devices. Fam Med 1996;28:189–192.
Output produced using NCSS; used with permission.

The calculations are

You know by now that the difference between these two correlations
is not statistically significant because the observed value of t is –1.50, and |–1.50| = 1.50
is less than the critical value of t with
97 degrees of freedom, 1.99. This conclusion corresponds to that
by Nesselroad and his colleagues in which they recommended that
patients be cautioned that the finger blood pressure devices may
not perform as marketed.

We designed a Microsoft Excel program for these calculations
as well. It is included on the CD-ROM [available only with the book] in a folder called "Calculations" and
is entitled "z for 2 dept r's."

Other Measures
of Correlation

Several other measures of correlation are often found in the
medical literature. Spearman's rho, the rank correlation
introduced in Chapter 3, is used with ordinal data or in situations
in which the numerical variables are not normally distributed. When
a research question involves one numerical and one nominal variable,
a correlation called the point–biserial correlation is
used. With nominal data, the risk ratio, or kappa (),
discussed in Chapter 5, can be used.

Spearman's
Rho

Recall that the value of the correlation coefficient is markedly
influenced by extreme values and thus does not provide a good description
of the relationship between two variables when their distributions
are skewed or contain outlying values. For example, consider the
relationships among the various finger devices and the standard
cuff device for measuring blood pressure from Presenting Problem
2. To illustrate, we use the first 25 subjects from this study,
listed in Table 8–2 (see the file entitled "Nesselroad25" on the CD-ROM [available only with the book]).

Table 8–2. Data
on Diastolic Blood Pressure for the First 25 Subjects.

Table 8–2. Data
on Diastolic Blood Pressure for the First 25 Subjects.

Subject

Cuff Diastolic

Device 1 Diastolic

Device 2 Diastolic

Device 3 Diastolic

1

80

58

51

38

2

65

79

61

47

3

70

66

61

50

4

80

93

75

53

5

60

75

76

54

6

82

71

75

56

7

70

58

60

58

8

70

73

74

58

9

60

72

67

59

10

70

88

70

60

11

48

70

88

60

12

100

114

82

62

13

70

74

56

64

14

70

75

79

67

15

70

89

62

69

16

60

95

75

70

17

80

87

89

72

18

80

57

74

73

19

90

69

90

73

20

80

60

85

75

21

70

72

75

77

22

85

85

61

79

23

100

102

99

89

24

70

113

83

94

25

90

127

108

99

Source: Data, used with
permission, from Nesselroad JM, Flacco VA, Phillips DM, Kruse J:
Accuracy of automated finger blood pressure devices. Fam Med 1996;28:189–192.
Output produced using NCSS; used with permission.

It is difficult to tell if the observations are normally distributed
without looking at graphs of the data. Some statistical programs
have routines to plot values against a normal distribution to help researchers
decide whether a nonparametric procedure should be used. A normal
probability plot for the cuff diastolic measurement is given in
Figure 8–5. Use the CD-ROM [available only with the book] to produce similar plots for
the finger device measurements.

Figure 8–5.

Diastolic blood pressure using cuff readings in 25 subjects.
(Data, used with permission, from Nesselroad JM, Flacco VA, Phillips
DM, Kruse J: Accuracy of automated finger blood pressure devices. Fam Med 1996;28:189–192.
Output produced using NCSS; used with permission.)

When the observations are plotted on a graph, as in Figure 8–5,
it appears that the data are not unduly skewed. This conclusion
is consistent with the tests given for the normality of a distribution
by NCSS. In the normal probability plot, if observations fall within
the curved lines, the data can be assumed to be normally distributed.

As we indicated in Chapter 3, a simple method for dealing with
the problem of extreme observations in correlation is to rank order
the data and then recalculate the correlation on ranks to obtain the
nonparametric correlation called Spearman's
rho, or rank correlation. To
illustrate this procedure, we continue to use data on the first
25 subjects in the study on blood pressure devices (Presenting Problem
2), even though the distribution of the values does not require
this procedure. Let us focus on the correlation between the cuff
and device 2, which we learned was 0.45 in the section titled, "Comparing
Correlations with Variables in Common in the Same Group."

Table 8–3 illustrates the ranks of the diastolic readings
on the first 25 subjects. Note that each variable is ranked separately;
when ties occur, the average of the ranks of the tied values is
used.

Table 8–3. Rank
Order of the Diastolic Blood Pressure for the First 25 Subjects.

Table 8–3. Rank
Order of the Diastolic Blood Pressure for the First 25 Subjects.

Row

Cuff Diastolic

Device 1 Diastolic

Device 2 Diastolic

Device 3 Diastolic

1

17.0

2.5

1.0

1.0

2

5.0

15.0

5.0

2.0

3

10.0

5.0

5.0

3.0

4

17.0

20.0

13.5

4.0

5

3.0

13.5

16.0

5.0

6

20.0

8.0

13.5

6.0

7

10.0

2.5

3.0

7.5

8

10.0

11.0

10.5

7.5

9

3.0

9.5

8.0

9.0

10

10.0

18.0

9.0

10.5

11

1.0

7.0

21.0

10.5

12

24.5

24.0

18.0

12.0

13

10.0

12.0

2.0

13.0

14

10.0

13.5

17.0

14.0

15

10.0

19.0

7.0

15.0

16

3.0

21.0

13.5

16.0

17

17.0

17.0

22.0

17.0

18

17.0

1.0

10.5

18.5

19

22.5

6.0

23.0

18.5

20

17.0

4.0

20.0

20.0

21

10.0

9.5

13.5

21.0

22

21.0

16.0

5.0

22.0

23

24.5

22.0

24.0

23.0

24

10.0

23.0

19.0

24.0

25

22.5

25.0

25.0

25.0

Source: Data, used with
permission, from Nesselroad JM, Flacco VA, Phillips DM, Kruse J:
Accuracy of automated finger blood pressure devices. Fam Med 1996;28:189–192.
Output produced using NCSS; used with permission.

The ranks of the variables are used in the equation for the correlation
coefficient, and the resulting calculation gives Spearman's
rank correlation (r_{S}), also
called Spearman's rho:

where R_{X} is the rank of
the X variable, R_{Y} is
the rank of the Y variable, and R_{X} and R_{Y} are
the mean ranks for the X and Y variables, respectively. The rank
correlation r_{S} may also be
calculated by using other formulas, but this approximate procedure
is quite good (Conover and Iman, 1981).

Calculating r_{S} for the
ranked observations in Table 8–3 gives

The value of r_{S} is smaller
than the value of Pearson's correlation; this may occur
when the bivariate distribution of the two variables is not normal.
The t test, as illustrated for the
Pearson correlation, can be used to determine whether the Spearman rank
correlation is significantly different from zero. For example, the
following procedure tests whether the value of Spearman's
rho in the population, symbolized _{S} (Greek
letter rho with subscript S denoting Spearman) differs from zero.

Step 1: H_{0}: The population
value of Spearman's rho is zero; that is, _{S} = 0.

H_{1}: The population value of Spearman's rho
is not zero; that is, _{S} ≠ 0.

Step 2: Because the null hypothesis
is a test of whether _{S} is zero, the t ratio may be used.

Step 3: Let us use = 0.05
for this example.

Step 4: The degrees of freedom
are n – 2 = 25 – 2 = 23.
The value of the t distribution with
23 degrees of freedom that divides the area into the central 95% and
the upper and lower 2½% is 2.069 (Table A–3),
so we will reject the null hypothesis if (the absolute value of)
the observed value of t is greater
than 2.069.

Step 5: The calculation is

Step 6: The observed value of the t ratio with 23 degrees of freedom
is 1.677, less than 2.069, so we do not reject the null hypothesis and
conclude there is insufficient evidence that a nonparametric correlation
exists between the diastolic pressure measurements made by the cuff
and finger device 2.

Of course, if investigators want to test only whether Spearman's
rho is greater than zero—that there is a significantly
positive relationship—they can use a one-tailed test. For
a one-tailed test with = 0.05 and 23
degrees of freedom, the critical value is 1.714, and the conclusion
is the same.

It is easy to demonstrate that performing the above-mentioned
test on ranked data gives approximately the same results as the
Spearman rho calculated the traditional way. We just used the Pearson
formula on ranks and found that Spearman's rho for the
sample of 25 subjects was 0.33 between the cuff measurement of diastolic
pressure and finger device 2. Use the CD-ROM [available only with the book], and calculate Spearman's
rho on the original data. You should also find 0.33 using the traditional methods
of calculation.

To summarize, Spearman's rho is appropriate when investigators
want to measure the relationship between: (1) two ordinal variables,
or (2) two numerical variables when one or both are not normally
distributed and investigators choose not to use a data transformation
(such as taking the logarithm). Spearman's rank correlation
is especially appropriate when outlying values occur among the observations.

Confidence Interval
for the Odds Ratio & the Relative Risk

Chapter 3 introduced the relative risk (or
risk ratio) and the odds ratio as
measures of relationship between two nominal characteristics. Developed
by epidemiologists, these statistics are used for studies examining
risks that may result in disease. To discuss the odds ratio, recall
the study discussed in Chapter 3 by Ballard and colleagues (1998)
that examined the use of antenatal thyrotropin-releasing hormone
(TRH). Data from this study were given in Chapter 3, Table 3–21.
We calculated the odds ratio as 1.1, meaning that an infant in the
TRH group is 1.1 times more likely to develop respiratory distress
syndrome than an infant in the placebo group. This finding is the
opposite of what the investigators expected to find, and it is important
to learn if the increased risk is statistically significant.

Significance can be determined in several ways. For instance,
to test the significance of the relationship between treatment (TRH
versus placebo) and the development of respiratory distress syndrome,
investigators may use the chi-square test discussed in Chapter 6.
The chi-square test for this example is left as an exercise (see
Exercise 2). An alternative chi-square test, based on the natural
logarithm of the odds ratio, is also available, and it results in
values close to the chi-square test illustrated in Chapter 6 (Fleiss,
1999).

More often, articles in the medical literature use confidence
intervals for risk ratios or odds ratios. Ballard and colleagues
reported a 95% confidence interval for the odds ratio as
(0.8 to 1.5). Let us see how they found this confidence interval.

Finding confidence intervals for odds ratios is a bit more complicated
than usual because these ratios are not normally distributed, so
calculations require finding natural logarithms and antilogarithms.
The formula for a 95% confidence interval for the odds
ratio is

where exp denotes the exponential function, or antilogarithm,
of the natural logarithm, ln, and a, b,
c, d are the cells in a 2 x 2 table
(see Table 6–9 in Chapter 6). The confidence interval for
the odds ratio for risk of respiratory distress syndrome in infants
who were given TRH from Table 3–21 is

This interval contains the value of the true odds ratio with
95% confidence. If the odds are the same in each group,
the value of the odds ratio is approximately 1, indicating similar
risks in each group. Because the interval contains 1, we may be
95% confident that the odds ratio risk may in fact be 1;
that is, insufficient evidence exists to conclude that the risk
of respiratory distress increases in infants who received TRH. By
the same logic, this treatment has no protective effect. Of course,
90% or 99% confidence intervals can be formed
by using 1.645 or 2.575 instead of 1.96 in the preceding equation.

To illustrate the confidence interval for the relative risk,
we refer to the physicians' health study (Steering Committee
of the Physicians' Health Study Research Group, 1989) summarized
in Chapter 3 and Table 3–19. Recall that the relative risk
for an MI in physicians taking aspirin was 0.581. The 95% confidence
interval for the true value of the relative risk also involves logarithms:

Again, the values for a, b, c, d are
the cells in the 2 x 2 table illustrated
in Table 6–9. Although it is possible to include a continuity
correction for the relative risk or odds ratio, it is not commonly
done. Substituting values from Table 3–19, the 95% confidence
interval for a relative risk of 0.581 is

The 95% confidence interval does not contain
1, so the evidence indicates that the use of aspirin resulted in
a reduced risk for MI. For a detailed and insightful discussion
of the odds ratio and its advantages and disadvantages, see Feinstein
(1985, Chapter 20) and Fleiss (1999, Chapter 5); for a discussion
of the odds ratio and the risk ratio, see Greenberg and coworkers
(2002, Chapters 8 and 9).

The folder containing Microsoft Excel equations on the CD-ROM
[available only with the book] describes two routines for finding the 95% confidence limits;
they are called "CI for OR" and "CI for
RR." You may find these routines helpful if you wish to
find 95% confidence limits for odds ratios or relative
risks for published studies that contain the summary data for these
statistics.

Measuring Relationships
in Other Situations

We have discussed how to measure and test the significance of
relationships by using Pearson's product moment correlation
coefficient, Spearman's nonparametric procedure based on
ranks, and risk or odds ratios. Not all situations are covered by
these procedures, however, such as when one variable is measured
on a nominal scale and the other is numerical but has been classified
into categories, when one variable is nominal and the other is ordinal,
or when both are ordinal but only a few categories occur. In these
cases, a contingency table is formed and the chi-square test is
used, as illustrated in Chapters 6 and 7.

On other occasions, the numerical variable is not collapsed into
categories. For example, Hodgson and Cutler (1997) studied 25 subjects
who had a living parent with Alzheimer's disease and a matched
group who had no family history of dementia. Subjects answered questions
about their concern of developing Alzheimer's disease and
completed a questionnaire designed to evaluate their concerns about
memory, the Memory Assessment Index (MAI). Data are given in Table
8–4.

Table 8–4. Data
on 50 Subjects in the Study on Anticipatory Dementia.

Table 8–4. Data
on 50 Subjects in the Study on Anticipatory Dementia.

Sample^{a}

Sex

Concerned^{a}

MAI^{b}

Life Satisf^{a}

Health Status^{a}

1

F

1

6

0

1

2

F

1

8

1

1

1

F

0

0

1

1

1

F

0

2

0

1

1

F

1

4

0

1

1

F

1

10

0

0

1

M

0

3

1

1

1

F

1

12

0

0

1

F

1

8

0

1

1

F

1

9

1

1

1

F

1

8

0

1

1

F

0

2

1

1

1

F

1

6

1

1

1

M

0

2

1

1

1

M

0

2

1

0

1

M

1

5

1

1

1

M

0

3

0

0

2

F

0

3

0

1

2

F

0

0

1

1

1

M

1

5

0

1

2

F

0

1

1

0

2

F

0

2

0

1

1

F

1

7

0

1

1

M

1

5

0

1

1

F

1

7

0

0

1

F

1

9

0

1

2

F

1

10

0

0

2

F

0

3

1

1

2

F

1

5

0

1

2

F

0

3

1

1

2

F

1

9

0

0

1

F

1

4

1

1

1

F

1

8

1

1

2

F

1

4

1

1

2

F

0

2

0

1

2

F

1

9

0

0

2

F

0

3

1

1

2

M

1

8

1

1

2

F

0

3

1

1

2

F

0

4

0

1

2

M

1

7

0

0

2

M

0

2

0

1

2

F

0

0

1

1

2

M

0

2

1

1

1

F

0

5

0

0

2

F

0

2

1

1

1

F

1

8

0

1

2

M

1

6

1

1

2

F

0

2

0

0

2

M

1

4

0

1

^{a}Sample: 1=Alzheimer, 2=Control.

Concerned: 0=No, 1=Yes.

Life satisfaction: 0=Not satisfied, 1=Satisfied.

Health status: 0=Excellent, 1=Not excellent.

^{b}MAI (Memory Assessment Index) on scale of 0=No
memory problems to 12=Negative perceptions of memory and
very concerned about developing dementia.

Source: Data, used with permission,
from Hodgson LG, Cutler SJ: Anticipatory dementia and well-being. Am
J Alzheimer's Dis 1997;12:62–66. Output produced
using NCSS; used with permission.

The investigators were interested in the relationship between
life satisfaction and performance on the MAI. Life satisfaction
was measured as yes or no, and the MAI was measured on a scale from 0 = no
memory problems to 12 = negative perceptions of memory
and concern about developing dementia. When one variable is binary
and the other is numerical, it is possible to evaluate the relationship
using a special correlation, called the point–biserial
correlation. If the binary variable is coded as 0 and 1, the Pearson
correlation procedure can be used to find the point–biserial
correlation. Box 8–1A gives the results of the correlation
procedure using life satisfaction and MAI. The correlation is –0.37,
and the P value is 0.008633.

Box 8–1. Correlation
and T Test for Life Satisfaction and Anticipatory Dementia As Measured
by Mai.

Box 8–1. Correlation
and T Test for Life Satisfaction and Anticipatory Dementia As Measured
by Mai.

A. Correlation Matrix

Anticipatory Dementia

Life Satisfaction

Anticipatory Dementia

1.000000

–0.367601

0.000000

0.008633

50.000000

50.000000

Life Satisfaction

–0.367601

1.000000

0.008633

0.000000

50.000000

50.000000

B. t Test

Count

Mean

Standard Deviation

LIFESAT=0

27

5.851852

2.931312

LIFESAT=1

23

3.652174

2.70704

Alternative Hypothesis

t Value

Probability Level

Decision (5%)

Power (=0.05)

Difference

<> 0

2.7386

0.008633

Reject H_{0}

0.765296

C. Box Plot

Source: Data, used with
permission, from Hodgson LG, Cutler SJ: Anticipatory dementia and
well-being. Am J Alzheimer's Dis 1997;12:62–66.
Output produced using NCSS; used with permission.

Did you wonder why a t test was
not used to see if a difference existed in mean MAI for those who
were satisfied with their life versus those who were not satisfied?
If so, you are right on target because a t test
is another way to look at the research question. It simply depends
on whether interest focuses on a relationship or a difference. What
do you think the results of a t test
would show? The output from the NCSS t test
procedure is given in Box 8–1B. Of special interest is
the P value (0.008633); it is the same
as for the correlation. This illustrates an important principle:
The point–biserial correlation between a binary variable
and a numerical variable has the same level of significance as does
a t test in which the groups are defined
by the binary variable.

The point–biserial correlation is often used by test
developers to help evaluate the questions on the test. For example,
the National Board of Medical Examiners determines the point–biserial correlation
between whether examinees get an item correct (a binary variable)
and the examinee's score on the entire exam (a numerical
variable). A positive point–biserial indicates that examinees who
answer the question correctly tend to score high on the exam as
a whole, whereas examinees missing the question tend to score low
generally. Similarly, a negative point–biserial correlation indicates
that examinees who answer the question correctly tend to score low
on the exam—certainly not a desirable situation. It may
be that the question is tricky or poorly worded because the better
examinees are more likely to miss the question; you can see why
this statistic is useful for test developers.

Linear Regression

Remember that when the goal is to predict the value of one characteristic
from knowledge of another, the statistical method used is regression analysis. This method is
also called linear regression, simple linear regression, or least
squares regression. A brief review of the history of these terms
is interesting and sheds some light on the nature of regression
analysis.

The concepts of correlation and regression were developed by
Sir Francis Galton, a cousin of Charles Darwin, who studied both
mathematics and medicine in the mid-19th century (Walker, 1931).
Galton was interested in heredity and wanted to understand why a
population remains more or less the same over many generations with
the "average" offspring resembling their parents;
that is, why do successive generations not become more diverse.
By growing sweet peas and observing the average size of seeds from
parent plants of different sizes, he discovered regression, which
he termed the "tendency of the ideal mean filial type to
depart from the parental type, reverting to what may be roughly
and perhaps fairly described as the average ancestral type." This phenomenon
is more typically known as regression toward the mean. The term "correlation" was used
by Galton in his work on inheritance in terms of the "co-relation" between
such characteristics as heights of fathers and sons. The mathematician
Karl Pearson went on to work out the theory of correlation and regression,
and the correlation coefficient is named after him for this reason.

The term linear regression refers
to the fact that correlation and regression measure only a straight-line,
or linear, relationship between two variables. The term "simple
regression" means that only one explanatory (independent)
variable is used to predict an outcome. In multiple
regression, more than one independent variable is included
in the prediction equation.

Least squares regression describes the mathematical method for
obtaining the regression equation. The important thing to remember
is that when the term "regression" is used alone,
it generally means linear regression based on the least squares
method. The concept behind least squares regression is described
in the next section and its application is discussed in the section
after that.

Least Squares
Method

Several times previously in this text, we mentioned the linear
nature of the pattern of points in a scatterplot. For example, in
Figure 8–2, a straight line can be drawn through the points
representing the values of BMI and percent body fat to indicate
the direction of the relationship. The least
squares method is a way to determine the equation of the line
that provides a good fit to the points.

To illustrate the method, consider the straight line in Figure 8–6. Elementary geometry can be used to determine the equation
for any straight line. If the point where the line crosses, or intercepts, the Y-axis
is denoted by a and the slope of the line by b, then the equation is

Figure 8–6.

Geometric interpretation of a regression line.

The slope of the line measures the amount Y changes
each time X changes by 1 unit. If the
slope is positive, Y increases as X increases; if the slope is negative, Y decreases as X increases;
and vice versa. In the regression model, the slope in the population
is generally symbolized by _{1}, called
the regression coefficient; and _{0} denotes
the intercept of the regression line,
that is, _{1} and _{0} are
the population parameters in regression. In most applications, the
points do not fall exactly along a straight line. For this reason,
the regression model contains an error term,
e, which is the distance the actual values of Y depart from the regression line.
Putting all this together, the regression equation is given by

When the regression equation is used to describe the relationship
in the sample, it is often written as

For a given value of X, say X*, the predicted value of Y* is found by extending a
horizontal line from the regression line to the Y-axis
as in Figure 8–7. The difference between the actual value
for Y* and the predicted value, e* = Y* – Y*',
can be used to judge how well the line fits the data points. The
least squares method determines the line that minimizes the sum
of the squared vertical differences between the actual and predicted
values of the Y variable; ie, _{0} and _{1} are
determined so that (Y – Y')^{2} is
minimized. The formulas for _{0} and _{1} are
found,^{a} and in terms of the sample estimates b and a, these
formulas are

^{a}The procedure for finding _{0} and _{1} involves
the use of differential calculus. The partial derivatives of the
preceding equations are found with respect to _{0} and _{1};
the two resulting equations are set equal to zero to locate the
minimum values; these two equations in two unknowns, _{0} and _{1},
are solved simultaneously to obtain the formulas for _{0} and _{1}.

Figure 8–7.

Least squares regression line.

Calculating
the Regression Equation

In the study described in Presenting Problem 4, the investigators
wanted to predict insulin sensitivity from BMI in a group of women.
Original observations were given in Chapter 7, Table 7–8. For
now we ignore the different groups of women and examine the entire
sample regardless of thyroid and weight levels.

Before calculating the regression equation for these data, let
us create a scatterplot and practice "guesstimating" the
value of the correlation coefficient from the plot (although it
is difficult to estimate the size of r accurately
when the sample size is small). Figure 8–8 is a scatterplot
with BMI score as the explanatory X variable
and insulin sensitivity as the response Y variable.
How large do you think the correlation is?

Figure 8–8.

Scatterplot of observations on body mass index and insulin
sensitivity. (Data, used with permission, from Gonzalo MA, Grant
C, Moreno I, Garcia FJ, Suarez AI, Herrera-Pombo JL, et al: Glucose
tolerance, insulin secretion, insulin sensitivity and glucose effectiveness
in normal and overweight hyperthyroid women. Clin
Endocrinol (Oxf) 1996;45:689–697.
Output produced using NCSS; used with permission.)

If we knew the correlation between BMI and insulin sensitivity,
we could use it to calculate the regression equation. Because we
do not, we assume the needed terms have been calculated; they are

Then,

In this example, the insulin sensitivity scores are said to be
regressed on BMI scores, and the regression equation is written
as Y' = 1.5817 – 0.0433X, where Y' is
the predicted insulin sensitivity score, and X is
the BMI.

Figure 8–9 illustrates the regression line drawn through
the observations. The regression equation has a positive intercept
of +1.58, so that theoretically a patient with zero BMI
would have an insulin sensitivity of 1.58, even though, in the present
example, a zero BMI is not possible. The slope of –0.043
indicates that each time a woman's BMI increases by 1,
her predicted insulin sensitivity decreases by approximately 0.043.
For example, as the BMI increases from 20 to 30, insulin sensitivity
decreases from about 0.73 to about 0.3. Whether the relationship
between BMI and insulin sensitivity is significant is discussed
in the next section.

Figure 8–9.

Regression of observations on body mass index and insulin
sensitivity. (Data, used with permission, from Gonzalo MA, Grant
C, Moreno I, Garcia FJ, Suarez AI, Herrera-Pombo JL, et al: Glucose
tolerance, insulin secretion, insulin sensitivity and glucose effectiveness
in normal and overweight hyperthyroid women. Clin
Endocrinol (Oxf) 1996;45:689–697.
Output produced using NCSS; used with permission.)

Assumptions
& Inferences in Regression

In the previous section, we worked with a sample of observations
instead of the population of observations. Just as the sample mean is an estimate of the population
mean , the regression line determined from the
formulas for a and b in
the previous section is an estimate of the regression equation for
the underlying population.

As in Chapters 6 and 7, in which we used statistical tests to
determine how likely it was that the observed differences between
two means occurred by chance, in regression analysis we must perform
statistical tests to determine the likelihood of any observed relationship
between X and Y variables.
Again, the question can be approached in two ways: using hypothesis
tests or forming confidence intervals. Before discussing these approaches,
however, we briefly discuss the assumptions required in regression
analysis.

If we are to use a regression equation, the observations must
have certain properties. Thus, for each value of the X variable, the Y variable
is assumed to have a normal distribution, and the mean of the distribution
is assumed to be the predicted value, Y'.
In addition, no matter the value of the X variable,
the standard deviation of Y is assumed
to be the same. These assumptions are rather like imagining a large
number of individual normal distributions of the Y variable,
all of the same size, one for each value of X. The
assumption of this equal variation in the Y's
across the entire range of the X's
is called homogeneity, or homoscedasticity. It is analogous
to the assumption of equal variances (homogeneous variances) in
the t test for independent groups,
as discussed in Chapter 6.

The straight-line, or linear, assumption requires that the mean
values of Y corresponding to various
values of X fall on a straight line.
The values of Y are assumed to be independent
of one another. This assumption is not met when repeated measurements
are made on the same subjects; that is, a subject's measure
at one time is not independent from the measure of that same subject
at another time. Finally, as with other statistical procedures,
we assume the observations constitute a random sample from the population
of interest.

Regression is a robust procedure and may be used in many situations
in which the assumptions are not met, as long as the measurements
are fairly reliable and the correct regression model is used. (Other
regression models are discussed in Chapter 10.) Meeting the regression
assumptions generally causes fewer problems in experiments or clinical
trials than in observational studies because reliability of the
measurements tends to be greater in experimental studies. Special
procedures can be used when the assumptions are seriously violated,
however; and as in ANOVA, researchers should seek a statistician's
advice before using regression if questions arise about its applicability.

The Standard
Error of the Estimate

Regression lines, like other statistics, can vary. After all,
the regression equation computed for any one sample of observations
is only an estimate of the true population regression equation.
If other samples are chosen from the population and a regression
equation is calculated for each sample, these equations will vary
from one sample to another with respect to both their slopes and their
intercepts. An estimate of this variation is symbolized S_{Y}._{X} (read s of y given x) and is called the standard error
of regression, or the standard error of
the estimate. It is based on the squared deviations of the
predicted Y's
from the actual Y's
and is found as follows:

The computation of this formula is quite tedious; and although
more user-friendly computational forms exist, we assume that you
will use a computer program to calculate the standard error of the estimate.
In testing both the slope and the intercept, a t test
can be used, and the standard error of the estimate is part of the
formula. It is also used in determining confidence limits. To present
these formulas and the logic involved in testing the slope and the
intercept, we illustrate the test of hypothesis for the intercept
and the calculation of a confidence interval for the slope, using
the BMI–insulin sensitivity regression equation.

Inference about
the Intercept

To test the hypothesis that the intercept departs significantly
from zero, we use the following procedure:

Step 1: H_{0}: _{0} = 0
(The intercept is zero)

H_{1}: _{0} ≠ 0
(The intercept is not zero)

Step 2: Because the null hypothesis
is a test of whether the intercept is zero, the t ratio
may be used if the assumptions are met. The t ratio
uses the standard error of the estimate to calculate the standard
error of the intercept (the denominator of the t ratio):

Step 3: Let us use equal
to 0.05.

Step 4: The degrees of freedom
are n – 2 = 33 – 2 = 31.
The value of the t distribution with
31 degrees of freedom that divides the area into the central 95% and
the combined upper and lower 5% is approximately 2.040
(from Table A–3). We therefore reject the null hypothesis
of a zero intercept if (the absolute value of) the observed value
of t is greater than 2.040.

Step 5: The calculation follows;
we used a spreadsheet (Microsoft Excel) to calculate S_{Y}._{X} = 0.256
and (X – X–)^{2} = 468.015.

Step 6: The absolute value of the
observed t ratio is 5.30, which is
greater than 2.040. The null hypothesis of a zero intercept is therefore rejected.
We conclude that the evidence is sufficient to show that the intercept
is significantly different from zero for the regression of insulin
sensitivity on BMI.

As you know by now, it is also possible to form confidence limits
for the intercept using the observed value and adding or subtracting
the critical value from the t distribution
multiplied by the standard error of the intercept.

Inferences about
the Regression Coefficient

Instead of illustrating the hypothesis test for the population
regression coefficient, let us find a 95% confidence interval
for _{1}. The interval is given by

Because the interval excludes zero, we can be 95% confident
that the regression coefficient is not zero but that it is between –0.0674
and –0.0192 or between about –0.07 and –0.02.
Because the regression coefficient is significantly less than zero,
can the correlation coefficient be equal to zero? (see Exercise 3.) The relationship between b and r illustrated earlier and Exercise 3 should convince you of the equivalence of the results obtained with
testing the significance of correlation and the regression coefficient.
In fact, authors in the medical literature often perform a regression
analysis and then report the P values
to indicate a significant correlation coefficient.

The output from the SPSS regression program is given in Table
8–5. The program produces the value of t and
the associated P value, as well as
95% confidence limits. Do the results agree with those
we found earlier? To become familiar with using regression, we suggest
you replicate these results using the CD-ROM [available only with the book].

Table 8–5. Computer
Output of Regression of Insulin Sensitivity on Body Mass Index.

Table 8–5. Computer
Output of Regression of Insulin Sensitivity on Body Mass Index.

Coefficients^{a}

Unstandardized Coefficients

Standard Coefficients

95% Confidence Interval for B

Mode 1

B

Std. Error

Beta

t

Significance

Lower Bound

Upper Bound

1

(Constant)

1.582

0.299

5.294

0.000

0.972

2.191

Body mass index

–0.043

0.012

–0.548

–3.652

0.001

–0.067

–0.019

^{a}Dependent variable: insulin sensitivity.

Source: Data, used with permission,
from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, Herrera-Pombo
JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity
and glucose effectiveness in normal and overweight hyperthyroid
women. Clin Endocrinol 1996;45:689–697. Output produced
using SPSS, a registered trademark of SPSS, Inc; used with permission.

Predicting with
the Regression Equation: Individual and Mean Values

One of the important reasons for obtaining a regression equation
is to predict future values for a group of subjects (or for individual
subjects). For example, a clinician may want to predict insulin sensitivity
from BMI for a group of women with newly diagnosed diabetes. Or
the clinician may wish to predict the sensitivity for a particular
woman. In either case, the variability associated with the regression
line must be reflected in the prediction. The 95% confidence
interval for a predicted mean Y in
a group of subjects is

The 95% confidence interval for predicting a single observation is

Comparing these two formulas, we see that the confidence interval
predicting a single observation is wider than the interval for the
mean of a group of individuals; 1 is added to the standard error term
for the individual case. This result makes sense, because for a
given value of X, the variation in
the scores of individuals is greater than that in the mean scores
of groups of individuals. Note also that the numerator of the third
term in the standard error is the squared deviation of X from .
The size of the standard error therefore depends on how close the
observation is to the mean; the closer X is
to its mean, the more accurate is the prediction of Y. For values of X quite
far from the mean, the variability in predicting the Y score is considerable. You can appreciate
why it is difficult for economists and others who wish to predict
future events to be very accurate!

Table 8–6 gives 95% confidence intervals associated
with predicted mean insulin sensitivity levels and predicted insulin
sensitivity levels for an individual corresponding to several different BMI
values (and for the mean BMI in this sample of 33 women). Several
insights about regression analysis can be gained by examining this
table. First, note the differences in magnitude between the standard
errors associated with the predicted mean insulin sensitivity and
those associated with individual insulin sensitivity levels: The
standard errors are much larger when we predict individual values
than when we predict the mean value. In fact, the standard error
for individuals is always larger than the standard error for means
because of the additional 1 in the formula. Also note that the standard
errors take on their smallest values when the observation of interest
is the mean (BMI of 24.921 in our example). As the observation departs
in either direction from the mean, the standard errors and confidence
intervals become increasingly larger, reflecting the squared difference
between the observation and the mean. If the confidence intervals
are plotted as confidence bands about
the regression line, they are closest to the line at the mean of X and curve away from it in both directions
on each side of . Figure 8–10 shows the graph of the confidence bands.

Figure 8–10.

Regression of observations on body mass index and insulin
sensitivity with confidence bands (heavy lines for means, light
lines for individuals). (Data, used with permission, from Gonzalo MA,
Grant C, Moreno I, Garcia FJ, Suarez AI, Herrera-Pombo JL, et al:
Glucose tolerance, insulin secretion, insulin sensitivity and glucose
effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996;45:689–697. Output produced
using NCSS; used with permission.)

Table 8–6 illustrates another interesting feature of
the regression equation. When the mean of X is
used in the regression equation, the predicted Y' is
the mean of Y. The regression line
therefore goes through the mean of X and
the mean of Y.

Table 8–6. 95% Confidence
Intervals for Predicted Mean Insulin Sensitivity Levels and Predicted
Individual Insulin Sensitivity Levels.

Table 8–6. 95% Confidence
Intervals for Predicted Mean Insulin Sensitivity Levels and Predicted
Individual Insulin Sensitivity Levels.

Predicting Means

Predicting Individuals

BMI

Insulin Sensitivity

Predicted

SE^{a}

Confidence Interval

SE^{b}

Confidence Interval

18.100

0.970

0.798

0.092

0.610 to 0.986

0.273

0.242 to 1.354

23.600

0.880

0.560

0.047

0.463 to 0.656

0.261

0.028 to 1.092

24.000

0.660

0.543

0.046

0.449 to 0.636

0.261

0.011 to 1.074

20.400

0.520

0.698

0.070

0.556 to 0.841

0.266

0.156 to 1.241

21.500

0.380

0.651

0.060

0.528 to 0.774

0.263

0.113 to 1.188

24.921

0.503

0.503

0.044

0.413 to 0.593

0.260

–0.027 to 1.033

^{a}Standard error for means.

^{b}Standard error for individuals.

BMI = body mass index.

Now we can see why confidence bands about the regression line
are curved. The error in the intercept means that the true regression
line can be either above or below the line calculated for the sample
observations, although it maintains the same orientation (slope).
The error in measuring the slope therefore means that the true regression
line can rotate about the point (, )
to a certain degree. The combination of these two errors results
in the concave confidence bands illustrated
in Figure 8–10. Sometimes journal articles have regression
lines with confidence bands that are parallel rather than curved.
These confidence bands are incorrect, although they may correspond
to standard errors or to confidence intervals at their narrowest
distance from the regression line.

Comparing Two
Regression Lines

Sometimes investigators wish to compare two regression lines
to see whether they are the same. For example, the investigators
in Presenting Problem 4 were particularly interested in the relationship
between BMI and insulin sensitivity in women who were hyperthyroid
versus those whose thyroid levels were normal. The investigators
determined separate regression lines for these two groups of women
and reported them in Figure 3 of their article. We reproduced their
regression lines in Figure 8–11.

Figure 8–11.

Separate regression lines for hyperthyroid (squares)
and control (circles) women. (Data, used with permission, from Gonzalo
MA, Grant C, Moreno I, Garcia FJ, Suarez AI, Herrera-Pombo JL, et
al: Glucose tolerance, insulin secretion, insulin sensitivity and
glucose effectiveness in normal and overweight hyperthyroid women. Clin Endocrinol (Oxf) 1996;45:689–697. Output produced
using NCSS; used with permission.)

As you might guess, researchers are often interested in comparing
regression lines to learn whether the relationships are the same
in different groups of subjects. When we compare two regression
lines, four situations can occur, as illustrated in Figure 8–12.
In Figure 8–12A, the slopes of the regression lines are
the same, but the intercepts differ. This situation occurs, for instance,
in blood pressure measurements regressed on age in men and women;
that is, the relationship between blood pressure and age is similar
for men and women (equal slopes), but men tend to have higher blood
pressure levels at all ages than women (higher intercept for men).

Figure 8–12.

Illustration of ways regression lines can differ. A:
Equal slopes and different intercepts. B: Equal intercepts and different
slopes. C: Different slopes and different intercepts. D: Equal slopes
and equal intercepts.

In Figure 8–12B, the intercepts are equal, but the slopes
differ. This pattern may describe, say, the regression of platelet
count on number of days following bone marrow transplantation in
two groups of patients: those for whom adjuvant therapy results
in remission of the underlying disease and those for whom the disease
remains active. In other words, prior to and immediately after transplantation,
the platelet count is similar for both groups (equal intercepts),
but at some time after transplantation, the platelet count remains
steady for patients in remission and begins to decrease for patients
not in remission (more negative slope for patients with active disease).

In Figure 8–12C, both the intercepts and the slopes
of the regression lines differ. The investigators in Presenting
Problem 4 reported a steeper decline in the slope of insulin sensitivity
as the BMI increased in the hyperthyroid women than in the control
group.^{b} Although they did not specifically address any
difference in intercepts, the relationship between BMI and insulin
sensitivity resembles the situation in Figure 8–12C.

^{b}Gonzalo and colleagues presented regression equations
after adjusting for age. We briefly discuss this procedure in the
next section under Multiple Regression.

If no differences exist in the relationships between the predictor
and outcome variables, the regression lines are similar to Figure 8–12D, in which the lines are coincident: Both intercepts
and slopes are equal. This situation occurs in many situations in
medicine and is considered to be the expected pattern (the null
hypothesis) until it is shown not to apply by testing hypotheses
or forming confidence limits for the intercept and or slope (or
both intercept and slope).

From the four situations illustrated in Figure 8–12,
we can see that three statistical questions need to be asked:

1. Are the slopes equal?

2. Are the intercepts equal?

3. Are both the slopes and the intercepts equal?

Statistical tests based on the t distribution
can be used to answer the first two questions; these tests are illustrated
in Kleinbaum and associates (1997). The authors point out, however,
that the preferred approach is to use regression models for more
than one independent variable—a procedure called multiple
regression—to answer these questions. The procedure consists
of pooling observations from both samples of subjects (eg, observations
on both hyperthyroid and control women) and computing one regression
line for the combined data. Other regression coefficients indicate
whether it matters to which group the observations belong. The simplest
model is then selected. Because the regression lines were statistically
different, the Gonzalo and colleagues reported two separate regression equations.

Use of Correlation
& Regression

Some of the characteristics of correlation and regression have
been noted throughout the discussions in this chapter, and we recapitulate
them here as well as mention other features. An important point
to reemphasize is that correlation and regression describe only linear relationships. If correlation
coefficients or regression equations are calculated blindly, without
examining plots of the data, investigators can miss very strong,
but nonlinear relationships.

Analysis of
Residuals

A procedure useful in evaluating the fit of the regression equation
is the analysis of residuals (Pedhazur, 1997). We calculated residuals when we found the difference
between the actual value of Y and the
predicted value of Y',
or Y – Y',
although we did not use the term. A residual is the part of Y that is not predicted by X (the part left over, or the residual).
The residual values on the Y-axis are
plotted against the X values on the X-axis. The mean of the residuals is
zero, and, because the slope has been subtracted in the process of
calculating the residuals, the correlation between them and the X values should be zero.

Stated another way, if the regression model provides a good fit
to the data, as in Figure 8–13A, the values of the residuals
are not related to the values of X. A
plot of the residuals and the X values
in this situation should resemble a scatter of points corresponding
to Figure 8–13B in which no correlation exists between
the residuals and the values of X. If,
in contrast, a curvilinear relationship occurs between Y and X, such
as in Figure 8–13C, the residuals are negative for both
small values and large values of X, because
the corresponding values of Y fall
below a regression line drawn through the data. They are positive,
however, for midsized values of X because
the corresponding values of Y fall
above the regression line. In this case, instead of obtaining a
random scatter, we get a plot like the curve in Figure 8–13D,
with the values of the residuals being related to the values of X. Other patterns can be used by statisticians
to help diagnose problems, such as a lack of equal variances or
various types of nonlinearity.

Figure 8–13.

Illustration of analysis of residuals. A: Linear relationship
between X and Y. B: Residuals versus values of X for relation in
part A.C: Curvilinear relationship between X and Y. D: Residuals
versus values of X for relation in part C.

Use the CD-ROM [available only with the book] and the regression program to produce a graph
of residuals for the data in Presenting Problem 4. Which of the
four situations in Figure 8–13 is most likely? See Exercise
8.

Dealing with
Nonlinear Observations

Several alternative actions can be taken if serious problems
arise with nonlinearity of data. As we discussed previously, a transformation may make the relationship
linear, and regular regression methods can then be used on the transformed
data. Another possibility, especially for a curve, is to fit a straight
line to one part of the curve and a second straight line to another
part of the curve, a procedure called piecewise
linear regression. In this situation, one regression equation
is used with all values of X less than
a given value, and the second equation is used with all values of X greater than the given value. A third
strategy, also useful for curves, is to perform polynomial regression;
this technique is discussed in Chapter 10. Finally, more complex
approaches called nonlinear regression may be used (Snedecor and
Cochran, 1989).

Regression Toward
the Mean

The phenomenon called regression toward
the mean often occurs in applied research and may go unrecognized.
A good illustration of regression toward the mean occurred in the
MRFIT study (Multiple Risk Factor Intervention Trial Research Group;
Gotto, 1982), which was designed to evaluate the effect of diet
and exercise on blood pressure in men with mild hypertension. To
be eligible to participate in the study, men had to have a diastolic
blood pressure of 90 mm Hg. The eligible subjects
were then assigned to either the treatment arm of the study, consisting
of programs to encourage appropriate diet and exercise, or the control
arm, consisting of typical care. This study has been called a landmark
trial and was reprinted in 1997 in the Journal
of the American Medical Association. See Exercise 13.

To illustrate the concept of regression toward to the mean, we
consider the hypothetical data in Table 8–7 for diastolic
blood pressure in 12 men. If these men were being screened for the
MRFIT study, only subjects 7 through 12 would be accepted; subjects
1 through 6 would not be eligible because their baseline diastolic
pressure is < 90 mm Hg. Suppose all subjects
had another blood pressure measurement some time later. Because
a person's blood pressure varies considerably from one
reading to another, about half the men can be expected to have higher
blood pressures and about half to have lower blood pressures, owing
to random variation. Regression toward the mean tells us that those
men who had lower pressures on the first reading are more likely
to have higher pressures on the second reading. Similarly, men who
had a diastolic blood pressure 90 on the first
reading are more likely to have lower pressures on the second reading.
If the entire sample of men is remeasured, the increases and decreases
tend to cancel each other. If, however, only a subset of the subjects
is examined again, for example, the men with initial diastolic pressures > 90,
the blood pressures will appear to have dropped, when in fact they
have not.

Table 8–7. Hypothetical
Data on Diastolic Blood Pressure to Illustrate Regression Toward
the Mean.

Table 8–7. Hypothetical
Data on Diastolic Blood Pressure to Illustrate Regression Toward
the Mean.

Subject

Baseline

Repeat

1

78

80

2

80

81

3

82

82

4

84

86

5

86

85

6

88

90

7

90

88

8

92

91

9

94

95

10

96

95

11

98

97

12

100

98

Regression toward the mean can result in a treatment or procedure
appearing to be of value when it has had no actual effect; the use
of a control group helps to guard against this effect. The investigators
in the MRFIT study were aware of the problem of regression toward
the mean and discussed precautions they took to reduce its effect.

Common Errors
in Regression

One error in regression analysis occurs when multiple observations
on the same subject are treated as though they were independent.
For example, consider ten patients who have their weight and skinfold
measurements recorded prior to beginning a low-calorie diet. We
may reasonably expect a moderately positive relationship between
weight and skinfold thickness. Now suppose that the same ten patients
are weighed and measured again after 6 weeks on the diet. If all 20
pairs of weight and skinfold measurements are treated as though
they were independent, several problems occur. First, the sample
size will appear to be 20 instead of 10, and we are more likely
to conclude significance. Second, because the relationship between
weight and skinfold thickness in the same person is somewhat stable
across minor shifts in weight, using both before and after diet
observations has the same effect as using duplicate measures, and
this results in a correlation larger than it should be.

The magnitude of the correlation can also be erroneously increased
by combining two different groups. For example, consider the relationship
between height and weight. Suppose the heights and weights of ten
men and ten women are recorded, and the correlation between height
and weight is calculated for the combined samples. Figure 8–14
illustrates how the scatterplot might look and indicates the problem
that results from combining men and women in one sample. The relationship
between height and weight appears to be more significant in the
combined sample than it is when measured in men and women separately.
Much of the apparent significance results because men tend both
to weigh more and to be taller than women. Inappropriate conclusions
may result from mixing two different populations—a rather
common error to watch for in the medical literature.

Figure 8–14.

Hypothetical data illustrating spurious correlation.

Comparing Correlation
& Regression

Correlation and regression have some similarities and some differences.
First, correlation is scale-independent, but regression is not;
that is, the correlation between two characteristics, such as height
and weight, is the same whether height is measured in centimeters
or inches and weight in kilograms or pounds. The regression equation
predicting weight from height, however, depends on which scales
are being used; that is, predicting weight measured in kilograms
from height measured in centimeters gives different values for a and b than
if predicting weight in pounds from height in inches.

An important consequence of scale independence in correlation
is that the correlation between X and Y is the same as the correlation between Y' and Y. They are equal because the regression
equation itself, Y' = a + bX, is a simple rescaling of the X variable; that is, each value of X is multiplied by a constant value b and then the constant a is added. The fact that the correlation
between the original variables X and Y is equal to the correlation between Y and Y' provides
a useful alternative for testing the significance of the regression,
as we will see in Chapter 10. Finally, the slope of the regression
line has the same sign (+ or –) as the correlation
coefficient (see Exercise 10). If the correlation is zero, the regression
line is horizontal with a slope of zero. Thus, the formulas for
the correlation coefficient and the regression coefficient are closely related.
If r has already been calculated, it
can be multiplied by the ratio of the standard deviation of Y to the standard deviation of X, SD_{Y}/SD_{X} to
obtain b (see Exercise 9). Thus,

Similarly, if the regression coefficient is known, r can be found by

Multiple Regression

Multiple regression analysis is a straightforward generalization
of simple regression for applications in which two or more independent
(explanatory) variables are used to predict an outcome. For example,
in the study described in Presenting Problem 4, the investigators
wanted to predict a woman's insulin sensitivity level based
on her BMI. They also wanted to control for the age of the woman,
however. The results from two analyses are given in Table 8–8.
First, regression was done using the BMI to predict insulin sensitivity
among hyperthyroid women; the resulting equation was

Table 8–8. Regression
Equations for Hyperthyroid Women Using BMI versus BMI and Age As
Predictor Variables.

Table 8–8. Regression
Equations for Hyperthyroid Women Using BMI versus BMI and Age As
Predictor Variables.

Regression Equation Section

Independent Variable

Regression Coefficient

Standard Error

t Value (H_{0}: B=0)

Probability Level

Decision (5%)

Intercept

2.336

0.462

5.054

0.0003

Reject H_{0}

BMI

–0.077

1.807E–02

–4.248

0.0011

Reject H_{0}

R^{2}

0.601

Regression Equation Section

Independent Variable

Regression Coefficient

Standard Error

t Value (H_{0}: B=0)

Probability Level

Decision (5%)

Intercept

2.2905

0.461

4.973

0.0004

Reject H_{0}

Age

–4.463E–03

4.103E–03

–1.088

0.3000

Accept H_{0}

BMI

–6.782E–02

1.972E–02

–3.439

0.0055

Reject H_{0}

R^{2}

0.639

BMI = body mass index.

Source: Data, used with permission,
from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI, Herrera-Pombo
JL, et al: Glucose tolerance, insulin secretion, insulin sensitivity
and glucose effectiveness in normal and overweight hyperthyroid
women. Clin Endocrinol 1996;45:689–697. Output produced
using NCSS; used with permission.

Next, the regression was repeated using both BMI and age as independent
variables. The results were

As you can see, the addition of the age variable has relatively
little effect; in fact, the P value
for age is 0.30, indicating that age is not significantly associated
with insulin sensitivity in this group of hyperthyroid women.

As an additional point, note that R^{2} (called R-squared) is 0.601 for the first regression
equation in Table 8–8. R^{2} is
interpreted in the same manner as the coefficient of determination, r^{2}, discussed in the section
titled, "Interpreting the Size of r." This
topic, along with multiple regression and other statistical methods
based on regression, is discussed in detail in Chapter 10.

Sample Sizes
for Correlation & Regression

As with other statistical procedures, it is important to have
an adequate number of subjects in any study that involves correlation
or regression. Complex formulas are required to estimate sample sizes
for these procedures, but fortunately we can use statistical power
programs to do the calculations.

Suppose that Jackson and colleagues (2002) wanted to know what
sample size would be necessary to produce a confidence interval
for the correlation of BMI and percent body fat that would be within ± 0.10
from an expectant correlation coefficient of 0.75. In other words,
how many subjects are needed for a 95% confidence interval
from 0.65 to 0.85, assuming they observe a correlation of 0.75 (recall
they actually found a correlation of 0.73)? We used the nQuery Advisor
program to illustrate the sample size needed in this situation;
the output is given in Figure 8–15. A sample of 102 patients
would be necessary. nQuery produces only a one-sided interval, so
we used 97.5% to obtain a 95% two-sided interval.
We could have used the upper limit of 0.85 instead of the lower limit
0.65 (line 3 of the nQuery table). Do you think the sample size
would be the same? Try it and see.

Figure 8–15.

Illustration of sample size program nQuery Advisor.
(Data, used with permission, from Jackson A, Stanforth PR, Gagnon
J, Rankinen T, Leon AS, Rao DC, et al: The effect of sex, age and
race on estimating percentage body fat from body mass index: The
Heritage Family Study. Int J Obes Relat
Metab Disord 2002;26:789–796.
Figure produced using nQuery Advisor; used with permission.)

To illustrate the power analysis for regression, consider the
regression equation to predict insulin sensitivity from BMI (Gonzalo
et al, 1996). Recall that we found that a 95% confidence
interval for the regression coefficient was between –0.0674
and –0.0192 in the entire sample of 33 women. Suppose Gonzalo
and colleagues wanted to know how many women would be needed for the
regression. The power program PASS finds the sample size by estimating
the number needed to obtain a given value for R^{2} (or r^{2} when only one independent
variable is used). We assume they want the correlation between the actual
insulin sensitivity and the predicted sensitivity to be at least
0.50, producing an r^{2} of 0.25.
The setup and output from the PASS program are given in Figures
8–16 and 8–17. From Figure 8–17, we see
that a sample size of about 26 is needed in each group for which
a regression equation is to be determined.

Figure 8–16.

Illustration of setup for using the PASS sample size
program for multiple regression using the data on insulin sensitivity.
(Data, used with permission, from Gonzalo MA, Grant C, Moreno I, Garcia
FJ, Suarez AI, Herrera-Pombo JL, et al: Glucose tolerance, insulin
secretion, insulin sensitivity and glucose effectiveness in normal
and overweight hyperthyroid women. Clin
Endocrinol (Oxf) 1996;45:689–697.
Output produced using PASS; used with permission.)

Figure 8–17.

Illustration of the PASS sample size program for multiple
regression using the data on insulin sensitivity. (Data, used with
permission, from Gonzalo MA, Grant C, Moreno I, Garcia FJ, Suarez AI,
Herrera-Pombo JL, et al: Glucose tolerance, insulin secretion, insulin
sensitivity and glucose effectiveness in normal and overweight hyperthyroid
women. Clin Endocrinol (Oxf) 1996;45:689–697. Output produced
using PASS; used with permission.)

Summary

Four presenting problems were used in this chapter to illustrate
the application of correlation and regression in medical studies.
The findings from the study described in Presenting Problem 1 demonstrate
the relationship between BMI and percent body fat, a correlation
equal to 0.73. The authors reported that the relationship was nonlinear,
which can be seen in Figure 8–2. Several factors other
than BMI affected the relationship. The authors concluded that BMI
is only a moderate predictor of percent body fat, and it is important
to consider age and gender when defining the prevalence of obesity
with BMI for populations of American men and women.

In Presenting Problem 2, Nesselroad and colleagues (1996) evaluated
three automated finger blood pressure devices marketed as being
accurate devices for monitoring blood pressure. We examined the
relationship among these devices and the standard method using a
blood pressure cuff. The observed correlations were quite low, ranging
from 0.32 to 0.45. We compared these two correlation coefficients
and concluded that no statistical difference exists between them.
Nesselroad also reported that the automated finger device measurements
were outside of the ±4 mm Hg range obtained with
the standard blood pressure cuff 75–81% of the
time. These researchers appropriately concluded that people who
want to monitor their blood pressure cannot trust these devices
to be accurate.

Hodgson and Cutler (1997) reported results from their study of
people's fears that normal age-associated memory change
is a precursor of dementia. We examined the relationship between memory
scores and whether people reported they were satisfied with their
life. We demonstrated that the conclusions from computing the biserial
correlation (the correlation between a numerical and a binary measure)
and performing a t test are the same.
Other results showed that the sense of well-being in these individuals
is related to anticipatory dementia. Those with higher levels of
anticipatory dementia are more depressed, have more psychiatric
symptoms, have lower life satisfaction, and describe their health
as being poorer than individuals not concerned about memory loss
and Alzheimer's disease. Furthermore, women in the study
demonstrated a relationship between anticipatory dementia and well-being that
was not observed in men.

Data from Gonzalo and colleagues (1996) was used to illustrate
regression, specifically the relationship between insulin sensitivity
and BMI for hyperthyroid and control women. We found separate regression
lines for hyperthyroid and for control women and observed that the
relationships between insulin sensitivity and BMI are different
in these two groups of women. The investigators also reported that
overall glucose tolerance was not affected by hyperthyroidism in
normal weight women.

The flowcharts for Appendix C summarize the methods for measuring
an association between two characteristics measured on the same
subjects. Flowchart C–4 indicates how the methods depend
on the scale of measurement for the variables, and flowchart C–5
shows applicable methods for testing differences in correlations
and in regression lines.

Exercises

[Note to AccessLange users: data and software are not available on the website.]

1. The extent to which stool energy losses are normalized in
cystic fibrosis patients receiving pancreatic enzyme replacement
therapy prompted a study by Murphy and colleagues (1991). They determined
the amount of energy within the stools of 20 healthy children and
20 patients with cystic fibrosis who were comparatively asymptomatic
while taking capsules of pancreatin, an enzyme replacement. Weighed
food intake was recorded daily for 7 days for all study participants. Over
the final 3 days of the study, all stools were collected. Measures
of lipid content, total nitrogen content, bacterial content, and
total energy content of the stools were recorded. Data for the cystic
fibrosis children are given in Table 8–9 and on the CD-ROM
in a folder entitled "Murphy."

Table 8–9. Observations
on Stool Lipid and Stool Energy Losses in Children with Cystic Fibrosis.

Table 8–9. Observations
on Stool Lipid and Stool Energy Losses in Children with Cystic Fibrosis.

Subject

Fecal Lipid (g/day)

Fecal Energy (MJ/day)

1

10.0

2.1

2

11.0

1.1

3

9.9

1.1

4

9.8

0.9

5

15.5

0.7

6

5.0

0.4

7

10.7

1.0

8

13.0

1.5

9

13.8

1.2

10

16.7

1.4

11

3.2

1.0

12

4.0

0.5

13

6.0

0.9

14

8.9

0.8

15

9.1

0.6

16

4.1

0.5

17

17.0

1.2

18

22.2

1.1

19

2.9

0.9

20

5.0

1.0

Source: Modified and reproduced,
with permission, from the table and Figure 3 in Murphy JL, Wooton SA,
Bond SA, Jackson AA: Energy content of stools in normal healthy
controls and patients with cystic fibrosis. Arch Dis Child 1991;66:495–500.

a. Find and interpret the correlation between stool lipid
and stool energy.

b. Figure 8–18 is from the study by Murphy. What
is the authors' purpose in displaying this graph? What
can be interpreted about the relationship between fecal lipid and
fecal energy for control patients? How does that relationship compare
with the relationship in patients with cystic fibrosis?

2. a. Perform a chi-square test of the significance of the
relationship between TRH and placebo and the subsequent development
of respiratory distress syndrome using the data in Chapter 3, Table
3–21.

b. Determine 95% confidence limits for the relative
risk of 2.3 for the risk of death within 28 days of delivery among
infants not at risk using the data in Table 3–20. What
is your conclusion?

Figure 8–18.

Stool lipid versus stool energy losses for the control
subjects and cystic fibrosis patients. (Reproduced, with permission,
from Figure 3 in Murphy JL, Wootton SA, Bond SA, Jackson AA: Energy
content of stools in normal healthy controls and patients with cystic
fibrosis. Arch Dis Child 1991;66:495–500.)

3. Calculate the correlation between BMI and insulin sensitivity
for the entire sample of 33 women, using the results in the section
titled, "Calculating the Regression Equation," for
b. The standard deviation of BMI is 3.82 and of insulin sensitivity
is 0.030.

4. Goldsmith and colleagues (1985) examined 35 patients with
hemophilia to determine whether a relationship exists between impaired
cell-mediated immunity and the amount of factor concentrate used.
In one of their studies, the ratio of OKT4 (helper T cells) to OKT8
(suppressor/cytotoxic T cells) was formed, and the logarithm
of this ratio was regressed on the logarithm of lifetime concentrate
use (Figure 8–19).

a. Why is the logarithm scale used for both variables?

b. Interpret the correlation.

c. What do the confidence bands mean?

Figure 8–19.

Regression of logarithm of OKT4:OKT8 on logarithm of
factor concentrate use. (Reproduced, with permission, from Goldsmith
JM, Kalish SB, Green D, Chmiel JS, Wallemark CB, Phair JP: Sequential
clinical and immunologic abnormalities in hemophiliacs. Arch Intern Med 1985;145:431–434.)

5. Helmrich and coworkers (1987) conducted a study to assess
the risk of deep vein thrombosis and pulmonary embolism in relation
to the use of oral contraceptives. They were especially interested
in the risk associated with low dosage (<50 g
estrogen) and confined their study to women under the age of 50
years. They administered standard questionnaires to women admitted
to the hospital for deep vein thrombosis or pulmonary embolism as
well as to a control set of women admitted for trauma and upper
respiratory infections to determine their history and current use
of oral contraceptives. Twenty of the 61 cases and 121 of the 1278
controls had used oral contraceptives in the previous month.

a. What research design was used in this study?

b. Find 95% confidence limits for the odds ratio
for these data.

c. The authors reported an age-adjusted odds ratio of 8.1
with 95% confidence limits of 3.7 and 18. Interpret these
results.

6.
Presenting Problem 2 in Chapter 3 by Hébert and colleagues
(1997) measured disability and functional changes in 655 residents
of a community in Quebec, Canada. The Functional Autonomy Measurement
System (SMAF), a 29-item rating scale measuring functional disability
in five areas, was a major instrument used in the study. We used
observations on mental ability for women 85 years or older at baseline
and 2 years later to illustrate the correlation coefficient in Chapter 3
and found it to be 0.58. Use the data on the CD-ROM and select
or filter those subjects with sex = 0 and age 85;
51 subjects should remain for the analysis.

a. Form a 95% confidence interval for this correlation.

b. Calculate the sample size needed to produce a confidence
interval for the correlation of the mental ability scores at times
1 and 3 that would be within ±0.10 from the observed
correlation coefficient. In other words, how many subjects are needed
for a 95% confidence interval from 0.48 to 0.68 around
the correlation of 0.58 found in their study?

7. The graphs in Figure 8–20 were published in the study
by Einarsson and associates (1985).

a. Which graph exhibits the strongest relationship with
age?

b. Which variable would be best predicted from a patient's
age?

c. Do the relationships between the variables and age appear
to be the same for men and women; that is, is it appropriate to
combine the observations for men and women in the same figure?

Figure 8–20.

Scatterplots and regression lines for relation between
age and hepatic secretion of cholesterol, total bile acid synthesis,
and size of cholic acid pool for women (circles) and men (squares). (Reproduced,
with permission, from Einarsson K, Nilsell K, Leijd B, Angelin B:
Influence of age on secretion of cholesterol and synthesis of bile
acids by the liver. N Engl J Med 1985;313:277–282.)

8. Use the CD-ROM regression program to produce a graph of residuals
for the data from Gonzalo and coworkers (1996). Which of the four
situations in Figure 8–13 is most likely?

9. Explain why the mean of the predicted values, ',
is equal to .

10. Develop an intuitive argument to explain why the sign of
the correlation coefficient and the sign of the slope of the regression
line are the same.

11. Use the data from the "Bossi" file (Presenting
Problem in Chapter 3) to form a 2 x 2
contingency table for the frequencies of hematuria (hematur) and
whether patients had RBC units > 5 (gt5rbc). The odds ratio is 1.90.
Form 95% confidence limits for the odds ratio and compare
them to those calculated by the statistical program. What is the
conclusion?

12. Group Exercise. The causes
and pathogenesis of steroid-responsive nephrotic syndrome (also
known as minimal-change disease) are unknown. Levinsky and colleagues
(1978) postulated that this disease might have an immunologic basis
because it may be associated with atopy, recent immunizations, or
a recent upper respiratory infection. It is also responsive to corticosteroid
treatment. They analyzed the serum from children with steroid-responsive
nephrotic syndrome for the presence of IgG-containing immune complexes
and the complement-binding properties (C1q-binding) of these complexes.
For purposes of comparison, they also studied these two variables
in patients with systemic lupus erythematosus. You will need to
consult the published article for details of the study; a graph
from the study is reproduced in Figure 8–21.

a. What were the study's basic research questions?

b. What was the study design? Is it the best one for the study's
purposes?

c. What was the rationale in defining the kinds of patients
to be studied? How were subjects obtained?

d. Interpret the correlations for the two sets of patients
in Figure 8–21. What conclusions do you draw about the
relationships between C1q-binding and IgG complexes in patients
with systemic lupus erythematosus? In patients with steroid-responsive
nephrotic syndrome?

e. Discuss the use of the parallel lines surrounding the regression
line; do they refer to means or individuals? (Hint: The standard
error of regression is 11.95 and (X – ^{2})
is 21,429.37).

f. Do you think the regression lines for the two sets of patients
will differ?

g. Would the results from this study generalize? If so, to
what patient populations, and what cautions should be taken? If
not, what features of the study limit its generalizability?

Figure 8–21.

Scatterplot of C1q-binding complexes and IgG complexes
in patients with systemic lupus erythematosus (SLE; circles) and
steroid-responsive nephrotic syndrome (SRNS; squares), illustrating the
possibility of differences in regression lines for SLE and SRNS
patients. (Reproduced, with permission, from Levinsky RJ, Malleson
PN, Barratt TM, Soothill JF: Circulating immune complexes in steroid-responsive
nephrotic syndrome. N Engl J Med 1978;298:126–129.)

13. Group Exercise. The MRFIT
study (Multiple Risk Factor Intervention Trial Research Group, 1982),
has been called a landmark trial; it was the first large-scale clinical
trial, and it is rare to have a study that follows more than 300,000
men who were screened for the trial for a number of years. The Journal
of the American Medical Association reprinted this article in 1997.
In addition, the journal published a comment in the Landmark Perspective
section by Gotto (1997). Obtain a copy of both articles.

a. What research design was used in the study?

b. Discuss the eligibility criteria. Are these criteria still
relevant today?

c. What were the treatment arms? Are these treatments still
relevant today?

d. What statistical methods were used? Were they appropriate?
One method, the Kaplan–Meier product-limit method, is discussed
in Chapter 9.

e. Refer to Figure 1 in the original study. What do the lines
in the figure indicate?

f. Examine the distribution of deaths given in the article's
Table 4. What statistical method is relevant to analyzing these
results?

g. The perspective by Gotto discusses the issue of power in the
MRFIT study. How was the power of the study affected by the initial
assumptions made in the study design?