Plus Various & Sundry other issues

**Purpose of the Procedure and Statistics**

The purpose of **Pearson's Correlation
Coefficient** is to**indicate a linear
relationship** between two measurement variables. This means that if
you have two sets of scores, you want to know: **Does one
score predict another?** For example:

**Does your SAT scores predict your GPA?**

Or why bother to take the SATs?

**Does stress predict how well you will do an exam or
other task?**

Might be good to know for people who have stressful jobs. Let's yell at our computer programmers more - then we'll get some good bug free code.

Doctor's might want to know this, parents also.

**In all these cases, you want to know if one score is
high - Is the other high? If one is low - is the other
low?**

Hey, Dad, I have Lots o' Dough for College!$$$$ |
Sorry, Dad - Crappy SATs!Well, Son - Flipping Burgers For You! | |||

That's why you take the SATs - it's supposed to be predictive Is it always
the case? Not really - we'll show you how to evaluate such
relationships!

**Here's a chart -then we will explain the
concepts!**

**1. Plot a Scattergram**

You want to make a simple graph shows if there is a pattern to the two sets
of scores.**Let's look at SAT and GPA.**

So you arrange your **data in columns -
Like This:**

- You
**pick**one column for**X**(for the X axis)**X**is usually the score**used to predict (SAT).** - You
**pick**one column for**Y**.**Y**is usually the score you**want to predict (GPA)**.

**Draw a graph**(really - we use computers now!)**Draw an X and Y set of axes****Plot**the**points**Each**SAT and GPA**pair makes an**(X,Y) pair**

If SAT and GPA are related, we would
expect:

Note: People above the mean of the GPA distribution are usually above the
mean of the SAT distribution and vice versa. **The
scattergram indicates a correlation. It looks like the major axis of the ellipse
(the line would be a good one to use to predict GPA from SAT)**.

Let's say we made a **scattergram of Height versus
GPA**. There's no relation there that we know of. So for any Height -
people can have good or bad GPA's. **For any SAT, our best
guess would be the mean of the GPA's**, Hence the horizontal line. I'd
guess 2.0 for you GPA - whether you are tall or short.

So in one case (GPA and SAT) we have a positive correlation and the other example (GPA and Height) we have zero GPA. What about the values? We haven't calculated them yet - I just ballparked them.

Let's look at **GPA and amount of drinking**
(we survey students and measure the number of drinks per day vs. GPA - this is
made up data.)

**The more you drink, the worse your GPA. This is a
negative correlation**.

**1. Think about Z scores**

How do you know if you are doing well in a distribution as compared to another distribution? If you have a high Z score on the SATs and it predicts your GPA - you should have a high Z score on the GPAs - You would be in the top of each distribution.

If your GPA stinks, I would expect your SATs to be not so hot. You would be below the mean on both and have negative z scores.

**2. The Formula**

The correlation coefficient is calculated based on the following formula that uses your Z scores:

This means:

**Calculate everybody's Z score.****Multiply your Zx by your Zy.****Add up these pairs for everyone.****Divide by the number of people or observations.**

So if we **multiply your Z scores together, sum all
these pairs** and we get a positive sum (**positive** scores for X multiplied by**positive scores** for Y and **negative** scores for X multiplied by **negative** scores for Y) - we have a **positive correlation**)

If you have a **negative relationship** you
will have the sum**of positive Zxs times negative Zys and
vice versa.** Add these up for a negative sum.

If you have **no correlation, then you get equal numbers
of positive Zx times negative Zy, positive Zx times positive Zy, negative Zx
times negative Zy, and negative Zx times positive Zy. Add these up and you get
zero**. The graphic below explains it all.

**3. The Values and Limits of the Pearson's Correlation
Coefficient**

**Pearson's correlation coefficient (or r) can range from -1
to +1. No other value is possible.** A value of zero (0.0) indicates
that the variables are not related or perhaps more complex or nonlinear
relationships. **Values close to -1 or +1 indicate strong
predicative relationships.** The sign indicates the direction of
relationship (or its slope). **Negative correlations would
come from relationships with negative slope. Positive would represent a positive
slope**.

**Significance: You need to
test the value you get for significance to see if it is not chance**.
It is possible that you pick a line by chance. For example, there is no
relationshipbetween Height and GPA. But in your sample, you choose by sheer luck
smart short kids and dumb talk kids. Look to the left. The significance test
should tell you if this is happening.

Nuance: Why is the maximum r = 1?

Consider a **correlation coefficient correlated between
your height in inches and your height in feet.** This incredibly stupid
to correlate (see our graph below). **Obviously the
relationship will be perfect**. All the points are on the line. There
is no spread to the scattergram. Thus your **Zx will equal
your Zy.** That's because you stand in the same relative location in
the height distribution no matter whether it is in feet or inches. Such a
relationship gives you an r = 1.0 in the simple derivation
above.

**4. Calculate r ^{2}**

**Squaring the correlation coefficient results in what
is called the coefficient of determination or proportion of explained
variance.** For example if r = 0.6, the proportion of explained
variance = 0.36. If r = -0.7, r2 = 0.49. Note these could be multiplied by 100
to produce the percent explained variance (36% and 49% respectively).

*What does this mean?*

It is assumed that someone's score (the one we want to predict) is made up of an explained component and unexplained component:

Thus the total variance which summarizes how everyone varies from the mean is made up of the predicted deviations from the mean and the unpredicted deviations. So:

**And r ^{2}
equals:**

[Explained Variance / Total Variance].

Thus:

5. The Regression Line (*a Big and Nifty Idea*):

Now that we know we have a relationship and SAT and GPA are related, we will in the future just want to get someone's SAT and predict their GPA.

To do this, we use the regression line where Y' is the Y score predicted from the X score. Slope and intercept are calculated and we want

This is the best-fit line and it’s the one through the ellipse. It means that
when you have a scattergram, you are going to calculate the equation of the line
that zaps right through the middle of the cloud. You need the **slope** and the **intercept**.

It has been determined that the**slope =
r(SDy)/SDx)**

and the

**Intercept = {r(SDy)/SDx) + Mean of Y}.**

Most calculators and computer programs will do this for you.

*Fancy Nuances:*

- Calculate the
**Standard errors of estimat**e for your prediction, slope, and intercept

**Tests for significance**between correlations, slopes and intercepts of different data sets.

**Binomial Effect Size**Displays instead of r_{2}to see how useful your r is.

**Multiple Regression**- why just have one predictor X - you can have more. Why not predict GPA from SAT and your number of siblings so:

**GPA' = Constant _{0} + Constant_{1}(SAT) +
Constant_{2}(Siblings)?**

**Wacky Scattergrams**, so you can't fit a
line. You may have a relationship but Pearson's correlation coefficient wouldn't
do the job. Here's two examples:

There are ways to deal with this. There are other even fancier wacky things that can go wrong. Worry about it when you become more advanced.

**Pearson's Correlation Coefficient:**

Tells you if there is**linear relationship between two
variables**

Tells you how **good** the relationship is by
seeing if **r is close to 1 or -1** or
r^{2} is close to 1.0.

Tells you if the **relationship is positive or
negative** by whether r is positive or negative.

Gives you an equation for a straight line so you can predict one score from another.

Every time you take a standardized test - someone is doing this to you.