The Pearson Product Moment Correlation Coefficient ( Pearson's r)
Plus Various & Sundry other issues

Why is this Important?

Purpose of the Procedure and Statistics

The purpose of Pearson's Correlation Coefficient is toindicate a linear relationship between two measurement variables. This means that if you have two sets of scores, you want to know: Does one score predict another? For example:

Does your SAT scores predict your GPA?

Or why bother to take the SATs?

Does stress predict how well you will do an exam or other task?

Might be good to know for people who have stressful jobs. Let's yell at our computer programmers more - then we'll get some good bug free code.

Does a baby's birth weight predict how many colds it will have in infancy?

Doctor's might want to know this, parents also.

In all these cases, you want to know if one score is high - Is the other high? If one is low - is the other low?

Hey, Dad, I have high SATs
Well, son -Lots o' Dough for College!
$$$$

  Sorry, Dad - Crappy SATs!
Well, Son - Flipping BurgersFor You!
 
         

That's why you take the SATs - it's supposed to be predictive Is it always the case? Not really - we'll show you how to evaluate such relationships!

The Big Ideas or the Things You Should Know & How to Do Them.

Here's a chart -then we will explain the concepts! 

1. Plot a Scattergram

You want to make a simple graph shows if there is a pattern to the two sets of scores.Let's look at SAT and GPA.



So you arrange your data in columns - Like This:



If SAT and GPA are related, we would expect:

Note: People above the mean of the GPA distribution are usually above the mean of the SAT distribution and vice versa. The scattergram indicates a correlation. It looks like the major axis of the ellipse (the line would be a good one to use to predict GPA from SAT).

Let's say we made a scattergram of Height versus GPA. There's no relation there that we know of. So for any Height - people can have good or bad GPA's. For any SAT, our best guess would be the mean of the GPA's, Hence the horizontal line. I'd guess 2.0 for you GPA - whether you are tall or short.

So in one case (GPA and SAT) we have a positive correlation and the other example (GPA and Height) we have zero GPA. What about the values? We haven't calculated them yet - I just ballparked them.

Let's look at GPA and amount of drinking (we survey students and measure the number of drinks per day vs. GPA - this is made up data.)

The more you drink, the worse your GPA. This is a negative correlation.




Big Idea

You get a tight, elongated ellipse like scattergram and you have a good predictive relationship and correlation!

Calculating the Pearson's Product Moment Correlation Coefficient (r) and Related Things.

1. Think about Z scores

How do you know if you are doing well in a distribution as compared to another distribution? If you have a high Z score on the SATs and it predicts your GPA - you should have a high Z score on the GPAs - You would be in the top of each distribution.

If your GPA stinks, I would expect your SATs to be not so hot. You would be below the mean on both and have negative z scores.

2. The Formula

The correlation coefficient is calculated based on the following formula that uses your Z scores:

 

This means:

So if we multiply your Z scores together, sum all these pairs and we get a positive sum (positive scores for X multiplied bypositive scores for Y and negative scores for X multiplied by negative scores for Y) - we have a positive correlation)

If you have a negative relationship you will have the sumof positive Zxs times negative Zys and vice versa. Add these up for a negative sum.

If you have no correlation, then you get equal numbers of positive Zx times negative Zy, positive Zx times positive Zy, negative Zx times negative Zy, and negative Zx times positive Zy. Add these up and you get zero. The graphic below explains it all.





3. The Values and Limits of the Pearson's Correlation Coefficient

Pearson's correlation coefficient (or r) can range from -1 to +1. No other value is possible. A value of zero (0.0) indicates that the variables are not related or perhaps more complex or nonlinear relationships. Values close to -1 or +1 indicate strong predicative relationships. The sign indicates the direction of relationship (or its slope). Negative correlations would come from relationships with negative slope. Positive would represent a positive slope.

Significance: You need to test the value you get for significance to see if it is not chance. It is possible that you pick a line by chance. For example, there is no relationshipbetween Height and GPA. But in your sample, you choose by sheer luck smart short kids and dumb talk kids. Look to the left. The significance test should tell you if this is happening.

 





Nuance: Why is the maximum r = 1?

Consider a correlation coefficient correlated between your height in inches and your height in feet. This incredibly stupid to correlate (see our graph below). Obviously the relationship will be perfect. All the points are on the line. There is no spread to the scattergram. Thus your Zx will equal your Zy. That's because you stand in the same relative location in the height distribution no matter whether it is in feet or inches. Such a relationship gives you an r = 1.0 in the simple derivation above.






4. Calculate r2

Squaring the correlation coefficient results in what is called the coefficient of determination or proportion of explained variance. For example if r = 0.6, the proportion of explained variance = 0.36. If r = -0.7, r2 = 0.49. Note these could be multiplied by 100 to produce the percent explained variance (36% and 49% respectively).

What does this mean?

It is assumed that someone's score (the one we want to predict) is made up of an explained component and unexplained component:

Your score

=

Explained (from the prediction) + Unexplained

Thus the total variance which summarizes how everyone varies from the mean is made up of the predicted deviations from the mean and the unpredicted deviations. So:

Total Variance

=

Explained (from the prediction) + Unexplained

And r2 equals:

[Explained Variance / Total Variance].

Thus:

r2

=

Proportion of Total Variance that is Explained (from the prediction)



5. The Regression Line (
a Big and Nifty Idea):

Y' = Slope (X) + Intercept

Now that we know we have a relationship and SAT and GPA are related, we will in the future just want to get someone's SAT and predict their GPA.

To do this, we use the regression line where Y' is the Y score predicted from the X score. Slope and intercept are calculated and we want

GPA' = Slope (SAT) + Intercept

This is the best-fit line and itís the one through the ellipse. It means that when you have a scattergram, you are going to calculate the equation of the line that zaps right through the middle of the cloud. You need the slope and the intercept.

It has been determined that theslope = r(SDy)/SDx)

and the

Intercept = {r(SDy)/SDx) + Mean of Y}.

Most calculators and computer programs will do this for you.

Fancy Nuances:

GPA' = Constant0 + Constant1(SAT) + Constant2(Siblings)?

Thing to Watch Out For:

Wacky Scattergrams, so you can't fit a line. You may have a relationship but Pearson's correlation coefficient wouldn't do the job. Here's two examples:




There are ways to deal with this. There are other even fancier wacky things that can go wrong. Worry about it when you become more advanced.

Bottom Line:

Pearson's Correlation Coefficient:

Tells you if there islinear relationship between two variables

Tells you how good the relationship is by seeing if r is close to 1 or -1 or r2 is close to 1.0.

Tells you if the relationship is positive or negative by whether r is positive or negative.

Gives you an equation for a straight line so you can predict one score from another.

Every time you take a standardized test - someone is doing this to you.