LESSON 6: Scatter plots, curve fitting and correlation

FOCUS QUESTION: How can I determine whether two variables are related?

This lesson shows you how to determine whether two variables are related by calculating their correlation. This lesson also introduces scatter plots and illustrates how to create a linear model to capture the relationship between two variables.

In this lesson you will:
  • Compute the correlation between two data sets.
  • Compare two data sets by plotting them against each other in a scatter plot.
  • Add a linear fit line to a scatter plot using the plottools.
  • Compute the linear fit line directly.
  • Compute the error between linear predictions and actual data.
  • Create computed title strings based on specifics of the data.
Darwin's finches by drawn by John Gould

Contents


DATA FOR THIS LESSON

File Description
DaphneIsland.txt
  • The data consists of family beak size data for Darwin ground finches on Daphne Island:
    • The first column contains the chick beak size in mm
    • The second column contains the mother's beak size in mm
    • The third column contains the father's beak size in mm.
  • The data was extracted from a data set distributed with the case study Natural Selection and Darwin's Finches by Martin Wikelski available on the web at http://wps.prenhall.com/esm_freeman_evol_3/0,7017,749374-,00.html.
  • The original data is summarized in the article: "The classical case of character release: Darwin's finches (Geospiza) on Isla Daphne Major, Galápagos" by P. T. Boag and P. R. Grant that appeared in Biological Journal of the Linnean Society 22:243-277 (1974). See http://en.wikipedia.org/wiki/Peter_and_Rosemary_Grant for additional information on the work of Peter and Rosemary Grant.

SETUP FOR LESSON 6


EXAMPLE 1: Load the Darwin Finch beak size parentage data

Create a new cell in which you type and execute:

    beaks = load('DaphneIsland.txt');

You should see a beaks variable in your Workspace.

In the space below, draw a picture of the beaks array and label its rows and columns.  


EXAMPLE 2: Define variables for child, mother, father, and parent beak sizes

Create a new cell in which you type and execute:

    child = beaks(:, 1);                % Size of offspring beaks
    mother = beaks(:, 2);               % Size of maternal beaks
    father = beaks(:, 3);               % Size of paternal beaks
    parent = mean(beaks(:, 2:3), 2);    % Average the parent beak sizes

You should 4 additional variables in your Workspace Browser:


EXAMPLE 3: Calculate the beak size correlations

Create a new cell in which you type and execute:

    fprintf('Daphne Island ground finch beak size correlations:\n');
    fprintf('   mother-child:\t %g\n', corr(mother, child));

You should see the following output in your Command Window:

Daphne Island ground finch beak size correlations:
   mother-child:	 0.75621


Create a new cell right here (beginning of a cell starts with %%). Write a MATLAB statement to print the correlation between the father's beaksize and the child's beaksize. Also write a MATLAB statement to print the correlation between the parent's average beak size and the child's beaksize.


EXAMPLE 4: Plot the average parent beak size against the child beak size

Create a new cell in which you type and execute:

    pcCor = corr(parent, child);               % Calculate parent-child correlation
                                               % Create a title string
    tString = ['Daphne Island ground finches (corr = ' num2str(pcCor) ')'];
    figure('Name', tString)                    % Put this title on the window
    plot(parent, child, 'ko')                  % Plot a scatter plot
    xlabel('Mean parent beak size (mm)');      % Label the x-axis
    ylabel('Child beak size (mm)');            % Label the y-axis
    title(tString);                            % Put this title on the graph

You should see a Figure Window with the following plot:


EXAMPLE 5: Edit the figure of EXAMPLE 4


EXAMPLE 6: Display the plot in a new Figure Window

Create a new cell in which you type and execute:

  open BeakSize.fig;

You should see a Figure Window containing an edited plot with a fit line:


EXAMPLE 7: Calculate and print the best fit lines for beak parentage

Create a new cell in which you type and execute:

    fprintf('Best fit for beak size of Daphne Island ground finches:\n');

    mPoly = polyfit(mother, child, 1);   % Linear fit of mother vs child
    fprintf('   mother-child:\t y = %g*x + %g\n', mPoly(1), mPoly(2));

You should see the following variable in the Workspace Browser:

You should also see the following output in your Command Window:

Best fit for beak size of Daphne Island ground finches:
   mother-child:	 y = 0.658622*x + 2.66147

In the space below: Enter your definitions in this cell and execute the cell to create these variables.


EXAMPLE 8: Evaluate each of the best fit lines to find predicted values

Create a new cell in which you type and execute:

    mPred = polyval(mPoly, mother);   % Evaluate linear fit of mother equation

You should see the following variable in the Workspace Browser:

In the space below: Enter your definitions in this cell and execute the cell to create these variables.


EXAMPLE 9: Find the difference between actual child beak sizes and predictions

Create a new cell in which you type and execute:

    mErrors = child - mPred;     % Actual - predicted by mother's size

You should see the following variable in the Workspace Browser:

In the space below: Enter your definitions in this cell and execute the cell to create these variables.


EXAMPLE 10: Find the root mean squared error (RMS) between actual and predicted sizes

Create a new cell in which you type and execute:

    mRMS = sqrt( mean(mErrors.* mErrors) );

You should see the following variable in the Workspace Browser:

In the space below: Enter your definitions in this cell and execute the cell to create these variables.


EXAMPLE 11: Output the root mean squared error (RMS)

Create a new cell in which you type and execute:

    fprintf('Root mean squared error (RMS) for Best fit lines:\n');
    fprintf('   mother-child:\t %g\n', mRMS);

You should see the following output in your Command Window:

Root mean squared error (RMS) for Best fit lines:
   mother-child:	 0.484323


SUMMARY OF SYNTAX

MATLAB syntax Description
rho = corr(x, y) calculates the correlation between the column vectors x and y. The variable rho contains a single value between -1 and 1 indicating how closely the values of x and y are related. If the correlation is close to 1, the values of x and y go up and down together. If the correlation is close to -1, the values of x and y go in opposite directions (if one goes up, the other tends to go down). A correlation value close to 0 indicates that the values of x and y are not related. The column vectors x and y must have the same same number of elements.
p = polyfit(x, y, n) calculates the coefficients of the best polynomial of degree n that fits the curve x versus y. This polynomial minimizes the RMS error, and is sometimes called the least-squared error approximation. The coefficients of the polynomial appear in the vector p, such that p(1) has the coefficient of the highest term in the polynomial.
Y = polyval(p, X) evaluates the polynomial whose coefficients are in the vector p at the points contained in the array X. The array Y holds the results of this evaluation.


This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions. The drawing was done by John Gould before 1772 and was cataloged in Darwin's finches or Galapagos finches. Darwin, 1745. Journal of researches into the natural history and geology of the countries visited during the voyage of H.M.S. Beagle round the world, under the Command of Capt. Fitz Roy, R.N. 2d edition. The copyright has expired on this image.