LESSON 10: Histograms

FOCUS QUESTION: How can I understand and compare the distributions of two data sets?

This lesson demonstrates different ways of combining and displaying histograms.

In this lesson you will:
  • Calculate and display a histogram.
  • Use bar charts, line graphs and stair plots to display distributions.
  • Convert binned counts to a cumulative probability distribution.
  • Observe the characteristics of common distributions.
  • Compare histograms and bar charts.
Photo of Drawin's finch Geospiza fortis

Contents


DATA FOR THIS LESSON

File Description
DaphneIslandBeaks.txt
SantaCruzIslandBeaks.txt
  • The data set consists of measurements of beak sizes in mm. of one species of Darwin's ground finch (Geospiza fortis) taken at Daphne Island and at Santa Cruz Island in the Galápagos by Peter and Rosemary Grant.
  • The populations of the two islands differ, although the islands are less than 10 km apart.
  • The data was extracted from a data set distributed with the case study Natural Selection and Darwin's Finches by Martin Wikelski available on the web at http://wps.prenhall.com/esm_freeman_evol_3/0,8018,8412374-,00.html.
  • The original data is summarized in the article: "The classical case of character release: Darwin's finches (Geospiza) on Isla Daphne Major, Galápagos" by P. T. Boag and P. R. Grant that appeared in Biological Journal of the Linnean Society 22:243-287 (1984).

See http://en.wikipedia.org/wiki/Peter_and_Rosemary_Grant for additional information on the work of Peter and Rosemary Grant.

SETUP FOR LESSON 10


EXAMPLE 1: Load the Daphne Island and Santa Cruz Island beak size data

Create a new cell in which you type and execute:

    Daphne = load('DaphneIslandBeaks.txt');
    SantaCruz = load('SantaCruzIslandBeaks.txt');

You should see the following 2 variables in your Workspace Browser:

In the space below: Enter your definitions in this cell and execute the cell to create these variables.


EXAMPLE 2: Display a histogram of the Daphne Island beak size data

Create a new cell in which you type and execute:

    nDaphne = length(Daphne);         % Find number of Daphne Island finches
    titleDaphne = ['Daphne Island ground finches (n=' num2str(nDaphne) ')'];
    figure('Name', titleDaphne);     % Create a titled figure window
    hist(Daphne)                     % Calculate and plot the histogram
    xlabel('Beak size in mm');
    ylabel('Number of birds');
    title(titleDaphne);              % Use same title for plot as window

You should see the following 2 variables in your Workspace Browser:

You also should the following labeled and titled plot:

In the space below: write MATLAB code to plot a histogram of the Santa Cruz Island finch beak sizes. Enter your definitions in this cell and execute the cell to create the plot.


EXAMPLE 3: Calculate and display a histogram using a bar chart, line graph and stair plot

Create a new cell in which you type and execute:

    [n, xout] = hist(Daphne);      % Calculate histogram but don't display
    figure
    hold on
    bar(xout, n);                  % Plot histogram using a bar chart
    plot(xout, n, '-ok')           % Plot histogram using a black line graph
    stairs(xout, n, 'r')           % Plot histogram using a red stair plot
    hold off
    xlabel('Beak size in mm');
    ylabel('Number of birds');
    title(titleDaphne);
    datacursormode on              % Turn the data cursor on for exploration

You should see the following 2 variables in your Workspace Browser:

You should the following labeled and titled plot:


EXAMPLE 4: Explore the data cursor feature


EXAMPLE 5: Use different bin sizes for Daphne Island histograms

Create a new cell in which you type and execute:

    figure
    subplot(3, 1, 1)
    hist(Daphne, 10)      % Create a plot a 10-bin histogram
    title(titleDaphne)    % Put title over topmost graph
    legend('10 bins')
    ylabel('Birds')
    subplot(3, 1, 2)
    hist(Daphne, 25)      % Create a plot a 25-bin histogram
    legend('25 bins')
    ylabel('Birds')
    subplot(3, 1, 3)
    hist(Daphne, 100)     % Create a plot a 100-bin histogram
    legend('100 bins')
    xlabel('Beak size in mm')
    ylabel('Birds')

You should see a subplot with three axes aligned vertically:


EXAMPLE 6: Compare beak distributions of Daphne and Santa Cruz Islands

Create a new cell in which you type and execute:

    nSantaCruz = length(SantaCruz);    % Find number Santa Cruz Island finches
    figure
    subplot(1, 2, 1)
    hist(Daphne)                       % Histogram of Daphne Island finches
    title(['Daphne (n=' num2str(nDaphne) ')'])
    xlabel('Beak size (mm)')
    ylabel('Number of birds')          % Only use one y label for both axes
    subplot(1, 2, 2)
    hist(SantaCruz)                    % Histogram of Santa Cruz Island finches
    title(['Santa Cruz (n=' num2str(nSantaCruz) ')'])
    xlabel('Beak size (mm)')

You should see the following variable in your Workspace Browser:

You should also see a subplot with two side-by-side axes:


EXAMPLE 7: Calculate histograms with explicit bin positions

Create a new cell in which you type and execute:

Calculate bin positions explicitly to encompass range of both data sets

    minBeak = min([min(Daphne), min(SantaCruz)]);
    maxBeak = max([max(Daphne), max(SantaCruz)]);
    xBins = minBeak:0.2:maxBeak;          % Bin positions will be 0.2 apart
    % Calculate the histograms based on these bins
    [nD, xD] = hist(Daphne, xBins);       % Histogram of Daphne Island
    [nS, xS] = hist(SantaCruz, xBins);    % Histogram of Santa Cruz Island

You should see the following variables in your Workspace Browser:


EXAMPLE 8: Use overlapping stair plots to compare scaled distributions

Create a new cell in which you type and execute:

    legendString = {['Daphne (n=' num2str(nDaphne) ')'], ...
                    ['Santa Cruz (n=' num2str(nSantaCruz) ')']};
    figure
    hold on
    stairs(xD, nD/sum(nD), 'k');          % Daphne in black
    stairs(xS, nS/sum(nS), 'r');          % SC in red
    xlabel('Beak size in mm');
    ylabel('Fraction of birds with this beak size');
    legend(legendString,  'Location', 'NorthWest');
    title('Scaled comparison of Daphne and Santa Cruz finches')
    hold off

You should see the following variable in your Workspace Browser:

You should see a labeled plot with overlapping stair graphs:


EXAMPLE 9: Calculate the cumulative distributions for both data sets

Create a new cell in which you type and execute:

    cumD = cumsum(nD)/sum(nD);       % Cumulative distribution of Daphne Island
    cumS = cumsum(nS)./sum(nS);      % Cumulative distribution of Santa Cruz Island

You should see the following two variables in your Workspace Browser:


EXAMPLE 10: Plot the cumulative distributions for both data sets

Create a new cell in which you type and execute:

    figure
    hold on
    plot(xD, cumD, 'k');
    plot(xS, cumS, 'r');
    title('Beak size distributions at two islands in the Galápagos')
    xlabel('Beak size (mm)')
    ylabel('Fraction of birds with beak size less than x')
    legend(legendString, 'Location', 'NorthWest');  % Reuse legend from last example
    hold off

You should see a subplot with two side-by-side axes:


EXAMPLE 11: Generate "random" numbers from three common probability distributions

Create a new cell in which you type and execute:

    yNormal = random('norm', 0, 1, [10000, 1]);  %normal with zero mean and unit sd
    yUniform = random('unif', -1, 1, [10000,1]); %uniform in the interval [-1, 1]
    yExponential = random('exp', 1, [10000, 1]); %exponential with mean 1

You should see the following variables in your Workspace Browser:


EXAMPLE 12: Display the histograms of the generated distributions

Create a new cell in which you type and execute:

    figure
    subplot(3,1,1)
    hist(yNormal,50)
    title('Normal distribution (mean = 0, sd = 1)')

    subplot(3,1,2)
    hist(yUniform,50)
    title('Uniform distribution (on interval [-1, 1])')

    subplot(3,1,3)
    hist(yExponential,50)
    title('Exponential distribution (mean = 1)')

You should see a subplot with two side-by-side axes:


SUMMARY OF SYNTAX

MATLAB syntax Description
cumsum(A) calculates the cumulative sum along the first non-singleton dimension of the array A.
cumsum(A, dim) calculates the cumulative sum of the array A along dimension dim.
hist(x) creates a histogram plot of the values in the vector x.
[n, xout] = hist(x) calculates the histogram of the vector x, but does not plot anything. The n variable contains the counts and the xout variable contains bin locations.
random(distName, parameters, [n, m]) generates an n x m array of numbers randomly selected from the specified probability distribution. The parameters item represents the values of the parameters needed to define the particular probability distribution. For example, a normal distribution is specified by its mean and standard deviation. On the other hand, the exponential distribution is specified only by its mean.
stairs(Y) plots stair-step graphs of the columns of the array Y against the positive integers.
stairs(X, Y) plots stair-step graphs of the columns of the array Y against the columns of the array X.


_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions. The image, Medium Ground Finch Geospiza fortis, Santa Cruz, Galapago taken by Mark Putney. The original source is http://www.flickr.com/photos/putneymark/13516124843/in/set-72157601810082531/. The image is available under common license at http://commons.wikimedia.org/wiki/File:Geospiza_fortis.jpg._