LESSON 10: Histograms questions

FOCUS QUESTION: How can I understand and compare the distributions of two data sets?

Contents


EXAMPLE 1: Load the Daphne Island and Santa Cruz Island beak size data

    Daphne = load('DaphneIslandBeaks.txt');
    SantaCruz = load('SantaCruzIslandBeaks.txt');

Questions Answers
Why was the data for the two islands put in separate files? The two data sets were not the same size, so you could not simply put them into two columns. Although you could use a missing designator to artificially make the two data sets the same size, the measurements are independent. Thus, arranging the two data sets as side-by-side columns in the same file might be misleading.


EXAMPLE 2: Display a histogram of the Daphne Island beak size data

    nDaphne = length(Daphne);         % Find number of Daphne Island finches
    titleDaphne = ['Daphne Island ground finches (n=' num2str(nDaphne) ')'];
    figure('Name', titleDaphne);     % Create a titled figure window
    hist(Daphne)                     % Calculate and plot the histogram
    xlabel('Beak size in mm');
    ylabel('Number of birds');
    title(titleDaphne);              % Use same title for plot as window

Questions Answers
What is a histogram? A histogram is a frequency table, that is a table listing how many times each value (or range of values) appears in a data set.
What is a frequency table? A frequency table records how many times each data value occurs in the data set. If the data set only has a small number of values, we keep a count for each possible value. For data sets that contain real numbers or have a large number of possible discrete values, we use a binned frequency table.
What does the hist function do? The hist function computes a binned frequency table or histogram for the data. Since the example did have output arguments, the resulting frequency table is plotted as a bar chart rather than returned as an array.
What is a binned frequency table? A binned frequency table divides the possible data values into subranges called bins and counts how many values fall into each bin.
How many bins does hist use? By default, the hist function uses 10 equal-sized bins that span the range of the data. (You may also explicitly specify the bins as in later examples.)
Does a histogram always have to be displayed as a bar chart? No. The bar chart is a common visual representation of a histogram but not the only useful one.
Does the hist function always display a figure? No. If you use the output arguments, as shown in the next example, the hist function does not produce a figure.


EXAMPLE 3: Calculate and display a histogram using a bar chart, line graph and stair plot

    [n, xout] = hist(Daphne);      % Calculate histogram but don't display
    figure
    hold on
    bar(xout, n);                  % Plot histogram using a bar chart
    plot(xout, n, '-ok')           % Plot histogram using a black line graph
    stairs(xout, n, 'r')           % Plot histogram using a red stair plot
    hold off
    xlabel('Beak size in mm');
    ylabel('Number of birds');
    title(titleDaphne);
    datacursormode on              % Turn the data cursor on for exploration

Questions Answers
How does [n, xout] = hist(Daphne) differ from the hist(Daphne) of EXAMPLE 2? When you use output arguments with the hist function, MATLAB does not draw a figure. Rather the hist function returns the frequency counts and the centers of the bins.
Did I need to assign the result of hist to variables? Yes, if you want to get the values in the frequency table rather then to just see a plot. Use the form with output arguments when you want to do your own display or if you want to compute something else from the frequency table.
When would I need the bin positions and counts from a histogram? This example illustrates using these values to display the histogram in three different ways. You might also want to do further computations on these values or compute a cummulative probability distribution.
What does '-ok' mean in plot? The '-ok' is shorthand for black (k) circular markers (o) that are connected with a solid lines (-).
Are there are other short-cuts for specifying plot characteristics? Yes. Several more shortcuts appear in this lesson. See LineSpec in the MATLAB help for a complete list.
What is the difference between plot and stairs? The plot function connects each consecutive (x, y) pair with a straight line. The stairs function connects each consecutive (x, y) pair with a staircase. MATLAB draws a horizontal line between the x values at the level of the first y value. At the second x value, MATLAB draws a vertical line between the two y values to form a stair.
What is the data cursor? The data cursor feature allows you to read the coordinates of graph by mousing over the points.
What does datacursormode on do? This command turns on the data cursor on current figure.
Can I turn on the data cursor from the plottools? Yes. Use Tools->Data Cursor on the Figure Window menubar. Alternatively, use the data cursor icon (MATLAB data cursor icon) on the Figure Window toolbar.


EXAMPLE 4: Explore the data cursor feature


EXAMPLE 5: Use different bin sizes for Daphne Island histograms

    figure
    subplot(3, 1, 1)
    hist(Daphne, 10)      % Create a plot a 10-bin histogram
    title(titleDaphne)    % Put title over topmost graph
    legend('10 bins')
    ylabel('Birds')
    subplot(3, 1, 2)
    hist(Daphne, 25)      % Create a plot a 25-bin histogram
    legend('25 bins')
    ylabel('Birds')
    subplot(3, 1, 3)
    hist(Daphne, 100)     % Create a plot a 100-bin histogram
    legend('100 bins')
    xlabel('Beak size in mm')
    ylabel('Birds')

Questions Answers
What does the 10 represent in the first call to hist? The 10 specifies the number of bins to use in the frequency table. The default number of bins is 10. So the first call to hist behaves the same hist(Daphne). The second call to hist uses 25 bins. Notice that the bars on the corresponding graph are thinner because more of them must fit in the same area.
Should I always use a large number of bins for a histogram? Choosing the right bin size is sometimes a tricky trade-off. If you choose too few bins, the poor resolution may hide interesting features. If you choose too many bins, some bins will be sparsely occupied and the histogram may take on a jagged appearance. You may also miss essential features. It is usually good to experiment with the bin size to see what the trade-offs are.
How does MATLAB determine the positions of the bins? MATLAB divides the range of data values in its first argument into that number of bins. Your data can't contain +inf or -inf.
What happens if I move the xlabel statement after the first hist? The xlabel adds an x-axis label to the current axis. The top histogram's x-axis will be labeled.
What happens if I move the xlabel statement directly after the first subplot? The xlabel adds an x-axis label to the current axis, which was created by the subplot. However, the hist function creates a new axis, so the label is lost.


EXAMPLE 6: Compare beak distributions of Daphne and Santa Cruz Islands

    nSantaCruz = length(SantaCruz);    % Find number Santa Cruz Island finches
    figure
    subplot(1, 2, 1)
    hist(Daphne)                       % Histogram of Daphne Island finches
    title(['Daphne (n=' num2str(nDaphne) ')'])
    xlabel('Beak size (mm)')
    ylabel('Number of birds')          % Only use one y label for both axes
    subplot(1, 2, 2)
    hist(SantaCruz)                    % Histogram of Santa Cruz Island finches
    title(['Santa Cruz (n=' num2str(nSantaCruz) ')'])
    xlabel('Beak size (mm)')

Questions Answers
Why are the vertical scales of the two histograms different? The counts depend on how many values each data set has. These data sets are of different size.
Why are the horizontal scales of the two histograms different? The horizontal scales depend on the maximum and minimum values in the data set.
Can I still compare the distributions? These histograms do not allow very effective comparison of the data. A more effective comparison would use the same bins and scale the data to be fractions of the data set rather than actual counts.


EXAMPLE 7: Calculate histograms with explicit bin positions

Calculate bin positions explicitly to encompass range of both data sets

    minBeak = min([min(Daphne), min(SantaCruz)]);
    maxBeak = max([max(Daphne), max(SantaCruz)]);
    xBins = minBeak:0.2:maxBeak;          % Bin positions will be 0.2 apart
    % Calculate the histograms based on these bins
    [nD, xD] = hist(Daphne, xBins);       % Histogram of Daphne Island
    [nS, xS] = hist(SantaCruz, xBins);    % Histogram of Santa Cruz Island

Questions Answers
Why was the min of the min needed to find the minimum beak size? The first pair of min functions finds the minimum values of the Daphne and Santa Cruz data individually. The square brackets combine these values into a two-element vector. We need to apply another min to find the overall minimum.
What is the first element in minBeak:0.2:maxBeak? The first element is the value of minBeak.
How far apart are the elements of minBeak:0.2:maxBeak? The elements are 0.2 apart. Successive elements are integral multiples of 0.2 plus the value of the first element.
Is maxBeak always the last element of minBeak:0.2:maxBeak? No, maxBeak is only the last element if it differs from minBeak by an integral multiple of 0.2. Otherwise, the sequence stops with the largest multiple that doesn't exceed the difference.


EXAMPLE 8: Use overlapping stair plots to compare scaled distributions

    legendString = {['Daphne (n=' num2str(nDaphne) ')'], ...
                    ['Santa Cruz (n=' num2str(nSantaCruz) ')']};
    figure
    hold on
    stairs(xD, nD/sum(nD), 'k');          % Daphne in black
    stairs(xS, nS/sum(nS), 'r');          % SC in red
    xlabel('Beak size in mm');
    ylabel('Fraction of birds with this beak size');
    legend(legendString,  'Location', 'NorthWest');
    title('Scaled comparison of Daphne and Santa Cruz finches')
    hold off

Questions Answers
Why was the nDaphne divided by sum(nDapne)? Since the data sets did not have the same number of elements, a comparison of the counts is not meaningful. Dividing by the total number of elements plots the fractions, which are comparable.


EXAMPLE 9: Calculate the cumulative distributions for both data sets

    cumD = cumsum(nD)/sum(nD);       % Cumulative distribution of Daphne Island
    cumS = cumsum(nS)./sum(nS);      % Cumulative distribution of Santa Cruz Island

Questions Answers
What does cumsum do? The cumsum function calculates the partial sums.
What are partial sums used for? Partial sums are used in calculus. Cumulative sums also approximate the cumulative probability distribution for the data.
What do the values of cumD represent? A value a in cumD represents the fraction of data set values that are less than or equal to the corresponding value in xD. (See EXAMPLE 7.)


EXAMPLE 10: Plot the cumulative distributions for both data sets

    figure
    hold on
    plot(xD, cumD, 'k');
    plot(xS, cumS, 'r');
    title('Beak size distributions at two islands in the Galápagos')
    xlabel('Beak size (mm)')
    ylabel('Fraction of birds with beak size less than x')
    legend(legendString, 'Location', 'NorthWest');  % Reuse legend from last example
    hold off


EXAMPLE 11: Generate "random" numbers from three common probability distributions

    yNormal = random('norm', 0, 1, [10000, 1]);  %normal with zero mean and unit sd
    yUniform = random('unif', -1, 1, [10000,1]); %uniform in the interval [-1, 1]
    yExponential = random('exp', 1, [10000, 1]); %exponential with mean 1

Questions Answers
Why are the values returned by random called pseudo-random? The sequence of values produced by random is generated by a formula and completely predictable from the implementation of random.
Won't I always generate the same values each time I call random? Although the sequence of values is predictable, you can pick different places (the seed) to start, giving the appearance of unpredictability.


EXAMPLE 12: Display the histograms of the generated distributions

    figure
    subplot(3,1,1)
    hist(yNormal,50)
    title('Normal distribution (mean = 0, sd = 1)')

    subplot(3,1,2)
    hist(yUniform,50)
    title('Uniform distribution (on interval [-1, 1])')

    subplot(3,1,3)
    hist(yExponential,50)
    title('Exponential distribution (mean = 1)')


_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions. The image, Medium Ground Finch Geospiza fortis, Santa Cruz, Galapago taken by Mark Putney. The original source is http://www.flickr.com/photos/putneymark/13516124843/in/set-72157601810082531/. The image is available under common license at http://commons.wikimedia.org/wiki/File:Geospiza_fortis.jpg._