LESSON 11: Box plots

FOCUS QUESTION: How can I compare the distributions for data sets that have outliers?

In this lesson you will:
  • Use box plots to compare distributions from different data sets.
  • Explore different options for drawing box plots.
  • Use groupings of labeled data.
  • Learn about median and the inter quartile range (IQR) as indicators of central tendency.
Iris versicolor linked to iris pictures

Contents


DATA FOR THIS LESSON

File Description
fisheriris
  • This data set contains the famous Fisher iris data set. The data set consists of measurements of 150 flower samples from each of three species of flowers: Iris setosa, Iris virginica, and Iris versicolor. The measurements are in mm.
  • Four features were measured for each sample:
    • The length of the flower sepal
    • The width of the flower sepal
    • The length of the flower petal
    • The width of the flower petal
  • All 150 samples from the Fisher iris data are stored in a single table called meas:
    • The four columns correspond to the four types of measurements: sepal length, sepal width, petal length and petal width, respectively.
    • The first 50 rows contain data for Iris setosa
    • The second 50 rows contain data for Iris virginica
    • The third 50 rows contain data for Iris versicolor.
  • The species information is kept in a separate vector called species.
  • The data is sometimes referred to as Anderson's Iris data in honor of Edgar Anderson, the biologist who collected the data. See http://en.wikipedia.org/wiki/Iris_flower_data_set for additional information.

Note: This dataset comes with the MATLAB distribution so you don't have to download it separately.

NYCDiseases.mat
  • The data set contains of the monthly totals of the number of new cases of measles, mumps, and chicken pox for New York City during the years 1931-1971.
  • The data is organized into the following variables:
    • measles - an array containing the monthly cases of measles
    • mumps - an array containing the monthly cases of mumps
    • chickenPox - an array containing the monthly cases of chicken pox
    • years - a vector containing the years 1931 through 1971
  • The data was extracted from the Hipel-McLeod Time Series Datasets Collection, available at http://www.stats.uwo.ca/faculty/aim/epubs/mhsets/readme-mhsets.html.
  • The data originally appeared in: Yorke, J.A. and London, W.P. (1973). "Recurrent Outbreaks of Measles, Chickenpox and Mumps", American Journal of Epidemiology, Vol. 98, pp. 469.
DaphneIslandBeaks.txt
SantaCruzIslandBeaks.txt
  • The data set consists of measurements of beak sizes in mm. of one species of Darwin's ground finch (Geospiza fortis) taken at Daphne Island and at Santa Cruz Island in the Galápagos by Peter and Rosemary Grant.
  • The populations of the two islands differ, although the islands are less than 10 km apart.
  • The data was extracted from a data set distributed with the case study Natural Selection and Darwin's Finches by Martin Wikelski available on the web at http://wps.prenhall.com/esm_freeman_evol_3/0,8018,8412374-,00.html.
  • The original data is summarized in the article: "The classical case of character release: Darwin's finches (Geospiza) on Isla Daphne Major, Galápagos" by P. T. Boag and P. R. Grant that appeared in Biological Journal of the Linnean Society 22:243-287 (11284).

See http://en.wikipedia.org/wiki/Peter_and_Rosemary_Grant for additional information on the work of Peter and Rosemary Grant.

SETUP FOR LESSON 11


EXAMPLE 1: Load the Fisher iris data (comes with MATLAB)

Create a new cell in which you type and execute:

   load fisheriris;

You should see the following 2 variables in your Workspace Browser:


EXAMPLE 2: Compare the distributions of sepal and petal lengths using box plots

Create a new cell in which you type and execute:

   lengths = meas(:, [1, 3]);   % Define a variable for sepal and petal lengths
   figure
   boxplot(lengths, 'Label', {'Sepal', 'Petal'})  % Show boxplots of lengths
   ylabel('Length in mm')
   title('Comparison of sepal and petal lengths for Fisher iris data')

You should see the following variable in your Workspace Browser:

You should also a Figure Window with a labeled box plot:


Create a new cell right here (beginning of a cell starts with %%). Write MATLAB code to make a boxplot of the sepal widths. (This will be a plot consisting of a single box.


EXAMPLE 3: Draw a box plot of the sepal lengths by species

Create a new cell in which you type and execute:

   sepalLens = meas(:, 1);        % Define a variable for the sepal length
   figure
   boxplot(sepalLens, species)    % The species vector specifies the group
   ylabel('Sepal length in mm')
   title('Comparison of three species in the Fisher iris data')

You should see the following variable in your Workspace Browser:

You should also a Figure Window with a labeled box plot:


Create a new cell right here (beginning of a cell starts with %%). Write MATLAB code to make a boxplot of the sepal width separated by species. (You should have three boxplots.)


EXAMPLE 4: Draw a notched box plot of the sepal widths

Create a new cell in which you type and execute:

   sepalWidths = meas(:, 2);       % Define a variable for the sepal widths
   figure
   boxplot(sepalWidths, species, 'notch', 'on')
   ylabel('Sepal width in mm')
   title('Comparison of three species in the Fisher iris data')

You should see the following variable in your Workspace Browser:

You should also a Figure Window with a labeled box plot:


EXAMPLE 5: Load the data about New York contagious diseases

Create a new cell in which you type and execute:

   load NYCDiseases.mat;    % Load the disease data

You should see 4 variables in the Workspace Browser:


EXAMPLE 6: Draw a horizontal box plot of the monthly counts of scaled measles data

Create a new cell in which you type and execute:

   monthLabels = {'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', ...
       'Aug', 'Sep', 'Oct', 'Nov', 'Dec'};
   figure
   boxplot(measles./1000, monthLabels, 'orientation', 'horizontal')
   xlabel('Measles cases in thousands');
   ylabel('Month')
   title('Measles cases in NYC 1931-1971')

You should see the following variable in your Workspace Browser:

You should also a Figure Window with a labeled box plot:


EXAMPLE 7: Draw a compact box plot of the monthly counts of scaled measles data

Create a new cell in which you type and execute:

   figure
   boxplot(measles./1000, monthLabels, 'plotstyle', 'compact')
   xlabel('Month')
   ylabel('Measles cases in thousands');
   title('Measles cases in NYC 1931-1971')

You should a Figure Window with a labeled box plot:


EXAMPLE 8: Draw a labeled box plot of the monthly counts of all diseases

Create a new cell in which you type and execute:

   diseases = [measles(:), mumps(:), chickenPox(:)]./1000;
   figure
   boxplot(diseases, 'labels', {'measles', 'mumps', 'chickenpox'});
   ylabel('Monthly cases in thousands')
   title('Contagious childhood disease in NYC 1931-1971');

You should see the following variable in your Workspace Browser:

You should see a Figure Window with a labeled box plot:


EXAMPLE 9: Load the Daphne Island and Santa Cruz Island beak size data

Create a new cell in which you type and execute:

    Daphne = load('DaphneIslandBeaks.txt');
    SantaCruz = load('SantaCruzIslandBeaks.txt');

You should see the following 2 variables in your Workspace Browser:


EXAMPLE 10: Create a labeled vector of beak sizes for plotting

Create a new cell in which you type and execute:

   beakSizes = [Daphne; SantaCruz];
   islands = [repmat('  Daphne  ', size(Daphne)); repmat('Santa Cruz', size(SantaCruz))];

You should see the following 2 variables in your Workspace Browser:


EXAMPLE 11: Create a box plot of unequal length data sets using labeled data

Create a new cell in which you type and execute:

   figure
   boxplot(beakSizes, islands, 'notch', 'on')
   ylabel('Beak size in mm')
   title('Geospiza fortis from nearby islands in the Galápagos');

You should see a Figure Window with a labeled box plot:


EXAMPLE 12*: Set the box widths to reflect the data set sizes

Create a new cell in which you type and execute:

   boxWidths = 0.75*[length(Daphne)./length(beakSizes), length(SantaCruz)./length(beakSizes)];
   figure
   boxplot(beakSizes, islands, 'widths', boxWidths)
   ylabel('Beak size in mm')
   title('Geospiza fortis from nearby islands in the Galápagos')

You should see the following variable in your Workspace browser:

You should see a Figure Window with a labeled box plot:


SUMMARY OF SYNTAX

MATLAB syntax Description
boxplot(X) creates a box plot of the values in the array X. Each column of X is treated as a distinct data set and gets its own box. The boxplot function has a large number of optional parameters. We used the following options:
  • a cell array of labels (EXAMPLE 2) to label the boxes. The cell array must have a element for each column.
  • a vector of labels (EXAMPLE 3) to sort the data vector into groups. The data and the labels vectors must be the same length. Each group is plotted as its own box.
  • the 'notch', 'on' (EXAMPLE 4) to create boxes that have V-shaped notches. The notch marks the 95% confidence interval for the median.
  • the 'orientation', 'horizontal' (EXAMPLE 6) to draw the boxes horizontally.
  • the 'plotstyle', 'compact' (EXAMPLE 7) to draw the boxes using a compact drawing style. The compact option shows less detail but is useful for graphs with a lot of boxes.
  • the 'widths' option (EXAMPLE 12) to draw the boxes of different widths based on a specified property. In EXAMPLE 12 we made the widths proportional to the number of items in the respective data sets.
repmat(X, n, m) creates a new array by tiling the array X in a pattern with n rows and m columns.
repmat(X, size(A)) creates a new array by tiling the array X in a pattern whose size is the same size as the array A (i.e., the pattern has the same number of rows and columns as A does).


_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions. The photo was taken by Danielle Langlois in July 2005 and is available under public license at http://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg._