LESSON 14: Tests for normality

FOCUS QUESTION: How can I determine whether the data is normally distributed?

Many statistical tests assume the data was drawn or sampled at random from a particular distribution such as the normal distribution (bell curve). This lesson examines four ways to determine whether the data matches a particular distribution such as the normal distribution or the exponential distribution.

In this lesson you will:
  • Fit the data's histogram to a normal or other distribution.
  • Compare the data's distribution to known distributions using a quantile-quantile plot.
  • Create and interpret a normplot or other probability plot.
  • Apply the Lilliefors test for sample normality.
  • Introduce null and alternative hypotheses.
Photograph of Iris setosa

Contents


DATA FOR THIS LESSON

File Description
fisheriris
  • This data set contains the famous Fisher iris data set. The data set consists of measurements of 150 flower samples from each of three species of flowers: Iris setosa, Iris virginica, and Iris versicolor. The measurements are in mm.
  • Four features were measured for each sample:
    • The length of the flower sepal
    • The width of the flower sepal
    • The length of the flower petal
    • The width of the flower petal
  • All 150 samples from the Fisher iris data are stored in a single table called meas:
    • The four columns correspond to the four types of measurements: sepal length, sepal width, petal length and petal width, respectively.
    • The first 50 rows contain data for Iris setosa
    • The second 50 rows contain data for Iris virginica
    • The third 50 rows contain data for Iris versicolor.
  • The species information is kept in a separate vector called species.
  • The data is sometimes referred to as Anderson's Iris data in honor of Edgar Anderson, the biologist who collected the data. See http://en.wikipedia.org/wiki/Iris_flower_data_set for additional information.

Note: This dataset comes with the MATLAB distribution so you don't have to download it separately.

NYCDiseases.mat
  • The data set contains of the monthly totals of the number of new cases of measles, mumps, and chicken pox for New York City during the years 1931-1971.
  • The data is organized into the following variables:
    • measles - an array containing the monthly cases of measles
    • mumps - an array containing the monthly cases of mumps
    • chickenPox - an array containing the monthly cases of chicken pox
    • years - a vector containing the years 1931 through 1971
  • The data was extracted from the Hipel-McLeod Time Series Datasets Collection, available at http://www.stats.uwo.ca/faculty/aim/epubs/mhsets/readme-mhsets.html.
  • The data originally appeared in: Yorke, J.A. and London, W.P. (1973). "Recurrent Outbreaks of Measles, Chickenpox and Mumps", American Journal of Epidemiology, Vol. 98, pp. 469.

SETUP FOR LESSON 14

SUGGESTED READING: Wikipedia also has a discussion of the concept of the null hypothesis which is somewhat readable. The discussion can be found at <http://en.wikipedia.org/wiki/Null_hypothesis>.


EXAMPLE 1: Load Fisher iris data

Create a new cell in which you type and execute:

   load fisheriris;

You should see the following 2 variables in your Workspace Browser:


EXAMPLE 2: Fit histogram of Iris setosa sepal lengths to a normal distribution

Create a new cell in which you type and execute:

   sLenSetosa = meas(strcmp(species, 'setosa'), 1);  % Sepal lengths of Iris setosa
   figure
   histfit(sLenSetosa)
   xlabel('Sepal length (mm)');
   ylabel('Number of specimens')
   title('Iris Setosa sepal length histogram compared to normal distribution');

You should see the following variable in your Workspace Browser:

You should also see the following histogram graph overlaid on the histogram of a normal distribution fitted to the data:


Create a new cell right here (beginning of a cell starts with %%). Write MATLAB code to do a histogram fit to a normal distribution of the entire collection of measurements (rather than just the sepal lengths).


EXAMPLE 3: Quantile-quantile plots comparing sepal length and sepal width

Create a new cell in which you type and execute:

    sWidSetosa = meas(strcmp(species, 'setosa'), 2);  % Sepal widths of Iris setosa
    figure
    qqplot(sLenSetosa, sWidSetosa)
    xlabel('Sepal length quantiles')
    ylabel('Sepal width quantiles')
    title('Comparison of distributions for sepal length and sepal width for Iris setosa')

You should see the following variable in your Workspace Browser:

You should also see a quantile-quantile plot in a separate figure window:


EXAMPLE 4: Comparison of Iris setosa sepal length to normal distribution

Create a new cell in which you type and execute:

    stdNorm = normrnd(0, 1, [400, 1]);
    figure
    qqplot(stdNorm, sLenSetosa);
    title('Comparison of Iris setosa sepal lengths to a normal distribution');
    xlabel('Standard normal distribution quantiles')
    ylabel('Sepal length quantiles')

You should see the following variable in your Workspace Browser:

You should also see the following quantile-quantile plot:


EXAMPLE 5: Normplot of the sepal lengths for Iris setosa

Create a new cell in which you type and execute:

   figure
   normplot(sLenSetosa)              % Probability plot for normal distribution
   xlabel('Sepal length (mm)');
   ylabel('Cumulative probability');
   title('Iris Setosa sepal length normplot');

You should the following labeled probability plot (normplot) in a new Figure Window:


EXAMPLE 6: Display the sepal length distributions of all three species using a normplot

Create a new cell in which you type and execute:

   irisSpecies = {'Setosa', 'Virginica', 'Versicolor'}; % Use a standard legend
   sLens = reshape(meas(:, 1), 50, 3);        % Make sepal lengths 50 x 3
   figure
   normplot(sLens);                           % Each column is plotted separately
   xlabel('Sepal length (mm)');
   legend(irisSpecies, 'Location', 'SE');
   title('Normplot comparing three iris species')

*You should see the following 2 variables in your Workspace Browser:

You should also see the following labeled normplot in a new Figure Window:


Create a new cell right here (beginning of a cell starts with %%). Write MATLAB code create two figures. The first figure should contain the normplot of meas as a four-column array. The second figure the normplot of the linear representation of meas. Compare the two figures.


EXAMPLE 7: Test for normality (the Lillifors test)

Create a new cell in which you type and execute:

   fprintf('Testing Iris setosa sepal length: ');
   if lillietest(sLenSetosa) == 1
       fprintf(' likely to be non-normal\n');
   else
       fprintf(' no evidence of non-normality\n');
   end;

You should the following output in your Command Window:

Testing Iris setosa sepal length:  no evidence of non-normality


EXAMPLE 8: Lilliefors test (standard statistical terminology)

Create a new cell in which you type and execute:

   fprintf('Null hypothesis: Iris setosa sepal length is normally distributed\n');
   fprintf('Alternative hypothesis: Iris setosa sepal length does not follow a normal distribution\n');
   if lillietest(sLenSetosa) == 1
       fprintf('\tReject null hypothesis in favor of alternative at 0.05 significance level\n');
   else
       fprintf('\tNo evidence for rejecting null hypothesis at the 0.05 significance level\n');
   end;

You should the following output in your Command Window:

Null hypothesis: Iris setosa sepal length is normally distributed
Alternative hypothesis: Iris setosa sepal length does not follow a normal distribution
	No evidence for rejecting null hypothesis at the 0.05 significance level


EXAMPLE 9: Use the Lilliefors test at different significance levels

Create a new cell in which you type and execute:

   fprintf('\nTesting whether Iris setosa sepal lengths are normally distributed:\n');
   sigLevel = [0.1, 0.05, 0.01, 0.005];   % Test at different significance levels
   for k = 1:length(sigLevel)
       if lillietest(sLenSetosa, sigLevel(k)) == 1
           fprintf('  Evidence');
       else
           fprintf('  No evidence');
       end;
       fprintf(' of non-normality at %g significance level\n', sigLevel(k));
   end;

You should see the following 2 variables in your Workspace Browser:

You should also see the following output in the Command Window:

Testing whether Iris setosa sepal lengths are normally distributed:
  Evidence of non-normality at 0.1 significance level
  No evidence of non-normality at 0.05 significance level
  No evidence of non-normality at 0.01 significance level
  No evidence of non-normality at 0.005 significance level


EXAMPLE 10: Load the data about New York contagious diseases

Create a new cell in which you type and execute:

   load NYCDiseases.mat;    % Load the disease data

You should see 5 variables in the Workspace Browser. The measles, mumps, and chickenPox variables contain monthly totals of the new cases for the years 1931 through 1971.


EXAMPLE 11: Fit histogram of NYC measles cases/month to a normal distribution

Create a new cell in which you type and execute:

   figure
   histfit(measles(:)/1000);
   xlabel('Cases/month (in thousands)');
   ylabel('Number of months')
   title('Measles cases in NYC 1931-1971(normal fit)')

You should see the following histogram overlaid on the histogram of a normal distribution fitted to the data:


EXAMPLE 12: Fit histogram of NYC measles cases/month to a exponential distribution

Create a new cell in which you type and execute:

   figure
   histfit(measles(:)/1000, [], 'exponential');
   xlabel('Cases/month (in thousands)');
   ylabel('Number of months')
   title('Measles cases in NYC 1931-1971(exponential fit)')

You should see the following histogram overlaid on the histogram of a exponential distribution fitted to the data:


EXAMPLE 13: Use a probability plot for exponential distribution

Create a new cell in which you type and execute:

   figure
   probplot('exponential', measles(:)./1000);
   xlabel('Cases/month (in thousands)');
   ylabel('Cumulative probability')
   title('Measles cases in NYC 1931-1971(exponential plot)')

You should see a Figure Window with a labeled probability plot:


EXAMPLE 14: Test monthly measles counts for normality using the Lilliefors test

Create a new cell in which you type and execute:

    fprintf(['\nLilliefors test of monthly measles cases ' ...
        'at the .05 significance level:\n']);
    if lillietest(measles(:), 0.05, 'exp') == 1
        fprintf('   not likely to be exponential\n');
    else
        fprintf('   could be exponential\n');
    end;

You should see the following output in the Command Window:

Lilliefors test of monthly measles cases at the .05 significance level:
   not likely to be exponential


SUMMARY OF SYNTAX

MATLAB syntax Description
cdfplot(x) plots the (empirical) cumulative probability distribution for the data in the vector x.
Y = ceil(X) returns and array Y that is the same size as the array X. Each element of Y is the smallest integer that is greater than or equal to the corresponding element of X. For example, ceil(3.5) is 4, while ceil(-2) is -2.
Y = floor(X) returns and array Y that is the same size as the array X. Each element of Y is the largest integer that is less than or equal to the corresponding element of X. For example, floor(3.5) is 3, while floor(-2.3) is -2.
histfit(x) plots a histogram of the vector x and overplots a fitted normal distribution for comparison. The histfit function uses the square root of the number of data points in x for the number of bins.
histfit(x, [], 'exponential') plots a histogram of the vector x and overplots a fitted exponential distribution for comparison. The histfit function uses the square root of the number of data points in x for the number of bins.
h = lillietest(x) performs the Lilliefors test for normality on the vector x. If h is 1, we reject the null hypothesis that x is a sample from the normal distribution at the 0.05 significance level. In other words if h is 1, we are 95% certain that x did not come from a normal distribution. On the other hand if h is zero, we don't have any reason to believe that x is not normal, but we haven't proved that it is normal.
h = lillietest(x, sig) performs the Lilliefors test for normality on the vector x. If h is 1, we reject the null hypothesis that x is a sample from the normal distribution at the sig significance level. On the other hand if h is zero, we don't have any reason to believe that x is not normal, but we haven't proved that it is normal. Common values of sig are 0.1, 0.05, 0.01, and 0.001. The default significance level is 0.05.
h = lillietest(x, sig, 'exp') performs the Lilliefors test for the vector x. If h is 1, we reject the null hypothesis that x is a sample from the exponential distribution at the sig significance level. On the other hand if h is zero, we don't have any reason to believe that x is not exponential, but we haven't proved that it is exponential.
y = normcdf(x, theMean, theSTD) returns a vector of the values of the cumulative probabililty distribution for a normal distribution at the points in the vector x. The normal distribution has mean given by theMean and standard deviation given by theSTD.
normplot(X) plots the quantiles of the distributions in the columns of the array X against the quantiles of the normal distribution. If a result is linear, the distribution in the corresponding column of X has the characteristics of the normal distribution.
probplot('exponential', X) plots the quantiles of the distributions in the columns of the array X against the quantiles of the exponential distribution. If a result is linear, the distribution in the corresponding column of X has the characteristics of the exponential distribution.


_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 07-Mar-2011. Please contact krobbins@cs.utsa.edu with comments or suggestions. The photo is of of Iris setosa taking in the Kiritappu Wetland at Hokkadio Japan taken July 2006 by Miya.m. (See http://commons.wikimedia.org/wiki/File:Iris_setosa01.jpg.)_