LESSON 14 QUESTIONS: Tests for normality

FOCUS QUESTION: How can I determine whether the data is normally distributed?

Contents


EXAMPLE 1: Load Fisher iris data

   load fisheriris;


EXAMPLE 2: Fit histogram of Iris setosa sepal lengths to a normal distribution

   sLenSetosa = meas(strcmp(species, 'setosa'), 1);  % Sepal lengths of Iris setosa
   figure
   histfit(sLenSetosa)
   xlabel('Sepal length (mm)');
   ylabel('Number of specimens')
   title('Iris Setosa sepal length histogram compared to normal distribution');

Questions Answers
What does the histfit function do? The histfit function plots the histogram of the data and overplots a fitted normal distribution.
What is the purpose of overlaying a normal distribution curve on the data histogram? The overlay gives you a good way of ascertaining visually whether your data is likely to be sampled from a normal distribution.
Why would I care whether or not my data is normally distributed? Most statistical tests make assumptions about the way the data is distributed. For example the 95% confidence intervals based on SEM assume the data is a random sample from a normal distribution.
How many bins does histfit use? By default the histfit function uses the square root of the number of data points for the number of bins.
Can I choose the number of bins that histfit uses? Yes, the histfit function takes an optional second argument specifying the number of bins.
Can I overlay other types of distributions? Yes, the histfit function takes a third argument specifying the distribution to fit. MATLAB supports a number of common distributions, such as the exponential distribution.


EXAMPLE 3: Quantile-quantile plots comparing sepal length and sepal width

    sWidSetosa = meas(strcmp(species, 'setosa'), 2);  % Sepal widths of Iris setosa
    figure
    qqplot(sLenSetosa, sWidSetosa)
    xlabel('Sepal length quantiles')
    ylabel('Sepal width quantiles')
    title('Comparison of distributions for sepal length and sepal width for Iris setosa')

Questions Answers
What does the qqplot(x, y) function do? The qqplot function plots the quantiles of vector x against the quantiles of vector y.
What are quantiles? Quantiles represent equally spaced points on the interval [0, 1] used to subdivide probability. When you divide the probability range into 4 equal pieces, you have 4-quantiles which are also known as quartiles. If you divide probability up into 100 equal pieces you have 100-quantiles or percentiles.
How does MATLAB determine how many quantiles to use? MATLAB uses the quantiles of the shorter of x or y.
Do x and y need to be the same length? No, as illustrated in EXAMPLE 4.
What does the dotted line represent? The dotted line represents a line joining the first and third quartiles (e.g., the 25th and 75th percentiles).


EXAMPLE 4: Comparison of Iris setosa sepal length to normal distribution

    stdNorm = normrnd(0, 1, [400, 1]);
    figure
    qqplot(stdNorm, sLenSetosa);
    title('Comparison of Iris setosa sepal lengths to a normal distribution');
    xlabel('Standard normal distribution quantiles')
    ylabel('Sepal length quantiles')

Questions Answers
If the points of a quantile-quantile plot line fall along a straight line, what can I conclude? You can conclude that the two vectors have similar distributions. This doesn't mean they have the same values but rather that they are spread out in a similar fashion.


EXAMPLE 5: Normplot of the sepal lengths for Iris setosa

   figure
   normplot(sLenSetosa)              % Probability plot for normal distribution
   xlabel('Sepal length (mm)');
   ylabel('Cumulative probability');
   title('Iris Setosa sepal length normplot');

Questions Answers
How do I interpret the graph resulting from a call to normplot(x)? The horizontal axis of the plot corresponds to the values in the vector x. The vertical axis corresponds to the quantiles of data in x. These quantiles are plotted on a special scale (normal probability scale) so that if the values are normally distributed, they fall along a straight line. (Notice the spacing of the tick marks. This technique is similar to that used for logarithmic scales.)
If a sample appears to be normally distributed, does that always imply that the underlying population is normally distributed? You can never be 100% certain about population inferences based on measurements of samples. However as the samples become larger, the estimates usually become more reliable, provided the samples are truly selected at random.
Why are the tick marks on the y-axis unevenly spaced? The y-axis uses unevenly spaced tick marks so that normally distributed data appears to fall on a straight line.
What do the straight lines on each plot represent? These lines are not the best fit lines for the data. Instead, each straight line connects the point representing the 25th percentile pair and the point representing the 75th percentile pair.
Why are portions of these straight lines dashed? The portions of each line lying outside the inter quartile range (IQR) are dashed.


EXAMPLE 6: Display the sepal length distributions of all three species using a normplot

   irisSpecies = {'Setosa', 'Virginica', 'Versicolor'}; % Use a standard legend
   sLens = reshape(meas(:, 1), 50, 3);        % Make sepal lengths 50 x 3
   figure
   normplot(sLens);                           % Each column is plotted separately
   xlabel('Sepal length (mm)');
   legend(irisSpecies, 'Location', 'SE');
   title('Normplot comparing three iris species')

Questions Answers
How do I interpret the graph resulting from a call to normplot(X)? Each column of the array X corresponds to one curve on the plot. The horizontal axis of the plot corresponds to the values in X. The vertical axis corresponds to the quantiles. The tick marks are spaced so that normally distributed data falls along a straight line.
What can I conclude from this plot? All three graphs are close to linear, but they don't overlap. It is likely that the sepal lengths from all three species are normally distributed, but with the respective normal distributions for these species are different.


EXAMPLE 7: Test for normality (the Lillifors test)

   fprintf('Testing Iris setosa sepal length: ');
   if lillietest(sLenSetosa) == 1
       fprintf(' likely to be non-normal\n');
   else
       fprintf(' no evidence of non-normality\n');
   end;
Testing Iris setosa sepal length:  no evidence of non-normality

Questions Answers
What is purpose of lillietest? The MATLAB lillietest function provides a computational test of whether a sample is likely to represent a population that is not normal.
What does a lillietest return value of 1 mean? A lillietest return value of 1 is a strong indication that the population corresponding to the sample is not normally distributed.
What does a lillietest return value 0 mean? A lillietest return value 0 indicates that there is no evidence of population non-normality. You can't necessarily conclude that the population is normal.
Can the lillietest test be stated in terms of hypothesis testing? Yes, the null hypothesis for lillietest is that the population corresponding to the sample is normally distributed. The alternative hypothesis is that the population corresponding to the sample is not normally distributed.


EXAMPLE 8: Lilliefors test (standard statistical terminology)

   fprintf('Null hypothesis: Iris setosa sepal length is normally distributed\n');
   fprintf('Alternative hypothesis: Iris setosa sepal length does not follow a normal distribution\n');
   if lillietest(sLenSetosa) == 1
       fprintf('\tReject null hypothesis in favor of alternative at 0.05 significance level\n');
   else
       fprintf('\tNo evidence for rejecting null hypothesis at the 0.05 significance level\n');
   end;
Null hypothesis: Iris setosa sepal length is normally distributed
Alternative hypothesis: Iris setosa sepal length does not follow a normal distribution
	No evidence for rejecting null hypothesis at the 0.05 significance level

Questions Answers
What is the significance level for the lillietest test? The default significance level for lillietest is 5%, corresponding to an alpha of 0.05.
What does the 5% significance level mean? This significance level indicates that you would have less than a 5% chance of seeing such a sample if the underlying distribution were normally distributed.


EXAMPLE 9: Use the Lilliefors test at different significance levels

   fprintf('\nTesting whether Iris setosa sepal lengths are normally distributed:\n');
   sigLevel = [0.1, 0.05, 0.01, 0.005];   % Test at different significance levels
   for k = 1:length(sigLevel)
       if lillietest(sLenSetosa, sigLevel(k)) == 1
           fprintf('  Evidence');
       else
           fprintf('  No evidence');
       end;
       fprintf(' of non-normality at %g significance level\n', sigLevel(k));
   end;
Testing whether Iris setosa sepal lengths are normally distributed:
  Evidence of non-normality at 0.1 significance level
  No evidence of non-normality at 0.05 significance level
  No evidence of non-normality at 0.01 significance level
  No evidence of non-normality at 0.005 significance level

Questions Answers
What does a sigLevel of 0.01 mean? The second argument of lillietest specifies the level of statistical significance of the test for normality. A return value of 1 from lillietest indicates the distribution is not normal. More specifically, the result indicates there is less than a 1% probability of observing such a sample from a normally distributed population.
Why wasn't sigLevel called alpha? You can use any variable you want as the argument of lillietest corresponding to alpha, provided that its value is between 0 and 1. Functions map the arguments to internal values by the argument position in the call, not by the name of the argument.
What is alpha for this example? The second argument of the lillietest function represents the alpha or level of significance. Any value or variable passed as the second argument will have the role of alpha for the test.
What does a lower alpha mean? A lower level of significance (alpha) specifies a more strict test for statistical significance. In other words, don't reject the null hypothesis without stronger evidence of non-normality.
Is a lower value of alpha more conservative? For most hypothesis testing, a lower value of alpha mean more a more conservative conclusion. However, here you do not want to find evidence of non-normality. A more conservative approach uses a higher value of alpha.
What levels of significance are typically used? In biology, an alpha of 0.05 (5% significance level) seems to be the most common. Sometimes an alpha of 0.1 (10% significance level) is used, particularly when indirect measurements of quantities are made. Occasionally, an alpha of 0.01 (1% significance level) is used. Because of the great variability across biological specimens, lower values of alpha such as 0.001 (0.1% significance level) are rarely used.


EXAMPLE 10: Load the data about New York contagious diseases

   load NYCDiseases.mat;    % Load the disease data


EXAMPLE 11: Fit histogram of NYC measles cases/month to a normal distribution

   figure
   histfit(measles(:)/1000);
   xlabel('Cases/month (in thousands)');
   ylabel('Number of months')
   title('Measles cases in NYC 1931-1971(normal fit)')

Questions Answers
Why does the overlay appear not to follow the data? By default, histfit overlays a normal distribution. This poor correspondence indicates that the measles data is not normally distributed. In fact it looks more like an exponential distribution.
Could I overlay other distributions? Yes, histfit allows you to specify other distributions. You can always overlay any distribution if you can calculate its histogram and fit its parameters to your data.


EXAMPLE 12: Fit histogram of NYC measles cases/month to a exponential distribution

   figure
   histfit(measles(:)/1000, [], 'exponential');
   xlabel('Cases/month (in thousands)');
   ylabel('Number of months')
   title('Measles cases in NYC 1931-1971(exponential fit)')

Questions Answers
What does the 'exponential' argument of the histfit The third argument of histfit specifies the distribution to overlay, in this case an exponential distribution.
How can I fit an exponential distribution to my data? The exponential distribution is completely specified by its mean value. The simplest way to fit an exponential distribution is simply to estimate the population mean (e.g., find the sample mean).
What does the [ ] argument mean? The second argument of histfit specifies the number of bins to use for the histogram. The [ ] designates the default value, square root of the number of data points.


EXAMPLE 13: Use a probability plot for exponential distribution

   figure
   probplot('exponential', measles(:)./1000);
   xlabel('Cases/month (in thousands)');
   ylabel('Cumulative probability')
   title('Measles cases in NYC 1931-1971(exponential plot)')

Questions Answers
What is a probplot? The probplot is a generalization of the normplot. In this example, the spacing of the y-axis tick marks is chosen so that if the data follows an exponential distribution, the graph appears to be linear.
What can I conclude about the distribution of measles cases from this graph? The measles cases have some characteristics of the exponential distribution, however, the distribution appears to have a much longer tail (very large values are more probable) than the exponential distribution. The long tail is marked by the deviation to the right of the data points from the graph.
What do the two rightmost data points on the graph correspond to? These two points correspond to the March and April monthly measles counts of the 1941 epidemic.


EXAMPLE 14: Test monthly measles counts for normality using the Lilliefors test

    fprintf(['\nLilliefors test of monthly measles cases ' ...
        'at the .05 significance level:\n']);
    if lillietest(measles(:), 0.05, 'exp') == 1
        fprintf('   not likely to be exponential\n');
    else
        fprintf('   could be exponential\n');
    end;
Lilliefors test of monthly measles cases at the .05 significance level:
   not likely to be exponential

Questions Answers
What is the lillietest function testing for in this case? The third argument specifies the distribution type to test for. This example tests for an exponential distribution.


_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 07-Mar-2011. Please contact krobbins@cs.utsa.edu with comments or suggestions. The photo is of of Iris setosa taking in the Kiritappu Wetland at Hokkadio Japan taken July 2006 by Miya.m. (See http://commons.wikimedia.org/wiki/File:Iris_setosa01.jpg.)_