LESSON 14 QUESTIONS: Tests for normality
FOCUS QUESTION: How can I determine whether the data is normally distributed?
Contents
- EXAMPLE 1: Load Fisher iris data
- EXAMPLE 2: Fit histogram of Iris setosa sepal lengths to a normal distribution
- EXAMPLE 3: Quantile-quantile plots comparing sepal length and sepal width
- EXAMPLE 4: Comparison of Iris setosa sepal length to normal distribution
- EXAMPLE 5: Normplot of the sepal lengths for Iris setosa
- EXAMPLE 6: Display the sepal length distributions of all three species using a normplot
- EXAMPLE 7: Test for normality (the Lillifors test)
- EXAMPLE 8: Lilliefors test (standard statistical terminology)
- EXAMPLE 9: Use the Lilliefors test at different significance levels
- EXAMPLE 10: Load the data about New York contagious diseases
- EXAMPLE 11: Fit histogram of NYC measles cases/month to a normal distribution
- EXAMPLE 12: Fit histogram of NYC measles cases/month to a exponential distribution
- EXAMPLE 13: Use a probability plot for exponential distribution
- EXAMPLE 14: Test monthly measles counts for normality using the Lilliefors test
EXAMPLE 1: Load Fisher iris data
load fisheriris;
EXAMPLE 2: Fit histogram of Iris setosa sepal lengths to a normal distribution
sLenSetosa = meas(strcmp(species, 'setosa'), 1); % Sepal lengths of Iris setosa figure histfit(sLenSetosa) xlabel('Sepal length (mm)'); ylabel('Number of specimens') title('Iris Setosa sepal length histogram compared to normal distribution');
| Questions | Answers |
What does the histfit function do? |
The histfit function plots the histogram
of the data and overplots a fitted normal distribution. |
| What is the purpose of overlaying a normal distribution curve on the data histogram? | The overlay gives you a good way of ascertaining visually whether your data is likely to be sampled from a normal distribution. |
| Why would I care whether or not my data is normally distributed? | Most statistical tests make assumptions about the way the data is distributed. For example the 95% confidence intervals based on SEM assume the data is a random sample from a normal distribution. |
How many bins does histfit use? |
By default the histfit function uses the square
root of the number of data points for the number of bins. |
Can I choose the number of bins that histfit uses? |
Yes, the histfit function takes an optional second
argument specifying the number of bins. |
| Can I overlay other types of distributions? | Yes, the histfit function takes a third argument
specifying the distribution to fit. MATLAB supports a number of
common distributions, such as the exponential distribution. |
EXAMPLE 3: Quantile-quantile plots comparing sepal length and sepal width
sWidSetosa = meas(strcmp(species, 'setosa'), 2); % Sepal widths of Iris setosa figure qqplot(sLenSetosa, sWidSetosa) xlabel('Sepal length quantiles') ylabel('Sepal width quantiles') title('Comparison of distributions for sepal length and sepal width for Iris setosa')
| Questions | Answers |
What does the qqplot(x, y) function do? |
The qqplot function plots the quantiles of vector
x against the quantiles of vector y. |
| What are quantiles? | Quantiles represent equally spaced points on the interval [0, 1] used to subdivide probability. When you divide the probability range into 4 equal pieces, you have 4-quantiles which are also known as quartiles. If you divide probability up into 100 equal pieces you have 100-quantiles or percentiles. |
| How does MATLAB determine how many quantiles to use? | MATLAB uses the quantiles of the shorter of x or y. |
| Do x and y need to be the same length? | No, as illustrated in EXAMPLE 4. |
| What does the dotted line represent? | The dotted line represents a line joining the first and third quartiles (e.g., the 25th and 75th percentiles). |
EXAMPLE 4: Comparison of Iris setosa sepal length to normal distribution
stdNorm = normrnd(0, 1, [400, 1]);
figure
qqplot(stdNorm, sLenSetosa);
title('Comparison of Iris setosa sepal lengths to a normal distribution');
xlabel('Standard normal distribution quantiles')
ylabel('Sepal length quantiles')
| Questions | Answers |
| If the points of a quantile-quantile plot line fall along a straight line, what can I conclude? | You can conclude that the two vectors have similar distributions. This doesn't mean they have the same values but rather that they are spread out in a similar fashion. |
EXAMPLE 5: Normplot of the sepal lengths for Iris setosa
figure normplot(sLenSetosa) % Probability plot for normal distribution xlabel('Sepal length (mm)'); ylabel('Cumulative probability'); title('Iris Setosa sepal length normplot');
| Questions | Answers |
How do I interpret the graph resulting from a
call to normplot(x)? |
The horizontal axis of the plot corresponds
to the values in the vector x. The vertical axis corresponds to
the quantiles of data in x. These quantiles are plotted on a
special scale (normal probability scale) so that if the values are normally
distributed, they fall along a straight line. (Notice the
spacing of the tick marks. This technique is similar to that used
for logarithmic scales.) |
| If a sample appears to be normally distributed, does that always imply that the underlying population is normally distributed? | You can never be 100% certain about population inferences based on measurements of samples. However as the samples become larger, the estimates usually become more reliable, provided the samples are truly selected at random. |
| Why are the tick marks on the y-axis unevenly spaced? | The y-axis uses unevenly spaced tick marks so that normally distributed data appears to fall on a straight line. |
| What do the straight lines on each plot represent? | These lines are not the best fit lines for the data. Instead, each straight line connects the point representing the 25th percentile pair and the point representing the 75th percentile pair. |
| Why are portions of these straight lines dashed? | The portions of each line lying outside the inter quartile range (IQR) are dashed. |
EXAMPLE 6: Display the sepal length distributions of all three species using a normplot
irisSpecies = {'Setosa', 'Virginica', 'Versicolor'}; % Use a standard legend
sLens = reshape(meas(:, 1), 50, 3); % Make sepal lengths 50 x 3
figure
normplot(sLens); % Each column is plotted separately
xlabel('Sepal length (mm)');
legend(irisSpecies, 'Location', 'SE');
title('Normplot comparing three iris species')
| Questions | Answers |
How do I interpret the graph resulting from a
call to normplot(X)? |
Each column of the array X corresponds to
one curve on the plot. The horizontal axis of the plot corresponds
to the values in X. The vertical axis corresponds to
the quantiles. The tick marks are spaced so that
normally distributed data falls along a straight line. |
| What can I conclude from this plot? | All three graphs are close to linear, but they don't overlap. It is likely that the sepal lengths from all three species are normally distributed, but with the respective normal distributions for these species are different. |
EXAMPLE 7: Test for normality (the Lillifors test)
fprintf('Testing Iris setosa sepal length: '); if lillietest(sLenSetosa) == 1 fprintf(' likely to be non-normal\n'); else fprintf(' no evidence of non-normality\n'); end;
Testing Iris setosa sepal length: no evidence of non-normality
| Questions | Answers |
What is purpose of lillietest? |
The MATLAB lillietest function provides a computational
test of whether a sample is likely to represent a population that is not normal. |
What does a lillietest return value of 1 mean? |
A lillietest return value of 1 is a strong indication
that the population corresponding to the sample is not normally distributed. |
What does a lillietest return value 0 mean? |
A lillietest return value 0 indicates that there is
no evidence of population non-normality. You can't necessarily conclude that the population
is normal. |
Can the lillietest test be stated in
terms of hypothesis testing? |
Yes, the null hypothesis for lillietest
is that the population corresponding to the sample is normally
distributed. The alternative hypothesis is that the population
corresponding to the sample is not normally distributed.
|
EXAMPLE 8: Lilliefors test (standard statistical terminology)
fprintf('Null hypothesis: Iris setosa sepal length is normally distributed\n'); fprintf('Alternative hypothesis: Iris setosa sepal length does not follow a normal distribution\n'); if lillietest(sLenSetosa) == 1 fprintf('\tReject null hypothesis in favor of alternative at 0.05 significance level\n'); else fprintf('\tNo evidence for rejecting null hypothesis at the 0.05 significance level\n'); end;
Null hypothesis: Iris setosa sepal length is normally distributed Alternative hypothesis: Iris setosa sepal length does not follow a normal distribution No evidence for rejecting null hypothesis at the 0.05 significance level
| Questions | Answers |
What is the significance level for the lillietest
test? |
The default significance level for lillietest is
5%, corresponding to an alpha of 0.05.
|
| What does the 5% significance level mean? | This significance level indicates that you would have less than a 5% chance of seeing such a sample if the underlying distribution were normally distributed. |
EXAMPLE 9: Use the Lilliefors test at different significance levels
fprintf('\nTesting whether Iris setosa sepal lengths are normally distributed:\n'); sigLevel = [0.1, 0.05, 0.01, 0.005]; % Test at different significance levels for k = 1:length(sigLevel) if lillietest(sLenSetosa, sigLevel(k)) == 1 fprintf(' Evidence'); else fprintf(' No evidence'); end; fprintf(' of non-normality at %g significance level\n', sigLevel(k)); end;
Testing whether Iris setosa sepal lengths are normally distributed: Evidence of non-normality at 0.1 significance level No evidence of non-normality at 0.05 significance level No evidence of non-normality at 0.01 significance level No evidence of non-normality at 0.005 significance level
| Questions | Answers |
What does a sigLevel of 0.01 mean? |
The second argument of lillietest specifies the
level of statistical significance of the test for normality.
A return value of
1 from lillietest indicates the distribution is not
normal. More specifically, the result indicates there is less than
a 1% probability of observing such a sample from a normally
distributed population. |
Why wasn't sigLevel called alpha? |
You can use any variable you want as the argument of lillietest
corresponding to alpha, provided that its value
is between 0 and 1. Functions map the arguments to internal values
by the argument position in the call, not by the name of the argument. |
What is alpha for this example? |
The second argument of the lillietest function
represents the alpha or level of significance. Any
value or variable passed as the second argument will have the role
of alpha for the test. |
What does a lower alpha mean? |
A lower level of significance (alpha)
specifies a more strict test for statistical significance.
In other words, don't reject the null hypothesis without stronger
evidence of non-normality. |
Is a lower value of alpha more conservative? |
For most hypothesis testing, a lower value of alpha
mean more a more conservative conclusion. However, here you do not want to
find evidence of non-normality. A more conservative approach uses
a higher value of alpha. |
What levels of significance are typically used? | In biology, an alpha of 0.05
(5% significance level) seems to be the most
common. Sometimes an alpha of 0.1 (10% significance level) is used,
particularly when indirect measurements of quantities are made.
Occasionally, an alpha of 0.01 (1%
significance level) is used. Because of the great variability across
biological specimens, lower values of alpha such as
0.001 (0.1% significance level) are rarely used. |
EXAMPLE 10: Load the data about New York contagious diseases
load NYCDiseases.mat; % Load the disease data
EXAMPLE 11: Fit histogram of NYC measles cases/month to a normal distribution
figure histfit(measles(:)/1000); xlabel('Cases/month (in thousands)'); ylabel('Number of months') title('Measles cases in NYC 1931-1971(normal fit)')
| Questions | Answers |
| Why does the overlay appear not to follow the data? | By default, histfit overlays a normal distribution.
This poor correspondence indicates that the measles data is not
normally distributed. In fact it looks more like an exponential
distribution. |
| Could I overlay other distributions? | Yes, histfit allows you to specify other
distributions. You can always overlay any distribution if you
can calculate its histogram and fit its parameters to your data. |
EXAMPLE 12: Fit histogram of NYC measles cases/month to a exponential distribution
figure histfit(measles(:)/1000, [], 'exponential'); xlabel('Cases/month (in thousands)'); ylabel('Number of months') title('Measles cases in NYC 1931-1971(exponential fit)')
| Questions | Answers |
What does the 'exponential' argument
of the histfit |
The third argument of histfit specifies the
distribution to overlay, in this case an exponential distribution.
|
| How can I fit an exponential distribution to my data? | The exponential distribution is completely specified by its mean value. The simplest way to fit an exponential distribution is simply to estimate the population mean (e.g., find the sample mean). |
What does the [ ] argument mean? |
The second argument of histfit specifies the number of
bins to use for the histogram. The [ ] designates the
default value, square root of the number of data points.
|
EXAMPLE 13: Use a probability plot for exponential distribution
figure probplot('exponential', measles(:)./1000); xlabel('Cases/month (in thousands)'); ylabel('Cumulative probability') title('Measles cases in NYC 1931-1971(exponential plot)')
| Questions | Answers |
What is a probplot? |
The probplot is a generalization of the
normplot. In this
example, the spacing of the y-axis tick marks is chosen so that
if the data follows an exponential distribution, the graph
appears to be linear.
|
| What can I conclude about the distribution of measles cases from this graph? | The measles cases have some characteristics of the exponential distribution, however, the distribution appears to have a much longer tail (very large values are more probable) than the exponential distribution. The long tail is marked by the deviation to the right of the data points from the graph. |
| What do the two rightmost data points on the graph correspond to? | These two points correspond to the March and April monthly measles counts of the 1941 epidemic. |
EXAMPLE 14: Test monthly measles counts for normality using the Lilliefors test
fprintf(['\nLilliefors test of monthly measles cases ' ... 'at the .05 significance level:\n']); if lillietest(measles(:), 0.05, 'exp') == 1 fprintf(' not likely to be exponential\n'); else fprintf(' could be exponential\n'); end;
Lilliefors test of monthly measles cases at the .05 significance level: not likely to be exponential
| Questions | Answers |
What is the lillietest function
testing for in this case? |
The third argument specifies the distribution type to test for. This example tests for an exponential distribution. |
_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 07-Mar-2011. Please contact krobbins@cs.utsa.edu with comments or suggestions. The photo is of of Iris setosa taking in the Kiritappu Wetland at Hokkadio Japan taken July 2006 by Miya.m. (See http://commons.wikimedia.org/wiki/File:Iris_setosa01.jpg.)_