LESSON 15: Hypothesis testing
FOCUS QUESTION: How can I tell whether the experimental group is different from the control group?
Contents
- DATA FOR THIS LESSON
- SETUP FOR LESSON 15
- EXAMPLE 1: Load the Fisher iris data and extract sepal lengths of Iris setosa
- EXAMPLE 2: Test Iris setosa mean sepal length using a one sample t-test
- EXAMPLE 3: Apply one sample t-test (standard statistial terminology)
- EXAMPLE 4: Test for inequality using the one sample t-test
- EXAMPLE 5: Look at the p-value and the confidence interval
- EXAMPLE 6: Load the Daphne Island and Santa Cruz Island beak size data
- EXAMPLE 7: Are Daphne finch beak sizes different from those of Santa Cruz finches?
- EXAMPLE 8: Look at the p-value and confidence interval for the two-sample t-test
- EXAMPLE 9: Load the consolidated sleep diary data
- EXAMPLE 10: Calculate wake-up hours, separated by gender
- EXAMPLE 11: Do men get up earlier than women on average?
- EXAMPLE 12: Compare average wake-up time of instructor to that of a random student
- EXAMPLE 13: Output IDs of the subjects whose average wake-up is similar instructor's
- SUMMARY OF SYNTAX
DATA FOR THIS LESSON
| File | Description |
fisheriris |
Note: This dataset comes with the MATLAB distribution so you don't have to download it separately. |
| DaphneIslandBeaks.txt SantaCruzIslandBeaks.txt |
See http://en.wikipedia.org/wiki/Peter_and_Rosemary_Grant for additional information on the work of Peter and Rosemary Grant. |
diaries.mat |
|
SETUP FOR LESSON 15
- Set the Current Directory to Z:\working\MATLAB\Lesson15. (You will need to make a new directory for Lesson15.)
- Download the data file DaphneIslandBeaks.txt to your Lesson15 directory.
- Download diaries.mat from Blackboard or copy it from Lesson 8.
- Create a new script called Lesson15Script.m. (Use File->New->Blank M-File from the main MATLAB menubar.) You will enter each of the examples in a new cell in this script.
SUGGESTED READING: Wikipedia has a discussion of hypothesis testing found at <http://en.wikipedia.org/wiki/Statistical_hypothesis_testing>.
SUGGESTED READING: Wikipedia also has a discussion of the concept of the null hypothesis which is somewhat readable. The discussion can be found at <http://en.wikipedia.org/wiki/Null_hypothesis>.
SUGGESTED READING: Wikipedia discusses the meaning of the p-value and the frequent misunderstandings in interpreting it. The discussion can be found at <http://en.wikipedia.org/wiki/Pvalue>.
EXAMPLE 1: Load the Fisher iris data and extract sepal lengths of Iris setosa
Create a new cell in which you type and execute:
load fisheriris; sLenSetosa = meas(strcmp(species, 'setosa'), 1); % Sepal lengths of Iris setosa
*You should see the following variables in your Workspace Browser:
- meas - an array in which each column corresponds to a particular type of measurement and each row corresponds to the 4 measurements for a particular speciman. The different species are combined into a single array.
- species - a cell column vector containing the species designation for the speciman given in the corresponding row of meas. Possible values are 'setosa', 'versicolor', and 'virginica'.
- sLenSetosa - vector with the sepal lengths for Iris Setosa
EXAMPLE 2: Test Iris setosa mean sepal length using a one sample t-test
Create a new cell in which you type and execute:
fprintf('Estimated population mean is %g\n', mean(sLenSetosa)); testMean = 5.936; % Could the true mean sepal length be this value? fprintf('True population mean of Iris setosa sepal length '); if ttest(sLenSetosa, testMean) == 1 fprintf('is likely to be different from %g\n', testMean); else fprintf('could be %g\n', testMean); end;
You should see the following variable in your Workspace Browser:
- testMean - the value of the specified mean against which we are testing
You should also see the following output in the Command Window:
Estimated population mean is 5.006 True population mean of Iris setosa sepal length is likely to be different from 5.936
EXAMPLE 3: Apply one sample t-test (standard statistial terminology)
Create a new cell in which you type and execute:
testMean = 5.936; % Could the true mean sepal length be this value? fprintf(['\nNull hypothesis: The true mean Iris setosa sepal length ' ... ' is %g\n'], testMean); fprintf(['Alt hypothesis: The true mean Iris setosa sepal length ' ... 'is different from %g\n'], testMean); if ttest(sLenSetosa, testMean) == 1 fprintf('\tReject null hypothesis in favor of alternative '); else fprintf('\tCannot reject the null hypothesis '); end; fprintf('at the 0.05 significance level\n');
You should see the following variable in your Workspace Browser:
- testMean - the value of the specified mean against which we are testing
You should also see the following output in the Command Window:
Null hypothesis: The true mean Iris setosa sepal length is 5.936 Alt hypothesis: The true mean Iris setosa sepal length is different from 5.936 Reject null hypothesis in favor of alternative at the 0.05 significance level
EXAMPLE 4: Test for inequality using the one sample t-test
Create a new cell in which you type and execute:
testMean = 5.936; % Could the true mean sepal length be this value? fprintf('\nTrue population mean of Iris setosa sepal length '); if ttest(sLenSetosa, testMean, 0.05, 'left') == 1 fprintf('is likely to be less than ') else fprintf('could be be greater than or equal to '); end; fprintf('%g\n', testMean);
You should see the following variable in your Workspace Browser:
- testMean - the value of the specified mean against which we are testing
You should also see the following output in the Command Window:
True population mean of Iris setosa sepal length is likely to be less than 5.936
EXAMPLE 5: Look at the p-value and the confidence interval
Create a new cell in which you type and execute:
testMean = 5.936; % Could the true mean sepal length be this value? fprintf( ... '\nIs the true population mean of Iris setosa different from %g?\n', ... testMean); [h, p, ci] = ttest(sLenSetosa, testMean); fprintf('\t hypothesis = %g\n', h); % Truth of alternative hypothesis fprintf('\t pvalue = %g\n', p); % Lower value indicates more support for alt hyp fprintf('\t 95%% confidence interval for population mean: [%g, %g]\n', ci);
You should see the following variable in your Workspace Browser:
- testMean - the value of the specified mean against which we are testing
You should also see the following output in the Command Window:
Is the true population mean of Iris setosa different from 5.936? hypothesis = 1 pvalue = 6.6085e-24 95% confidence interval for population mean: [4.90582, 5.10618]
EXAMPLE 6: Load the Daphne Island and Santa Cruz Island beak size data
Create a new cell in which you type and execute:
Daphne = load('DaphneIslandBeaks.txt'); SantaCruz = load('SantaCruzIslandBeaks.txt');
You should see the following 2 variables in your Workspace Browser:
- Daphne - a column vector with beak sizes of the Daphne Island finches
- SantaCruz - a column vector with the beak sizes of the Santa Cruz Island finches
EXAMPLE 7: Are Daphne finch beak sizes different from those of Santa Cruz finches?
Create a new cell in which you type and execute:
fprintf(['\nNull hypothesis: The true mean beak sizes of Daphne and ' ... 'Santa Cruz finches are equal\n']); fprintf(['Alt hypothesis: The true mean beak size of Daphne finches ' ... 'is different from the true mean beak size of Santa Cruz finches\n']); if ttest2(Daphne, SantaCruz) == 1 fprintf('\tReject null hypothesis in favor of alternative '); else fprintf('\tCannot reject the null hypothesis '); end; fprintf('at the 0.05 significance level\n');
You should see the following output in the Command Window:
Null hypothesis: The true mean beak sizes of Daphne and Santa Cruz finches are equal Alt hypothesis: The true mean beak size of Daphne finches is different from the true mean beak size of Santa Cruz finches Reject null hypothesis in favor of alternative at the 0.05 significance level
EXAMPLE 8: Look at the p-value and confidence interval for the two-sample t-test
Create a new cell in which you type and execute:
fprintf( ... '\nIs the true population mean of Daphne finches different from SantaCruz?\n'); [h, p, ci] = ttest2(Daphne, SantaCruz); fprintf('\t hypothesis = %g\n', h); % Truth of alternative hypothesis fprintf('\t pvalue = %g\n', p); % Lower value indicates more support for alt hyp fprintf('\t 95%% confidence interval for difference of population means: '); fprintf('[%g, %g]\n', ci);
You should see the following variables in your Workspace Browser:
- h - a value indicating whether to reject (1) or not reject (0) the null hypothesis
- p - (the p-value) gives the probably that such a distribution could happen by chance if the null hypothesis were really true
- ci - confidence interval for the difference of the two population means
You should also see the following output in the Command Window:
Is the true population mean of Daphne finches different from SantaCruz? hypothesis = 1 pvalue = 2.77109e-11 95% confidence interval for difference of population means: [-1.4464, -0.795025]
EXAMPLE 9: Load the consolidated sleep diary data
Create a new cell in which you type and execute:
load diaries.mat; % Load the sleep diaries
You should see 8 variables in the Workspace Browser:
- bedTimes - an array with the bedtimes of individual students in the columns
- dayCaffeine - a logical array with columns indicating daytime caffeine use for individual students
- gender - a vector of strings containing 'male' or 'female' designations for each student
- nightCaffeine - a logical array with columns indicating caffeine use after 6 pm for individual students
- section - vector containing sections numbers of the individual studens
- toSleepMinutes - an array with the number of minutes to fall asleep each night for the individual students
- useAlarm - a logical array with indications of alarm use for individual students in the columns.
- wakeTimes - an array with the wake times of individual students in the columns.
EXAMPLE 10: Calculate wake-up hours, separated by gender
Create a new cell in which you type and execute:
wakeupHours = (wakeTimes - floor(wakeTimes))*24; % Get fractional part of wakeTimes men = strcmp('male', gender); % 1's where gender is 'male' mensWHours = wakeupHours(:, men); % Pick columns corresponding to men women = ~men; % 1's where gender is 'female' womensWHours = wakeupHours(:, women); % Pick columns corresponding to women numSubjects = size(wakeupHours, 2); % Also need number of subjects
*You should see the following variables in your Workspace Browser:
- wakeupHours - array containing the hour of wake-up for each diary entry
- men - vector with ones corresponding to subjects who were male
- mensWHours - array with the wake-up hours of the men in the cohort
- women - vector with ones corresponding to the subjects who were female
- womensWHours - array with the wake-up hours of the men in the cohort
- numSubjects - number of subjects in the cohort
EXAMPLE 11: Do men get up earlier than women on average?
Create a new cell in which you type and execute:
fprintf('\nThe average wake-up time for men '); [h, p, ci] = ttest2(mensWHours(:), womensWHours(:)); if h == 1 fprintf('is likely to be different from that of women\n'); else fprintf('could be the same as that of women\n'); end; fprintf('\t hypothesis = %g\n', h); % Truth of alternative hypothesis fprintf('\t pvalue = %g\n', p); fprintf('\t 95%% confidence interval for difference = [%g, %g]\n', ci);
You should see the following variables in your Workspace Browser:
- h - a value indicating whether to reject (1) or not reject (0) the null hypothesis
- p - (the p-value) gives the probably that such a distribution could happen by chance if the null hypothesis were really true
- ci - confidence interval for the difference of the two population means
You should also see the following output in the Command Window:
The average wake-up time for men is likely to be different from that of women hypothesis = 1 pvalue = 0.0114848 95% confidence interval for difference = [-0.421104, -0.0533088]
EXAMPLE 12: Compare average wake-up time of instructor to that of a random student
Create a new cell in which you type and execute:
randStudent = randi(numSubjects - 1, 1, 1) + 1; % Pick a random student fprintf('\nThe average wake-up time of the instructor '); if ttest2(wakeupHours(:, 1), wakeupHours(:, randStudent)) == 1 fprintf('is likely to be different than '); else fprintf('could be the same as '); end; fprintf('the average wake-up time for subject %d\n', randStudent);
You should see the following variable in your Workspace Browser:
- randStudent - the number of a randomly chosen subject in the cohort
You should also see the following output in the Command Window.
The average wake-up time of the instructor is likely to be different than the average wake-up time for subject 86
EXAMPLE 13: Output IDs of the subjects whose average wake-up is similar instructor's
Create a new cell in which you type and execute:
averWakeup = mean(wakeupHours);
fprintf(['\nThe following students have average wake-up times ' ...
'indistinguishable from instructor''s (%g):\n'], averWakeup(1));
for k = 2:numSubjects
if ttest2(wakeupHours(:, 1), wakeupHours(:, k)) == 0
fprintf('\t Subject %g''s average wake-up time = %g\n', k, averWakeup(k));
end;
end;
You should see the following variables in your Workspace Browser:
- averWakeup - average wakeup times of members of the cohort
- k - index variable that ranges over the students in the cohort
You should also see the following output in the Command Window:
The following students have average wake-up times indistinguishable from instructor's (6.38685): Subject 6's average wake-up time = 6.44752 Subject 10's average wake-up time = 6.26992 Subject 13's average wake-up time = 6.95582 Subject 32's average wake-up time = 5.87253 Subject 37's average wake-up time = 6.963 Subject 47's average wake-up time = 7.18662 Subject 50's average wake-up time = 7.03739 Subject 56's average wake-up time = 6.71099 Subject 60's average wake-up time = 7.12994 Subject 71's average wake-up time = 7.87514 Subject 75's average wake-up time = 6.92933 Subject 79's average wake-up time = 7.06183 Subject 81's average wake-up time = 7.9929 Subject 82's average wake-up time = 7.30281 Subject 91's average wake-up time = 5.50243 Subject 100's average wake-up time = 6.38685 Subject 102's average wake-up time = 6.14901 Subject 111's average wake-up time = 7.35553 Subject 114's average wake-up time = 6.30092 Subject 129's average wake-up time = 6.63787 Subject 133's average wake-up time = 7.05643 Subject 136's average wake-up time = 7.09694 Subject 138's average wake-up time = 6.90769 Subject 140's average wake-up time = 5.68481
SUMMARY OF SYNTAX
| MATLAB syntax | Description |
Y = abs(X) |
the array Y is the same size as the array X
and contains the absolute values of the corresponding elements of
the array X.
|
h = ttest(X, m) |
performs a one-sample student's t-test
to determine whether the true mean of the population represented
by the sample in the vector If If The
If |
[h, p, ci] = ttest(X, m) |
performs a one-sample student's t-test
to determine whether the true mean of the population represented
by the sample in the vector The variable The variable |
[h, p, ci] = ttest(X, m, alpha) |
performs a one-sample student's t-test at significance level
The variable The variable |
[h, p, ci] = ttest(X, m, alpha, 'left') |
performs a one-sided one-sample student's t-test at significance level
If The variable The variable |
[h, p, ci] = ttest(X, m, alpha, 'right') |
performs a one-sided one-sample student's t-test at significance level
If The variable The variable |
h = ttest2(X, Y) |
performs a two-sample student's t-test
to determine whether the true means of the population represented
by the samples If If The
If |
[h, p, ci] = ttest2(X, Y) |
performs a two-sample student's t-test
to determine whether the true means of the population represented
by the samples The variable The variable |
[h, p, ci] = ttest2(X, Y, alpha) |
performs a two-sample student's t-test at significance level
The variable The variable |
[h, p, ci] = ttest2(X, Y, alpha, 'left') |
performs a one-sided two-sample student's t-test at significance level
If The variable The variable |
[h, p, ci] = ttest2(X, Y, alpha, 'right') |
performs a one-sided two-sample student's t-test at significance level
If The variable The variable |
_This lesson was written by Kay A. Robbins of the University of Texas at San Antonio and last modified on 31-Dec-2010. Please contact krobbins@cs.utsa.edu with comments or suggestions. The photo is of Sir Ronald Fisher, founder of modern statistics and namesake of the Fisher Iris dataset. (See http://en.wikipedia.org/wiki/File:R._A._Fischer.jpg._
