Lab 2-3: T-tests and Tests for Change in the Variance#
# import libraries we'll need
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
T-Test for small sample sizes (n<30)#
We have instantaneous monthly observations of dissolved organic carbon (DOC) in two streams over the course of one water year (October-September). Use a two-sample, two-sided, t-test to determine:
Using data for all 12 months, with what confidence can we say that the annual mean DOC concentrations are different between the two streams?
Compare the two streams again, but this time perform two tests, one for the first 6 months of the water year (October-March), and a second test for the last 6 months (April-September).
Can we say that the DOC concentrations between the two streams are different in the first half and/or second half of the water year? With what level of confidence could we say that they are different?
wy_month_labels = ['Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']
wy_month_numbers = np.arange(12)+1
# DOC for the first stream, mg/L
doc_1 = [65.3, 98.4, 113.1, 120.5, 105.3, 100.3, 92.3, 97.5, 88.2, 89.5, 72.1, 61.9]
# DOC for the second stream, mg/L
doc_2 = [62.0, 50.7, 30.9, 52.5, 98.7, 95.8, 99.3, 110.2, 104.9, 96.4, 82.5, 75.5]
# Note that you need to enter the code to calculate the t-test yourself, based on the lecture notes or book
Tests for a Change in the Variance (or in the Standard Deviation)#
Test for statistical significance of a change in the variance (or standard deviation). Note that the standard deviation does not benefit from the Central Limit Theorem.
We present two tests for a change in the variance, but please note that both of these are for normally-distributed data, which is rare in environmental data. In Section 5.6, the Helsel et al. explain the dangers of using these for data that are not normally-distributed and recommend several nonparametric tests.
We present a single sample test, which relies on the Chi Squared distribution, and the two sample test, which relies on the F distribution.
How to pick which of these to use? Similar to what we explored in looking for changes in the mean, the single sample test should be used when we are testing whether the variance of a sample of data differs from a known variance. The two sample test should be used when we are testing whether the variance of two samples of data are different from each other. We compare these below.
Single sample test for a normally-distributed population#
Even though it is not strictly true, assume for the moment that the sample data are derived from a normally distributed population. Use a single sample test (with rejection region based on the Chi Squared distribution). Assume that the sample standard deviation from the 1929-1974 data is close to the true population standard deviation of the earlier data set. Test that the more recent sample is different from this.
Use \({t} = \frac{(n-1)s^2}{\sigma^2}\) with n-1 degrees of freedom.
# Read the excel file
skykomish_data_file = '../data/Skykomish_peak_flow_12134500_WY1929_2023.xlsx'
skykomish_data = pd.read_excel(skykomish_data_file)
# Preview our data
skykomish_data.head(3)
/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py:329: UserWarning: Unknown extension is not supported and will be removed
warn(msg)
date of peak | water year | peak value (cfs) | gage_ht (feet) | |
---|---|---|---|---|
0 | 1928-10-09 | 1929 | 18800 | 10.55 |
1 | 1930-02-05 | 1930 | 15800 | 10.44 |
2 | 1931-01-28 | 1931 | 35100 | 14.08 |
# Divide the data into the early period (before 1977) and late period (after and including 1977).
skykomish_before = skykomish_data[ skykomish_data['water year'] < 1977 ]
skykomish_after = skykomish_data[ skykomish_data['water year'] >= 1977 ]
# first calculate the test statistic
sd1 = skykomish_before['peak value (cfs)'].std() #we pretend this is the "true population standard deviation)
sd2 = skykomish_after['peak value (cfs)'].std()
m = len(skykomish_after['peak value (cfs)'])
t = (m-1)*sd2**2/sd1**2
print(t)
62.95286530852466
Now, we know from the lecture notes that this test statistic is a chi-squared distributed with n-1 degrees of freedom. Let’s choose that we want 95% confidence that there is a change, and therefore alpha = 0.05. In this example we are just going to test for an increase in the standard deviation (we are doing a one-sided test). We can look up our critical value in a chi-squared distribution table using our degrees of freedom and chosen alpha.
How can we look this up in python?
stats.chi2.ppf?
alpha = 0.05
vals = stats.chi2.ppf(1-alpha, m-1)
print(vals)
61.65623337627955
Our t statistic is larger than the cut-off value from the chi-squared distribution, so we determine that yes, with 95% confidence, a change has occurred.
Two sample test for normally-distributed populations#
Here, we use the same data as above, but now we treat each section as a sample from a population. Even though it is not strictly true, assume for the moment that the sample data are derived from a normally distributed population.
Use \({test stat} = \frac{{s_2}^2}{{s_1}^2}\) with m-1 and n-1 degrees of freedom.
# first calculate the test statistic
sd1 = skykomish_before['peak value (cfs)'].std() #now we are treating this as a sample standard deviation.
sd2 = skykomish_after['peak value (cfs)'].std()
m = len(skykomish_after['peak value (cfs)'])
n = len(skykomish_before['peak value (cfs)'])
teststat = sd2**2/sd1**2
print(teststat)
print(m)
print(n)
1.398952562411659
46
48
alpha = 0.05
# Note for the f test statistic, it is important which number count goes into the numerator vs the denominator -- this is a distribution of the ratio
# for the stats.f.ppf function, you need to enter the size of the numerator first and the size of the denominator second.
vals2 = stats.f.ppf(1-alpha, m-1, n-1)
print(vals2)
1.6301958440576045
Our value is not larger than this, so again, we cannot say that a change has occurred.
Note that we have done a two-sample, one-sided test here. If we do not hypothesize that the variance has increased, but instead just want to conduct a two-sided test on whether the variance has changed, we should do a two-sided test. We do this below.
#two-sided test.
alpha = 0.05/2
vals21 = stats.f.ppf(alpha, m-1, n-1)
print(vals21)
vals22 = stats.f.ppf(1-alpha, m-1, n-1)
print(vals22)
0.5558760048539069
1.7922085167648096
Our value is between these two numbers, so we cannot say that a change has occurred. Note that when we presume that there is uncertainty in both variances, with both periods as random samples from true distributions, we are not able to detect a change.