{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 2-3: More Hypothesis Testing\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import libraries we'll need\n", "import pandas as pd\n", "import numpy as np\n", "import scipy.stats as stats\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## T-Test for small sample sizes (n<30)\n", "\n", "We have instantaneous monthly observations of dissolved organic carbon (DOC) in two streams over the course of one water year (October-September). Use a two-sample, two-sided, t-test to determine:\n", "\n", "1. Using data for all 12 months, with what confidence can we say that the annual mean DOC concentrations are different between the two streams?\n", "2. Compare the two streams again, but this time perform two tests, one for the first 6 months of the water year (October-March), and a second test for the last 6 months (April-September).\n", "3. Can we say that the DOC concentrations between the two streams are different in the first half and/or second half of the water year? With what level of confidence could we say that they are different?\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "wy_month_labels = ['Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']\n", "wy_month_numbers = np.arange(12)+1" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# DOC for the first stream, mg/L\n", "doc_1 = [65.3, 98.4, 113.1, 120.5, 105.3, 100.3, 92.3, 97.5, 88.2, 89.5, 72.1, 61.9]\n", "# DOC for the second stream, mg/L\n", "doc_2 = [62.0, 50.7, 30.9, 52.5, 98.7, 95.8, 99.3, 110.2, 104.9, 96.4, 82.5, 75.5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Chi-Squared Test for a Change in the Standard Deviation\n", "Test for statistical significance of a change in the standard deviation.\n", "Note that the standard deviation does not benefit from the Central Limit Theorem.\n", "Even though it is not strictly true, assume for the moment that the\n", "sample data are derived from a normally distributed population. Use a\n", "single sample test (with rejection region based on the Chi Squared\n", "distribution). Assume that the sample standard deviation from the\n", "1929-1974 data is close to the true population standard deviation of the\n", "earlier data set. Test that the more recent sample is different from this.\n", "\n", "Use ${t} = \\frac{(n-1)s^2}{\\sigma^2}$ with n-1 degrees of freedom." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/openpyxl/worksheet/_read_only.py:79: UserWarning: Unknown extension is not supported and will be removed\n", " for idx, row in parser.parse():\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date of peakwater yearpeak value (cfs)gage_ht (feet)
01928-10-0919291880010.55
11930-02-0519301580010.44
21931-01-2819313510014.08
\n", "
" ], "text/plain": [ " date of peak water year peak value (cfs) gage_ht (feet)\n", "0 1928-10-09 1929 18800 10.55\n", "1 1930-02-05 1930 15800 10.44\n", "2 1931-01-28 1931 35100 14.08" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the excel file\n", "skykomish_data_file = '../data/Skykomish_peak_flow_12134500_skykomish_river_near_gold_bar.xlsx'\n", "skykomish_data = pd.read_excel(skykomish_data_file)\n", "# Preview our data\n", "skykomish_data.head(3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Divide the data into the early period (before 1975) and late period (after and including 1975).\n", "skykomish_before = skykomish_data[ skykomish_data['water year'] < 1975 ] \n", "skykomish_after = skykomish_data[ skykomish_data['water year'] >= 1975 ] " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "65.86640081438826\n" ] } ], "source": [ "# first calculate the test statistic\n", "sd1 = skykomish_before['peak value (cfs)'].std() #we pretend this is the \"true population standard deviation)\n", "sd2 = skykomish_after['peak value (cfs)'].std()\n", "m = len(skykomish_after['peak value (cfs)'])\n", "t = (m-1)*sd2**2/sd1**2\n", "print(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we know from the lecture notes that this test statistic is a chi-squared distributed with n-1 degrees of freedom. Let's choose that we want 95% confidence that there is a change, and therefore alpha = 0.05. In this example we are just going to test for an increase in the standard deviation (we are doing a one-sided test). We can look up our critical value in a chi-squared distribution table using our degrees of freedom and chosen alpha.\n", "\n", "How can we look this up in python?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m \u001b[0mstats\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchi2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mppf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mq\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Percent point function (inverse of `cdf`) at q of the given RV.\n", "\n", "Parameters\n", "----------\n", "q : array_like\n", " lower tail probability\n", "arg1, arg2, arg3,... : array_like\n", " The shape parameter(s) for the distribution (see docstring of the\n", " instance object for more information)\n", "loc : array_like, optional\n", " location parameter (default=0)\n", "scale : array_like, optional\n", " scale parameter (default=1)\n", "\n", "Returns\n", "-------\n", "x : array_like\n", " quantile corresponding to the lower tail probability q.\n", "\u001b[0;31mFile:\u001b[0m /opt/conda/lib/python3.10/site-packages/scipy/stats/_distn_infrastructure.py\n", "\u001b[0;31mType:\u001b[0m method" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "stats.chi2.ppf?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29.787477080861958\n" ] } ], "source": [ "alpha = 0.05\n", "vals = stats.chi2.ppf(alpha, m-1)\n", "print(vals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Our t statistic is larger than the cut-off value from the chi-squared distribution, so we determine that yes, with 95% confidence, a change has occurred.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 4 }