Graphical Data Analysis#

When you get a set of measurements, ask yourself:

  • What do you want to learn from this data?

  • What is your hypothesis, and what would it look like if the data supports or does not support your hypothesis?

  • Plot your data! (and always label your plots clearly)

Reporting of Numbers#

  • Keep track of units, and always report units with your numbers!

    • Make sure to check metadata about how the measurements were made

  • Significant figures

    • From our snow depth example last week:

      • Should I report a snow depth value of 20.3521 cm?

      • Should I report a snow depth value of 2035 mm?

      • Should I report a snow depth value of 20.0000 cm?

    • Consider the certainty with which you know a value. Don’t include any more precision beyond that

    • Note: Rounding errors - Allow the computer to include full precision for intermediate calculations, round to significant figures for the final result of the computation that you report in the answer


To start, we will import some python packages:

# numpy has a lot of math and statistics functions we'll need to use
import numpy as np

 # pandas gives us a way to work with and plot tabular datasets easily (called "dataframes")
import pandas as pd

# we'll use matplotlib for plotting here (it works behind the scenes in pandas)
import matplotlib.pyplot as plt 

# tell jupyter to make out plots "inline" in the notbeook
%matplotlib inline 

Why are you plotting?#

You have an application in mind with your data. This application should inform your choice of analysis technique, what you want to plot and visualize.

Open our file using the pandas read_csv function.

# Use pandas.read_csv() function to open this file.
# This stores the data in a "Data Frame"
my_data = pd.read_csv('my_data.csv')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_1807/441235391.py in <module>
      1 # Use pandas.read_csv() function to open this file.
      2 # This stores the data in a "Data Frame"
----> 3 my_data = pd.read_csv('my_data.csv')

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    705                 encoding=ioargs.encoding,
    706                 errors=errors,
--> 707                 newline="",
    708             )
    709         else:

FileNotFoundError: [Errno 2] No such file or directory: 'my_data.csv'
# look at the first few rows of data with the .head() method
my_data.head()
time tair_max tair_min cumulative_precip
0 1920-12-31 20.455167 -9.901765 102.502512
1 1921-12-31 20.119887 -10.364254 97.108113
2 1922-12-31 19.872675 -10.313181 97.166797
3 1923-12-31 20.449070 -11.359639 97.902843
4 1924-12-31 20.449110 -10.046539 99.329978

Scatterplots#

  • If we’re looking for relationships btween variables within our data, try making scatterplots.

  • Later this quarter we’ll get into statistical tests for correllation where we’ll use scatterplots to visualize our data.

  • Remember that correlation =/= causation!

my_data.plot.scatter(x='tair_max', y='tair_min')
plt.title('Maximum vs Minimum\nAir Temperature');
../../_images/6f905da9bfd8ea3c18f3fc43c43f03a0453aa3e50e8f253a48d35e835e4e814f.png

Timeseries plots#

  • If we are interested in how some random variable changes over time.

  • Similarly, if we have a spatial dimension and are interested in how a variable change along some length we could make a spatial plot.

my_data.plot(x='time', y='tair_max')
plt.ylabel('Maximum Annual Air Temperature ($\degree C$)')
plt.title('Maximum Annual Air Temperature Timeseries');
../../_images/0dc73c1e16496c204f4505cca384ce855bf2dd4b80d6b7243948516199e43adf.png

Histogram plots#

my_data['tair_max'].plot.hist(bins=10)
plt.xlabel('Maximum Annual Air Temperature ($\degree C$)')
plt.title('Maximum Annual Air Temperature Histogram');
../../_images/9deb32c93968dd04ecc7359fae41a625034ae3dcb76fb88e602fa9b38b8c2e32.png

Boxplots#

  • A boxplot (sometimes called “box-and-whisker” plots) can also help visualize a distribution, especially when we want to compare multiple data sets side by side.

  • The box usually represents the interquartile range (IQR) (between the 25th and 75th percentiles)

  • Symbols (lines, circles, etc) within the box can represent the sample mean and/or median

  • Vertical line “whiskers” can represent the full range (minimum to maximum) or another percentile range (such as 2nd and 98th percentiles)

  • Data points beyond the “whiskers” are “outliers”

  • What each symbol represents can vary, so be sure to check documentation to be sure! See documentation for making boxplots with pandas, and boxplots with matplotlib.

my_data.boxplot(column=['tair_min','tair_max'], grid=False)
plt.ylabel('Air Temperature ($\degree C$)')
plt.title('Min/Max Annual Air Temperature Boxplots');
../../_images/448e0b73266ebe719d9f8094a76a88e8a83734234dfb26f08307019ee6a781bf.png

Let’s look at a different set of data:#

from scipy.stats import linregress

# Plot the same set of points three different ways to show how plots can be manipulated to trick us!
fig, [plotA, plotB, plotC] = plt.subplots(ncols=3, nrows=1, figsize=(15,5), tight_layout=True)

# The underlying data is a linear relationship, but with a lot of random noise added
# There is a trend in the data, but it is hard to detect
x = np.linspace(0,20,21)
y = x + 15*np.random.randn(21)

# Be careful! Depending only on the axes limits we choose, we can make the data look very different
plotA.scatter(x,y)
# Adding a regression line can sometimes be misleading (suggesting there's a trend even if there isn't)
m, b, _, _, _ = linregress(x, y)
# Just because I've plotted a linear regression here, doesn't mean that it's statistically significant!
plotA.plot(x, m*x + b, color='red')
plotA.set_xlim((-1,21)); plotA.set_ylim((-50,50))
plotA.set_xlabel('X'); plotA.set_ylabel('Y')
plotA.set_title('A', fontsize=25, fontweight='bold')

# We can make the data look a lot different by just changing the axes limites
# This can be misleading, be careful!
plotB.scatter(x,y)
plotB.set_xlim((-1,21)); plotB.set_ylim((-150,150))
plotB.set_xlabel('X'); plotB.set_ylabel('Y')
plotB.set_title('B', fontsize=25, fontweight='bold')

# We can make the data look a lot different by just changing the axes limites
# This can be misleading, be careful!
plotC.scatter(x,y)
plotC.set_xlim((-50,50)); plotC.set_ylim((-50,50))
plotC.set_xlabel('X'); plotC.set_ylabel('Y')
plotC.set_title('C', fontsize=25, fontweight='bold')

fig.suptitle('Is there a trend in any of these plots?', fontsize=20, fontweight='bold', y=1.05);
../../_images/87e0aff8b65b97dad9dc7ddb7551db2bec67aebd3ca53f6750057f4bde220c74.png

Ethics in graphical analysis#

Be careful!

  • Others could try and manipulate plots and statistics to convince us of something

  • We can end up tricking outselves with “wishful thinking” and “confirmation bias” if we are not careful

  • This is why we have statistical tests, they’re our attempt to find objective measures of “is this a true trend”

  • Don’t draw a trendline through data when there isn’t a statisticaly significant trend!