import warnings
'ignore') warnings.filterwarnings(
Lab 02
Warm-up: Exploratory Data Analysis
- Where possible, certainly for any variable of particular interest-examine exploratory visualizations.
Visualization
- The objective of visualization is to reveal hidden information through simple charts and diagrams.
- It is instrumental in identifying patterns and relationships among groups of variables.
- Visualization techniques depend on the type of variables.
Tools for Displaying Single Variables
Histograms
- Histograms are the most common graphical tool to represent continuous data.
- A histogram divides the data into bins, counts the number of data points falling into each bin, and shows the bins on the horizontal axis and the frequency on the vertical axis.
- The bin width (sometimes called class width) has an impact on the shape of the histogram.
#import the data set in the .csv file into your your current session
import pandas as pd
= pd.read_csv("https://www.statlearning.com/s/College.csv", index_col = 0) college_df
#https://matplotlib.org/stable/tutorials/introductory/pyplot.html
import matplotlib.pyplot as plt
#For college_df, produce some histograms for a few of the quantitative variables.
#change numbers of bins
= 50
n_bins #the frequency of number applications decreases as magnitude increases
'Apps'], bins = n_bins, color = 'red', alpha = 0.5, edgecolor = 'black')
plt.hist(college_df['The histogram of number of applications received')
plt.title( plt.show()
Subplotting with matplotlib
Using the College data set, let’s create a 2x2 subplot with the following variables:
- Apps
- perc.alumni
- S.F.Ratio
- Expend
where each subplot is a histogram.
#create subplots with matplotlib
#https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=subplots
import matplotlib.pyplot as plt
#Create a figure and a set of subplots
#returns a figure and array of axes
= plt.subplots(2, 2, figsize = (16,8)) #figsize: width, height, respectively.
figure, axs #axs object is a 2 by 2 numpy array
#type(axs)
#axs
import matplotlib.pyplot as plt
#Create a figure and a set of subplots
#returns a figure and array of axes
= plt.subplots(2, 2, figsize = (16,8)) #figsize: width, height, respectively.
figure, axs #fill-in the subplots
= 50
n_bins ##A question: Why not plt hist??
#the frequency of number applications decreases as magnitude increases
0,0].hist(college_df['Apps'], bins = n_bins, color = 'red', alpha = 0.5, edgecolor = 'black')
axs[0,0].set_title('The histogram of number of applications received')
axs[0,1].hist(college_df['Accept'], bins = n_bins, color = 'green', alpha = 0.5, edgecolor = 'black')
axs[0,1].set_title('The histogram of number of applications accepted')
axs[#the distribution of estimated book costs is symmetric. there a few expensive books used at colleges.
1,0].hist(college_df['Books'], bins = n_bins, color = 'orange', alpha = 0.5, edgecolor = 'black')
axs[1,0].set_title('The histogram of estimated book costs')
axs[1,1].hist(college_df['Expend'], bins = n_bins, alpha = 0.5, edgecolor = 'black')
axs[1,1].set_title('The histogram of instructional expenditure per student')
axs[#plt.show() turns off printing the data array etc
plt.show(figure) #you can change the number of bins, as you wish
#could not give y axis title for each subplot
Density Plots
- A density plot is a smoothed, continuous version of a histogram estimated from the data.
- A point is drawn at the top of every individual rectangular bin and all of these points are then connected together to make a single smooth density estimation. (Try to make a connection between Rieman sums and its limit as the number of bins goes to infinity (or width of the bins approaches to zero) and histograms).
- The height of the curve is adjusted so that the total area under the curve integrates to one.
Histograms, Density Plots with seaborn
#https://seaborn.pydata.org/generated/seaborn.displot.html
#no submodule
import seaborn as sns
#draw a histogram with seaborn, kind="hist" is default
= college_df.Expend, kind = "hist")
sns.displot(x #draw a denstiy plot with seaborn with kernel density
= college_df.Expend, kind = "kde")
sns.displot(x #overlay histogram with a density plot
= college_df.Expend, kde = True); sns.displot(x
Tools for Displaying a Single Variable with repect to a Categorical Variable
#Each type of plot can be drawn separately for subsets of data using **hue mapping**
#Investigate how expenditure is changing with respect to school type
= college_df, x = "Expend", hue = "Private", kind = "hist")
sns.displot(data = college_df, x = "Expend", hue = "Private", kind = "kde");
sns.displot(data #as you can see instructional expenditure is higher in private schools compared to non-private ones
#alternatively you can prefer stacked plots
= college_df, x = "Expend", hue = "Private", kind = "hist", multiple = "stack"); sns.displot(data
#faceting, map the expenditure variable into private variable with col argument
= college_df, x = "Expend", col = "Private", kind = "hist")
sns.displot(data = college_df, x = "Expend", col = "Private", kind = "kde"); sns.displot(data
Boxplots
- Boxplots are used to describe the shape of data distribution.
- We can also use the boxplots to describe the shape of data distribution with respect to the levels of a categorical variable.
- It also enables us to identify outliers where an observation is considered as an outlier if it is either less than Q1 - 1.5 IQR or greater than Q3 + 1.5 IQR, with IQR = Q3 - Q1.
- This rule is conservative and often too many points are identified as outliers. Hence sometimes only those points outside of [Q1 - 3 IQR, Q3 + 3 IQR] are only identified as outliers.
Using the College data set, we will create a new categorical variable, called Elite
, by binning the Top10perc
variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
#attach a new column to the college df
import pandas as pd
= pd.read_csv("https://www.statlearning.com/s/College.csv", index_col = 0)
college_df
#this should be much more smarter.
#i tried boolean indexing below, but it worked first, but,
#it did not work later.
'Elite'] = "No"
college_df[>50] = "Yes"
college_df.Elite[college_df.Top10perc
#alternatively, a bit longer way to add a new column to a data frame
#elite = []
#for i in college_df['Top10perc']:
# if i > 50 : elite.append('Yes')
# else: elite.append('No')
#college_df['Elite'] = elite
college_df.head()
Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | Elite | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 | No |
Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 | No |
Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 | No |
Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 | Yes |
Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 | No |
#https://seaborn.pydata.org/generated/seaborn.boxplot.html
#compare the graduation rates with respect to school's elite status
= college_df, x = "Elite", y = "Grad.Rate");
sns.boxplot(data #vertical box-plot
#as you can see in the elite schools, the graduate rate is higher compared to non-elite ones
#we can see a few outliers in the non-elite schools. while graduation rate is
#very high in some non-elite schools, graduation rate is very low in some non-elite schools
#this is an example that column names should not involve .
#college_df.Grad.Rate.max()#will not work
#hmmm
"Grad.Rate"].max() college_df[
118
#produce side-by-side boxplots of Outstate versus Private.
#horizontal box-plot
= college_df, x = "Outstate", y = "Private");
sns.boxplot(data #as you can see in private schools, out-state tution fees are also very high.
#if you born in Arizona, but, want to study in a private school, such as in Boston,
#you are more likely to pay for higher tution rates.
Tools for Displaying Relationships Between Two Variables
Scatterplot
- The most conventional way to display relationships between two variables is a scatterplot.
- It shows the direction and strength of association between two variables.
#https://seaborn.pydata.org/generated/seaborn.scatterplot.html
#get a scatter plot of out of state tution fee and graduation rate
= college_df, x = "Outstate", y = "Grad.Rate");
sns.scatterplot(data #As tuition fee increases, the high graduation rate increases
#how graph above chages with respect to school type?
= college_df, x = 'Outstate', y = "Grad.Rate", hue = 'Private'); sns.scatterplot(data
Tools for Displaying More Than Two Variables
Scatterplot Matrix
- Displaying more than two variables on a single scatterplot is not possible.
- A scatterplot matrix is one possible visualization of three or more continuous variables taken two at a time.
#https://seaborn.pydata.org/generated/seaborn.pairplot.html
#let's get a pairwise matrix plot of some variables.
#diag_kind = None, otherwise histogram appears on the diagonals.
vars = ['Outstate', 'Grad.Rate', 'Personal'], diag_kind = None, hue = 'Private'); sns.pairplot(college_df,
import numpy as np
= college_df[['Outstate', 'Grad.Rate', 'Personal']]
df df.corr()
Outstate | Grad.Rate | Personal | |
---|---|---|---|
Outstate | 1.000000 | 0.571290 | -0.299087 |
Grad.Rate | 0.571290 | 1.000000 | -0.269344 |
Personal | -0.299087 | -0.269344 | 1.000000 |
import session_info
session_info.show()
Click to view session information
----- matplotlib 3.4.3 numpy 1.21.2 pandas 1.3.3 seaborn 0.11.2 session_info 1.0.0 -----
Click to view modules imported as dependencies
PIL 8.3.2 anyio NA appnope 0.1.2 attr 21.2.0 babel 2.9.1 backcall 0.2.0 beta_ufunc NA binom_ufunc NA brotli 1.0.9 certifi 2021.05.30 cffi 1.14.6 chardet 4.0.0 charset_normalizer 2.0.0 colorama 0.4.4 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.4.1 decorator 5.1.0 defusedxml 0.7.1 entrypoints 0.3 google NA idna 3.1 ipykernel 6.4.1 ipython_genutils 0.2.0 ipywidgets 7.6.5 jedi 0.18.0 jinja2 3.0.1 json5 NA jsonschema 3.2.0 jupyter_server 1.11.0 jupyterlab_server 2.8.1 kiwisolver 1.3.2 markupsafe 2.0.1 matplotlib_inline NA mpl_toolkits NA nbclassic NA nbformat 5.1.3 nbinom_ufunc NA packaging 21.3 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prometheus_client NA prompt_toolkit 3.0.20 ptyprocess 0.7.0 pvectorc NA pydev_ipython NA pydevconsole NA pydevd 2.4.1 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.10.0 pyparsing 2.4.7 pyrsistent NA pytz 2021.1 requests 2.26.0 scipy 1.7.1 send2trash NA six 1.16.0 sniffio 1.2.0 socks 1.7.1 statsmodels 0.13.2 storemagic NA terminado 0.12.1 tornado 6.1 traitlets 5.1.0 urllib3 1.26.7 wcwidth 0.2.5 websocket 0.57.0 zmq 22.3.0
----- IPython 7.27.0 jupyter_client 7.0.3 jupyter_core 4.8.1 jupyterlab 3.1.12 notebook 6.4.4 ----- Python 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:59:00) [Clang 11.1.0 ] macOS-10.15.7-x86_64-i386-64bit ----- Session information updated at 2022-09-20 00:03