Lab 02

Warm-up: Exploratory Data Analysis

Author

Gül İnan

Published

September 20, 2022

Where possible, certainly for any variable of particular interest-examine exploratory visualizations.

Visualization

The objective of visualization is to reveal hidden information through simple charts and diagrams.
It is instrumental in identifying patterns and relationships among groups of variables.
Visualization techniques depend on the type of variables.

Tools for Displaying Single Variables

Histograms

Histograms are the most common graphical tool to represent continuous data.
A histogram divides the data into bins, counts the number of data points falling into each bin, and shows the bins on the horizontal axis and the frequency on the vertical axis.
The bin width (sometimes called class width) has an impact on the shape of the histogram.

import warnings 
warnings.filterwarnings('ignore')

#import the data set in the .csv file into your your current session
import pandas as pd
college_df = pd.read_csv("https://www.statlearning.com/s/College.csv", index_col = 0)

#https://matplotlib.org/stable/tutorials/introductory/pyplot.html
import matplotlib.pyplot as plt
#For college_df, produce some histograms for a few of the quantitative variables.
#change numbers of bins 
n_bins = 50
#the frequency of number applications decreases as magnitude increases
plt.hist(college_df['Apps'], bins = n_bins, color = 'red', alpha = 0.5, edgecolor = 'black')
plt.title('The histogram of number of applications received')
plt.show()

Subplotting with matplotlib

Using the College data set, let’s create a 2x2 subplot with the following variables:

Apps
perc.alumni
S.F.Ratio
Expend

where each subplot is a histogram.

#create subplots with matplotlib
#https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=subplots
import matplotlib.pyplot as plt
#Create a figure and a set of subplots
#returns a figure and array of axes
figure, axs = plt.subplots(2, 2, figsize = (16,8)) #figsize: width, height, respectively.
#axs object is a 2 by 2 numpy array
#type(axs)
#axs

import matplotlib.pyplot as plt
#Create a figure and a set of subplots
#returns a figure and array of axes
figure, axs = plt.subplots(2, 2, figsize = (16,8)) #figsize: width, height, respectively.
#fill-in the subplots
n_bins = 50
##A question: Why not plt hist??
#the frequency of number applications decreases as magnitude increases
axs[0,0].hist(college_df['Apps'], bins = n_bins, color = 'red', alpha = 0.5, edgecolor = 'black')
axs[0,0].set_title('The histogram of number of applications received')
axs[0,1].hist(college_df['Accept'], bins = n_bins, color = 'green', alpha = 0.5, edgecolor = 'black')
axs[0,1].set_title('The histogram of number of applications accepted')
#the distribution of estimated book costs is symmetric. there a few expensive books used at colleges.
axs[1,0].hist(college_df['Books'], bins = n_bins, color = 'orange', alpha = 0.5, edgecolor = 'black')
axs[1,0].set_title('The histogram of estimated book costs')
axs[1,1].hist(college_df['Expend'], bins = n_bins, alpha = 0.5, edgecolor = 'black')
axs[1,1].set_title('The histogram of instructional expenditure per student')
plt.show(figure) #plt.show() turns off printing the data array etc
#you can change the number of bins, as you wish
#could not give y axis title for each subplot

Density Plots

A density plot is a smoothed, continuous version of a histogram estimated from the data.
A point is drawn at the top of every individual rectangular bin and all of these points are then connected together to make a single smooth density estimation. (Try to make a connection between Rieman sums and its limit as the number of bins goes to infinity (or width of the bins approaches to zero) and histograms).
The height of the curve is adjusted so that the total area under the curve integrates to one.

Histograms, Density Plots with seaborn

#https://seaborn.pydata.org/generated/seaborn.displot.html
#no submodule
import seaborn as sns
#draw a histogram with seaborn, kind="hist" is default
sns.displot(x = college_df.Expend, kind = "hist") 
#draw a denstiy plot with seaborn with kernel density
sns.displot(x = college_df.Expend, kind = "kde")
#overlay histogram with a density plot 
sns.displot(x = college_df.Expend, kde = True);

Tools for Displaying a Single Variable with repect to a Categorical Variable

#Each type of plot can be drawn separately for subsets of data using **hue mapping**
#Investigate how expenditure is changing with respect to school type
sns.displot(data = college_df, x = "Expend", hue = "Private", kind = "hist") 
sns.displot(data = college_df, x = "Expend", hue = "Private", kind = "kde");
#as you can see instructional expenditure is  higher in private schools compared to non-private ones

#alternatively you can prefer stacked plots
sns.displot(data = college_df, x = "Expend", hue = "Private", kind = "hist", multiple = "stack");

#faceting, map the expenditure variable into private variable with col argument
sns.displot(data = college_df, x = "Expend", col = "Private", kind = "hist") 
sns.displot(data = college_df, x = "Expend", col = "Private", kind = "kde");

Boxplots

Boxplots are used to describe the shape of data distribution.
We can also use the boxplots to describe the shape of data distribution with respect to the levels of a categorical variable.
It also enables us to identify outliers where an observation is considered as an outlier if it is either less than Q1 - 1.5 IQR or greater than Q3 + 1.5 IQR, with IQR = Q3 - Q1.
This rule is conservative and often too many points are identified as outliers. Hence sometimes only those points outside of [Q1 - 3 IQR, Q3 + 3 IQR] are only identified as outliers.

Using the College data set, we will create a new categorical variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

#attach a new column to the college df

import pandas as pd
college_df = pd.read_csv("https://www.statlearning.com/s/College.csv", index_col = 0)

#this should be much more smarter.
#i tried boolean indexing below, but it worked first, but,
#it did not work later.
college_df['Elite'] = "No"
college_df.Elite[college_df.Top10perc>50] = "Yes"

#alternatively, a bit longer way to add a new column to a data frame
#elite = []
#for i in college_df['Top10perc']:
#        if i > 50 :    elite.append('Yes')
#        else:          elite.append('No')
        
#college_df['Elite'] = elite 
college_df.head()

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate	Elite
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60	No
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56	No
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54	No
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59	Yes
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15	No

#https://seaborn.pydata.org/generated/seaborn.boxplot.html
#compare the graduation rates with respect to school's elite status
sns.boxplot(data = college_df, x = "Elite", y = "Grad.Rate");
#vertical box-plot
#as you can see in the elite schools, the graduate rate is higher compared to non-elite ones
#we can see a few outliers in the non-elite schools. while graduation rate is
#very high in some non-elite schools, graduation rate is very low in some non-elite schools

#this is an example that column names should not involve .
#college_df.Grad.Rate.max()#will not work
#hmmm
college_df["Grad.Rate"].max()

#produce side-by-side boxplots of Outstate versus Private.
#horizontal box-plot
sns.boxplot(data = college_df, x = "Outstate", y = "Private");
#as you can see in private schools, out-state tution fees are also very high.
#if you born in Arizona, but, want to study in a private school, such as in Boston,
#you are more likely to pay for higher tution rates.

Tools for Displaying Relationships Between Two Variables

Scatterplot

The most conventional way to display relationships between two variables is a scatterplot.
It shows the direction and strength of association between two variables.

#https://seaborn.pydata.org/generated/seaborn.scatterplot.html
#get a scatter plot of out of state tution fee and graduation rate
sns.scatterplot(data = college_df, x = "Outstate", y = "Grad.Rate");
#As tuition fee increases, the high graduation rate increases

#how graph above chages with respect to school type?
sns.scatterplot(data = college_df, x = 'Outstate', y = "Grad.Rate", hue = 'Private');

Tools for Displaying More Than Two Variables

Scatterplot Matrix

Displaying more than two variables on a single scatterplot is not possible.
A scatterplot matrix is one possible visualization of three or more continuous variables taken two at a time.

#https://seaborn.pydata.org/generated/seaborn.pairplot.html
#let's get a pairwise matrix plot of some variables. 
#diag_kind = None, otherwise histogram appears on the diagonals.
sns.pairplot(college_df, vars = ['Outstate', 'Grad.Rate', 'Personal'], diag_kind = None, hue = 'Private');

import numpy as np
df = college_df[['Outstate', 'Grad.Rate', 'Personal']]
df.corr()

	Outstate	Grad.Rate	Personal
Outstate	1.000000	0.571290	-0.299087
Grad.Rate	0.571290	1.000000	-0.269344
Personal	-0.299087	-0.269344	1.000000

import session_info
session_info.show()

Click to view session information

-----
matplotlib          3.4.3
numpy               1.21.2
pandas              1.3.3
seaborn             0.11.2
session_info        1.0.0
-----

Click to view modules imported as dependencies

PIL                         8.3.2
anyio                       NA
appnope                     0.1.2
attr                        21.2.0
babel                       2.9.1
backcall                    0.2.0
beta_ufunc                  NA
binom_ufunc                 NA
brotli                      1.0.9
certifi                     2021.05.30
cffi                        1.14.6
chardet                     4.0.0
charset_normalizer          2.0.0
colorama                    0.4.4
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.4.1
decorator                   5.1.0
defusedxml                  0.7.1
entrypoints                 0.3
google                      NA
idna                        3.1
ipykernel                   6.4.1
ipython_genutils            0.2.0
ipywidgets                  7.6.5
jedi                        0.18.0
jinja2                      3.0.1
json5                       NA
jsonschema                  3.2.0
jupyter_server              1.11.0
jupyterlab_server           2.8.1
kiwisolver                  1.3.2
markupsafe                  2.0.1
matplotlib_inline           NA
mpl_toolkits                NA
nbclassic                   NA
nbformat                    5.1.3
nbinom_ufunc                NA
packaging                   21.3
parso                       0.8.2
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prometheus_client           NA
prompt_toolkit              3.0.20
ptyprocess                  0.7.0
pvectorc                    NA
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.4.1
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.10.0
pyparsing                   2.4.7
pyrsistent                  NA
pytz                        2021.1
requests                    2.26.0
scipy                       1.7.1
send2trash                  NA
six                         1.16.0
sniffio                     1.2.0
socks                       1.7.1
statsmodels                 0.13.2
storemagic                  NA
terminado                   0.12.1
tornado                     6.1
traitlets                   5.1.0
urllib3                     1.26.7
wcwidth                     0.2.5
websocket                   0.57.0
zmq                         22.3.0

-----
IPython             7.27.0
jupyter_client      7.0.3
jupyter_core        4.8.1
jupyterlab          3.1.12
notebook            6.4.4
-----
Python 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:59:00) [Clang 11.1.0 ]
macOS-10.15.7-x86_64-i386-64bit
-----
Session information updated at 2022-09-20 00:03