import warnings
'ignore') warnings.filterwarnings(
Lab 01
Multiple Linear Regression Analysis
Revisit- Advertising Data Example
In a study the association between advertising and sales of a particular product is being investigated. The advertising data set consists of sales
of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV
, radio
, and newspaper
.
#import required libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
#import the data set in the .csv file into your your current session
= pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col = 0)
advertise_df #index_col = column(s) to use as the row labels
advertise_df.head()
TV | radio | newspaper | sales | |
---|---|---|---|---|
1 | 230.1 | 37.8 | 69.2 | 22.1 |
2 | 44.5 | 39.3 | 45.1 | 10.4 |
3 | 17.2 | 45.9 | 69.3 | 9.3 |
4 | 151.5 | 41.3 | 58.5 | 18.5 |
5 | 180.8 | 10.8 | 58.4 | 12.9 |
# We will fit three individual SLR and one MLR
# Arrange the Y and X matrices for each case, respectively.
# Define the common response
= advertise_df.sales
Y # Arrange the first design matrix by including the intercept and TV variable
= sm.add_constant(advertise_df.TV)
X_tv #X_design_one
# Arrange the second design matrix by including the intercept and radio variable
= sm.add_constant(advertise_df.radio)
X_radio #X_design_two
# Arrange the third design matrix by including the intercept and newspaper variable
= sm.add_constant(advertise_df.newspaper)
X_news #X_design_three
# Arrange the matrix for MLR by including the intercept and TV, radio and newspaper variables
= advertise_df[["TV", "radio", "newspaper"]]
X = sm.add_constant(X) X_full
Simple Linear Regression Analysis
Last week, we have investigated the relationship between sales and TV advertising.
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
#class statsmodels.regression.linear_model.OLS
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit.html
#OLS.fit() method
#Describe model
= sm.OLS(Y, X_tv) #OLS class takes the data
model_tv #Fit model and return results object
= model_tv.fit()
results_tv #alternatively:sm.OLS(Y, X_tv).fit()
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults
#fit method returns another class named RegressionResults
#hence results_tv object has its own class that is regression results
#of course, regression results class has its own fit method
#Based on the results, get predictions
= results_tv.predict(X_tv)
predictions_tv #of course, regression results class has its summary method
#Get the results summary
= results_tv.summary()
print_model_tv print(print_model_tv)
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.612
Model: OLS Adj. R-squared: 0.610
Method: Least Squares F-statistic: 312.1
Date: Mon, 10 Oct 2022 Prob (F-statistic): 1.47e-42
Time: 20:23:21 Log-Likelihood: -519.05
No. Observations: 200 AIC: 1042.
Df Residuals: 198 BIC: 1049.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 7.0326 0.458 15.360 0.000 6.130 7.935
TV 0.0475 0.003 17.668 0.000 0.042 0.053
==============================================================================
Omnibus: 0.531 Durbin-Watson: 1.935
Prob(Omnibus): 0.767 Jarque-Bera (JB): 0.669
Skew: -0.089 Prob(JB): 0.716
Kurtosis: 2.779 Cond. No. 338.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#as example, evaluate the loglik at OLS estimates. Yes, we can get the value in the tablo above.
#model_tv.loglike(params=[7.0326,0.0475])
#another example, rsquared is a property for results class. get the rsquare of the model
#results_tv.rsquared
#results_tv.mse_resid
#results_tv.aic
We can also investigate the relationship between sales and radio. The results in the following table shows that a $1,000 increase in spending on radio advertising is associated with an increase in sales of around 203 units.
#Describe model
= sm.OLS(Y, X_radio)
model_radio #Fit model and return results object
= model_radio.fit()
results_radio #Based on the results, get predictions
= results_radio.predict(X_radio)
predictions_radio #Get the results summary
= results_radio.summary()
print_model_radio print(print_model_radio)
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.332
Model: OLS Adj. R-squared: 0.329
Method: Least Squares F-statistic: 98.42
Date: Mon, 10 Oct 2022 Prob (F-statistic): 4.35e-19
Time: 20:23:21 Log-Likelihood: -573.34
No. Observations: 200 AIC: 1151.
Df Residuals: 198 BIC: 1157.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 9.3116 0.563 16.542 0.000 8.202 10.422
radio 0.2025 0.020 9.921 0.000 0.162 0.243
==============================================================================
Omnibus: 19.358 Durbin-Watson: 1.946
Prob(Omnibus): 0.000 Jarque-Bera (JB): 21.910
Skew: -0.764 Prob(JB): 1.75e-05
Kurtosis: 3.544 Cond. No. 51.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can also further investigate the relationship between sales and newspaper. The results in the following table shows that a $1,000 increase in spending on newspaper advertising is associated with an increase in sales of around 55 units.
#Describe model
= sm.OLS(Y, X_news)
model_news #Fit model and return results object
= model_news.fit()
results_news #Based on the results, get predictions
= results_news.predict(X_news)
predictions_news #Get the results summary
= results_news.summary()
print_model_news print(print_model_news)
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.052
Model: OLS Adj. R-squared: 0.047
Method: Least Squares F-statistic: 10.89
Date: Mon, 10 Oct 2022 Prob (F-statistic): 0.00115
Time: 20:23:21 Log-Likelihood: -608.34
No. Observations: 200 AIC: 1221.
Df Residuals: 198 BIC: 1227.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 12.3514 0.621 19.876 0.000 11.126 13.577
newspaper 0.0547 0.017 3.300 0.001 0.022 0.087
==============================================================================
Omnibus: 6.231 Durbin-Watson: 1.983
Prob(Omnibus): 0.044 Jarque-Bera (JB): 5.483
Skew: 0.330 Prob(JB): 0.0645
Kurtosis: 2.527 Cond. No. 64.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Multiple Linear Regression Analysis
- One problem associated with fitting seperate simple linear regression model for each predictor is that it is not clear that how to make a single prediction of sales given the three advertising media budgets.
- On the other hand, each of the three regression equations ignores the other two media forming estimates for the regression coefficients.
- A better approach would be to fit a multiple linear regression to sales with three predictors such as:
\[sales_i = \beta_0 + \beta_1 *TV_i + \beta_2 * radio_i + \beta_3 *newspaper_i + \epsilon_i \quad\] for \(i=1,2,\ldots,200\), where \(\epsilon_i \sim N(0,\sigma^2)\).
#Describe model
= sm.OLS(Y, X_full)
model_full #Fit model and return results object
= model_full.fit()
results_full #Based on the results, get predictions
= results_full.predict(X_full)
predictions_full #Get the results summary
= results_full.summary()
print_model_full print(print_model_full)
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Mon, 10 Oct 2022 Prob (F-statistic): 1.58e-96
Time: 20:23:21 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
radio 0.1885 0.009 21.893 0.000 0.172 0.206
newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
==============================================================================
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpretation of the output
The results above shows us the multiple regression coeffcient estimates when TV, radio, and newspaper advertising budgets are used to predict product sales.
As an example, the interpreation of \(\hat{\beta}_2=0.1885\) is follows: for a given amount of TV and newspaper advertising, spending an additional $ $1,000 $ on radio advertising is associated with approximately 189 units of additional sales.
The coefficient estimates \(\hat{\beta}_1\) and \(\hat{\beta}_3\) can be interpreted in a similar way.
The eye-catching point is that the coefficient of newspaper in SLR is 0.0547 (P-value close to 0), whereas the coefficient of newspaper in MLR is -0.0010 (P-value = 0.860).
This is a very good example that the coefficient of a variable in a SLR when other predictors are omitted from the model can be quite different in a MLR when other predictors included into the model !!!
One more time: In SLR, the slope term for newspaper represents the average increase in product sales associated with a $ $1,000 $ increase in newspaper advertising, ignoring TV and radio. In MLR, the slope term for newspaper represents the average increase in product sales associated with increasing newspaper spending by $ $1,000$ while holding TV and radio fixed.
The correlation between newspaper and radio is around 0.35. The correlation matrix result may imply that: in markets where we spend more on radio our sales will tend to be higher, we also tend to spend more on newspaper advertising in those same markets. Hence, in SLR which only examines sales versus newspaper, we will observe that higher values of newspaper tend to associated with higher values of sales, even though newspaper advertising is not directly associated with sales as MLR shows. So newspaper advertising is a surrogate for radio advertising: newspaper gets “credit” for the association between radio on sales.
advertise_df.corr()
TV | radio | newspaper | sales | |
---|---|---|---|---|
TV | 1.000000 | 0.054809 | 0.056648 | 0.782224 |
radio | 0.054809 | 1.000000 | 0.354104 | 0.576223 |
newspaper | 0.056648 | 0.354104 | 1.000000 | 0.228299 |
sales | 0.782224 | 0.576223 | 0.228299 | 1.000000 |
- A side note: in the results of MLR, we can see that loglikelihood is -386.18. Once we know the loglikelihood we can also calculate AIC and BIC values (they are already available in the output above).
- If we compare the AIC, BIC values of all four models above, we can see that MLR is better than the others (the lower AIC, BIC, the better the model is).
= -386.18
loglik = -2*loglik + 2*4 # four is the number of regression parameters
AIC = -2*loglik + 4* np.log(200) BIC
#model_full.loglike(params=[2.9389,0.0458,0.1885,-0.0010])
#results_full.aic
#results_full.bic
On the other hand, let’s test whether we should keep \(X_2\) (radio) and \(X_3\) (newspaper) in the full model or not. So, our reduced model is the one with \(X_1\) (TV) only.
\[\begin{eqnarray} H_{0}:\beta_2=\beta_3=0 \\ \nonumber \end{eqnarray}\]
Let’s carry out an F test over full model.
0:2] #f_value, p value results_full.compare_f_test(results_tv)[
(272.0406768057634, 2.8294869157010127e-57)
P-value is close to zero, indicating that no at least one of the variables \(X_2\) (radio) and \(X_3\) (newspaper) should be in the model.
Check it with out formula since there is no example on the web:
\[\begin{equation} F=\frac{(RSS_0-RSS)/q}{RSS/(n-p-1)} \sim F_{q,(n-p-1)}. \nonumber \end{equation}\]
=sum(results_tv.resid**2)
RSSo=sum(results_full.resid**2)
RSS= (RSSo-RSS)/2 #q=2
num =RSS/(200-4) #p=3, n=200
denom = num /denom
F F
272.04067680576344
Reference
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. New York: Springer.
import session_info
session_info.show()
Click to view session information
----- numpy 1.21.2 pandas 1.3.3 session_info 1.0.0 statsmodels 0.13.2 -----
Click to view modules imported as dependencies
anyio NA appnope 0.1.2 attr 21.2.0 babel 2.9.1 backcall 0.2.0 beta_ufunc NA binom_ufunc NA brotli 1.0.9 certifi 2021.05.30 chardet 4.0.0 charset_normalizer 2.0.0 colorama 0.4.4 cython_runtime NA dateutil 2.8.2 debugpy 1.4.1 decorator 5.1.0 entrypoints 0.3 google NA idna 3.1 ipykernel 6.4.1 ipython_genutils 0.2.0 jedi 0.18.0 jinja2 3.0.1 joblib 1.1.0 json5 NA jsonschema 3.2.0 jupyter_server 1.11.0 jupyterlab_server 2.8.1 lz4 4.0.0 markupsafe 2.0.1 mpl_toolkits NA nbclassic NA nbformat 5.1.3 nbinom_ufunc NA packaging 21.3 parso 0.8.2 patsy 0.5.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prometheus_client NA prompt_toolkit 3.0.20 ptyprocess 0.7.0 pvectorc NA pydev_ipython NA pydevconsole NA pydevd 2.4.1 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.10.0 pyrsistent NA pytz 2021.1 requests 2.26.0 scipy 1.7.1 send2trash NA six 1.16.0 sniffio 1.2.0 socks 1.7.1 storemagic NA terminado 0.12.1 tornado 6.1 traitlets 5.1.0 typing_extensions NA uritemplate 4.1.1 urllib3 1.26.7 wcwidth 0.2.5 websocket 0.57.0 zmq 22.3.0
----- IPython 7.27.0 jupyter_client 7.0.3 jupyter_core 4.8.1 jupyterlab 3.1.12 notebook 6.4.4 ----- Python 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:59:00) [Clang 11.1.0 ] macOS-10.15.7-x86_64-i386-64bit ----- Session information updated at 2022-10-10 20:23