import warnings
warnings.filterwarnings('ignore')Lab 01
Multiple Linear Regression Analysis
Revisit- Advertising Data Example
In a study the association between advertising and sales of a particular product is being investigated. The advertising data set consists of sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
#import required libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm#import the data set in the .csv file into your your current session
advertise_df = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col = 0)
#index_col = column(s) to use as the row labels
advertise_df.head()| TV | radio | newspaper | sales | |
|---|---|---|---|---|
| 1 | 230.1 | 37.8 | 69.2 | 22.1 |
| 2 | 44.5 | 39.3 | 45.1 | 10.4 |
| 3 | 17.2 | 45.9 | 69.3 | 9.3 |
| 4 | 151.5 | 41.3 | 58.5 | 18.5 |
| 5 | 180.8 | 10.8 | 58.4 | 12.9 |
# We will fit three individual SLR and one MLR
# Arrange the Y and X matrices for each case, respectively.
# Define the common response
Y = advertise_df.sales
# Arrange the first design matrix by including the intercept and TV variable
X_tv = sm.add_constant(advertise_df.TV)
#X_design_one
# Arrange the second design matrix by including the intercept and radio variable
X_radio = sm.add_constant(advertise_df.radio)
#X_design_two
# Arrange the third design matrix by including the intercept and newspaper variable
X_news = sm.add_constant(advertise_df.newspaper)
#X_design_three
# Arrange the matrix for MLR by including the intercept and TV, radio and newspaper variables
X = advertise_df[["TV", "radio", "newspaper"]]
X_full = sm.add_constant(X)Simple Linear Regression Analysis
Last week, we have investigated the relationship between sales and TV advertising.
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
#class statsmodels.regression.linear_model.OLS
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit.html
#OLS.fit() method
#Describe model
model_tv = sm.OLS(Y, X_tv) #OLS class takes the data
#Fit model and return results object
results_tv = model_tv.fit()
#alternatively:sm.OLS(Y, X_tv).fit()
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults
#fit method returns another class named RegressionResults
#hence results_tv object has its own class that is regression results
#of course, regression results class has its own fit method
#Based on the results, get predictions
predictions_tv = results_tv.predict(X_tv)
#of course, regression results class has its summary method
#Get the results summary
print_model_tv = results_tv.summary()
print(print_model_tv) OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.612
Model: OLS Adj. R-squared: 0.610
Method: Least Squares F-statistic: 312.1
Date: Mon, 10 Oct 2022 Prob (F-statistic): 1.47e-42
Time: 20:23:21 Log-Likelihood: -519.05
No. Observations: 200 AIC: 1042.
Df Residuals: 198 BIC: 1049.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 7.0326 0.458 15.360 0.000 6.130 7.935
TV 0.0475 0.003 17.668 0.000 0.042 0.053
==============================================================================
Omnibus: 0.531 Durbin-Watson: 1.935
Prob(Omnibus): 0.767 Jarque-Bera (JB): 0.669
Skew: -0.089 Prob(JB): 0.716
Kurtosis: 2.779 Cond. No. 338.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#as example, evaluate the loglik at OLS estimates. Yes, we can get the value in the tablo above.
#model_tv.loglike(params=[7.0326,0.0475])#another example, rsquared is a property for results class. get the rsquare of the model
#results_tv.rsquared
#results_tv.mse_resid
#results_tv.aicWe can also investigate the relationship between sales and radio. The results in the following table shows that a $1,000 increase in spending on radio advertising is associated with an increase in sales of around 203 units.
#Describe model
model_radio = sm.OLS(Y, X_radio)
#Fit model and return results object
results_radio = model_radio.fit()
#Based on the results, get predictions
predictions_radio = results_radio.predict(X_radio)
#Get the results summary
print_model_radio = results_radio.summary()
print(print_model_radio) OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.332
Model: OLS Adj. R-squared: 0.329
Method: Least Squares F-statistic: 98.42
Date: Mon, 10 Oct 2022 Prob (F-statistic): 4.35e-19
Time: 20:23:21 Log-Likelihood: -573.34
No. Observations: 200 AIC: 1151.
Df Residuals: 198 BIC: 1157.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 9.3116 0.563 16.542 0.000 8.202 10.422
radio 0.2025 0.020 9.921 0.000 0.162 0.243
==============================================================================
Omnibus: 19.358 Durbin-Watson: 1.946
Prob(Omnibus): 0.000 Jarque-Bera (JB): 21.910
Skew: -0.764 Prob(JB): 1.75e-05
Kurtosis: 3.544 Cond. No. 51.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can also further investigate the relationship between sales and newspaper. The results in the following table shows that a $1,000 increase in spending on newspaper advertising is associated with an increase in sales of around 55 units.
#Describe model
model_news = sm.OLS(Y, X_news)
#Fit model and return results object
results_news = model_news.fit()
#Based on the results, get predictions
predictions_news = results_news.predict(X_news)
#Get the results summary
print_model_news = results_news.summary()
print(print_model_news) OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.052
Model: OLS Adj. R-squared: 0.047
Method: Least Squares F-statistic: 10.89
Date: Mon, 10 Oct 2022 Prob (F-statistic): 0.00115
Time: 20:23:21 Log-Likelihood: -608.34
No. Observations: 200 AIC: 1221.
Df Residuals: 198 BIC: 1227.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 12.3514 0.621 19.876 0.000 11.126 13.577
newspaper 0.0547 0.017 3.300 0.001 0.022 0.087
==============================================================================
Omnibus: 6.231 Durbin-Watson: 1.983
Prob(Omnibus): 0.044 Jarque-Bera (JB): 5.483
Skew: 0.330 Prob(JB): 0.0645
Kurtosis: 2.527 Cond. No. 64.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Multiple Linear Regression Analysis
- One problem associated with fitting seperate simple linear regression model for each predictor is that it is not clear that how to make a single prediction of sales given the three advertising media budgets.
- On the other hand, each of the three regression equations ignores the other two media forming estimates for the regression coefficients.
- A better approach would be to fit a multiple linear regression to sales with three predictors such as:
\[sales_i = \beta_0 + \beta_1 *TV_i + \beta_2 * radio_i + \beta_3 *newspaper_i + \epsilon_i \quad\] for \(i=1,2,\ldots,200\), where \(\epsilon_i \sim N(0,\sigma^2)\).
#Describe model
model_full = sm.OLS(Y, X_full)
#Fit model and return results object
results_full = model_full.fit()
#Based on the results, get predictions
predictions_full = results_full.predict(X_full)
#Get the results summary
print_model_full = results_full.summary()
print(print_model_full) OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Mon, 10 Oct 2022 Prob (F-statistic): 1.58e-96
Time: 20:23:21 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
radio 0.1885 0.009 21.893 0.000 0.172 0.206
newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
==============================================================================
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpretation of the output
The results above shows us the multiple regression coeffcient estimates when TV, radio, and newspaper advertising budgets are used to predict product sales.
As an example, the interpreation of \(\hat{\beta}_2=0.1885\) is follows: for a given amount of TV and newspaper advertising, spending an additional $ $1,000 $ on radio advertising is associated with approximately 189 units of additional sales.
The coefficient estimates \(\hat{\beta}_1\) and \(\hat{\beta}_3\) can be interpreted in a similar way.
The eye-catching point is that the coefficient of newspaper in SLR is 0.0547 (P-value close to 0), whereas the coefficient of newspaper in MLR is -0.0010 (P-value = 0.860).
This is a very good example that the coefficient of a variable in a SLR when other predictors are omitted from the model can be quite different in a MLR when other predictors included into the model !!!
One more time: In SLR, the slope term for newspaper represents the average increase in product sales associated with a $ $1,000 $ increase in newspaper advertising, ignoring TV and radio. In MLR, the slope term for newspaper represents the average increase in product sales associated with increasing newspaper spending by $ $1,000$ while holding TV and radio fixed.
The correlation between newspaper and radio is around 0.35. The correlation matrix result may imply that: in markets where we spend more on radio our sales will tend to be higher, we also tend to spend more on newspaper advertising in those same markets. Hence, in SLR which only examines sales versus newspaper, we will observe that higher values of newspaper tend to associated with higher values of sales, even though newspaper advertising is not directly associated with sales as MLR shows. So newspaper advertising is a surrogate for radio advertising: newspaper gets “credit” for the association between radio on sales.
advertise_df.corr()| TV | radio | newspaper | sales | |
|---|---|---|---|---|
| TV | 1.000000 | 0.054809 | 0.056648 | 0.782224 |
| radio | 0.054809 | 1.000000 | 0.354104 | 0.576223 |
| newspaper | 0.056648 | 0.354104 | 1.000000 | 0.228299 |
| sales | 0.782224 | 0.576223 | 0.228299 | 1.000000 |
- A side note: in the results of MLR, we can see that loglikelihood is -386.18. Once we know the loglikelihood we can also calculate AIC and BIC values (they are already available in the output above).
- If we compare the AIC, BIC values of all four models above, we can see that MLR is better than the others (the lower AIC, BIC, the better the model is).
loglik = -386.18
AIC = -2*loglik + 2*4 # four is the number of regression parameters
BIC = -2*loglik + 4* np.log(200)#model_full.loglike(params=[2.9389,0.0458,0.1885,-0.0010])
#results_full.aic
#results_full.bicOn the other hand, let’s test whether we should keep \(X_2\) (radio) and \(X_3\) (newspaper) in the full model or not. So, our reduced model is the one with \(X_1\) (TV) only.
\[\begin{eqnarray} H_{0}:\beta_2=\beta_3=0 \\ \nonumber \end{eqnarray}\]
Let’s carry out an F test over full model.
results_full.compare_f_test(results_tv)[0:2] #f_value, p value(272.0406768057634, 2.8294869157010127e-57)
P-value is close to zero, indicating that no at least one of the variables \(X_2\) (radio) and \(X_3\) (newspaper) should be in the model.
Check it with out formula since there is no example on the web:
\[\begin{equation} F=\frac{(RSS_0-RSS)/q}{RSS/(n-p-1)} \sim F_{q,(n-p-1)}. \nonumber \end{equation}\]
RSSo=sum(results_tv.resid**2)
RSS=sum(results_full.resid**2)
num = (RSSo-RSS)/2 #q=2
denom =RSS/(200-4) #p=3, n=200
F = num /denom
F272.04067680576344
Reference
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. New York: Springer.
import session_info
session_info.show()Click to view session information
----- numpy 1.21.2 pandas 1.3.3 session_info 1.0.0 statsmodels 0.13.2 -----
Click to view modules imported as dependencies
anyio NA appnope 0.1.2 attr 21.2.0 babel 2.9.1 backcall 0.2.0 beta_ufunc NA binom_ufunc NA brotli 1.0.9 certifi 2021.05.30 chardet 4.0.0 charset_normalizer 2.0.0 colorama 0.4.4 cython_runtime NA dateutil 2.8.2 debugpy 1.4.1 decorator 5.1.0 entrypoints 0.3 google NA idna 3.1 ipykernel 6.4.1 ipython_genutils 0.2.0 jedi 0.18.0 jinja2 3.0.1 joblib 1.1.0 json5 NA jsonschema 3.2.0 jupyter_server 1.11.0 jupyterlab_server 2.8.1 lz4 4.0.0 markupsafe 2.0.1 mpl_toolkits NA nbclassic NA nbformat 5.1.3 nbinom_ufunc NA packaging 21.3 parso 0.8.2 patsy 0.5.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prometheus_client NA prompt_toolkit 3.0.20 ptyprocess 0.7.0 pvectorc NA pydev_ipython NA pydevconsole NA pydevd 2.4.1 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.10.0 pyrsistent NA pytz 2021.1 requests 2.26.0 scipy 1.7.1 send2trash NA six 1.16.0 sniffio 1.2.0 socks 1.7.1 storemagic NA terminado 0.12.1 tornado 6.1 traitlets 5.1.0 typing_extensions NA uritemplate 4.1.1 urllib3 1.26.7 wcwidth 0.2.5 websocket 0.57.0 zmq 22.3.0
----- IPython 7.27.0 jupyter_client 7.0.3 jupyter_core 4.8.1 jupyterlab 3.1.12 notebook 6.4.4 ----- Python 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:59:00) [Clang 11.1.0 ] macOS-10.15.7-x86_64-i386-64bit ----- Session information updated at 2022-10-10 20:23