Lab 01

Multiple Linear Regression Analysis

Author

Gül İnan

Published

October 4, 2022

Revisit- Advertising Data Example

In a study the association between advertising and sales of a particular product is being investigated. The advertising data set consists of sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

import warnings 
warnings.filterwarnings('ignore')

#import required libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm

#import the data set in the .csv file into your your current session
advertise_df = pd.read_csv("https://www.statlearning.com/s/Advertising.csv", index_col = 0)
#index_col = column(s) to use as the row labels
advertise_df.head()

	TV	radio	newspaper	sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9

# We will fit three individual SLR and one MLR
# Arrange the Y and X matrices for each case, respectively.
# Define the common response
Y = advertise_df.sales
# Arrange the first design matrix by including the intercept and TV variable
X_tv = sm.add_constant(advertise_df.TV) 
#X_design_one
# Arrange the second design matrix by including the intercept and radio variable
X_radio = sm.add_constant(advertise_df.radio) 
#X_design_two
# Arrange the third design matrix by including the intercept and newspaper variable
X_news = sm.add_constant(advertise_df.newspaper) 
#X_design_three
# Arrange the matrix for MLR by including the intercept and TV, radio and newspaper variables
X = advertise_df[["TV", "radio", "newspaper"]]
X_full = sm.add_constant(X)

Simple Linear Regression Analysis

Last week, we have investigated the relationship between sales and TV advertising.

#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
#class statsmodels.regression.linear_model.OLS
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.fit.html
#OLS.fit() method
#Describe model
model_tv = sm.OLS(Y, X_tv) #OLS class takes the data
#Fit model and return results object
results_tv = model_tv.fit() 
#alternatively:sm.OLS(Y, X_tv).fit()
#https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults
#fit method returns another class named RegressionResults
#hence results_tv object has its own class that is regression results
#of course, regression results class has its own fit method
#Based on the results, get predictions
predictions_tv = results_tv.predict(X_tv)
#of course, regression results class has its summary method
#Get the results summary
print_model_tv = results_tv.summary()
print(print_model_tv)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     312.1
Date:                Mon, 10 Oct 2022   Prob (F-statistic):           1.47e-42
Time:                        20:23:21   Log-Likelihood:                -519.05
No. Observations:                 200   AIC:                             1042.
Df Residuals:                     198   BIC:                             1049.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0326      0.458     15.360      0.000       6.130       7.935
TV             0.0475      0.003     17.668      0.000       0.042       0.053
==============================================================================
Omnibus:                        0.531   Durbin-Watson:                   1.935
Prob(Omnibus):                  0.767   Jarque-Bera (JB):                0.669
Skew:                          -0.089   Prob(JB):                        0.716
Kurtosis:                       2.779   Cond. No.                         338.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

#as example, evaluate the loglik at OLS estimates. Yes, we can get the value in the tablo above.
#model_tv.loglike(params=[7.0326,0.0475])

#another example, rsquared is a property for results class. get the rsquare of the model
#results_tv.rsquared
#results_tv.mse_resid
#results_tv.aic

We can also investigate the relationship between sales and radio. The results in the following table shows that a $1,000 increase in spending on radio advertising is associated with an increase in sales of around 203 units.

#Describe model
model_radio = sm.OLS(Y, X_radio) 
#Fit model and return results object
results_radio = model_radio.fit() 
#Based on the results, get predictions
predictions_radio = results_radio.predict(X_radio) 
#Get the results summary
print_model_radio = results_radio.summary()
print(print_model_radio)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.332
Model:                            OLS   Adj. R-squared:                  0.329
Method:                 Least Squares   F-statistic:                     98.42
Date:                Mon, 10 Oct 2022   Prob (F-statistic):           4.35e-19
Time:                        20:23:21   Log-Likelihood:                -573.34
No. Observations:                 200   AIC:                             1151.
Df Residuals:                     198   BIC:                             1157.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          9.3116      0.563     16.542      0.000       8.202      10.422
radio          0.2025      0.020      9.921      0.000       0.162       0.243
==============================================================================
Omnibus:                       19.358   Durbin-Watson:                   1.946
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               21.910
Skew:                          -0.764   Prob(JB):                     1.75e-05
Kurtosis:                       3.544   Cond. No.                         51.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can also further investigate the relationship between sales and newspaper. The results in the following table shows that a $1,000 increase in spending on newspaper advertising is associated with an increase in sales of around 55 units.

#Describe model
model_news = sm.OLS(Y, X_news) 
#Fit model and return results object
results_news = model_news.fit() 
#Based on the results, get predictions
predictions_news = results_news.predict(X_news) 
#Get the results summary
print_model_news = results_news.summary()
print(print_model_news)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.052
Model:                            OLS   Adj. R-squared:                  0.047
Method:                 Least Squares   F-statistic:                     10.89
Date:                Mon, 10 Oct 2022   Prob (F-statistic):            0.00115
Time:                        20:23:21   Log-Likelihood:                -608.34
No. Observations:                 200   AIC:                             1221.
Df Residuals:                     198   BIC:                             1227.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         12.3514      0.621     19.876      0.000      11.126      13.577
newspaper      0.0547      0.017      3.300      0.001       0.022       0.087
==============================================================================
Omnibus:                        6.231   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.044   Jarque-Bera (JB):                5.483
Skew:                           0.330   Prob(JB):                       0.0645
Kurtosis:                       2.527   Cond. No.                         64.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Multiple Linear Regression Analysis

One problem associated with fitting seperate simple linear regression model for each predictor is that it is not clear that how to make a single prediction of sales given the three advertising media budgets.
On the other hand, each of the three regression equations ignores the other two media forming estimates for the regression coefficients.
A better approach would be to fit a multiple linear regression to sales with three predictors such as:

\[sales_i = \beta_0 + \beta_1 *TV_i + \beta_2 * radio_i + \beta_3 *newspaper_i + \epsilon_i \quad\] for $i=1,2,\ldots,200$, where $\epsilon_i \sim N(0,\sigma^2)$.

#Describe model
model_full = sm.OLS(Y, X_full) 
#Fit model and return results object
results_full = model_full.fit() 
#Based on the results, get predictions
predictions_full = results_full.predict(X_full) 
#Get the results summary
print_model_full = results_full.summary()
print(print_model_full)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Mon, 10 Oct 2022   Prob (F-statistic):           1.58e-96
Time:                        20:23:21   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
radio          0.1885      0.009     21.893      0.000       0.172       0.206
newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
==============================================================================
Omnibus:                       60.414   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.241
Skew:                          -1.327   Prob(JB):                     1.44e-33
Kurtosis:                       6.332   Cond. No.                         454.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation of the output

The results above shows us the multiple regression coeffcient estimates when TV, radio, and newspaper advertising budgets are used to predict product sales.
As an example, the interpreation of $\hat{\beta}_2=0.1885$ is follows: for a given amount of TV and newspaper advertising, spending an additional $ $1,000 $ on radio advertising is associated with approximately 189 units of additional sales.
The coefficient estimates $\hat{\beta}_1$ and $\hat{\beta}_3$ can be interpreted in a similar way.
The eye-catching point is that the coefficient of newspaper in SLR is 0.0547 (P-value close to 0), whereas the coefficient of newspaper in MLR is -0.0010 (P-value = 0.860).
This is a very good example that the coefficient of a variable in a SLR when other predictors are omitted from the model can be quite different in a MLR when other predictors included into the model !!!
One more time: In SLR, the slope term for newspaper represents the average increase in product sales associated with a $ $1,000 $ increase in newspaper advertising, ignoring TV and radio. In MLR, the slope term for newspaper represents the average increase in product sales associated with increasing newspaper spending by $ $1,000$ while holding TV and radio fixed.
The correlation between newspaper and radio is around 0.35. The correlation matrix result may imply that: in markets where we spend more on radio our sales will tend to be higher, we also tend to spend more on newspaper advertising in those same markets. Hence, in SLR which only examines sales versus newspaper, we will observe that higher values of newspaper tend to associated with higher values of sales, even though newspaper advertising is not directly associated with sales as MLR shows. So newspaper advertising is a surrogate for radio advertising: newspaper gets “credit” for the association between radio on sales.

advertise_df.corr()

	TV	radio	newspaper	sales
TV	1.000000	0.054809	0.056648	0.782224
radio	0.054809	1.000000	0.354104	0.576223
newspaper	0.056648	0.354104	1.000000	0.228299
sales	0.782224	0.576223	0.228299	1.000000

A side note: in the results of MLR, we can see that loglikelihood is -386.18. Once we know the loglikelihood we can also calculate AIC and BIC values (they are already available in the output above).
If we compare the AIC, BIC values of all four models above, we can see that MLR is better than the others (the lower AIC, BIC, the better the model is).

loglik = -386.18
AIC = -2*loglik + 2*4 # four is the number of regression parameters
BIC = -2*loglik + 4* np.log(200)

#model_full.loglike(params=[2.9389,0.0458,0.1885,-0.0010])
#results_full.aic
#results_full.bic

On the other hand, let’s test whether we should keep $X_2$ (radio) and $X_3$ (newspaper) in the full model or not. So, our reduced model is the one with $X_1$ (TV) only.

\[\begin{eqnarray} H_{0}:\beta_2=\beta_3=0 \\ \nonumber \end{eqnarray}\]

Let’s carry out an F test over full model.

results_full.compare_f_test(results_tv)[0:2] #f_value, p value

(272.0406768057634, 2.8294869157010127e-57)

P-value is close to zero, indicating that no at least one of the variables $X_2$ (radio) and $X_3$ (newspaper) should be in the model.

Check it with out formula since there is no example on the web:

\[\begin{equation} F=\frac{(RSS_0-RSS)/q}{RSS/(n-p-1)} \sim F_{q,(n-p-1)}. \nonumber \end{equation}\]

RSSo=sum(results_tv.resid**2)
RSS=sum(results_full.resid**2)
num = (RSSo-RSS)/2 #q=2
denom =RSS/(200-4) #p=3, n=200
F = num /denom
F

272.04067680576344

Reference

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. New York: Springer.

import session_info
session_info.show()

Click to view session information

-----
numpy               1.21.2
pandas              1.3.3
session_info        1.0.0
statsmodels         0.13.2
-----

Click to view modules imported as dependencies

anyio                       NA
appnope                     0.1.2
attr                        21.2.0
babel                       2.9.1
backcall                    0.2.0
beta_ufunc                  NA
binom_ufunc                 NA
brotli                      1.0.9
certifi                     2021.05.30
chardet                     4.0.0
charset_normalizer          2.0.0
colorama                    0.4.4
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.4.1
decorator                   5.1.0
entrypoints                 0.3
google                      NA
idna                        3.1
ipykernel                   6.4.1
ipython_genutils            0.2.0
jedi                        0.18.0
jinja2                      3.0.1
joblib                      1.1.0
json5                       NA
jsonschema                  3.2.0
jupyter_server              1.11.0
jupyterlab_server           2.8.1
lz4                         4.0.0
markupsafe                  2.0.1
mpl_toolkits                NA
nbclassic                   NA
nbformat                    5.1.3
nbinom_ufunc                NA
packaging                   21.3
parso                       0.8.2
patsy                       0.5.2
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prometheus_client           NA
prompt_toolkit              3.0.20
ptyprocess                  0.7.0
pvectorc                    NA
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.4.1
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.10.0
pyrsistent                  NA
pytz                        2021.1
requests                    2.26.0
scipy                       1.7.1
send2trash                  NA
six                         1.16.0
sniffio                     1.2.0
socks                       1.7.1
storemagic                  NA
terminado                   0.12.1
tornado                     6.1
traitlets                   5.1.0
typing_extensions           NA
uritemplate                 4.1.1
urllib3                     1.26.7
wcwidth                     0.2.5
websocket                   0.57.0
zmq                         22.3.0

-----
IPython             7.27.0
jupyter_client      7.0.3
jupyter_core        4.8.1
jupyterlab          3.1.12
notebook            6.4.4
-----
Python 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:59:00) [Clang 11.1.0 ]
macOS-10.15.7-x86_64-i386-64bit
-----
Session information updated at 2022-10-10 20:23