Table of Contents¶
Introduction ¶
Phase 1 Summary¶
A brief yet complete and accurate summary of the work conducted for your Phase 1 report and how they relate to your Phase 2 report.
Important Phase 2 Note: Please do NOT include your Phase 1 report or its contents with your Phase 2 submissions. You can, however, make some changes with your Phase 1 tasks if you need to, and then ONLY include these changes with your Phase 2 report with some explanation for these changes.
Report Overview¶
A complete and accurate overview of the contents of your Phase 2 report. Clarification: A Table of Contents is not a report overview.
Overview of Methodology¶
A detailed, complete, and accurate overview of your statistical modelling methodology (which is multiple linear regression). More specifically, in this subsection, you will provide a summary of your "Statistical Modelling" section below.
Statistical Modelling ¶
(Statistical Modelling Section: Details of assumptions check, model selection, plots of residuals, and technical analysis of regression results.)
NOTE: The second half of this regression case study ("Statistical Modeling and Performance Evaluation" Section) will be very helpful for this section.
Full Model Overview¶
Overview of your full model, including the variables and terms you are using in your model.
Module Imports¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
df = pd.read_csv('Phase2_Group???.csv')
Full Model Diagnostic Checks¶
You need to check whether there are indications of violations of the regression assumptions for the full model.
Feature Selection¶
You can use the code below to to perform backward feature selection using p-values (credit).
## create the patsy model description from formula
patsy_description = patsy.ModelDesc.from_formula(formula_string_encoded)
# initialize feature-selected fit to full model
linreg_fit = model_full_fitted
# do backwards elimination using p-values
p_val_cutoff = 0.05
## WARNING 1: The code below assumes that the Intercept term is present in the model.
## WARNING 2: It will work only with main effects and two-way interactions, if any.
print('\nPerforming backwards feature selection using p-values:')
while True:
# uncomment the line below if you would like to see the regression summary
# in each step:
### print(linreg_fit.summary())
pval_series = linreg_fit.pvalues.drop(labels='Intercept')
pval_series = pval_series.sort_values(ascending=False)
term = pval_series.index[0]
pval = pval_series[0]
if (pval < p_val_cutoff):
break
term_components = term.split(':')
print(f'\nRemoving term "{term}" with p-value {pval:.4}')
if (len(term_components) == 1): ## this is a main effect term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0])]))
else: ## this is an interaction term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0]),
patsy.EvalFactor(term_components[1])]))
linreg_fit = smf.ols(formula=patsy_description, data=data_encoded).fit()
###
## this is the clean fit after backwards elimination
model_reduced_fitted = smf.ols(formula = patsy_description, data = data_encoded).fit()
###
#########
print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")
print(f"Regression Adjusted R-squared: {model_reduced_fitted.rsquared_adj:.4f}")
Reduced Model Overview¶
Overview of your reduced model, including the variables and terms you are using in your model.
Reduced Model Diagnostic Checks¶
You need to check whether there are indications of violations of the regression assumptions for the reduced model.
Critique & Limitations ¶
Critique & Limitations of your approach: strengths and weaknesses in detail.
Summary & Conclusions ¶
Project Summary¶
A comprehensive summary of your entire project (both Phase 1 and Phase 2). That is, what exactly did you do in your project? (Example: I first cleaned the data in such and such ways. And then I applied multiple linear regression techniques in such and such ways. etc).
Summary of Findings¶
A comprehensive summary of your findings. That is, what exactly did you find about your particular problem?
Conclusions¶
Your detailed conclusions as they relate to your goals and objectives.