# Regression Case Study: Predicting Melbourne House Prices (Phase 2)

Stats_Phase2_Report_Template

# Predicting Melbourne House Prices¶

## Introduction ¶

### Phase 1 Summary¶

A brief yet complete and accurate summary of the work conducted for your Phase 1 report and how they relate to your Phase 2 report.

Important Phase 2 Note: Please do NOT include your Phase 1 report or its contents with your Phase 2 submissions. You can, however, make some changes with your Phase 1 tasks if you need to, and then ONLY include these changes with your Phase 2 report with some explanation for these changes.

### Report Overview¶

A complete and accurate overview of the contents of your Phase 2 report. Clarification: A Table of Contents is not a report overview.

### Overview of Methodology¶

A detailed, complete, and accurate overview of your statistical modelling methodology (which is multiple linear regression). More specifically, in this subsection, you will provide a summary of your "Statistical Modelling" section below.

## Statistical Modelling ¶

(Statistical Modelling Section: Details of assumptions check, model selection, plots of residuals, and technical analysis of regression results.)

NOTE: The second half of this regression case study ("Statistical Modeling and Performance Evaluation" Section) will be very helpful for this section.

### Full Model Overview¶

Overview of your full model, including the variables and terms you are using in your model.

#### Module Imports¶

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")



### Full Model Diagnostic Checks¶

You need to check whether there are indications of violations of the regression assumptions for the full model.

### Feature Selection¶

You can use the code below to to perform backward feature selection using p-values (credit).

In [ ]:
## create the patsy model description from formula
patsy_description = patsy.ModelDesc.from_formula(formula_string_encoded)

# initialize feature-selected fit to full model
linreg_fit = model_full_fitted

# do backwards elimination using p-values
p_val_cutoff = 0.05

## WARNING 1: The code below assumes that the Intercept term is present in the model.
## WARNING 2: It will work only with main effects and two-way interactions, if any.

print('\nPerforming backwards feature selection using p-values:')

while True:

# uncomment the line below if you would like to see the regression summary
# in each step:
### print(linreg_fit.summary())

pval_series = linreg_fit.pvalues.drop(labels='Intercept')
pval_series = pval_series.sort_values(ascending=False)
term = pval_series.index[0]
pval = pval_series[0]
if (pval < p_val_cutoff):
break
term_components = term.split(':')
print(f'\nRemoving term "{term}" with p-value {pval:.4}')
if (len(term_components) == 1): ## this is a main effect term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0])]))
else: ## this is an interaction term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0]),
patsy.EvalFactor(term_components[1])]))

linreg_fit = smf.ols(formula=patsy_description, data=data_encoded).fit()

###
## this is the clean fit after backwards elimination
model_reduced_fitted = smf.ols(formula = patsy_description, data = data_encoded).fit()
###

#########
print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")


### Reduced Model Overview¶

Overview of your reduced model, including the variables and terms you are using in your model.

### Reduced Model Diagnostic Checks¶

You need to check whether there are indications of violations of the regression assumptions for the reduced model.

## Critique & Limitations ¶

Critique & Limitations of your approach: strengths and weaknesses in detail.

## Summary & Conclusions ¶

### Project Summary¶

A comprehensive summary of your entire project (both Phase 1 and Phase 2). That is, what exactly did you do in your project? (Example: I first cleaned the data in such and such ways. And then I applied multiple linear regression techniques in such and such ways. etc).

### Summary of Findings¶

A comprehensive summary of your findings. That is, what exactly did you find about your particular problem?

### Conclusions¶