(Credit: A. C. & I. P.)
DISCLAIMER: Please do not assume this report is 100% accurate nor it fully satisfies all the project requirements for full credit.
Introduction ¶
Dataset Source¶
The Melbourne House Prices dataset used in study was sourced from Kaggle (Becker, 2021). This dataset contains prices of houses sold in Melbourne between January 2016 and September 2017.
Dataset Details¶
The dataset is about the housing market in Melbourne and contains information about the house sale price, location, and the brokering real estate agency. Additional features included in this dataset are the date of sale, year built, number of rooms & bathrooms, distance to city center, land size, suburb, and the number of properties within the suburb. These features seem to be sufficient for an attempt for predictive modeling of Melbourne house prices as a regression problem.
This dataset has a total of 20 features (excluding the street address of houses) and 13580 observations. Houses with no price information have already been removed from the dataset.
Dataset Retrieval
- We read in the dataset from our GitHub repository and load the modules we will use throughout this report.
- We display 10 randomly sampled rows from this dataset.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import io
import requests
pd.set_option('display.max_columns', None)
###
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")
###
# name of the dataset to be imported from our GitHub account
df_name = 'melbourne_house_prices.csv'
df_url = 'https://raw.githubusercontent.com/akmand/datasets/master/' + df_name
url_content = requests.get(df_url, verify=False).content
melb_df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
melb_df.sample(10, random_state=999)
Date | Suburb | Rooms | Type | Method | SellerG | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | CouncilArea | Lattitude | Longtitude | Regionname | Propertycount | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8563 | 12/11/2016 | Doncaster | 4 | h | S | Ray | 13.9 | 3108 | 4 | 2 | 2.0 | 806 | 170.0 | 1975.0 | Manningham | -37.79390 | 145.13920 | Eastern Metropolitan | 9028 | 1468000 |
3552 | 24/06/2017 | Reservoir | 1 | u | PI | Stockdale | 12.0 | 3073 | 1 | 1 | 1.0 | 0 | 60.0 | 1980.0 | Darebin | -37.71408 | 145.00441 | Northern Metropolitan | 21650 | 270000 |
11403 | 26/07/2016 | Moorabbin | 4 | h | S | Buxton | 14.6 | 3189 | 4 | 1 | 2.0 | 592 | NaN | NaN | Kingston | -37.94070 | 145.04820 | Southern Metropolitan | 2555 | 921000 |
8475 | 19/11/2016 | Windsor | 2 | u | SP | Jellis | 5.1 | 3181 | 2 | 2 | 1.0 | 0 | 111.0 | 2007.0 | Port Phillip | -37.85630 | 144.99160 | Southern Metropolitan | 4380 | 1200000 |
9442 | 8/10/2016 | Parkville | 1 | u | S | Jellis | 2.6 | 3052 | 1 | 1 | 1.0 | 0 | NaN | NaN | Melbourne | -37.78640 | 144.94090 | Northern Metropolitan | 2309 | 352500 |
12848 | 14/05/2016 | Elwood | 3 | t | S | Chisholm | 7.7 | 3184 | 3 | 1 | 1.0 | 0 | NaN | NaN | Port Phillip | -37.88080 | 144.98930 | Southern Metropolitan | 8989 | 940000 |
3034 | 1/07/2017 | Bentleigh | 3 | h | SP | hockingstuart | 11.4 | 3204 | 3 | 1 | 1.0 | 401 | 114.0 | 1970.0 | Glen Eira | -37.92999 | 145.04932 | Southern Metropolitan | 6795 | 1045000 |
4320 | 3/06/2017 | South Morang | 4 | h | S | Harcourts | 20.5 | 3752 | 4 | 3 | 2.0 | 624 | 410.0 | 2005.0 | Whittlesea | -37.63485 | 145.07272 | Northern Metropolitan | 7969 | 915000 |
5879 | 8/04/2017 | Altona | 3 | h | SP | hockingstuart | 13.8 | 3018 | 3 | 1 | 2.0 | 577 | NaN | NaN | Hobsons Bay | -37.86400 | 144.81610 | Western Metropolitan | 5301 | 840000 |
8112 | 27/11/2016 | Port Melbourne | 2 | u | S | Chisholm | 3.8 | 3207 | 2 | 1 | 1.0 | 0 | 100.0 | 2000.0 | Port Phillip | -37.84240 | 144.94490 | Southern Metropolitan | 8648 | 735000 |
Dataset Features¶
The features in our dataset are described in the table below. These descriptions are taken from the Kaggle data source.
from tabulate import tabulate
table = [['Name','Data Type','Units','Description'],
['Suburb','Nominal categorical','NA','Suburb of house sold'],
['Rooms','Numeric','NA','Number of rooms'],
['Type','Nominal categorical','NA','br - bedroom(s); h - house, cottage, villa, semi, terrace; \
\nu - unit, duplex; t - townhouse; dev site - development site; \no res - other residential.'],
['Price','Numeric','AUD','Price of house sold'],
['Method','Nominal categorical','NA','S - property sold; SP - property sold prior; \
\nPI - property passed in; PN - sold prior not disclosed; SN - \
\nsold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn \
\nprior to auction; SA - sold after auction; SS - sold after auction \
\nprice not disclosed. N/A - price or highest bid not available.'],
['SellerG','Nominal categorical','NA','Real estate agent'],
['Date','Date','dd/mm/yyyy','Date sold'],
['Distance','Numeric','KM','Distance from Central Business District'],
['Bathroom','Numeric','NA','Number of bathrooms'],
['Car','Numeric','NA','Number of carspots'],
['Landsize','Numeric','Square Meter','Land size'],
['Building area','Numeric','Square Meter','Building size'],
['Year built','Date','year','Year the house was built'],
['Council area','Nominal categorical','NA','Governing council for the area'],
['Lattitude','Numeric','Degree','Lattitude'],
['Longtitude','Numeric','Degree','Longtitude'],
['Region name','Nominal categorical','NA','General Region (West, North West, etc)'],
['Property count','Numeric','NA','Number of properties that exist in the suburb']]
print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))
╒════════════════╤═════════════════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════╕ │ Name │ Data Type │ Units │ Description │ ╞════════════════╪═════════════════════╪══════════════╪═══════════════════════════════════════════════════════════════════════════════╡ │ Suburb │ Nominal categorical │ NA │ Suburb of house sold │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Rooms │ Numeric │ NA │ Number of rooms │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Type │ Nominal categorical │ NA │ br - bedroom(s); h - house, cottage, villa, semi, terrace; │ │ │ │ │ u - unit, duplex; t - townhouse; dev site - development site; │ │ │ │ │ o res - other residential. │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Price │ Numeric │ AUD │ Price of house sold │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Method │ Nominal categorical │ NA │ S - property sold; SP - property sold prior; │ │ │ │ │ PI - property passed in; PN - sold prior not disclosed; SN - │ │ │ │ │ sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn │ │ │ │ │ prior to auction; SA - sold after auction; SS - sold after auction │ │ │ │ │ price not disclosed. N/A - price or highest bid not available. │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ SellerG │ Nominal categorical │ NA │ Real estate agent │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Date │ Date │ dd/mm/yyyy │ Date sold │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Distance │ Numeric │ KM │ Distance from Central Business District │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Bathroom │ Numeric │ NA │ Number of bathrooms │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Car │ Numeric │ NA │ Number of carspots │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Landsize │ Numeric │ Square Meter │ Land size │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Building area │ Numeric │ Square Meter │ Building size │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Year built │ Date │ year │ Year the house was built │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Council area │ Nominal categorical │ NA │ Governing council for the area │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Lattitude │ Numeric │ Degree │ Lattitude │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Longtitude │ Numeric │ Degree │ Longtitude │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Region name │ Nominal categorical │ NA │ General Region (West, North West, etc) │ ├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ Property count │ Numeric │ NA │ Number of properties that exist in the suburb │ ╘════════════════╧═════════════════════╧══════════════╧═══════════════════════════════════════════════════════════════════════════════╛
Target Feature¶
For this project, the target feature in this dataset will be the house price in Australian dollars. That is, the price of Melbourne houses will be predicted based on the explanatory/ descriptive variables.
Goals and Objectives¶
Melbourne has a very active housing market as demand for housing throughout the city is usually well above the supply. For this reason, a model that can accurately predict a house's selling price in Melbourne would have many real-world uses. For instance, real estate agents can provide better service to their customers using this predictive model. Likewise, banks lending out money to home buyers can better estimate the financial aspects of this home loan. Perhaps more importantly as potential home buyers, we as individuals can better figure out if we are being ripped off or we are getting a good deal provided that our predictive model is a reliable one.
Thus, the main objective of this project is two-fold: (1) predict the price of a house sold in Melbourne based on publicly available features of the house, and (2) which features seem to be the best predictors of the house sale price. A secondary objective is to perform some exploratory data analysis by basic descriptive statistics & data visualisation plots to gain some insight into the patterns and relationships existing in the data subsequent to some data cleaning & preprocessing, which is the subject of this Phase 1 report.
At this point, we make the important assumption that rows in our dataset are not correlated. That is, we assume that house price observations are independent of one another in this dataset. Of course, this is not a very realistic assumption, however, this assumption allows us to circumvent time series aspects of the underlying dynamics of house prices and also to resort to rather classical predictive models such as multiple linear regression.
Data Cleaning and Preprocessing¶
In this section, we describe the data cleaning and preprocessing steps undertaken for this project.
Data Cleaning Steps¶
- Drop irrelevant features in our dataset
- Check and rename/ modify some column names
- Check for missing values
- Remove all the rows with missing values
- Random sampling of the dataset for 5000 rows
Let's first display all the columns in our dataset.
melb_df.columns
Index(['Date', 'Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname', 'Propertycount', 'Price'], dtype='object')
Postcode
is very similar to the Suburb
feature, thus it is considered redundant for the model and removed. Bedroom2
is just the number of bedrooms based on other source, which is also the same as Rooms
feature, thus it is also considered redundant and removed.
#drop irrelevant/repeated columns/features
melb_df = melb_df.drop(columns=["Postcode", "Bedroom2"])
As this is not a time series project, the Date
feature is transformed into Months
and Years
features for the model. After transforming, the original Date
feature is removed.
# Convert Date feature from string to Pandas' datatime dtype
melb_df['Date'] = pd.to_datetime(melb_df['Date'], dayfirst=True)
###
# Create Year and Month features
melb_df['year_sold'] = melb_df['Date'].dt.year
melb_df['month_sold'] = melb_df['Date'].dt.month
###
# Remove Date as we will not need it anymore
melb_df = melb_df.drop(columns=["Date"])
Some of the columns are not labelled properly, which will be problematic when modelling. We will make all column names lower case and replace any spaces with underscores for consistency. We will also remove any white spaces before & after column names.
# make column names lower case and also remove any white spaces
# before & after the column names using the strip() function
melb_df.columns = melb_df.columns.str.lower().str.strip()
columns_mapping = {
'sellerg': 'real_estate',
'landsize': 'land_size',
'buildingarea': 'building_area',
'yearbuilt': 'year_built',
'councilarea': 'council',
'regionname': 'region',
'propertycount': 'property_count'
}
# rename columns
melb_df = melb_df.rename(columns = columns_mapping)
melb_df.sample(5, random_state=999)
suburb | rooms | type | method | real_estate | distance | bathroom | car | land_size | building_area | year_built | council | lattitude | longtitude | region | property_count | price | year_sold | month_sold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8563 | Doncaster | 4 | h | S | Ray | 13.9 | 2 | 2.0 | 806 | 170.0 | 1975.0 | Manningham | -37.79390 | 145.13920 | Eastern Metropolitan | 9028 | 1468000 | 2016 | 11 |
3552 | Reservoir | 1 | u | PI | Stockdale | 12.0 | 1 | 1.0 | 0 | 60.0 | 1980.0 | Darebin | -37.71408 | 145.00441 | Northern Metropolitan | 21650 | 270000 | 2017 | 6 |
11403 | Moorabbin | 4 | h | S | Buxton | 14.6 | 1 | 2.0 | 592 | NaN | NaN | Kingston | -37.94070 | 145.04820 | Southern Metropolitan | 2555 | 921000 | 2016 | 7 |
8475 | Windsor | 2 | u | SP | Jellis | 5.1 | 2 | 1.0 | 0 | 111.0 | 2007.0 | Port Phillip | -37.85630 | 144.99160 | Southern Metropolitan | 4380 | 1200000 | 2016 | 11 |
9442 | Parkville | 1 | u | S | Jellis | 2.6 | 1 | 1.0 | 0 | NaN | NaN | Melbourne | -37.78640 | 144.94090 | Northern Metropolitan | 2309 | 352500 | 2016 | 10 |
Next we check the data types and observe that they match the intended data types, thus no change is needed here.
# Check for data types
print(f"Shape of the dataset = {melb_df.shape} \n")
print(f"Data types are below where 'object' indicates a string type: ")
print(melb_df.dtypes)
Shape of the dataset = (13580, 19) Data types are below where 'object' indicates a string type: suburb object rooms int64 type object method object real_estate object distance float64 bathroom int64 car float64 land_size int64 building_area float64 year_built float64 council object lattitude float64 longtitude float64 region object property_count int64 price int64 year_sold int64 month_sold int64 dtype: object
The unique values for all columns with categorical data types are displayed to check for any white spaces and other data quality issues. It turns out that the data is already clean, and no futher data cleaning steps are necessary here.
from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of categorical features</b>'))
melb_df.describe(include='object').T
count | unique | top | freq | |
---|---|---|---|---|
suburb | 13580 | 314 | Reservoir | 359 |
type | 13580 | 3 | h | 9449 |
method | 13580 | 5 | S | 9022 |
real_estate | 13580 | 268 | Nelson | 1565 |
council | 12211 | 33 | Moreland | 1163 |
region | 13580 | 8 | Southern Metropolitan | 4695 |
# To see all unique values for categorical data types
categoricalColumns = melb_df.columns[melb_df.dtypes==object].tolist()
for col in categoricalColumns:
print('Unique values for ' + col)
print(melb_df[col].unique())
print('')
Unique values for suburb ['Aberfeldie' 'Airport West' 'Albanvale' 'Albert Park' 'Albion' 'Altona Meadows' 'Altona North' 'Armadale' 'Ascot Vale' 'Ashburton' 'Balwyn' 'Balwyn North' 'Bayswater' 'Bayswater North' 'Beaumaris' 'Bentleigh' 'Bentleigh East' 'Berwick' 'Black Rock' 'Blackburn' 'Blackburn North' 'Blackburn South' 'Bonbeach' 'Box Hill' 'Brighton East' 'Broadmeadows' 'Brooklyn' 'Brunswick' 'Brunswick East' 'Brunswick West' 'Bulleen' 'Bundoora' 'Burnside Heights' 'Burwood East' 'Camberwell' 'Carnegie' 'Caroline Springs' 'Cheltenham' 'Clayton South' 'Coburg' 'Craigieburn' 'Croydon' 'Deer Park' 'Delahey' 'Derrimut' 'Diamond Creek' 'Doncaster' 'Doncaster East' 'Donvale' 'Doreen' 'Edithvale' 'Epping' 'Essendon' 'Fawkner' 'Ferntree Gully' 'Fitzroy' 'Fitzroy North' 'Flemington' 'Forest Hill' 'Frankston' 'Frankston South' 'Gladstone Park' 'Glen Huntly' 'Glen Iris' 'Glen Waverley' 'Glenroy' 'Greensborough' 'Hampton' 'Hampton East' 'Hawthorn East' 'Heidelberg' 'Heidelberg Heights' 'Highett' 'Hoppers Crossing' 'Huntingdale' 'Ivanhoe East' 'Jacana' 'Kealba' 'Keilor Downs' 'Keilor East' 'Kew' 'Kew East' 'Keysborough' 'Lalor' 'Malvern' 'Maribyrnong' 'Meadow Heights' 'Melton' 'Melton West' 'Mentone' 'Mill Park' 'Mitcham' 'Moonee Ponds' 'Mooroolbark' 'Mount Waverley' 'Mulgrave' 'Murrumbeena' 'New Gisborne' 'Newport' 'Niddrie' 'North Melbourne' 'Northcote' 'Notting Hill' 'Oak Park' 'Oakleigh' 'Oakleigh East' 'Oakleigh South' 'Ormond' 'Plumpton' 'Port Melbourne' 'Prahran' 'Preston' 'Reservoir' 'Richmond' 'Ringwood' 'Ringwood North' 'Rosanna' 'Roxburgh Park' 'Sandringham' 'Seddon' 'South Melbourne' 'South Morang' 'Springvale' 'St Albans' 'St Helena' 'St Kilda' 'Strathmore' 'Sunbury' 'Sunshine North' 'Sunshine West' 'Surrey Hills' 'Taylors Lakes' 'Templestowe' 'Templestowe Lower' 'The Basin' 'Thomastown' 'Vermont' 'Vermont South' 'Wantirna' 'Wantirna South' 'Werribee' 'West Footscray' 'Westmeadows' 'Wheelers Hill' 'Williamstown' 'Wollert' 'Yarraville' 'Altona' 'Aspendale' 'Aspendale Gardens' 'Avondale Heights' 'Boronia' 'Braybrook' 'Brighton' 'Burwood' 'Carrum Downs' 'Chelsea' 'Clayton' 'Clifton Hill' 'Coburg North' 'Collingwood' 'Dandenong North' 'Dingley Village' 'Doveton' 'Eltham' 'Elwood' 'Essendon North' 'Fairfield' 'Gisborne' 'Greenvale' 'Hallam' 'Hawthorn' 'Heidelberg West' 'Hillside' 'Keilor' 'Kingsville' 'Mont Albert' 'Montmorency' 'Moorabbin' 'Mordialloc' 'Narre Warren' 'Parkdale' 'Pascoe Vale' 'Point Cook' 'Princes Hill' 'Ringwood East' 'Scoresby' 'South Yarra' 'Sunshine' 'Thornbury' 'Toorak' 'Viewbank' 'Wallan' 'Watsonia' 'Watsonia North' 'West Melbourne' 'Williams Landing' 'Williamstown North' 'Windsor' 'Alphington' 'Attwood' 'Canterbury' 'Carlton' 'Carlton North' 'Cremorne' 'Eltham North' 'Footscray' 'Hadfield' 'Heathmont' 'Hurstbridge' 'Keilor Lodge' 'Maidstone' 'Malvern East' 'Melton South' 'Mernda' 'Middle Park' 'North Warrandyte' 'Seaford' 'Waterways' 'Balaclava' 'Briar Hill' 'Burnside' 'Caulfield East' 'Coolaroo' 'Croydon Hills' 'Croydon North' 'Dallas' 'Dandenong' 'Gowanbrae' 'Springvale South' 'Tullamarine' 'Warrandyte' 'Bellfield' 'Caulfield South' 'Chadstone' 'Chelsea Heights' 'Cranbourne' 'Frankston North' 'Hampton Park' 'Ivanhoe' 'Keilor Park' 'Kilsyth' 'Kings Park' 'Kooyong' 'Langwarrin' 'McKinnon' 'Monbulk' 'Nunawading' 'Spotswood' 'Tarneit' 'Abbotsford' 'Ardeer' 'Ashwood' 'Campbellfield' 'Clarinda' 'Eaglemont' 'Elsternwick' 'Kensington' 'Mount Evelyn' 'Strathmore Heights' 'Truganina' 'Wyndham Vale' 'Yallambie' 'Bullengarook' 'Cairnlea' 'Carrum' 'Caulfield' 'Caulfield North' 'Chirnside Park' 'Deepdene' 'Diggers Rest' 'East Melbourne' 'Hughesdale' 'Kurunjang' 'Lower Plenty' 'Rowville' 'Southbank' 'Sydenham' 'Travancore' 'Wonga Park' 'Beaconsfield' 'Brookfield' 'Cranbourne North' 'Emerald' 'Endeavour Hills' 'Knoxfield' 'Melbourne' 'Noble Park' 'Ripponlea' 'Seabrook' 'Seaholme' 'South Kingsville' 'Whittlesea' 'Pakenham' 'Taylors Hill' 'Yarra Glen' 'Essendon West' 'Kingsbury' 'Officer' 'Bacchus Marsh' 'Montrose' 'Parkville' 'Healesville' 'Rockbank' 'Skye' 'Beaconsfield Upper' 'Riddells Creek' 'Sandhurst' 'Silvan' 'Burnley' 'Croydon South' 'Gardenvale' 'Upwey' 'Docklands'] Unique values for type ['h' 't' 'u'] Unique values for method ['S' 'VB' 'SP' 'PI' 'SA'] Unique values for real_estate ['Rendina' 'Brad' 'Harcourts' 'Bells' 'Cayzer' 'YPA' 'Barlow' 'Sweeney' 'hockingstuart' 'Kay' 'Nelson' 'Buxton' 'Jellis' 'Marshall' 'iTRAK' 'McGrath' 'Obrien' 'Woodards' 'Barry' 'Ray' 'Fletchers' 'Philip' 'Chambers' 'Luxton' 'Noel' 'HAR' "O'Brien" 'C21' 'One' 'Purplebricks' 'Stockdale' 'Morrison' 'ASL' 'The' 'Hodges' 'Love' 'Eview' 'Darren' 'Triwest' 'FN' 'Miles' 'Kelly' 'Area' 'Melbourne' 'SN' 'Raine' 'Williams' 'Frank' 'RT' 'Weston' 'RW' 'Carter' 'Sell' 'iSell' 'Buckingham' 'Considine' 'Jas' 'Village' 'Greg' 'Lindellas' 'Douglas' 'Nick' 'Darras' 'Peter' 'Parkes' 'REMAX' 'Roger' 'Follett' 'Prof.' 'Reliance' 'LLC' 'LJ' 'Biggin' 'Collins' 'buyMyplace' 'Le' 'Leeburn' 'Garvey' 'Joseph' 'Chisholm' 'S&L' 'Walshe' 'Harrington' 'Mason' 'Schroeder' "Abercromby's" 'Charlton' 'Daniel' 'Gardiner' 'L' 'Gary' 'Domain' 'McDonald' 'Hoskins' 'McLennan' 'Del' 'Millership' 'Alexkarbon' 'Meadows' 'Tim' 'Max' 'Jason' 'McNaughton' 'PRD' 'Bullen' 'RE' 'Rexhepi' 'William' 'Ace' 'Ryder' 'Point' 'Red' 'Edward' 'Boutique' 'TRUE' 'M.J' 'GL' 'VICProp' 'Keatings' 'JRW' 'Leading' 'Caine' 'Burnham' 'Appleby' 'Castran' 'Scott' 'Moonee' 'Peake' 'Thomson' 'Nicholson' 'LITTLE' 'Kaye' 'Nicholls' 'Bowman' 'J' 'R&H' 'Redina' 'Langwell' 'MICM' 'Hall' 'Gunn&Co' 'JMRE' 'Ham' 'Thomas' 'Mandy' 'Haughton' 'Whiting' 'PRDNationwide' 'Win' 'Westside' 'Better' 'Iconek' 'Aquire' 'Galldon' 'Only' "D'Aprano" 'U' 'Dingle' 'Ross' 'Dixon' 'Wilson' 'Ristic' 'Ash' 'Upper' 'Pagan' 'Xynergy' 'Bekdon' 'Metro' 'Lucas' 'Rounds' 'Mitchell' 'Weda' 'Hunter' 'Leyton' 'W.B.' 'Conquest' 'Trimson' 'Victory' 'Smart' 'Mindacom' 'Beller' 'Asset' '@Realty' 'Jim' 'Real' 'iProperty' 'Hayeswinckle' 'Nardella' 'Prime' 'Community' 'Elite' 'Holland' 'Free' 'Veitch' 'Oriental' "O'Donoghues" 'Morleys' "Sotheby's" 'WHITEFOX' 'White' 'Century' 'Rodney' 'Craig' 'Rosin' 'Reed' 'Anderson' 'Hooper' 'Matthew' 'Leased' 'Maddison' "Tiernan's" 'hockingstuart/Barry' 'David' 'Pride' 'Johnston' 'Prowse' 'Ascend' 'Christopher' 'Fletchers/One' 'Paul' 'Walsh' 'New' 'North' 'Naison' 'Reach' 'Blue' 'Ken' 'Bayside' 'Re' 'Vic' 'Direct' 'Geoff' 'Buxton/Find' 'First' 'Owen' 'Grantham' 'Crane' 'Sweeney/Advantage' 'Compton' 'Brace' 'Allens' 'VICPROP' 'Zahn' 'Assisi' 'Wood' 'Besser' 'Oak' 'Australian' 'CASTRAN' 'Buxton/Advantage' "Private/Tiernan's" 'Nguyen' 'Joe' 'Homes' 'Property' 'Luxe' 'Weast' 'Coventry' 'Changing' 'Raine&Horne' 'hockingstuart/Advantage' 'Steveway' 'Propertyau' 'Calder' 'Parkinson' 'Clairmont' 'hockingstuart/Village' 'iOne'] Unique values for council [nan 'Yarra' 'Moonee Valley' 'Brimbank' 'Darebin' 'Hobsons Bay' 'Stonnington' 'Boroondara' 'Kingston' 'Port Phillip' 'Knox' 'Bayside' 'Banyule' 'Glen Eira' 'Casey' 'Whitehorse' 'Hume' 'Moreland' 'Manningham' 'Macedon Ranges' 'Whittlesea' 'Melton' 'Frankston' 'Monash' 'Yarra Ranges' 'Maroondah' 'Greater Dandenong' 'Nillumbik' 'Melbourne' 'Maribyrnong' 'Wyndham' 'Cardinia' 'Moorabool' 'Unavailable'] Unique values for region ['Western Metropolitan' 'Southern Metropolitan' 'Eastern Metropolitan' 'Eastern Victoria' 'South-Eastern Metropolitan' 'Northern Metropolitan' 'Northern Victoria' 'Western Victoria']
The summary statistics are generated for all the numerical features. There does not seem to be any outliers in the data.
from IPython.display import display, HTML
display(HTML('<b>Table 2: Summary of numerical features</b>'))
melb_df.describe(include=['int64','float64']).T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
rooms | 13580.0 | 2.937997e+00 | 0.955748 | 1.00000 | 2.000000 | 3.000000 | 3.000000e+00 | 1.000000e+01 |
distance | 13580.0 | 1.013778e+01 | 5.868725 | 0.00000 | 6.100000 | 9.200000 | 1.300000e+01 | 4.810000e+01 |
bathroom | 13580.0 | 1.534242e+00 | 0.691712 | 0.00000 | 1.000000 | 1.000000 | 2.000000e+00 | 8.000000e+00 |
car | 13518.0 | 1.610075e+00 | 0.962634 | 0.00000 | 1.000000 | 2.000000 | 2.000000e+00 | 1.000000e+01 |
land_size | 13580.0 | 5.584161e+02 | 3990.669241 | 0.00000 | 177.000000 | 440.000000 | 6.510000e+02 | 4.330140e+05 |
building_area | 7130.0 | 1.519676e+02 | 541.014538 | 0.00000 | 93.000000 | 126.000000 | 1.740000e+02 | 4.451500e+04 |
year_built | 8205.0 | 1.964684e+03 | 37.273762 | 1196.00000 | 1940.000000 | 1970.000000 | 1.999000e+03 | 2.018000e+03 |
lattitude | 13580.0 | -3.780920e+01 | 0.079260 | -38.18255 | -37.856822 | -37.802355 | -3.775640e+01 | -3.740853e+01 |
longtitude | 13580.0 | 1.449952e+02 | 0.103916 | 144.43181 | 144.929600 | 145.000100 | 1.450583e+02 | 1.455264e+02 |
property_count | 13580.0 | 7.454417e+03 | 4378.581772 | 249.00000 | 4380.000000 | 6555.000000 | 1.033100e+04 | 2.165000e+04 |
price | 13580.0 | 1.075684e+06 | 639310.724296 | 85000.00000 | 650000.000000 | 903000.000000 | 1.330000e+06 | 9.000000e+06 |
year_sold | 13580.0 | 2.016533e+03 | 0.498899 | 2016.00000 | 2016.000000 | 2017.000000 | 2.017000e+03 | 2.017000e+03 |
month_sold | 13580.0 | 7.052283e+00 | 2.536034 | 1.00000 | 5.000000 | 7.000000 | 9.000000e+00 | 1.200000e+01 |
Missing values are checked by displaying the number of missing values in every column. We observe that car
, building_area
, year_built
and council
features have missing values. We decide to drop these observations for simplicity.
# Count missing values in each column
print(f"\nNumber of missing values for each column/ feature:")
print(melb_df.isnull().sum())
Number of missing values for each column/ feature: suburb 0 rooms 0 type 0 method 0 real_estate 0 distance 0 bathroom 0 car 62 land_size 0 building_area 6450 year_built 5375 council 1369 lattitude 0 longtitude 0 region 0 property_count 0 price 0 year_sold 0 month_sold 0 dtype: int64
# Drop all rows with missing values/NaN
melb_df = melb_df.dropna()
melb_df.shape
(6196, 19)
Random Sampling¶
As the data has more than 5000 rows, random sampling is done to get only 5000 rows out of the remaining 6196 rows for ease of computation. At the end, we display 5 random rows from our cleaned data.
melb_df = melb_df.sample(n=5000, random_state=999)
melb_df.shape
melb_df.sample(5, random_state=999)
suburb | rooms | type | method | real_estate | distance | bathroom | car | land_size | building_area | year_built | council | lattitude | longtitude | region | property_count | price | year_sold | month_sold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3616 | Windsor | 2 | h | S | hockingstuart | 4.6 | 1 | 1.0 | 136 | 58.0 | 1892.0 | Stonnington | -37.85542 | 144.99571 | Southern Metropolitan | 4380 | 957500 | 2017 | 6 |
10409 | South Yarra | 2 | u | SP | Jellis | 3.3 | 1 | 1.0 | 0 | 86.0 | 2003.0 | Stonnington | -37.83930 | 144.98860 | Southern Metropolitan | 14887 | 722000 | 2016 | 9 |
6777 | St Kilda | 3 | h | VB | Wilson | 6.1 | 2 | 2.0 | 176 | 192.0 | 1880.0 | Port Phillip | -37.86410 | 144.97390 | Southern Metropolitan | 13240 | 2000000 | 2017 | 3 |
7304 | Brighton East | 5 | h | S | hockingstuart | 10.7 | 5 | 3.0 | 1561 | 272.0 | 1969.0 | Bayside | -37.91020 | 145.02450 | Southern Metropolitan | 6938 | 3207000 | 2016 | 12 |
2627 | Point Cook | 4 | h | S | hockingstuart | 14.7 | 2 | 2.0 | 448 | 190.0 | 2012.0 | Wyndham | -37.89140 | 144.72014 | Western Metropolitan | 15542 | 698000 | 2017 | 7 |
Data Exploration and Visualisation¶
Our dataset is now considered to be clean and we are ready to start visualising and explore each of the features.
Univariate Visualisation¶
Bar Chart of Region Name¶
We count the region name to see which region has the highest count in Melbourne Housing in a descending order. As we can see in Figure 1, the Southern Metropolitan has the highest number of house sold compared to the other regions in Melbourne.
plt.figure(figsize = (20,8))
fig = sns.countplot(x = 'region', data = melb_df, palette = 'magma',
order = melb_df['region'].value_counts().index)
fig = plt.title('Figure 1: Region Count for Melbourne Housing', fontsize = 15)
plt.show()
Bar Chart of Method¶
From Figure 2, we can see that the method "Property Sold" ("S") is the most common method used in selling house based on the dataset.
plt.figure(figsize = (15,8))
fig = sns.countplot(x = 'method', data = melb_df, palette = None,
order = melb_df['method'].value_counts().index)
fig = plt.title('Figure 2: Method Count in Melbourne Housing', fontsize = 15)
plt.show()
Boxplot & Histogram of Price¶
We can see in Figures 3A and 3B that the distribution of price is clearly right-skewed and has a huge range, which indicates the price variable will probably need a log transformation in the second phase of the project.
# Boxplot of Price
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['price']).set_title('Figure 3A: Box Plot of Price', fontsize = 15)
plt.show();
# Boxplot of Price
plt.figure(figsize = (15,8))
sns.distplot(melb_df['price'], kde=True, bins=40).set_title('Figure 3B: Histogram of Price', fontsize = 15)
plt.show();
Two-Variable Visualisation¶
Scatterplot of price and distance from CBD¶
Figure 4 shows that there is a correlation between the distance from CBD and price. As the distance increases, the price tends to be lower.
plt.figure(figsize = (15,8))
plt.scatter(melb_df['distance'], melb_df['price'], alpha = 0.3)
plt.title('Figure 4: Scatterplot of Price and Distance from CBD', fontsize = 15)
plt.xlabel('Distance from CBD')
plt.ylabel('Price')
plt.show();
Boxplot of Price by House Type¶
From figure 5, we can see that H type (house, cottage, villa, semi, terrace) of house have overall a higher price compared to other types. Furthermore, U (unit or duplex) house has a lower price compared to other types.
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['type'], melb_df['price']);
plt.title('Figure 5: Boxplot of Price by House Type', fontsize = 15)
plt.show();
Boxplot of Price by Number of Rooms¶
In figure 6, the price tends to increase as the number of rooms increases.
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['rooms'], melb_df['price']);
plt.title('Figure 6: Boxplot of Price by Number of Rooms', fontsize = 15)
plt.show();
Three-Variable Visualisation¶
Boxplot of Price broken down by Method and House Type¶
We can clearly see that in figure 7, the comparison between house type and price in different method is somewhat comparable. There is no significant difference.
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['method'], melb_df['price'],
hue = melb_df['type'])
plt.title('Figure 7: Boxplot of Price broken down by Method and House Type', fontsize = 15)
plt.show();