HDB Resale Prices

Visualising and Understanding the factors of HDB resale price trend.

(This analysis was last updated on 04/10/2021.) ***

Purpose of Exploratory Data Analysis (EDA)

  1. Find the main factors of price changes in HDB resale transactions.
  2. Model the price changes into an app for people to use. ***

    All Datasets are sourced and downloaded from data.gov.sg

  3. Resale Flat Prices from 1990s till now (Managed by HDB Singapore) : https://data.gov.sg/dataset/resale-flat-prices

(Only data from the last 2 years will be analysed) ***

# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
plt.style.use('ggplot')
sns.set_style('whitegrid')
%matplotlib inline

from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import shap

from tqdm import tqdm
import time
import requests
from bs4 import BeautifulSoup
from onemapsg import OneMapClient
import creds # import credentials for email and pw later
import pickle
# Complie all resale price data into one dataframe
df = []
folder_path = os.path.join(os.getcwd(),'dataset/resale-flat-prices')
for filename in os.listdir( folder_path ):
    file_path = os.path.join(folder_path, filename)
    if filename.endswith(".csv"): 
        # print(file_path)
        df = df + [pd.read_csv(file_path)]
df = pd.concat(df,axis=0).reset_index(drop=True)
df.to_csv("dataset/full_resale_prices_1990to2021.csv",index=None)

# Filter last 2 years data
df.loc[:,"month"] = pd.to_datetime(df.month)
df = df.rename({"month":"trans_date"},axis=1)
twoyears = pd.Timestamp.today() - pd.Timedelta(days=365*2)
df = df.loc[df.trans_date >= twoyears,:]
df = df.sort_values('trans_date').reset_index(drop=True)
df.head()
trans_date town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price remaining_lease
0 2019-11-01 ANG MO KIO 2 ROOM 314 ANG MO KIO AVE 3 07 TO 09 44.0 Improved 1978 195000.0 57 years 02 months
1 2019-11-01 QUEENSTOWN 3 ROOM 85 C'WEALTH CL 01 TO 03 58.0 Standard 1967 240000.0 46 years 02 months
2 2019-11-01 QUEENSTOWN 3 ROOM 85 C'WEALTH CL 01 TO 03 58.0 Standard 1967 265000.0 46 years 02 months
3 2019-11-01 QUEENSTOWN 3 ROOM 112 C'WEALTH CRES 01 TO 03 67.0 Standard 1969 250000.0 48 years 02 months
4 2019-11-01 QUEENSTOWN 3 ROOM 50 C'WEALTH DR 13 TO 15 63.0 Model A 2015 540000.0 94 years 04 months
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46944 entries, 0 to 46943
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   trans_date           46944 non-null  datetime64[ns]
 1   town                 46944 non-null  object        
 2   flat_type            46944 non-null  object        
 3   block                46944 non-null  object        
 4   street_name          46944 non-null  object        
 5   storey_range         46944 non-null  object        
 6   floor_area_sqm       46944 non-null  float64       
 7   flat_model           46944 non-null  object        
 8   lease_commence_date  46944 non-null  int64         
 9   resale_price         46944 non-null  float64       
 10  remaining_lease      46944 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 3.9+ MB

Data Clean-up (remaining_lease)

The data format of the “remaining_lease” column is mainly in the “## years ## months” format. This format is non-numerical and will not accurate in providing information in how the remaining lease of the HDB affect it’s resale price. Data in this column will be converted into a float value in terms of years.

# Convert remaining_lease to a numerical value
def convert_to_ym(row):
    lease = row["remaining_lease"]
    try:
        if isinstance(lease,str):                              # Format : "## years ## months"
            if len(lease) == 18:
                years = int(lease[0:2])
                months = int(lease[10:12])
                return years + months/12
            elif (len(lease) == 17) & ('months' in lease):     # Format : "## years 0 months"
                return lease[0:2]
            elif (len(lease) == 17):                           # Format : "## years ## month"
                years = int(lease[0:2])
                months = int(lease[10])
                return years + months/12
            elif len(lease) == 8:                              # Format : "## years"
                return lease[0:2]
        elif isinstance(lease, int):                          # Format : "##"
            return lease
    except:
        print("Error")
        print(lease)
        print(type(lease))

df.lease_commence_date = pd.to_datetime(df.lease_commence_date,
                                                format='%Y')
df.loc[:,"lease_commence_year"] =  df.lease_commence_date.dt.year
df.loc[:, "remaining_lease"] = df.apply(convert_to_ym,axis=1)
df.head()
trans_date town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price remaining_lease lease_commence_year
0 2019-11-01 ANG MO KIO 2 ROOM 314 ANG MO KIO AVE 3 07 TO 09 44.0 Improved 1978-01-01 195000.0 57.166667 1978
1 2019-11-01 QUEENSTOWN 3 ROOM 85 C'WEALTH CL 01 TO 03 58.0 Standard 1967-01-01 240000.0 46.166667 1967
2 2019-11-01 QUEENSTOWN 3 ROOM 85 C'WEALTH CL 01 TO 03 58.0 Standard 1967-01-01 265000.0 46.166667 1967
3 2019-11-01 QUEENSTOWN 3 ROOM 112 C'WEALTH CRES 01 TO 03 67.0 Standard 1969-01-01 250000.0 48.166667 1969
4 2019-11-01 QUEENSTOWN 3 ROOM 50 C'WEALTH DR 13 TO 15 63.0 Model A 2015-01-01 540000.0 94.333333 2015

Dtype Conversion and Prelimary Data Cleaning

Analyse unique entries in each individual columns and convert to categorical data if <= 100 categories.

  • Inconsistent labels such as “MULTI-GENERATION” and “MULTI GENERATION” will be standardized.
  • Resale price will be converted into in terms of thousands for easier vizualisation and understanding.
df.dtypes
trans_date             datetime64[ns]
town                           object
flat_type                      object
block                          object
street_name                    object
storey_range                   object
floor_area_sqm                float64
flat_model                     object
lease_commence_date    datetime64[ns]
resale_price                  float64
remaining_lease                object
lease_commence_year             int64
dtype: object
df.nunique()
trans_date               23
town                     26
flat_type                 7
block                  2526
street_name             551
storey_range             17
floor_area_sqm          156
flat_model               20
lease_commence_date      54
resale_price           2082
remaining_lease         585
lease_commence_year      54
dtype: int64
# Convert columns to categories dtypes
df.trans_date = df.trans_date.dt.date
df.town = df.town.astype('category')  
df.block = df.block.astype('category')
df.street_name = df.street_name.str.lower().astype('category')
df.storey_range = df.storey_range.astype('category')  
df.flat_model = df.flat_model.str.lower().astype('category')
df.remaining_lease = df.remaining_lease.astype('float')

# Convert flat_type to Category
print(df.flat_type.value_counts())
df.loc[df.flat_type=="MULTI GENERATION","flat_type"] = "MULTI-GENERATION"
df.flat_type = df.flat_type.astype('category')

# Convert Resale Price to in terms of 'thousands'
df.resale_price = df.resale_price / 1000
4 ROOM              19675
5 ROOM              12238
3 ROOM              10538
EXECUTIVE            3698
2 ROOM                756
MULTI-GENERATION       20
1 ROOM                 19
Name: flat_type, dtype: int64
# Rearrange columns
df = df.loc[:, [ 'trans_date','lease_commence_date', 'remaining_lease',
                'town', 'street_name', 'block', 'storey_range',
                'flat_model', 'flat_type', 'floor_area_sqm', 'resale_price' ] ]
df.head()
trans_date lease_commence_date remaining_lease town street_name block storey_range flat_model flat_type floor_area_sqm resale_price
0 2019-11-01 1978-01-01 57.166667 ANG MO KIO ang mo kio ave 3 314 07 TO 09 improved 2 ROOM 44.0 195.0
1 2019-11-01 1967-01-01 46.166667 QUEENSTOWN c'wealth cl 85 01 TO 03 standard 3 ROOM 58.0 240.0
2 2019-11-01 1967-01-01 46.166667 QUEENSTOWN c'wealth cl 85 01 TO 03 standard 3 ROOM 58.0 265.0
3 2019-11-01 1969-01-01 48.166667 QUEENSTOWN c'wealth cres 112 01 TO 03 standard 3 ROOM 67.0 250.0
4 2019-11-01 2015-01-01 94.333333 QUEENSTOWN c'wealth dr 50 13 TO 15 model a 3 ROOM 63.0 540.0

Trend visualization

Part 1: Overall Trend Analysis (Price vs Transaction Date)

Since time-series analysis is not going to be the main focus of the this analysis, we need to ensure that the HDB resale prices do not major fluctuations (Seasonality). Otherwise, there may be a need to account for them.

fig, ax = plt.subplots(figsize=(18,5))

sns.countplot(x="trans_date",data=df,
              ax=ax, color='#3F5D7D')
ax2 = ax.twinx()
sns.pointplot(x="trans_date", y="resale_price",
              data=df,
              ax=ax2, color="#D83A56")

fig.patch.set_facecolor('xkcd:white')
ax.spines[["top","right"]].set_visible(False)  
ax2.grid(False)
              
ax.set_title("Trend of Total Number of HDB Sales Transaction and Mean Prices each Month from Oct 2019 to Sept 2021 (last 2 years)",
            fontsize=15, fontweight=10,
            pad=8)
ax.set_xlabel("Transaction Month",
            fontsize=10, fontweight=15,
            labelpad=6)
ax.set_ylabel("Number of Transactions",
            fontsize=10, fontweight=15,
            labelpad=10)
ax2.set_ylabel("Mean Resale Price in SGD (thousands)",
            fontsize=10, fontweight=15,
            labelpad=10)
fig.legend(labels=['Number of Transaction','Mean Resale Price'],
            loc="center", bbox_to_anchor=(0.18, 0.83))
ax.axes.set_xticklabels([ f"{date.year}-{date.month}" for date in df.trans_date.unique()])

# Label Countplots (top of bar)
for p in ax.patches:
    height = p.get_height()
    ax.text(x = p.get_x()+(p.get_width()/2),
            y = height+50,
            s = "{:.0f}".format(height),
            ha = "center")
plt.xticks(rotation=10)
fig.show()

svg

Some observations for the figure above:

  • Overall increasing trend for the last 2 years, but not linearly.
  • No obvious seasonality
  • Resale prices does not seem to be affected by the number of transaction (Demand and Supply).
    • However, this is only a guage as these dataset only shows the transactions and not the number of interested buyers or unsuccessful sellers.
    • Low number of transaction in April and May 2020 is due to circuit breaker in Singapore, where there were minimum interactions.

Part 2: Trend of HDB Resale Flat Prices with respect to Time of Transaction for each Flat Type.

(Price vs Transaction Date, by Flat Type)

Flat type(i.e Number of Rooms,etc) is an important factor in every home buyer’s choice. A further exploration into individual mean HDB resale prices for each flat types in Singapore provides a deeper understanding to how the choice of flat types may affect the resales prices.

fig, ax = plt.subplots(figsize=(12,7))

sns.lineplot(x="trans_date", y="resale_price",
             data=df, hue="flat_type", ax=ax)

ax.spines[["left","right"]].set_visible(False)  
fig.patch.set_facecolor('xkcd:white')

ax.set_title("Mean Monthly HDB Resale Price for each Flat Type from Oct 2019 to Sept 2021 (last 2 years)",
            fontsize=15, fontweight=30,
            pad=8)
ax.set_xlabel("Transaction Month",
            fontsize=10, fontweight=15,
            labelpad=6)
ax.set_ylabel("Mean Resale Price in SGD (thousands)",
            fontsize=10, fontweight=15,
            labelpad=10)

fig.show()

svg

Some observations for the figure above:

  • A linear trend can be see most all flat types except the “Multi-Generation”.
    • This attributes to the overall linear increase in resale prices.
  • Non-linearity in the “Multi-Generation” flat types can be see as the main cause of fluctuation in overall trend in resale prices.

Hence, with the exception of the “Multi-Generation” flat type, both time of transaction(trans_date) and flat type will be useful in predicting the prices of the flat type.

Part 3: HDB Resale Flat Prices vs Flat Size

Similarly, choice of floor area is an important factor for a home buyer.

fig, ax = plt.subplots(figsize=(12,7))

sns.scatterplot(x="floor_area_sqm", y="resale_price", hue="flat_type",
                data=df, ax=ax)
sns.regplot(x="floor_area_sqm", y="resale_price", color='black',
                data=df, ax=ax, scatter=False)

fig.patch.set_facecolor('xkcd:white')
ax.spines[["left","right"]].set_visible(False)

ax.set_title("HDB Resale Price vs Floor Area from Oct 2019 to Sept 2021 (last 2 years)",
            fontsize=15, fontweight=30,
            pad=8)
ax.set_xlabel("Floor Area (metre square)",
            fontsize=10, fontweight=15,
            labelpad=6)
ax.set_ylabel("Mean Resale Price in SGD (thousands)",
            fontsize=10, fontweight=15,
            labelpad=10)

fig.show()

svg

Some observations for the figure above:

  • Floor area increases with increasing room size, with executive being the biggest.
  • Resale price tends to increase with floor area.
  • However, floor size is may not be an major factor due to the large range of possible resale prices(vertical range) it can lie after 50m^2.

Part 4: HDB Resale Flat Prices vs Storey Range

Choice of which floor to live in may play a part in resale price of the flat. Usually, resale prices will be higher on the higher floors due to a better view and less noise pollution from the surrounding ground floor.

fig, ax = plt.subplots(figsize=(12,7))
fig.patch.set_facecolor('xkcd:white')
data = df.groupby(["flat_type","storey_range"]).mean().reset_index()

sns.scatterplot(x="storey_range", y="resale_price", marker='o', s=50,
             data=data,hue="flat_type", ax=ax)
sns.lineplot(x="storey_range", y="resale_price",
             data=data,hue="flat_type", ax=ax)

ax.spines[["left","right"]].set_visible(False)  
ax.set_title("Mean Monthly HDB Resale Price for each Flat Type from Oct 2019 to Sept 2021 (last 2 years)",
            fontsize=15, fontweight=30,
            pad=8)
ax.set_xlabel("Storey Range",
            fontsize=10, fontweight=15,
            labelpad=6)
ax.set_ylabel("Mean Resale Price in SGD (thousands)",
            fontsize=10, fontweight=15,
            labelpad=10)
plt.xticks(rotation=45)
fig.show()

svg

Observations:

  • An overall linear trend can be seen in each flat type, except 1 room flats.
  • A flatter trend can seen in 1 and 2 room flats due to the government’s effort to keep prices low for seniors and low-income families.

(https://www.straitstimes.com/singapore/housing/prime-areas-to-be-kept-diverse-and-inclusive-with-range-of-hdb-flat-types-for)

Machine Learning Part 1

In this section, we will creating machine learning models. We will not be using “block” and “street_name” columns of our dataset because prediction using possibly new values that are not in the current dataset will cause errors and inaccuracies in the future. (i.e New blocks in new locations will be built.)

*Take note that we need to convert the transaction date(trans_date) to numeric values to ensure that we can take in future values.

(Since we are not approaching this question as a time series problem)

df.loc[:,"month_offset"] = pd.to_datetime(df.trans_date)
print(df.iloc[-1,-1])
df.loc[:,"month_offset"] = round((df.month_offset - df.iloc[-1,-1]).dt.days/30)
df.head()
2021-09-01 00:00:00
trans_date lease_commence_date remaining_lease town street_name block storey_range flat_model flat_type floor_area_sqm resale_price month_offset
0 2019-11-01 1978-01-01 57.166667 ANG MO KIO ang mo kio ave 3 314 07 TO 09 improved 2 ROOM 44.0 195.0 -22.0
1 2019-11-01 1967-01-01 46.166667 QUEENSTOWN c'wealth cl 85 01 TO 03 standard 3 ROOM 58.0 240.0 -22.0
2 2019-11-01 1967-01-01 46.166667 QUEENSTOWN c'wealth cl 85 01 TO 03 standard 3 ROOM 58.0 265.0 -22.0
3 2019-11-01 1969-01-01 48.166667 QUEENSTOWN c'wealth cres 112 01 TO 03 standard 3 ROOM 67.0 250.0 -22.0
4 2019-11-01 2015-01-01 94.333333 QUEENSTOWN c'wealth dr 50 13 TO 15 model a 3 ROOM 63.0 540.0 -22.0
fig = plt.figure(figsize=(7,5))
fig.patch.set_facecolor('xkcd:white')
sns.heatmap(df.corr(),cmap='icefire_r')
plt.show()

svg

# Basic Testing Pipeline
def ML_pipeline(data,model,num_vars,cat_vars,tree=False):
    data_pipeline = ColumnTransformer([
                    ('numerical', StandardScaler(), num_vars),
                    ('categorical', OrdinalEncoder(), cat_vars)
                    ])
    X_processed = data_pipeline.fit_transform(data)
    y = data.loc[:,'resale_price']

    X_train, X_test, y_train, y_test = train_test_split(X_processed, y,
                                                        test_size=.2,random_state=123)
    
    # Cross Validation Results
    scores = cross_validate(model, X_train, y_train,
                            scoring=["r2","neg_root_mean_squared_error"],
                            cv=5, n_jobs=-1)
    print("5CV R2 Score: {}".format( np.mean(scores['test_r2'])) )
    print("5CV RSME Score: {}".format(-1*
            np.mean(scores['test_neg_root_mean_squared_error'])) )

    # Test Results
    model.fit(X_train,y_train)
    print("Test R2 Score: {}".format(model.score(X_test,y_test)))
    print("Test RMSE Score: {}\n".format( 
         mean_squared_error(y_test,model.predict(X_test),squared=False)))

    if tree:
        try:
            print("Preparing SHAP analysis.")
            explainer = shap.TreeExplainer(model)
    
            shap_values = explainer.shap_values(X_test[:100])
            Xcolumns = num_vars + cat_vars
            fig = plt.figure(figsize=(12,7))
            shap.summary_plot(shap_values, X_test[:100],
                                Xcolumns, plot_type="bar",show=False)
            fig.patch.set_facecolor('xkcd:white')
            fig.show()
        except:
            print("Failed. No SHAP analysis. Only for Tree models.")

    return model
# Choice of Features
num_vars = ['floor_area_sqm','remaining_lease',"month_offset"]
cat_vars = ['town', 'storey_range',
            'flat_model', 'flat_type']
print("Testing w/ RF Regressor Model")
ML_pipeline(df, RandomForestRegressor(random_state=123),
            num_vars,cat_vars)
Testing w/ RF Regressor Model
5CV R2 Score: 0.9329152628966997
5CV RSME Score: 41.20370616053473
Test R2 Score: 0.9396357589822842
Test RMSE Score: 39.94865790119532





RandomForestRegressor(random_state=123)
print("Testing w/ Linear Regression Model")
ML_pipeline(df, Ridge(random_state=123),
            num_vars,cat_vars)
Testing w/ Linear Regression Model
5CV R2 Score: 0.6142070751050608
5CV RSME Score: 98.8148633031839
Test R2 Score: 0.628435989737439
Test RMSE Score: 99.11266230614957





Ridge(random_state=123)
print("Testing w/ XGBR Regressor Model")
ML_pipeline(df, xgb.XGBRegressor(),
            num_vars,cat_vars,True)
Testing w/ XGBR Regressor Model
5CV R2 Score: 0.9440758653507496
5CV RSME Score: 37.61817266649467
Test R2 Score: 0.9503124626248961
Test RMSE Score: 36.243994650021854

Preparing SHAP analysis.




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

svg

Observation:

  • The XGboost model provides the best results among 3 models in terms of highest R^2 value and lowest RSME for both cross validation and test set result.
  • According to SHAP, Remaining_lease, town and flat type are the most important factors in determining resale price.
  • Flat_model is the least important factor as it is a subset of the flat types.

Feature Engineering

After browsing articles, property market sites and social media, 3 additional features are found to be potential features to improve our model.

1. Nearby MRTs (within 1km radius)
2. Nearby Schools (within 1km radius)
3. Nearby Primary Schools (within 1km and 2km radius)
    - Within 1km and 2km radius gives higher priority during balloting
    ( https://www.moe.gov.sg/primary/p1-registration/distance/ )
4. Nearby Malls (within 1km radius)

Assuming a straight line walking route, it takes about 12mins to walk 1km according to the calculation below. In order to determine the actual land distance, we can use geocoordinates from Singapore’s onemap API to determine lat and long distance and convert it using a multiplier with respect to Singapore location.

# Start onemap client
Client = OneMapClient(creds.email,creds.password)

def countdown(count,start):
    # Sleep counter abides onemap's rules (max 250 query/s)
    count += 1
    if count >= 250:
        end = time.time()
        time.sleep(end-start)
        start = time.time()
        count = 0
    return count,start

# Distance Multiplier Calculation
latlong_dist = np.sqrt(0.1**2 + 0.1**2) # 0.1 lat and 0.1 long difference
actual_dist_m = 15.723 * 1000 # According to https://latlongdata.com/distance-calculator/
latlong_multiplier = actual_dist_m/latlong_dist

# Assuming average walking speed as 1.4m/s 
# (according to https://www.healthline.com/health/exercise-fitness/average-walking-speed#average-speed-by-age)
kmpermin = (1.4/1000) * 60
print(f"Expected time for 1km walking distance: {1/kmpermin}")
Expected time for 1km walking distance: 11.904761904761903

Preparing MRT dataset

Dataset can be downloaded directly from the Singapore data gov website at https://data.gov.sg/dataset/train-station-chinese-names or it can be accessed via API below.

# Import Dataset via API
url = 'https://data.gov.sg/api/action/datastore_search?resource_id=65c1093b-0c34-41cf-8f0e-f11318766298&limit=1000'
response = requests.get(url)
print(f"Response Code : {response.status_code}")
mrt_df = pd.json_normalize(response.json()['result']['records'])
mrt_df = mrt_df.loc[:,["mrt_station_english","mrt_line_english"]]
mrt_df.columns = ["station","line"]
mrt_df = mrt_df.loc[mrt_df.line.str.contains("Line"),:] # remove LRTs
mrt_df.head()
Response Code : 200
station line
0 Jurong East North South Line
1 Bukit Batok North South Line
2 Bukit Gombak North South Line
3 Choa Chu Kang North South Line
4 Yew Tee North South Line
# Adding Longitude and Lattitude Data
mrt_df.loc[:,"lat"] = ""
mrt_df.loc[:,"long"] = ""
start = time.time()
count = 0
for i,row in tqdm(mrt_df.iterrows(), total=mrt_df.shape[0]):
    station_name = row["station"] + " MRT"
    try:
        search = Client.search(station_name,page_num=1)["results"][0]
        mrt_df.loc[i,"lat"] = search["LATITUDE"]
        mrt_df.loc[i,"long"] = search["LONGITUDE"]
        count,start = countdown(count,start)
    except:
        continue
mrt_df.loc[mrt_df.lat.isna(),:] # Checking for any entries with no geo location
100%|██████████| 141/141 [00:16<00:00,  8.39it/s]
station line lat long
# mrt_df = pd.read_csv("dataset/mrt_geo.csv") # If no access to onemap
mrt_df.head()
station line lat long
0 Jurong East North South Line 1.333153 103.742286
1 Bukit Batok North South Line 1.349033 103.749566
2 Bukit Gombak North South Line 1.358612 103.751791
3 Choa Chu Kang North South Line 1.385363 103.744371
4 Yew Tee North South Line 1.397535 103.747405
mrt_df = mrt_df.dropna(axis=0)
mrt_df.lat = mrt_df.lat.astype(np.float)
mrt_df.long = mrt_df.long.astype(np.float)
mrt_df.to_csv("dataset/mrt_geo.csv",index=False)

Preparing Schools Dataset

Dataset can be downloaded directly from the Singapore data gov website at https://data.gov.sg/dataset/school-directory-and-information?view_id=ba7c477d-a077-4303-96a1-ac1d4f25b190&resource_id=ede26d32-01af-4228-b1ed-f05c45a1d8ee or it can be accessed via API below.

# Import Dataset via API
url = 'https://data.gov.sg/api/action/datastore_search?resource_id=ede26d32-01af-4228-b1ed-f05c45a1d8ee&limit=1000'
response = requests.get(url)
print(f"Response Code : {response.status_code}")
school_df = pd.json_normalize(response.json()['result']['records'])
school_df = school_df.loc[:,["school_name","postal_code","mainlevel_code"]]
school_df.head()
Response Code : 200
school_name postal_code mainlevel_code
0 ADMIRALTY PRIMARY SCHOOL 738907 PRIMARY
1 ADMIRALTY SECONDARY SCHOOL 737916 SECONDARY
2 AHMAD IBRAHIM PRIMARY SCHOOL 768643 PRIMARY
3 AHMAD IBRAHIM SECONDARY SCHOOL 768928 SECONDARY
4 AI TONG SCHOOL 579646 PRIMARY
# Adding Longitude and Lattitude Data
school_df.loc[:,"lat"] = ""
school_df.loc[:,"long"] = ""
start = time.time()
count = 0
for i,row in tqdm(school_df.iterrows(), total=school_df.shape[0]):
    postal_code = row["postal_code"]
    if len(str(postal_code))<6:
        postal_code = (6-len(str(postal_code)))*"0" + str(postal_code)

    try:
        search = Client.search(postal_code,page_num=1)["results"][0]
        school_df.loc[i,"lat"] = search["LATITUDE"]
        school_df.loc[i,"long"] = search["LONGITUDE"]
        count,start = countdown(count,start)
    except:
        continue
school_df.loc[school_df.lat.isna(),:]
100%|██████████| 346/346 [01:12<00:00,  4.79it/s]
school_name postal_code mainlevel_code lat long
school_df = pd.read_csv("dataset/schools_geo.csv") # If no access to onemap
school_df.head()
school_name postal_code mainlevel_code lat long
0 ADMIRALTY PRIMARY SCHOOL 738907 PRIMARY 1.442635 103.800040
1 ADMIRALTY SECONDARY SCHOOL 737916 SECONDARY 1.445891 103.802398
2 AHMAD IBRAHIM PRIMARY SCHOOL 768643 PRIMARY 1.433414 103.832750
3 AHMAD IBRAHIM SECONDARY SCHOOL 768928 SECONDARY 1.436060 103.829719
4 AI TONG SCHOOL 579646 PRIMARY 1.360583 103.833020
school_df = school_df.dropna(axis=0)
school_df.lat = school_df.lat.astype(np.float)
school_df.long = school_df.long.astype(np.float)
school_df.to_csv("dataset/schools_geo.csv",index=False)

Preparing Malls Dataset

The list of malls are found by webscraping the wikipedia page due to no such central dataset in any government source.

url = "https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore"
response = requests.get(url=url)
print(f"Response Code : {response.status_code}")
soup = BeautifulSoup(response.content, 'html.parser')
all_region = soup.find(id="mw-content-text")
mall_list = []
for region in all_region:
    items = all_region.find_all("ul")
    for item in items[:8]:
        for mall in item.select("li"):
            mall_list.append(str(mall.string))
mall_list = sorted(set(mall_list))
mall_df = pd.DataFrame(mall_list,columns=["mall"])
mall_df.head()
Response Code : 200
mall
0 100 AM
1 313@Somerset
2 321 Clementi
3 888 Plaza
4 AMK Hub
mall_df = pd.DataFrame(mall_list,columns=["mall"])
mall_df.loc[:,"lat"] = ""
mall_df.loc[:,"long"] = ""
start = time.time()
count = 0
for i,row in tqdm(mall_df.iterrows(), total=mall_df.shape[0]):
    mall = row["mall"]
    try:
        search = Client.search(mall,page_num=1)["results"][0]
        mall_df.loc[i,"lat"] = search["LATITUDE"]
        mall_df.loc[i,"long"] = search["LONGITUDE"]
        count,start = countdown(count,start)
    except:
        continue
mall_df.loc[mall_df.lat.isna(),:]
100%|██████████| 170/170 [00:19<00:00,  8.94it/s]
mall lat long

Take note that some of the malls are not found using onemap’s api. Therefore, lat and long are inputted manually using google map.

mall_df = pd.read_csv("dataset/malls_geo.csv") # after manual input 
mall_df.head()
mall lat long
0 100 AM 1.274588 103.843471
1 313@Somerset 1.301385 103.837684
2 321 Clementi 1.312002 103.764987
3 888 Plaza 1.437123 103.795314
4 AMK Hub 1.369389 103.848478
mall_df = mall_df.dropna(axis=0)
mall_df.lat = mall_df.lat.astype(np.float)
mall_df.long = mall_df.long.astype(np.float)
mall_df.to_csv("dataset/malls_geo.csv",index=False)

Adding Geocoordinates to the Original Dataset

The algorithm will attempt several naming convention of each entry to find the the most exact geocoordinates using onemap’s api.

df.loc[:,"lat"] = ""
df.loc[:,"long"] = ""
count = 0
start = time.time()
for i,row in tqdm(df2.iterrows(), total=df2.shape[0]):
    if row["lat"] == "":
        block = row["block"]
        street = row["street_name"]
        location = block + " " + street # "123 Charming Street Ave 2"
        index_ = df2.loc[(df2.street_name==street) &
                        (df2.block==block),:].index

        try:
            search = Client.search(location,page_num=1)["results"][0]
            df.iloc[index_,-2] = search["LATITUDE"]
            df.iloc[index_,-1] = search["LONGITUDE"]
            count,start = countdown(count,start)
        except:
            try:
                location = street  # "Charming Street Ave 2"
                search = Client.search(location,page_num=1)["results"][0]
                df.iloc[index_,-2] = search["LATITUDE"]
                df.iloc[index_,-1] = search["LONGITUDE"]
                count,start = countdown(count,start)
            except:
                try:
                    town = row["town"]
                    location = block + " " + town  # "123 Hougang"
                    search = Client.search(location,page_num=1)["results"][0]
                    df.iloc[index_,-2] = search["LATITUDE"]
                    df.iloc[index_,-1] = search["LONGITUDE"]
                    count,start = countdown(count,start)
                except:
                    print(location)
                    continue
df.loc[df.lat.isna(),:]
# Second round of iteration is required as sometimes the onemap api do not return a geolocation despite it being found before.
for i,row in df.loc[df.lat.isna(),:].iterrows():
    street_name = row["street_name"]
    block = row["block"]

    # Check if there is other rows that gotten a result
    search_df = df.loc[(df.street_name==street_name) &
                         (df.block==block),:]
    if len(search_df)>1: # Yes, there is others
        for j,row2 in search_df.iterrows():
            if ~np.isnan(row2["lat"]):
                df.loc[i,["lat","long"]] = df.loc[j,["lat","long"]]
                break
    elif ~block[-1].isdigit(): # check if last char of block is a str
        # check for other results
        search_df = df.loc[(df.street_name==street_name) &
                            (df.block.str.contains(block[:-1])),:]
        if len(search_df)>1: # Yes, there is others
            for j,row2 in search_df.iterrows():
                if ~np.isnan(row2["lat"]):
                    df.loc[i,["lat","long"]] = df.loc[j,["lat","long"]]
                    break
        else: # no results
            try:
                location = street_name + " " + block[:-1] # find directly from onemap
                search = Client.search(location,page_num=1)["results"][0]
                df.loc[i,"lat"] = search["LATITUDE"]
                df.loc[i,"long"] = search["LONGITUDE"]
            except:
                print("Not found")

    if np.isnan(df.loc[i,"lat"]):
        print(f"{street_name} at {block}")
df.loc[df.lat.isna(),:]
# df = pd.read_csv("dataset/df_geo.csv")
df.to_csv("dataset/df_geo.csv",index=False)
df.head()
trans_date lease_commence_date remaining_lease town street_name block storey_range flat_model flat_type floor_area_sqm resale_price month_offset lat long
0 2019-10-01 1979-01-01 58.666667 ANG MO KIO ANG MO KIO AVE 10 406 10 TO 12 improved 2 ROOM 44.0 223.000 1.362005 103.853880 -23.0
1 2019-10-01 1989-01-01 68.083333 WOODLANDS WOODLANDS ST 31 312 04 TO 06 simplified 4 ROOM 84.0 295.000 1.430688 103.775652 -23.0
2 2019-10-01 2000-01-01 79.750000 JURONG WEST JURONG WEST ST 64 668A 07 TO 09 model a2 4 ROOM 90.0 413.888 1.341831 103.702881 -23.0
3 2019-10-01 2000-01-01 79.750000 JURONG WEST JURONG WEST ST 64 668D 01 TO 03 model a2 4 ROOM 85.0 324.000 1.341258 103.703001 -23.0
4 2019-10-01 2001-01-01 80.083333 JURONG WEST JURONG WEST ST 64 660B 16 TO 18 model a 4 ROOM 91.0 428.000 1.336206 103.704177 -23.0

Adding Features into Main Dataset

New Features:

1. Nearby MRTs (within 1km radius)
2. Nearby Schools (within 1km radius)
3. Nearby Primary Schools (within 1km and 2km radius)
    - Within 1km and 2km radius gives higher priority during balloting
    ( https://www.moe.gov.sg/primary/p1-registration/distance/ )
4. Nearby Malls (within 1km radius)
def find_nearby(row,data,multiplier,dist=1000,rules=""):
    start = row[["lat","long"]].values
    destination = data.loc[:,["lat","long"]]
    sumofsqs = ((start-destination)**2).sum(axis=1) # pythagoras theorem
    latlong_dist = np.sqrt(sumofsqs) 
    actual_dist = latlong_dist*multiplier # convert to actual distance
    if rules=="":
        counts = sum(actual_dist<dist)
    else:
        counts = sum((actual_dist<dist) & (data.loc[:,rules[0]] == rules[1]))
    return counts
 df.loc[:,["mrt_1km","sch_1km","pri_1km","pri_2km","mall_1km"]] = ""
for i,row in tqdm(df.iterrows(), total=df.shape[0]):
    df.loc[i,"mrt_1km"] = find_nearby(row,mrt_df,latlong_multiplier)
    df.loc[i,"sch_1km"] = find_nearby(row,school_df,latlong_multiplier)
    df.loc[i,"pri_1km"] = find_nearby(row,school_df,latlong_multiplier,
                                        1000, ("mainlevel_code","PRIMARY"))
    df.loc[i,"pri_2km"] = find_nearby(row,school_df,latlong_multiplier,
                                        2000, ("mainlevel_code","PRIMARY"))
    df.loc[i,"mall_1km"] = find_nearby(row,mall_df,latlong_multiplier)
100%|██████████| 49129/49129 [08:30<00:00, 96.25it/s]
df.head()
trans_date lease_commence_date remaining_lease town street_name block storey_range flat_model flat_type floor_area_sqm resale_price month_offset lat long mrt_1km sch_1km pri_1km pri_2km mall_1km
0 2019-10-01 1979-01-01 58.666667 ANG MO KIO ANG MO KIO AVE 10 406 10 TO 12 improved 2 ROOM 44.0 223.000 1.362005 103.853880 -23.0 0 0 0 0 0
1 2019-10-01 1989-01-01 68.083333 WOODLANDS WOODLANDS ST 31 312 04 TO 06 simplified 4 ROOM 84.0 295.000 1.430688 103.775652 -23.0 0 0 0 0 0
2 2019-10-01 2000-01-01 79.750000 JURONG WEST JURONG WEST ST 64 668A 07 TO 09 model a2 4 ROOM 90.0 413.888 1.341831 103.702881 -23.0 0 0 0 0 0
3 2019-10-01 2000-01-01 79.750000 JURONG WEST JURONG WEST ST 64 668D 01 TO 03 model a2 4 ROOM 85.0 324.000 1.341258 103.703001 -23.0 0 0 0 0 0
4 2019-10-01 2001-01-01 80.083333 JURONG WEST JURONG WEST ST 64 660B 16 TO 18 model a 4 ROOM 91.0 428.000 1.336206 103.704177 -23.0 0 0 0 0 0

Machine Learning Part 2

# Choice of Features
num_vars = ['floor_area_sqm','remaining_lease',"month_offset",
            "pri_1km","pri_2km","sch_1km",
            "mrt_1km","mall_1km"]

cat_vars = ['town', 'storey_range',
            'flat_model', 'flat_type']
print("Testing w/ RF Regressor Model")
ML_pipeline(df, RandomForestRegressor(random_state=123),
            num_vars,cat_vars,"")
Testing w/ RF Regressor Model
5CV R2 Score: 0.9273124978347971
5CV RSME Score: 42.90953615991486
Test R2 Score: 0.9334845131188934
Test RMSE Score: 41.45773122598622





RandomForestRegressor(random_state=123)
print("Testing w/ XGBR Regressor Model")
ML_pipeline(df, xgb.XGBRegressor(),
            num_vars,cat_vars,True)
Testing w/ XGBR Regressor Model
5CV R2 Score: 0.9308986797886275
5CV RSME Score: 41.83816577619727
Test R2 Score: 0.935174594598615
Test RMSE Score: 40.92764588104071

Preparing SHAP analysis.




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

svg

Previous best model(XGBoost):

  • 5CV R2 Score: 0.9440758653507496
  • 5CV RSME Score: 37.61817266649467
  • Test R2 Score: 0.9503124626248961
  • Test RMSE Score: 36.243994650021854

Observations:

  • Both models with additional new features engineered provided no improvement to the previous models.
  • Low SHAP value for all 6 new features indicates that MRTs, Schools and Malls in close proximity does not affect the resale price of HDBs
# Saving Model into a pickle
num_vars = ['floor_area_sqm','remaining_lease',"month_offset"]
cat_vars = ['town', 'storey_range',
            'flat_model', 'flat_type']
transform = ColumnTransformer([
                    ('numerical', StandardScaler(), num_vars),
                    ('categorical', OrdinalEncoder(), cat_vars)
                    ])
pipeline = Pipeline(steps=[("Transforming",transform),
                            ("Fitting",xgb.XGBRegressor())],
                            verbose=True)

pipeline.fit(df,df.resale_price)
filename = 'pipeline.pkl'
with open(filename, 'wb') as pickle_file:
    pickle.dump(pipeline, pickle_file)
pickle_file.close()

# Saving final df
df.to_csv("dataset/df_final.csv",index=False)

[Pipeline] ...... (step 1 of 2) Processing Transforming, total=   0.0s
[Pipeline] ........... (step 2 of 2) Processing Fitting, total=   1.8s

Conclusion and Future Improvements

Discovery

Using the original dataset and engineered features, we discovered the major factors that affect HDB resales price such as the location, the flat type and the remaining lease of the flat. Features such as proximity to schools, malls and MRTS are heavily marketed in social media and property markets. As these features do not greatly affect the price of HDBs, they remain as a attractive selling point. The next time you buy a HDB resale flat in Singapore, make sure you have these features because it will mean you that you getting the most bang for the buck!

Technique Used:

  • Data Wrangling
  • API retrieval (onemap and data.gov.sg)
  • Webscaping (wikipedia)
  • Machine Learning Modelling
  • Feature Engineering (via contextual knowledge)

Possible Future Improvements

  • Further Optimisation for Machine Models
  • Time Series Analysis
  • Optimising and Finding more possible Features

Predictive program can be found at https://share.streamlit.io/liankeat/resaleflatsinsg/main.