House King County

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Getting started

This tutorial is adapted from the corresponding MLR3 tutorial.

Loading and preparing the data

using MLJ
using PrettyPrinting
import DataFrames: DataFrame, select!, Not, describe
import Statistics
using Dates
using PyPlot
using UrlDownload


df = DataFrame(urldownload("https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/kc_housing.csv", true))
describe(df)
21×7 DataFrame
 Row │ variable       mean        min              median     max              nmissing  eltype
     │ Symbol         Union…      Any              Union…     Any              Int64     DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────
   1 │ id             4.5803e9    1000102          3.90493e9  9900000190              0  Int64
   2 │ date                       20140502T000000             20150527T000000         0  String15
   3 │ price          5.40088e5   75000.0          450000.0   7.7e6                   0  Float64
   4 │ bedrooms       3.37084     0                3.0        33                      0  Int64
   5 │ bathrooms      2.11476     0.0              2.25       8.0                     0  Float64
   6 │ sqft_living    2079.9      290              1910.0     13540                   0  Int64
   7 │ sqft_lot       15107.0     520              7618.0     1651359                 0  Int64
   8 │ floors         1.49431     1.0              1.5        3.5                     0  Float64
   9 │ waterfront     0.00754176  0                0.0        1                       0  Int64
  10 │ view           0.234303    0                0.0        4                       0  Int64
  11 │ condition      3.40943     1                3.0        5                       0  Int64
  12 │ grade          7.65687     1                7.0        13                      0  Int64
  13 │ sqft_above     1788.39     290              1560.0     9410                    0  Int64
  14 │ sqft_basement  291.509     0                0.0        4820                    0  Int64
  15 │ yr_built       1971.01     1900             1975.0     2015                    0  Int64
  16 │ yr_renovated   84.4023     0                0.0        2015                    0  Int64
  17 │ zipcode        98077.9     98001            98065.0    98199                   0  Int64
  18 │ lat            47.5601     47.1559          47.5718    47.7776                 0  Float64
  19 │ long           -122.214    -122.519         -122.23    -121.315                0  Float64
  20 │ sqft_living15  1986.55     399              1840.0     6210                    0  Int64
  21 │ sqft_lot15     12768.5     651              7620.0     871200                  0  Int64

We drop unrelated columns

select!(df, Not([:id, :date]))
schema(df)
┌───────────────┬────────────┬─────────┐
│ names         │ scitypes   │ types   │
├───────────────┼────────────┼─────────┤
│ price         │ Continuous │ Float64 │
│ bedrooms      │ Count      │ Int64   │
│ bathrooms     │ Continuous │ Float64 │
│ sqft_living   │ Count      │ Int64   │
│ sqft_lot      │ Count      │ Int64   │
│ floors        │ Continuous │ Float64 │
│ waterfront    │ Count      │ Int64   │
│ view          │ Count      │ Int64   │
│ condition     │ Count      │ Int64   │
│ grade         │ Count      │ Int64   │
│ sqft_above    │ Count      │ Int64   │
│ sqft_basement │ Count      │ Int64   │
│ yr_built      │ Count      │ Int64   │
│ yr_renovated  │ Count      │ Int64   │
│ zipcode       │ Count      │ Int64   │
│ lat           │ Continuous │ Float64 │
│ long          │ Continuous │ Float64 │
│ sqft_living15 │ Count      │ Int64   │
│ sqft_lot15    │ Count      │ Int64   │
└───────────────┴────────────┴─────────┘

Afterwards, we convert the zip code to an unordered factor (Multiclass), we also create two binary features isrenovated and has_basement derived from yr_renovated and sqft_basement:

coerce!(df, :zipcode => Multiclass)
df.isrenovated  = @. !iszero(df.yr_renovated)
df.has_basement = @. !iszero(df.sqft_basement)
schema(df)
┌───────────────┬────────────────┬─────────────────────────────────┐
│ names         │ scitypes       │ types                           │
├───────────────┼────────────────┼─────────────────────────────────┤
│ price         │ Continuous     │ Float64                         │
│ bedrooms      │ Count          │ Int64                           │
│ bathrooms     │ Continuous     │ Float64                         │
│ sqft_living   │ Count          │ Int64                           │
│ sqft_lot      │ Count          │ Int64                           │
│ floors        │ Continuous     │ Float64                         │
│ waterfront    │ Count          │ Int64                           │
│ view          │ Count          │ Int64                           │
│ condition     │ Count          │ Int64                           │
│ grade         │ Count          │ Int64                           │
│ sqft_above    │ Count          │ Int64                           │
│ sqft_basement │ Count          │ Int64                           │
│ yr_built      │ Count          │ Int64                           │
│ yr_renovated  │ Count          │ Int64                           │
│ zipcode       │ Multiclass{70} │ CategoricalValue{Int64, UInt32} │
│ lat           │ Continuous     │ Float64                         │
│ long          │ Continuous     │ Float64                         │
│ sqft_living15 │ Count          │ Int64                           │
│ sqft_lot15    │ Count          │ Int64                           │
│ isrenovated   │ Count          │ Bool                            │
│ has_basement  │ Count          │ Bool                            │
└───────────────┴────────────────┴─────────────────────────────────┘

These created variables should be treated as OrderedFactor,

coerce!(df, :isrenovated => OrderedFactor, :has_basement => OrderedFactor);

The feature waterfront is currently encoded as a string, but it's really just a boolean:

unique(df.waterfront)
2-element Vector{Int64}:
 0
 1

So let's recode it

df.waterfront = (df.waterfront .!= "FALSE")
coerce!(df, :waterfront => OrderedFactor);

For a number of the remaining features which are treated as Count there are few unique values in which case it might make more sense to recode them as OrderedFactor, this can be done with autotype:

coerce!(df, autotype(df, :few_to_finite))
schema(df)
┌───────────────┬───────────────────┬───────────────────────────────────┐
│ names         │ scitypes          │ types                             │
├───────────────┼───────────────────┼───────────────────────────────────┤
│ price         │ Continuous        │ Float64                           │
│ bedrooms      │ OrderedFactor{13} │ CategoricalValue{Int64, UInt32}   │
│ bathrooms     │ OrderedFactor{30} │ CategoricalValue{Float64, UInt32} │
│ sqft_living   │ Count             │ Int64                             │
│ sqft_lot      │ Count             │ Int64                             │
│ floors        │ OrderedFactor{6}  │ CategoricalValue{Float64, UInt32} │
│ waterfront    │ OrderedFactor{1}  │ CategoricalValue{Bool, UInt32}    │
│ view          │ OrderedFactor{5}  │ CategoricalValue{Int64, UInt32}   │
│ condition     │ OrderedFactor{5}  │ CategoricalValue{Int64, UInt32}   │
│ grade         │ OrderedFactor{12} │ CategoricalValue{Int64, UInt32}   │
│ sqft_above    │ Count             │ Int64                             │
│ sqft_basement │ Count             │ Int64                             │
│ yr_built      │ Count             │ Int64                             │
│ yr_renovated  │ OrderedFactor{70} │ CategoricalValue{Int64, UInt32}   │
│ zipcode       │ Multiclass{70}    │ CategoricalValue{Int64, UInt32}   │
│ lat           │ Continuous        │ Float64                           │
│ long          │ Continuous        │ Float64                           │
│ sqft_living15 │ Count             │ Int64                             │
│ sqft_lot15    │ Count             │ Int64                             │
│ isrenovated   │ OrderedFactor{2}  │ CategoricalValue{Bool, UInt32}    │
│ has_basement  │ OrderedFactor{2}  │ CategoricalValue{Bool, UInt32}    │
└───────────────┴───────────────────┴───────────────────────────────────┘

Let's also rescale the column price to be in 1000s of dollars:

df.price = df.price ./ 1000;

For simplicity let's just drop a few additional columns that don't seem to matter much:

select!(df, Not([:yr_renovated, :sqft_basement, :zipcode]));

Basic data visualisation

Let's plot a basic histogram of the prices to get an idea for the distribution:

plt.figure(figsize=(8,6))
plt.hist(df.price, color = "blue", edgecolor = "white", bins=50,
         density=true, alpha=0.5)
plt.xlabel("Price", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
Histogram of the prices

Let's see if there's a difference between renovated and unrenovated flats:

plt.figure(figsize=(8,6))
plt.hist(df.price[df.isrenovated .== true], color="blue", density=true,
        edgecolor="white", bins=50, label="renovated", alpha=0.5)
plt.hist(df.price[df.isrenovated .== false], color="red", density=true,
        edgecolor="white", bins=50, label="unrenovated", alpha=0.5)
plt.xlabel("Price", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.legend(fontsize=12)

Histogram of the prices depending on renovation We can observe that renovated flats seem to achieve higher sales values, and this might thus be a relevant feature.

Likewise, this could be done to verify that condition, waterfront etc are important features.

Fitting a first model

DTR = @load DecisionTreeRegressor pkg=DecisionTree

y, X = unpack(df, ==(:price))
train, test = partition(collect(eachindex(y)), 0.7, shuffle=true, rng=5)
tree = machine(DTR(), X, y)

fit!(tree, rows=train);
import MLJDecisionTreeInterface ✔

Let's see how it does

rms(y[test], MLJ.predict(tree, rows=test))
179.71280372976895

Let's try to do better.

Random forest model

We might be able to improve upon the RMSE using more powerful learners.

RFR = @load RandomForestRegressor pkg=ScikitLearn
import MLJScikitLearnInterface ✔
MLJScikitLearnInterface.RandomForestRegressor

That model only accepts input in the form of Count and so we have to coerce all Finite types into Count:

coerce!(X, Finite => Count);

Now we can fit

rf_mdl = RFR()
rf = machine(rf_mdl, X, y)
fit!(rf, rows=train)

rms(y[test], MLJ.predict(rf, rows=test))
135.53923427824247

A bit better but it would be best to check this a bit more carefully:

cv3 = CV(; nfolds=3)
res = evaluate(rf_mdl, X, y, resampling=CV(shuffle=true),
               measure=rms, verbosity=0)
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬────────────────────────────────────────────┐
│ measure                │ measurement │ operation │ per_fold                                   │
├────────────────────────┼─────────────┼───────────┼────────────────────────────────────────────┤
│ RootMeanSquaredError() │ 135.0       │ predict   │ [150.0, 142.0, 131.0, 125.0, 121.0, 140.0] │
└────────────────────────┴─────────────┴───────────┴────────────────────────────────────────────┘

GBM

Let's try a different kind of model: Gradient Boosted Decision Trees from the package xgboost and we'll try to tune it too.

XGBR = @load XGBoostRegressor
import MLJXGBoostInterface ✔
MLJXGBoostInterface.XGBoostRegressor

It expects a Table(Continuous) input so we need to coerce X again:

coerce!(X, Count => Continuous)

xgb  = XGBR()
xgbm = machine(xgb, X, y)
fit!(xgbm, rows=train)

rms(y[test], MLJ.predict(xgbm, rows=test))
137.35137858698184

Let's try to tune it, first we define ranges for a number of useful parameters:

r1 = range(xgb, :max_depth, lower=3, upper=10)
r2 = range(xgb, :num_round, lower=1, upper=25);

And now we tune, we use a very coarse resolution because we use so many ranges, 2^7 is already some 128 models...

tm = TunedModel(model=xgb, tuning=Grid(resolution=7),
                resampling=CV(rng=11), ranges=[r1,r2],
                measure=rms)
mtm = machine(tm, X, y)
fit!(mtm, rows=train)

rms(y[test], MLJ.predict(mtm, rows=test))
145.8097421015522