House King County
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
This tutorial is adapted from the corresponding MLR3 tutorial.
using MLJ
using PrettyPrinting
import DataFrames: DataFrame, select!, Not, describe
import Statistics
using Dates
using UrlDownload
df = DataFrame(urldownload("https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/kc_housing.csv", true))
describe(df)
21×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────
1 │ id 4.5803e9 1000102 3.90493e9 9900000190 0 Int64
2 │ date 20140502T000000 20150527T000000 0 String15
3 │ price 5.40088e5 75000.0 450000.0 7.7e6 0 Float64
4 │ bedrooms 3.37084 0 3.0 33 0 Int64
5 │ bathrooms 2.11476 0.0 2.25 8.0 0 Float64
6 │ sqft_living 2079.9 290 1910.0 13540 0 Int64
7 │ sqft_lot 15107.0 520 7618.0 1651359 0 Int64
8 │ floors 1.49431 1.0 1.5 3.5 0 Float64
9 │ waterfront 0.00754176 0 0.0 1 0 Int64
10 │ view 0.234303 0 0.0 4 0 Int64
11 │ condition 3.40943 1 3.0 5 0 Int64
12 │ grade 7.65687 1 7.0 13 0 Int64
13 │ sqft_above 1788.39 290 1560.0 9410 0 Int64
14 │ sqft_basement 291.509 0 0.0 4820 0 Int64
15 │ yr_built 1971.01 1900 1975.0 2015 0 Int64
16 │ yr_renovated 84.4023 0 0.0 2015 0 Int64
17 │ zipcode 98077.9 98001 98065.0 98199 0 Int64
18 │ lat 47.5601 47.1559 47.5718 47.7776 0 Float64
19 │ long -122.214 -122.519 -122.23 -121.315 0 Float64
20 │ sqft_living15 1986.55 399 1840.0 6210 0 Int64
21 │ sqft_lot15 12768.5 651 7620.0 871200 0 Int64
We drop unrelated columns
select!(df, Not([:id, :date]))
schema(df)
┌───────────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├───────────────┼────────────┼─────────┤
│ price │ Continuous │ Float64 │
│ bedrooms │ Count │ Int64 │
│ bathrooms │ Continuous │ Float64 │
│ sqft_living │ Count │ Int64 │
│ sqft_lot │ Count │ Int64 │
│ floors │ Continuous │ Float64 │
│ waterfront │ Count │ Int64 │
│ view │ Count │ Int64 │
│ condition │ Count │ Int64 │
│ grade │ Count │ Int64 │
│ sqft_above │ Count │ Int64 │
│ sqft_basement │ Count │ Int64 │
│ yr_built │ Count │ Int64 │
│ yr_renovated │ Count │ Int64 │
│ zipcode │ Count │ Int64 │
│ lat │ Continuous │ Float64 │
│ long │ Continuous │ Float64 │
│ sqft_living15 │ Count │ Int64 │
│ sqft_lot15 │ Count │ Int64 │
└───────────────┴────────────┴─────────┘
Afterwards, we convert the zip code to an unordered factor (Multiclass
), we also create two binary features isrenovated
and has_basement
derived from yr_renovated
and sqft_basement
:
coerce!(df, :zipcode => Multiclass)
df.isrenovated = @. !iszero(df.yr_renovated)
df.has_basement = @. !iszero(df.sqft_basement)
schema(df)
┌───────────────┬────────────────┬─────────────────────────────────┐
│ names │ scitypes │ types │
├───────────────┼────────────────┼─────────────────────────────────┤
│ price │ Continuous │ Float64 │
│ bedrooms │ Count │ Int64 │
│ bathrooms │ Continuous │ Float64 │
│ sqft_living │ Count │ Int64 │
│ sqft_lot │ Count │ Int64 │
│ floors │ Continuous │ Float64 │
│ waterfront │ Count │ Int64 │
│ view │ Count │ Int64 │
│ condition │ Count │ Int64 │
│ grade │ Count │ Int64 │
│ sqft_above │ Count │ Int64 │
│ sqft_basement │ Count │ Int64 │
│ yr_built │ Count │ Int64 │
│ yr_renovated │ Count │ Int64 │
│ zipcode │ Multiclass{70} │ CategoricalValue{Int64, UInt32} │
│ lat │ Continuous │ Float64 │
│ long │ Continuous │ Float64 │
│ sqft_living15 │ Count │ Int64 │
│ sqft_lot15 │ Count │ Int64 │
│ isrenovated │ Count │ Bool │
│ has_basement │ Count │ Bool │
└───────────────┴────────────────┴─────────────────────────────────┘
These created variables should be treated as OrderedFactor,
coerce!(df, :isrenovated => OrderedFactor, :has_basement => OrderedFactor);
The feature waterfront
is currently encoded as a string, but it's really just a boolean:
unique(df.waterfront)
2-element Vector{Int64}:
0
1
So let's recode it
df.waterfront = (df.waterfront .!= "FALSE")
coerce!(df, :waterfront => OrderedFactor);
For a number of the remaining features which are treated as Count
there are few unique values in which case it might make more sense to recode them as OrderedFactor, this can be done with autotype
:
coerce!(df, autotype(df, :few_to_finite))
schema(df)
┌───────────────┬───────────────────┬───────────────────────────────────┐
│ names │ scitypes │ types │
├───────────────┼───────────────────┼───────────────────────────────────┤
│ price │ Continuous │ Float64 │
│ bedrooms │ OrderedFactor{13} │ CategoricalValue{Int64, UInt32} │
│ bathrooms │ OrderedFactor{30} │ CategoricalValue{Float64, UInt32} │
│ sqft_living │ Count │ Int64 │
│ sqft_lot │ Count │ Int64 │
│ floors │ OrderedFactor{6} │ CategoricalValue{Float64, UInt32} │
│ waterfront │ OrderedFactor{1} │ CategoricalValue{Bool, UInt32} │
│ view │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
│ condition │ OrderedFactor{5} │ CategoricalValue{Int64, UInt32} │
│ grade │ OrderedFactor{12} │ CategoricalValue{Int64, UInt32} │
│ sqft_above │ Count │ Int64 │
│ sqft_basement │ Count │ Int64 │
│ yr_built │ Count │ Int64 │
│ yr_renovated │ OrderedFactor{70} │ CategoricalValue{Int64, UInt32} │
│ zipcode │ Multiclass{70} │ CategoricalValue{Int64, UInt32} │
│ lat │ Continuous │ Float64 │
│ long │ Continuous │ Float64 │
│ sqft_living15 │ Count │ Int64 │
│ sqft_lot15 │ Count │ Int64 │
│ isrenovated │ OrderedFactor{2} │ CategoricalValue{Bool, UInt32} │
│ has_basement │ OrderedFactor{2} │ CategoricalValue{Bool, UInt32} │
└───────────────┴───────────────────┴───────────────────────────────────┘
Let's also rescale the column price
to be in 1000s of dollars:
df.price = df.price ./ 1000;
For simplicity let's just drop a few additional columns that don't seem to matter much:
select!(df, Not([:yr_renovated, :sqft_basement, :zipcode]));
Let's plot a basic histogram of the prices to get an idea for the distribution:
using Plots
histogram(df.price, color = "blue", normalize=:pdf, bins=50, alpha=0.5, legend=false)
xlabel!("Price")
ylabel!("Frequency")
Let's see if there's a difference between renovated and unrenovated flats:
histogram(df.price[df.isrenovated .== true], color = "blue", normalize=:pdf, bins=50, alpha=0.5, label="renovated")
histogram!(df.price[df.isrenovated .== false], color = "red", normalize=:pdf, bins=50, alpha=0.5, label="unrenovated")
xlabel!("Price")
ylabel!("Frequency")
We can observe that renovated flats seem to achieve higher sales values, and this might thus be a relevant feature.
Likewise, this could be done to verify that condition
, waterfront
etc are important features.
DTR = @load DecisionTreeRegressor pkg=DecisionTree
y, X = unpack(df, ==(:price))
train, test = partition(collect(eachindex(y)), 0.7, shuffle=true, rng=5)
tree = machine(DTR(), X, y)
fit!(tree, rows=train);
import MLJDecisionTreeInterface ✔
Let's see how it does
rms(y[test], MLJ.predict(tree, rows=test))
179.64505965923362
Let's try to do better.
We might be able to improve upon the RMSE using more powerful learners.
RFR = @load RandomForestRegressor pkg=MLJScikitLearnInterface
import MLJScikitLearnInterface ✔
MLJScikitLearnInterface.RandomForestRegressor
That model only accepts input in the form of Count
and so we have to coerce all Finite
types into Count
:
coerce!(X, Finite => Count);
Now we can fit
rf_mdl = RFR()
rf = machine(rf_mdl, X, y)
fit!(rf, rows=train)
rms(y[test], MLJ.predict(rf, rows=test))
136.90049515798702
A bit better but it would be best to check this a bit more carefully:
cv3 = CV(; nfolds=3)
res = evaluate(rf_mdl, X, y, resampling=CV(shuffle=true),
measure=rms, verbosity=0)
PerformanceEvaluation object with these fields:
model, measure, operation,
measurement, per_fold, per_observation,
fitted_params_per_fold, report_per_fold,
train_test_rows, resampling, repeats
Extract:
┌────────────────────────┬───────────┬─────────────┐
│ measure │ operation │ measurement │
├────────────────────────┼───────────┼─────────────┤
│ RootMeanSquaredError() │ predict │ 136.0 │
└────────────────────────┴───────────┴─────────────┘
┌────────────────────────────────────────────┬─────────┐
│ per_fold │ 1.96*SE │
├────────────────────────────────────────────┼─────────┤
│ [135.0, 132.0, 130.0, 138.0, 135.0, 145.0] │ 4.37 │
└────────────────────────────────────────────┴─────────┘
Let's try a different kind of model: Gradient Boosted Decision Trees from the package xgboost and we'll try to tune it too.
XGBR = @load XGBoostRegressor
import MLJXGBoostInterface ✔
MLJXGBoostInterface.XGBoostRegressor
It expects a Table(Continuous)
input so we need to coerce X
again:
coerce!(X, Count => Continuous)
xgb = XGBR()
xgbm = machine(xgb, X, y)
fit!(xgbm, rows=train)
rms(y[test], MLJ.predict(xgbm, rows=test))
136.7734795543891
Let's try to tune it, first we define ranges for a number of useful parameters:
r1 = range(xgb, :max_depth, lower=3, upper=10)
r2 = range(xgb, :num_round, lower=1, upper=25);
And now we tune, we use a very coarse resolution because we use so many ranges, 2^7
is already some 128 models...
tm = TunedModel(model=xgb, tuning=Grid(resolution=7),
resampling=CV(rng=11), ranges=[r1,r2],
measure=rms)
mtm = machine(tm, X, y)
fit!(mtm, rows=train)
rms(y[test], MLJ.predict(mtm, rows=test))
138.20745278225192