SMOTE-Tomek for Ethereum Fraud Detection
import Pkg;
Pkg.add(["Random", "CSV", "DataFrames", "MLJ", "Imbalance", "MLJBalancing",
"ScientificTypes","Impute", "StatsBase", "Plots", "Measures", "HTTP"])
using Imbalance
using MLJBalancing
using CSV
using DataFrames
using ScientificTypes
using CategoricalArrays
using MLJ
using Plots
using Random
using Impute
using HTTP: download
Loading Data
In this example, we will consider the Ethereum Fraud Detection Dataset found on Kaggle where the objective is to predict whether an Ethereum transaction is fraud or not (called FLAG
) given some features about the transaction.
CSV
gives us the ability to easily read the dataset after it's downloaded as follows
download("https://raw.githubusercontent.com/JuliaAI/Imbalance.jl/dev/docs/src/examples/fraud_detection/transactions.csv", "./")
df = CSV.read("./transactions.csv", DataFrame)
first(df, 5) |> pretty
There are plenty of useless columns that we can get rid of such as Column1
, Index
and probably, Address
. We also have to get rid of the categorical features because SMOTE won't be able to deal with those and it leaves us with more options for the model.
df = df[:,
Not([
:Column1,
:Index,
:Address,
Symbol(" ERC20 most sent token type"),
Symbol(" ERC20_most_rec_token_type"),
]),
]
first(df, 5) |> pretty
If you scroll through the printed data frame, you find that some columns also have Missing
for their element type, meaning that they may be containing missing values. We will use linear interpolation, last-observation carried forward and next observation carried backward techniques to fill up the missing values. This will allow us to call disallowmissing!(df)
to return a dataframe where Missing
is not an element type for any column.
df = Impute.interp(df) |> Impute.locf() |> Impute.nocb(); disallowmissing!(df)
first(df, 5) |> pretty
Coercing Data
Let's look at the schema first
ScientificTypes.schema(df)
┌──────────────────────────────────────────────────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├──────────────────────────────────────────────────────┼────────────┼─────────┤
│ FLAG │ Count │ Int64 │
│ Avg min between sent tnx │ Continuous │ Float64 │
│ Avg min between received tnx │ Continuous │ Float64 │
│ Time Diff between first and last (Mins) │ Continuous │ Float64 │
│ Sent tnx │ Count │ Int64 │
│ Received Tnx │ Count │ Int64 │
│ Number of Created Contracts │ Count │ Int64 │
│ Unique Received From Addresses │ Count │ Int64 │
│ Unique Sent To Addresses │ Count │ Int64 │
│ min value received │ Continuous │ Float64 │
│ max value received │ Continuous │ Float64 │
│ avg val received │ Continuous │ Float64 │
│ min val sent │ Continuous │ Float64 │
│ max val sent │ Continuous │ Float64 │
│ avg val sent │ Continuous │ Float64 │
│ min value sent to contract │ Continuous │ Float64 │
│ ⋮ │ ⋮ │ ⋮ │
└──────────────────────────────────────────────────────┴────────────┴─────────┘
30 rows omitted
The FLAG
target should definitely be Multiclass, the rest seems fine.
df = coerce(df, :FLAG =>Multiclass)
ScientificTypes.schema(df)
┌──────────────────────────────────────────────────────┬───────────────┬────────
│ names │ scitypes │ types ⋯
├──────────────────────────────────────────────────────┼───────────────┼────────
│ FLAG │ Multiclass{2} │ Categ ⋯
│ Avg min between sent tnx │ Continuous │ Float ⋯
│ Avg min between received tnx │ Continuous │ Float ⋯
│ Time Diff between first and last (Mins) │ Continuous │ Float ⋯
│ Sent tnx │ Count │ Int64 ⋯
│ Received Tnx │ Count │ Int64 ⋯
│ Number of Created Contracts │ Count │ Int64 ⋯
│ Unique Received From Addresses │ Count │ Int64 ⋯
│ Unique Sent To Addresses │ Count │ Int64 ⋯
│ min value received │ Continuous │ Float ⋯
│ max value received │ Continuous │ Float ⋯
│ avg val received │ Continuous │ Float ⋯
│ min val sent │ Continuous │ Float ⋯
│ max val sent │ Continuous │ Float ⋯
│ avg val sent │ Continuous │ Float ⋯
│ min value sent to contract │ Continuous │ Float ⋯
│ ⋮ │ ⋮ │ ⋱
└──────────────────────────────────────────────────────┴───────────────┴────────
1 column and 30 rows omitted
Unpacking and Splitting Data
Both MLJ
and the pure functional interface of Imbalance
assume that the observations table X
and target vector y
are separate. We can accomplish that by using unpack
from MLJ
y, X = unpack(df, ==(:FLAG); rng=123);
first(X, 5) |> pretty
Splitting the data into train and test portions is also easy using MLJ
's partition
function.
(X_train, X_test), (y_train, y_test) = partition(
(X, y),
0.8,
multi = true,
shuffle = true,
stratify = y,
rng = Random.Xoshiro(41)
)
Resampling
Before deciding to oversample, let's see how adverse is the imbalance problem, if it exists. Ideally, you may as well check if the classification model is robust to this problem.
checkbalance(y) # comes from Imbalance
This signals a potential class imbalance problem. Let's consider using SMOTE-Tomek
to resample this data. The SMOTE-Tomek
algorithm is nothing but SMOTE
followed by TomekUndersampler
. We can wrap these in a pipeline along with a classification model for predictions using BalancedModel
from MLJBalancing
. Let's go for a RandomForestClassifier
from DecisionTree.jl
for the model.
import Pkg; Pkg.add("DecisionTree")
Construct the Resampling & Classification Models
oversampler = Imbalance.MLJ.SMOTE(ratios=Dict(1=>0.5), rng=Random.Xoshiro(42))
undersampler = Imbalance.MLJ.TomekUndersampler(min_ratios=Dict(0=>1.3), force_min_ratios=true)
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree
model = RandomForestClassifier(n_trees=2, rng=Random.Xoshiro(42))
RandomForestClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = -1,
n_trees = 2,
sampling_fraction = 0.7,
feature_importance = :impurity,
rng = Xoshiro(0xa379de7eeeb2a4e8, 0x953dccb6b532b3af, 0xf597b8ff8cfd652a, 0xccd7337c571680d1))
Form the Pipeline using BalancedModel
balanced_model = BalancedModel(model=model, balancer1=oversampler, balancer2=undersampler)
BalancedModelProbabilistic(
model = RandomForestClassifier(
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = -1,
n_trees = 2,
sampling_fraction = 0.7,
feature_importance = :impurity,
rng = Xoshiro(0xa379de7eeeb2a4e8, 0x953dccb6b532b3af, 0xf597b8ff8cfd652a, 0xccd7337c571680d1)),
balancer1 = SMOTE(
k = 5,
ratios = Dict(1 => 0.5),
rng = Xoshiro(0xa379de7eeeb2a4e8, 0x953dccb6b532b3af, 0xf597b8ff8cfd652a, 0xccd7337c571680d1),
try_preserve_type = true),
balancer2 = TomekUndersampler(
min_ratios = Dict(0 => 1.3),
force_min_ratios = true,
rng = TaskLocalRNG(),
try_preserve_type = true))
Now we can treat balanced_model
like any MLJ
model.
Fit the BalancedModel
# 3. Wrap it with the data in a machine
mach_over = machine(balanced_model, X_train, y_train)
# 4. fit the machine learning model
fit!(mach_over, verbosity=0)
trained Machine; does not cache data
model: BalancedModelProbabilistic(model = RandomForestClassifier(max_depth = -1, …), …)
args:
1: Source @967 ⏎ Table{Union{AbstractVector{Continuous}, AbstractVector{Count}}}
2: Source @913 ⏎ AbstractVector{Multiclass{2}}
Validate the BalancedModel
cv=CV(nfolds=10)
evaluate!(mach_over, resampling=cv, measure=balanced_accuracy)
PerformanceEvaluation object with these fields:
model, measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows, resampling, repeats
Extract:
┌─────────────────────┬──────────────┬─────────────┬─────────┬──────────────────
│ measure │ operation │ measurement │ 1.96*SE │ per_fold ⋯
├─────────────────────┼──────────────┼─────────────┼─────────┼──────────────────
│ BalancedAccuracy( │ predict_mode │ 0.93 │ 0.00757 │ [0.927, 0.936, ⋯
│ adjusted = false) │ │ │ │ ⋯
└─────────────────────┴──────────────┴─────────────┴─────────┴──────────────────
1 column omitted
Compare with RandomForestClassifier
only
To see if this represents any form of improvement, fitting and validating the original model by itself.
# 3. Wrap it with the data in a machine
mach = machine(model, X_train, y_train, scitype_check_level=0)
fit!(mach)
evaluate!(mach, resampling=cv, measure=balanced_accuracy)
PerformanceEvaluation object with these fields:
model, measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows, resampling, repeats
Extract:
┌─────────────────────┬──────────────┬─────────────┬─────────┬──────────────────
│ measure │ operation │ measurement │ 1.96*SE │ per_fold ⋯
├─────────────────────┼──────────────┼─────────────┼─────────┼──────────────────
│ BalancedAccuracy( │ predict_mode │ 0.908 │ 0.00932 │ [0.903, 0.898, ⋯
│ adjusted = false) │ │ │ │ ⋯
└─────────────────────┴──────────────┴─────────────┴─────────┴──────────────────
1 column omitted
Assuming normal scores, the 95%
confidence interval was 90.8±0.9
and after resampling it has become 93±0.7
which corresponds to a small improvement in accuracy.