Credit Card Fraud
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
@OUTPUT (macro with 1 method)
Classification of fraudulent/not credit card transactions (imbalanced data) By Kristian Bjarnason. The original script can be found here
Editor's note. To reduce training times, we have reduced the the original number of data observations. To re-instate the full dataset (290k observations) change reduction=0.05
to reduction=1
. The data is highly imbalanced, and this is ignored when training some models. Some other changes to Bjarnason's original notebook are noted at the end.
using Dates, Statistics, LinearAlgebra, Random # standard libraries
using MLJ, Plots, DataFrames, UrlDownload
using CSV # needed for `urldownload` to work
import StatsBase # needed for `countmap`
Adjusting fontsize in plotting:
Plots.scalefontsizes(0.85)
Divide the sample into two equal sub-samples. Keep the proportion of frauds the same in each sub-sample (246 frauds in each). Use one sub-sample to estimate (train) your models and the second one to evaluate the out-of-sample performance of each model.
Importing the data:
table = urldownload(
"https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv",
);
data = DataFrame(table)
first(data, 4)
4×31 DataFrame
Row │ Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.0 -1.35981 -0.0727812 2.53635 1.37816 -0.338321 0.462388 0.239599 0.0986979 0.363787 0.0907942 -0.5516 -0.617801 -0.99139 -0.311169 1.46818 -0.470401 0.207971 0.0257906 0.403993 0.251412 -0.0183068 0.277838 -0.110474 0.0669281 0.128539 -0.189115 0.133558 -0.0210531 149.62 0
2 │ 0.0 1.19186 0.266151 0.16648 0.448154 0.0600176 -0.0823608 -0.078803 0.0851017 -0.255425 -0.166974 1.61273 1.06524 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.0690831 -0.225775 -0.638672 0.101288 -0.339846 0.16717 0.125895 -0.0089831 0.0147242 2.69 0
3 │ 1.0 -1.35835 -1.34016 1.77321 0.37978 -0.503198 1.8005 0.791461 0.247676 -1.51465 0.207643 0.624501 0.0660837 0.717293 -0.165946 2.34586 -2.89008 1.10997 -0.121359 -2.26186 0.52498 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.0553528 -0.0597518 378.66 0
4 │ 1.0 -0.966272 -0.185226 1.79299 -0.863291 -0.0103089 1.2472 0.237609 0.377436 -1.38702 -0.0549519 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.05965 -0.684093 1.96578 -1.23262 -0.208038 -0.1083 0.0052736 -0.190321 -1.17558 0.647376 -0.221929 0.0627228 0.0614576 123.5 0
Inspecting the scientific types of variables contained in the DataFrame:
schema(data)
┌────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├────────┼────────────┼─────────┤
│ Time │ Continuous │ Float64 │
│ V1 │ Continuous │ Float64 │
│ V2 │ Continuous │ Float64 │
│ V3 │ Continuous │ Float64 │
│ V4 │ Continuous │ Float64 │
│ V5 │ Continuous │ Float64 │
│ V6 │ Continuous │ Float64 │
│ V7 │ Continuous │ Float64 │
│ V8 │ Continuous │ Float64 │
│ V9 │ Continuous │ Float64 │
│ V10 │ Continuous │ Float64 │
│ V11 │ Continuous │ Float64 │
│ V12 │ Continuous │ Float64 │
│ V13 │ Continuous │ Float64 │
│ V14 │ Continuous │ Float64 │
│ V15 │ Continuous │ Float64 │
│ V16 │ Continuous │ Float64 │
│ V17 │ Continuous │ Float64 │
│ V18 │ Continuous │ Float64 │
│ V19 │ Continuous │ Float64 │
│ V20 │ Continuous │ Float64 │
│ V21 │ Continuous │ Float64 │
│ V22 │ Continuous │ Float64 │
│ V23 │ Continuous │ Float64 │
│ V24 │ Continuous │ Float64 │
│ V25 │ Continuous │ Float64 │
│ V26 │ Continuous │ Float64 │
│ V27 │ Continuous │ Float64 │
│ V28 │ Continuous │ Float64 │
│ Amount │ Continuous │ Float64 │
│ Class │ Count │ Int64 │
└────────┴────────────┴─────────┘
The Time column is not relevant to our analysis, we drop it:
select!(data, Not(:Time));
And the target variable, Class
, should not be interpretted by our algorithms as a Count
variable. We'll view it as an ordered factor (i.e., binary data with an intrinsic positive
class, corresponding here to 1
, the second in the lexigrahic ordering).
coerce!(data, :Class => OrderedFactor);
We can check by calling schema
again, or like this:
scitype(data.Class)
AbstractVector{OrderedFactor{2}} (alias for AbstractArray{ScientificTypesBase.OrderedFactor{2}, 1})
levels(data.Class) # second element is `positive` class
2-element Vector{Int64}:
0
1
Let's get a summary of the remaining data.
describe(data)
30×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
1 │ V1 1.17516e-15 -56.4075 0.0181088 2.45493 0 Float64
2 │ V2 3.38497e-16 -72.7157 0.0654856 22.0577 0 Float64
3 │ V3 -1.43063e-15 -48.3256 0.179846 9.38256 0 Float64
4 │ V4 2.09485e-15 -5.68317 -0.0198465 16.8753 0 Float64
5 │ V5 1.02188e-15 -113.743 -0.0543358 34.8017 0 Float64
6 │ V6 1.4945e-15 -26.1605 -0.274187 73.3016 0 Float64
7 │ V7 -5.74807e-16 -43.5572 0.0401031 120.589 0 Float64
8 │ V8 1.21348e-16 -73.2167 0.022358 20.0072 0 Float64
9 │ V9 -2.42058e-15 -13.4341 -0.0514287 15.595 0 Float64
10 │ V10 2.23536e-15 -24.5883 -0.0929174 23.7451 0 Float64
11 │ V11 1.69887e-15 -4.79747 -0.0327574 12.0189 0 Float64
12 │ V12 -1.21987e-15 -18.6837 0.140033 7.84839 0 Float64
13 │ V13 8.36663e-16 -5.79188 -0.0135681 7.12688 0 Float64
14 │ V14 1.21348e-15 -19.2143 0.0506013 10.5268 0 Float64
15 │ V15 4.87947e-15 -4.49894 0.0480715 8.87774 0 Float64
16 │ V16 1.43542e-15 -14.1299 0.0664133 17.3151 0 Float64
17 │ V17 -3.73625e-16 -25.1628 -0.0656758 9.25353 0 Float64
18 │ V18 9.70785e-16 -9.49875 -0.00363631 5.04107 0 Float64
19 │ V19 1.03785e-15 -7.21353 0.00373482 5.59197 0 Float64
20 │ V20 6.38674e-16 -54.4977 -0.0624811 39.4209 0 Float64
21 │ V21 1.62862e-16 -34.8304 -0.0294502 27.2028 0 Float64
22 │ V22 -3.44884e-16 -10.9331 0.00678194 10.5031 0 Float64
23 │ V23 2.61857e-16 -44.8077 -0.0111929 22.5284 0 Float64
24 │ V24 4.47391e-15 -2.83663 0.0409761 4.58455 0 Float64
25 │ V25 5.1094e-16 -10.2954 0.0165935 7.51959 0 Float64
26 │ V26 1.6845e-15 -2.60455 -0.0521391 3.51735 0 Float64
27 │ V27 -3.6634e-16 -22.5657 0.00134215 31.6122 0 Float64
28 │ V28 -1.22146e-16 -15.4301 0.0112438 33.8478 0 Float64
29 │ Amount 88.3496 0.0 22.0 25691.2 0 Float64
30 │ Class 0 1 0 CategoricalValue{Int64, UInt32}
Note that the Amount
variable spans a wide range of values. To reduce variation in the data, we take logs. Since some values are 0
, we first add 1e-6
to eavh value. We transform in place using '!':
data[!,:Amount] = log.(data[!,:Amount] .+ 1e-6);
histogram(data.Amount)
Next we unpack the dataframe and creating a separate frame X
for input features (predictors) and vector y
for the target variable. Because of class imbalance, we make the partition stratified, and we also dump some observations, to reduce training times. Change the next line to reduction = 1
to keep all the data:
reduction = 0.05
frac_train = 0.8*reduction
frac_test = 0.2*reduction
y, X = unpack(data, ==(:Class))
(Xtrain, Xtest, _), (ytrain, ytest, _) =
partition((X, y), frac_train, frac_test; stratify=y, multi=true, rng=111);
StatsBase.countmap(ytrain)
Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Int64} with 2 entries:
0 => 11373
1 => 20
StatsBase.countmap(ytest)
Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Int64} with 2 entries:
0 => 2843
1 => 5
We will estimate of three different models:
logit
support vector machines
neural network.
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels
model_logit = LogisticClassifier(lambda=1.0)
mach = machine(model_logit, Xtrain, ytrain) |> fit!
import MLJLinearModels ✔
trained Machine; caches model-specific representations of data
model: LogisticClassifier(lambda = 1.0, …)
args:
1: Source @030 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @408 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}
Predictions
LogisticClassifier
is a probabilistic predictor, i.e. for each observation in the sample it attaches a probability to each of the possible values of the target. To recover a deterministic output, we use predict_mode
instead of predict
:
yhat_logit = predict_mode(mach, Xtest);
first(yhat_logit, 4)
# How does this model perform?
confusion_matrix(yhat_logit, ytest)
┌─────────────┐
│Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│ 0 │ 1 │
├─────────┼──────┼──────┤
│ 0 │ 2843 │ 5 │
├─────────┼──────┼──────┤
│ 1 │ 0 │ 0 │
└─────────┴──────┴──────┘
To plot a receiver operator characteristic, we need the probabilistic predictions:
yhat = predict(mach, Xtest);
yhat[1:3]
3-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, Int64, UInt32, Float64}:
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.998, 1=>0.00178)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.998, 1=>0.00174)
UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.998, 1=>0.00185)
false_positive_rates, true_positive_rates, thresholds =
roc_curve(yhat, ytest)
plot(false_positive_rates, true_positive_rates)
plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black, label=:none)
xlabel!("false positive rate")
ylabel!("true positive rate")
misclassification_rate(yhat_logit, ytest)
0.0017556179775280898
Looks like it's not too bad. Let's see if we can do even better by doing a little tuning.
Still LogisticClassifier but implementing hyperparameter tuning.
r = range(model_logit, :lambda, lower=1e-6, upper=100, scale=:log)
self_tuning_logit_model = TunedModel(
model_logit,
tuning = Grid(resolution=10),
resampling = CV(nfolds=3),
range = r,
measure = misclassification_rate,
)
mach = machine(self_tuning_logit_model, Xtrain, ytrain) |> fit!
trained Machine; does not cache data
model: ProbabilisticTunedModel(model = LogisticClassifier(lambda = 1.0, …), …)
args:
1: Source @299 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
2: Source @426 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}
Predictions
yhat_logit_tuned = predict_mode(mach, Xtest);
Let's take a look at the misclassification_rate. It is, surprisingly, slightly higher than the one calculated for the non tuned model.
@show misclassification_rate(yhat_logit_tuned, ytest)
misclassification_rate(yhat_logit_tuned, ytest) = 0.001053370786516854
This is lower, although the difference may not be statistically significant.
Initial SVM classification with cost = 1.0:
SVC = @load SVC pkg = LIBSVM
import MLJLIBSVMInterface ✔
MLJLIBSVMInterface.SVC
To fit the SVM, we declare a pipeline which comprises both a standardizer and the model. Training is substantially longer than for the preceding linear model (over 10 minutes):
model_svm = Standardizer() |> SVC()
mach = machine(model_svm, Xtrain, ytrain) |> fit!
yhat_svm = predict(mach, Xtest)
confusion_matrix(yhat_svm, ytest)
┌─────────────┐
│Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│ 0 │ 1 │
├─────────┼──────┼──────┤
│ 0 │ 2843 │ 4 │
├─────────┼──────┼──────┤
│ 1 │ 0 │ 1 │
└─────────┴──────┴──────┘
@show misclassification_rate(yhat_svm, ytest)
misclassification_rate(yhat_svm, ytest) = 0.0014044943820224719
Tuned SVM
r = range(model_svm, :(svc.cost), lower=0.1, upper=3.5, scale=:linear)
self_tuning_svm_model = TunedModel(
model_svm,
resampling = CV(nfolds=3),
tuning = Grid(resolution=6),
range = r,
measure = misclassification_rate,
)
mach = machine(self_tuning_svm_model, Xtrain, ytrain) |> fit!
fitted_params(mach).best_model
DeterministicPipeline(
standardizer = Standardizer(
features = Symbol[],
ignore = false,
ordered_factor = false,
count = false),
svc = SVC(
kernel = LIBSVM.Kernel.RadialBasis,
gamma = 0.0,
cost = 3.5,
cachesize = 200.0,
degree = 3,
coef0 = 0.0,
tolerance = 0.001,
shrinking = true),
cache = true)
plot(mach)
yhat_svm_tuned = predict(mach, Xtest)
confusion_matrix(yhat_svm_tuned, ytest)
┌─────────────┐
│Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│ 0 │ 1 │
├─────────┼──────┼──────┤
│ 0 │ 2843 │ 3 │
├─────────┼──────┼──────┤
│ 1 │ 0 │ 2 │
└─────────┴──────┴──────┘
misclassification_rate(yhat_svm_tuned, ytest)
0.001053370786516854
NeuralNetworkClassifier = @load NeuralNetworkClassifier pkg=MLJFlux
import MLJFlux ✔
MLJFlux.NeuralNetworkClassifier
We assume familiarity with the building blocks of Flux models. In MLJFlux, a builder is essentially a rule for creating a Flux chain, once the data has been inspected for size. See the MLJFlux documentation for further details. We do note specify the softmax "finalizer" because MLJFlux classifiers add that under the hood.
import MLJFlux.@builder
using Flux
builder = @builder Chain(
Dense(n_in, 16, relu),
Dropout(0.1; rng=rng),
Dense(16, n_out),
)
GenericBuilder(apply = #1)
In the @builder macro call, n_in
, n_out
, and rng
are replaced with the actual number of input features found in the data, the actual number of output classes, and the rng
specified in the model hyperparameters (see below).
We are now ready to specify the MLJFlux model. If you have running with GPU, you can try adding the option acceleration=CUDALibs()
.
rng = Xoshiro(123)
model = NeuralNetworkClassifier(
; builder,
loss=(yhat, y)->Flux.tversky_loss(yhat, y, β=0.9), # combines precision and recall
batch_size = round(Int, reduction*2048),
epochs=50,
rng,
)
NeuralNetworkClassifier(
builder = GenericBuilder(
apply = var"#1#2"()),
finaliser = NNlib.softmax,
optimiser = Flux.Optimise.Adam(0.001, (0.9, 0.999), 1.0e-8, IdDict{Any, Any}()),
loss = var"#3#4"(),
epochs = 50,
batch_size = 102,
lambda = 0.0,
alpha = 0.0,
rng = Random.Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80),
optimiser_changes_trigger_retraining = false,
acceleration = ComputationalResources.CPU1{Nothing}(nothing))
Although we have not paid attention to it so far (and probably should have) there is substantial class imbalance for our target:
StatsBase.countmap(y)
Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Int64} with 2 entries:
0 => 284315
1 => 492
We will address this by wrapping our model in a SMOTE overampling strategy, using MLJ's BalancedModel
wrapper. Here are options for oversampling:
models("oversampler")
7-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
(name = BorderlineSMOTE1, package_name = Imbalance, ... )
(name = ROSE, package_name = Imbalance, ... )
(name = RandomOversampler, package_name = Imbalance, ... )
(name = RandomWalkOversampler, package_name = Imbalance, ... )
(name = SMOTE, package_name = Imbalance, ... )
(name = SMOTEN, package_name = Imbalance, ... )
(name = SMOTENC, package_name = Imbalance, ... )
We'll use SMOTE:
SMOTE = @load SMOTE pkg=Imbalance
balanced_model = BalancedModel(model, oversampler=SMOTE())
import Imbalance ✔
BalancedModelProbabilistic(
model = NeuralNetworkClassifier(
builder = GenericBuilder(apply = #1),
finaliser = NNlib.softmax,
optimiser = Flux.Optimise.Adam(0.001, (0.9, 0.999), 1.0e-8, IdDict{Any, Any}()),
loss = var"#3#4"(),
epochs = 50,
batch_size = 102,
lambda = 0.0,
alpha = 0.0,
rng = Random.Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80),
optimiser_changes_trigger_retraining = false,
acceleration = ComputationalResources.CPU1{Nothing}(nothing)),
oversampler = SMOTE(
k = 5,
ratios = 1.0,
rng = Random.TaskLocalRNG(),
try_preserve_type = true))
Our final model adds standarization as a pre-processor:
model_nn = Standardizer() |> balanced_model
ProbabilisticPipeline(
standardizer = Standardizer(
features = Symbol[],
ignore = false,
ordered_factor = false,
count = false),
balanced_model_probabilistic = BalancedModelProbabilistic(
model = NeuralNetworkClassifier(builder = GenericBuilder(apply = #1), …),
oversampler = SMOTE(k = 5, …)),
cache = true)
mach = machine(model_nn, Xtrain, ytrain) |> fit!
yhat_nn = predict_mode(mach, Xtest);
confusion_matrix(yhat_nn, ytest)
┌─────────────┐
│Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│ 0 │ 1 │
├─────────┼──────┼──────┤
│ 0 │ 2843 │ 2 │
├─────────┼──────┼──────┤
│ 1 │ 0 │ 3 │
└─────────┴──────┴──────┘
misclassification_rate(yhat_nn, ytest)
0.0007022471910112359
In the original notebook the train-test-validation split was not stratified.
The original raw Flux model has been replaced with an MLJFlux model, for a common
interface.
MLJ's
BalancedModel
wrapper has been used to correct for class imbalance in the MLJFlux model, using the SMOTE algorithm. Originally, naive oversampling was applied in a separate pre-processing step.In tuning the metric used for the objective function is always
misclassification_rate
, for consistency.