Credit Card Fraud

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Data Preparation
Estimation of models
Editorial notes

@OUTPUT (macro with 1 method)

Classification of fraudulent/not credit card transactions (imbalanced data) By Kristian Bjarnason. The original script can be found here

Editor's note. To reduce training times, we have reduced the the original number of data observations. To re-instate the full dataset (290k observations) change reduction=0.05 to reduction=1. The data is highly imbalanced, and this is ignored when training some models. Some other changes to Bjarnason's original notebook are noted at the end.

using Dates, Statistics, LinearAlgebra, Random # standard libraries
using MLJ, Plots, DataFrames, UrlDownload
using CSV # needed for `urldownload` to work
import StatsBase # needed for `countmap`

Adjusting fontsize in plotting:

Plots.scalefontsizes(0.85)

Divide the sample into two equal sub-samples. Keep the proportion of frauds the same in each sub-sample (246 frauds in each). Use one sub-sample to estimate (train) your models and the second one to evaluate the out-of-sample performance of each model.

Importing the data:

table = urldownload(
"https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv",
);
data = DataFrame(table)
first(data, 4)

4×31 DataFrame
 Row │ Time     V1         V2          V3       V4         V5          V6          V7         V8         V9         V10         V11        V12         V13        V14        V15        V16        V17        V18         V19        V20         V21         V22         V23        V24         V25        V26        V27         V28         Amount   Class
     │ Float64  Float64    Float64     Float64  Float64    Float64     Float64     Float64    Float64    Float64    Float64     Float64    Float64     Float64    Float64    Float64    Float64    Float64    Float64     Float64    Float64     Float64     Float64     Float64    Float64     Float64    Float64    Float64     Float64     Float64  Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     0.0  -1.35981   -0.0727812  2.53635   1.37816   -0.338321    0.462388    0.239599  0.0986979   0.363787   0.0907942  -0.5516    -0.617801   -0.99139   -0.311169   1.46818   -0.470401   0.207971   0.0257906   0.403993   0.251412   -0.0183068   0.277838   -0.110474   0.0669281   0.128539  -0.189115   0.133558   -0.0210531   149.62      0
   2 │     0.0   1.19186    0.266151   0.16648   0.448154   0.0600176  -0.0823608  -0.078803  0.0851017  -0.255425  -0.166974    1.61273    1.06524     0.489095  -0.143772   0.635558   0.463917  -0.114805  -0.183361   -0.145783  -0.0690831  -0.225775   -0.638672    0.101288  -0.339846    0.16717    0.125895  -0.0089831   0.0147242     2.69      0
   3 │     1.0  -1.35835   -1.34016    1.77321   0.37978   -0.503198    1.8005      0.791461  0.247676   -1.51465    0.207643    0.624501   0.0660837   0.717293  -0.165946   2.34586   -2.89008    1.10997   -0.121359   -2.26186    0.52498     0.247998    0.771679    0.909412  -0.689281   -0.327642  -0.139097  -0.0553528  -0.0597518   378.66      0
   4 │     1.0  -0.966272  -0.185226   1.79299  -0.863291  -0.0103089   1.2472      0.237609  0.377436   -1.38702   -0.0549519  -0.226487   0.178228    0.507757  -0.287924  -0.631418  -1.05965   -0.684093   1.96578    -1.23262   -0.208038   -0.1083      0.0052736  -0.190321  -1.17558     0.647376  -0.221929   0.0627228   0.0614576   123.5       0

Inspecting the scientific types of variables contained in the DataFrame:

schema(data)

┌────────┬────────────┬─────────┐
│ names  │ scitypes   │ types   │
├────────┼────────────┼─────────┤
│ Time   │ Continuous │ Float64 │
│ V1     │ Continuous │ Float64 │
│ V2     │ Continuous │ Float64 │
│ V3     │ Continuous │ Float64 │
│ V4     │ Continuous │ Float64 │
│ V5     │ Continuous │ Float64 │
│ V6     │ Continuous │ Float64 │
│ V7     │ Continuous │ Float64 │
│ V8     │ Continuous │ Float64 │
│ V9     │ Continuous │ Float64 │
│ V10    │ Continuous │ Float64 │
│ V11    │ Continuous │ Float64 │
│ V12    │ Continuous │ Float64 │
│ V13    │ Continuous │ Float64 │
│ V14    │ Continuous │ Float64 │
│ V15    │ Continuous │ Float64 │
│ V16    │ Continuous │ Float64 │
│ V17    │ Continuous │ Float64 │
│ V18    │ Continuous │ Float64 │
│ V19    │ Continuous │ Float64 │
│ V20    │ Continuous │ Float64 │
│ V21    │ Continuous │ Float64 │
│ V22    │ Continuous │ Float64 │
│ V23    │ Continuous │ Float64 │
│ V24    │ Continuous │ Float64 │
│ V25    │ Continuous │ Float64 │
│ V26    │ Continuous │ Float64 │
│ V27    │ Continuous │ Float64 │
│ V28    │ Continuous │ Float64 │
│ Amount │ Continuous │ Float64 │
│ Class  │ Count      │ Int64   │
└────────┴────────────┴─────────┘

The Time column is not relevant to our analysis, we drop it:

select!(data, Not(:Time));

And the target variable, Class, should not be interpretted by our algorithms as a Count variable. We'll view it as an ordered factor (i.e., binary data with an intrinsic positive class, corresponding here to 1, the second in the lexigrahic ordering).

coerce!(data, :Class => OrderedFactor);

We can check by calling schema again, or like this:

scitype(data.Class)

AbstractVector{OrderedFactor{2}} (alias for AbstractArray{ScientificTypesBase.OrderedFactor{2}, 1})

levels(data.Class) # second element is `positive` class

2-element Vector{Int64}:
 0
 1

Let's get a summary of the remaining data.

describe(data)

30×7 DataFrame
 Row │ variable  mean          min       median       max      nmissing  eltype
     │ Symbol    Union…        Any       Union…       Any      Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ V1        1.17516e-15   -56.4075  0.0181088    2.45493         0  Float64
   2 │ V2        3.38497e-16   -72.7157  0.0654856    22.0577         0  Float64
   3 │ V3        -1.43063e-15  -48.3256  0.179846     9.38256         0  Float64
   4 │ V4        2.09485e-15   -5.68317  -0.0198465   16.8753         0  Float64
   5 │ V5        1.02188e-15   -113.743  -0.0543358   34.8017         0  Float64
   6 │ V6        1.4945e-15    -26.1605  -0.274187    73.3016         0  Float64
   7 │ V7        -5.74807e-16  -43.5572  0.0401031    120.589         0  Float64
   8 │ V8        1.21348e-16   -73.2167  0.022358     20.0072         0  Float64
   9 │ V9        -2.42058e-15  -13.4341  -0.0514287   15.595          0  Float64
  10 │ V10       2.23536e-15   -24.5883  -0.0929174   23.7451         0  Float64
  11 │ V11       1.69887e-15   -4.79747  -0.0327574   12.0189         0  Float64
  12 │ V12       -1.21987e-15  -18.6837  0.140033     7.84839         0  Float64
  13 │ V13       8.36663e-16   -5.79188  -0.0135681   7.12688         0  Float64
  14 │ V14       1.21348e-15   -19.2143  0.0506013    10.5268         0  Float64
  15 │ V15       4.87947e-15   -4.49894  0.0480715    8.87774         0  Float64
  16 │ V16       1.43542e-15   -14.1299  0.0664133    17.3151         0  Float64
  17 │ V17       -3.73625e-16  -25.1628  -0.0656758   9.25353         0  Float64
  18 │ V18       9.70785e-16   -9.49875  -0.00363631  5.04107         0  Float64
  19 │ V19       1.03785e-15   -7.21353  0.00373482   5.59197         0  Float64
  20 │ V20       6.38674e-16   -54.4977  -0.0624811   39.4209         0  Float64
  21 │ V21       1.62862e-16   -34.8304  -0.0294502   27.2028         0  Float64
  22 │ V22       -3.44884e-16  -10.9331  0.00678194   10.5031         0  Float64
  23 │ V23       2.61857e-16   -44.8077  -0.0111929   22.5284         0  Float64
  24 │ V24       4.47391e-15   -2.83663  0.0409761    4.58455         0  Float64
  25 │ V25       5.1094e-16    -10.2954  0.0165935    7.51959         0  Float64
  26 │ V26       1.6845e-15    -2.60455  -0.0521391   3.51735         0  Float64
  27 │ V27       -3.6634e-16   -22.5657  0.00134215   31.6122         0  Float64
  28 │ V28       -1.22146e-16  -15.4301  0.0112438    33.8478         0  Float64
  29 │ Amount    88.3496       0.0       22.0         25691.2         0  Float64
  30 │ Class                   0                      1               0  CategoricalValue{Int64, UInt32}

Note that the Amount variable spans a wide range of values. To reduce variation in the data, we take logs. Since some values are 0, we first add 1e-6 to eavh value. We transform in place using '!':

data[!,:Amount] = log.(data[!,:Amount] .+ 1e-6);
histogram(data.Amount)

Next we unpack the dataframe and creating a separate frame X for input features (predictors) and vector y for the target variable. Because of class imbalance, we make the partition stratified, and we also dump some observations, to reduce training times. Change the next line to reduction = 1 to keep all the data:

reduction = 0.05
frac_train = 0.8*reduction
frac_test = 0.2*reduction

y, X = unpack(data, ==(:Class))
(Xtrain, Xtest, _), (ytrain, ytest, _) =
    partition((X, y), frac_train, frac_test; stratify=y, multi=true, rng=111);

StatsBase.countmap(ytrain)

Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Int64} with 2 entries:
  0 => 11373
  1 => 20

StatsBase.countmap(ytest)

Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Int64} with 2 entries:
  0 => 2843
  1 => 5

‎

We will estimate of three different models:

logit
support vector machines
neural network.

LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels
model_logit = LogisticClassifier(lambda=1.0)
mach = machine(model_logit, Xtrain, ytrain) |> fit!

import MLJLinearModels ✔
trained Machine; caches model-specific representations of data
  model: LogisticClassifier(lambda = 1.0, …)
  args: 
    1:	Source @030 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @408 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}

Predictions

LogisticClassifier is a probabilistic predictor, i.e. for each observation in the sample it attaches a probability to each of the possible values of the target. To recover a deterministic output, we use predict_mode instead of predict:

yhat_logit = predict_mode(mach, Xtest);
first(yhat_logit, 4)

# How does this model perform?

confusion_matrix(yhat_logit, ytest)

          ┌─────────────┐
          │Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│  0   │  1   │
├─────────┼──────┼──────┤
│    0    │ 2843 │  5   │
├─────────┼──────┼──────┤
│    1    │  0   │  0   │
└─────────┴──────┴──────┘

To plot a receiver operator characteristic, we need the probabilistic predictions:

yhat = predict(mach, Xtest);
yhat[1:3]

3-element CategoricalDistributions.UnivariateFiniteVector{ScientificTypesBase.OrderedFactor{2}, Int64, UInt32, Float64}:
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.998, 1=>0.00178)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.998, 1=>0.00174)
 UnivariateFinite{ScientificTypesBase.OrderedFactor{2}}(0=>0.998, 1=>0.00185)

false_positive_rates, true_positive_rates, thresholds =
    roc_curve(yhat, ytest)
plot(false_positive_rates, true_positive_rates)
plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black, label=:none)
xlabel!("false positive rate")
ylabel!("true positive rate")

misclassification_rate(yhat_logit, ytest)

0.0017556179775280898

Looks like it's not too bad. Let's see if we can do even better by doing a little tuning.

‎

Still LogisticClassifier but implementing hyperparameter tuning.

r = range(model_logit, :lambda, lower=1e-6, upper=100, scale=:log)

self_tuning_logit_model = TunedModel(
    model_logit,
    tuning = Grid(resolution=10),
    resampling = CV(nfolds=3),
    range = r,
    measure = misclassification_rate,
)

mach = machine(self_tuning_logit_model, Xtrain, ytrain) |> fit!

trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = LogisticClassifier(lambda = 1.0, …), …)
  args: 
    1:	Source @299 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @426 ⏎ AbstractVector{ScientificTypesBase.OrderedFactor{2}}

Predictions

yhat_logit_tuned = predict_mode(mach, Xtest);

Let's take a look at the misclassification_rate. It is, surprisingly, slightly higher than the one calculated for the non tuned model.

@show misclassification_rate(yhat_logit_tuned, ytest)

misclassification_rate(yhat_logit_tuned, ytest) = 0.001053370786516854

This is lower, although the difference may not be statistically significant.

‎

Initial SVM classification with cost = 1.0:

SVC = @load SVC pkg = LIBSVM

import MLJLIBSVMInterface ✔
MLJLIBSVMInterface.SVC

To fit the SVM, we declare a pipeline which comprises both a standardizer and the model. Training is substantially longer than for the preceding linear model (over 10 minutes):

model_svm = Standardizer() |>  SVC()
mach = machine(model_svm, Xtrain, ytrain) |> fit!
yhat_svm = predict(mach, Xtest)
confusion_matrix(yhat_svm, ytest)

          ┌─────────────┐
          │Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│  0   │  1   │
├─────────┼──────┼──────┤
│    0    │ 2843 │  4   │
├─────────┼──────┼──────┤
│    1    │  0   │  1   │
└─────────┴──────┴──────┘

@show misclassification_rate(yhat_svm, ytest)

misclassification_rate(yhat_svm, ytest) = 0.0014044943820224719

Tuned SVM

r = range(model_svm, :(svc.cost), lower=0.1, upper=3.5, scale=:linear)
self_tuning_svm_model = TunedModel(
    model_svm,
    resampling = CV(nfolds=3),
    tuning = Grid(resolution=6),
    range = r,
    measure = misclassification_rate,
)
mach = machine(self_tuning_svm_model, Xtrain, ytrain) |> fit!

fitted_params(mach).best_model

DeterministicPipeline(
  standardizer = Standardizer(
        features = Symbol[], 
        ignore = false, 
        ordered_factor = false, 
        count = false), 
  svc = SVC(
        kernel = LIBSVM.Kernel.RadialBasis, 
        gamma = 0.0, 
        cost = 3.5, 
        cachesize = 200.0, 
        degree = 3, 
        coef0 = 0.0, 
        tolerance = 0.001, 
        shrinking = true), 
  cache = true)

plot(mach)

yhat_svm_tuned = predict(mach, Xtest)
confusion_matrix(yhat_svm_tuned, ytest)

          ┌─────────────┐
          │Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│  0   │  1   │
├─────────┼──────┼──────┤
│    0    │ 2843 │  3   │
├─────────┼──────┼──────┤
│    1    │  0   │  2   │
└─────────┴──────┴──────┘

misclassification_rate(yhat_svm_tuned, ytest)

0.001053370786516854

‎

NeuralNetworkClassifier = @load NeuralNetworkClassifier pkg=MLJFlux

import MLJFlux ✔
MLJFlux.NeuralNetworkClassifier

We assume familiarity with the building blocks of Flux models. In MLJFlux, a builder is essentially a rule for creating a Flux chain, once the data has been inspected for size. See the MLJFlux documentation for further details. We do note specify the softmax "finalizer" because MLJFlux classifiers add that under the hood.

import MLJFlux.@builder
using Flux

builder = @builder Chain(
    Dense(n_in, 16, relu),
    Dropout(0.1; rng=rng),
    Dense(16, n_out),
)

GenericBuilder(apply = #1)

In the @builder macro call, n_in, n_out, and rng are replaced with the actual number of input features found in the data, the actual number of output classes, and the rng specified in the model hyperparameters (see below).

We are now ready to specify the MLJFlux model. If you have running with GPU, you can try adding the option acceleration=CUDALibs().

rng = Xoshiro(123)
model = NeuralNetworkClassifier(
    ; builder,
    loss=(yhat, y)->Flux.tversky_loss(yhat, y, β=0.9), # combines precision and recall
    batch_size = round(Int, reduction*2048),
    epochs=50,
    rng,
)

NeuralNetworkClassifier(
  builder = GenericBuilder(
        apply = var"#1#2"()), 
  finaliser = NNlib.softmax, 
  optimiser = Flux.Optimise.Adam(0.001, (0.9, 0.999), 1.0e-8, IdDict{Any, Any}()), 
  loss = var"#3#4"(), 
  epochs = 50, 
  batch_size = 102, 
  lambda = 0.0, 
  alpha = 0.0, 
  rng = Random.Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80), 
  optimiser_changes_trigger_retraining = false, 
  acceleration = ComputationalResources.CPU1{Nothing}(nothing))

Although we have not paid attention to it so far (and probably should have) there is substantial class imbalance for our target:

StatsBase.countmap(y)

Dict{CategoricalArrays.CategoricalValue{Int64, UInt32}, Int64} with 2 entries:
  0 => 284315
  1 => 492

We will address this by wrapping our model in a SMOTE overampling strategy, using MLJ's BalancedModel wrapper. Here are options for oversampling:

models("oversampler")

7-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :constructor, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = BorderlineSMOTE1, package_name = Imbalance, ... )
 (name = ROSE, package_name = Imbalance, ... )
 (name = RandomOversampler, package_name = Imbalance, ... )
 (name = RandomWalkOversampler, package_name = Imbalance, ... )
 (name = SMOTE, package_name = Imbalance, ... )
 (name = SMOTEN, package_name = Imbalance, ... )
 (name = SMOTENC, package_name = Imbalance, ... )

We'll use SMOTE:

SMOTE = @load SMOTE pkg=Imbalance
balanced_model = BalancedModel(model, oversampler=SMOTE())

import Imbalance ✔
BalancedModelProbabilistic(
  model = NeuralNetworkClassifier(
        builder = GenericBuilder(apply = #1), 
        finaliser = NNlib.softmax, 
        optimiser = Flux.Optimise.Adam(0.001, (0.9, 0.999), 1.0e-8, IdDict{Any, Any}()), 
        loss = var"#3#4"(), 
        epochs = 50, 
        batch_size = 102, 
        lambda = 0.0, 
        alpha = 0.0, 
        rng = Random.Xoshiro(0xfefa8d41b8f5dca5, 0xf80cc98e147960c1, 0x20e2ccc17662fc1d, 0xea7a7dcb2e787c01, 0xf4e85a418b9c4f80), 
        optimiser_changes_trigger_retraining = false, 
        acceleration = ComputationalResources.CPU1{Nothing}(nothing)), 
  oversampler = SMOTE(
        k = 5, 
        ratios = 1.0, 
        rng = Random.TaskLocalRNG(), 
        try_preserve_type = true))

Our final model adds standarization as a pre-processor:

model_nn = Standardizer() |> balanced_model

ProbabilisticPipeline(
  standardizer = Standardizer(
        features = Symbol[], 
        ignore = false, 
        ordered_factor = false, 
        count = false), 
  balanced_model_probabilistic = BalancedModelProbabilistic(
        model = NeuralNetworkClassifier(builder = GenericBuilder(apply = #1), …), 
        oversampler = SMOTE(k = 5, …)), 
  cache = true)

mach = machine(model_nn, Xtrain, ytrain) |> fit!
yhat_nn = predict_mode(mach, Xtest);
confusion_matrix(yhat_nn, ytest)

          ┌─────────────┐
          │Ground Truth │
┌─────────┼──────┬──────┤
│Predicted│  0   │  1   │
├─────────┼──────┼──────┤
│    0    │ 2843 │  2   │
├─────────┼──────┼──────┤
│    1    │  0   │  3   │
└─────────┴──────┴──────┘

misclassification_rate(yhat_nn, ytest)

0.0007022471910112359

‎