Breast Cancer Wisconsin(Diagnostic)

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Introduction

This tutorial covers the concepts of iterative model selection on the popular "Breast Cancer Wisconsin (Diagnostic) Data Set" from the UCI archives. The tutorial also covers basic data preprocessing and usage of MLJ Scientific Types.

Loading the relevant packages

For a guide to package intsllation in Julia please refer this link taken directly from Juliav1 documentation

using UrlDownload
using DataFrames
using PrettyPrinting
using PyPlot
using MLJ

Inititalizing a global random seed which we'll use throughout the code to maintain consistency in results

RANDOM_SEED = 42;

Downloading and loading the data

Using the package UrlDownload.jl, we can capture the data from the given link using the below commands

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data";
feature_names = ["ID", "Class", "mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness", "mean compactness", "mean concavity", "mean concave points", "mean symmetry", "mean fractal dimension", "radius error", "texture error", "perimeter error", "area error", "smoothness error", "compactness error", "concavity error", "concave points error", "symmetry error", "fractal dimension error", "worst radius", "worst texture", "worst perimeter", "worst area", "worst smoothness", "worst compactness", "worst concavity", "worst concave points", "worst symmetry", "worst fractal dimension"]
data = urldownload(url, true, format = :CSV, header = feature_names);

Exploring the obtained data

Inspecting the class variable

figure(figsize=(8, 6))
hist(data.Class)
xlabel("Classes")
ylabel("Number of samples")
Distribution of target classes

Inspecting the feature set

df = DataFrame(data)[:, 2:end];

Printing the 1st 10 rows so as to get a visual idea about the type of data we're dealing with

pprint(first(df,10))
10×31 DataFrame
 Row │ Class    mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness  mean concavity  mean concave points  mean symmetry  mean fractal dimension  radius error  texture error  perimeter error  area error  smoothness error  compactness error  concavity error  concave points error  symmetry error  fractal dimension error  worst radius  worst texture  worst perimeter  worst area  worst smoothness  worst compactness  worst concavity  worst concave points  worst symmetry  worst fractal dimension
     │ String1  Float64      Float64       Float64         Float64    Float64          Float64           Float64         Float64              Float64        Float64                 Float64       Float64        Float64          Float64     Float64           Float64            Float64          Float64               Float64         Float64                  Float64       Float64        Float64          Float64     Float64           Float64            Float64          Float64               Float64         Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ M              17.99         10.38          122.8      1001.0          0.1184            0.2776          0.3001               0.1471          0.2419                 0.07871        1.095          0.9053            8.589      153.4           0.006399            0.04904          0.05373               0.01587         0.03003                 0.006193         25.38          17.33           184.6       2019.0            0.1622             0.6656           0.7119                0.2654          0.4601                  0.1189
   2 │ M              20.57         17.77          132.9      1326.0          0.08474           0.07864         0.0869               0.07017         0.1812                 0.05667        0.5435         0.7339            3.398       74.08          0.005225            0.01308          0.0186                0.0134          0.01389                 0.003532         24.99          23.41           158.8       1956.0            0.1238             0.1866           0.2416                0.186           0.275                   0.08902
   3 │ M              19.69         21.25          130.0      1203.0          0.1096            0.1599          0.1974               0.1279          0.2069                 0.05999        0.7456         0.7869            4.585       94.03          0.00615             0.04006          0.03832               0.02058         0.0225                  0.004571         23.57          25.53           152.5       1709.0            0.1444             0.4245           0.4504                0.243           0.3613                  0.08758
   4 │ M              11.42         20.38           77.58      386.1          0.1425            0.2839          0.2414               0.1052          0.2597                 0.09744        0.4956         1.156             3.445       27.23          0.00911             0.07458          0.05661               0.01867         0.05963                 0.009208         14.91          26.5             98.87       567.7            0.2098             0.8663           0.6869                0.2575          0.6638                  0.173
   5 │ M              20.29         14.34          135.1      1297.0          0.1003            0.1328          0.198                0.1043          0.1809                 0.05883        0.7572         0.7813            5.438       94.44          0.01149             0.02461          0.05688               0.01885         0.01756                 0.005115         22.54          16.67           152.2       1575.0            0.1374             0.205            0.4                   0.1625          0.2364                  0.07678
   6 │ M              12.45         15.7            82.57      477.1          0.1278            0.17            0.1578               0.08089         0.2087                 0.07613        0.3345         0.8902            2.217       27.19          0.00751             0.03345          0.03672               0.01137         0.02165                 0.005082         15.47          23.75           103.4        741.6            0.1791             0.5249           0.5355                0.1741          0.3985                  0.1244
   7 │ M              18.25         19.98          119.6      1040.0          0.09463           0.109           0.1127               0.074           0.1794                 0.05742        0.4467         0.7732            3.18        53.91          0.004314            0.01382          0.02254               0.01039         0.01369                 0.002179         22.88          27.66           153.2       1606.0            0.1442             0.2576           0.3784                0.1932          0.3063                  0.08368
   8 │ M              13.71         20.83           90.2       577.9          0.1189            0.1645          0.09366              0.05985         0.2196                 0.07451        0.5835         1.377             3.856       50.96          0.008805            0.03029          0.02488               0.01448         0.01486                 0.005412         17.06          28.14           110.6        897.0            0.1654             0.3682           0.2678                0.1556          0.3196                  0.1151
   9 │ M              13.0          21.82           87.5       519.8          0.1273            0.1932          0.1859               0.09353         0.235                  0.07389        0.3063         1.002             2.406       24.32          0.005731            0.03502          0.03553               0.01226         0.02143                 0.003749         15.49          30.73           106.2        739.3            0.1703             0.5401           0.539                 0.206           0.4378                  0.1072
  10 │ M              12.46         24.04           83.97      475.9          0.1186            0.2396          0.2273               0.08543         0.203                  0.08243        0.2976         1.599             2.039       23.94          0.007149            0.07217          0.07743               0.01432         0.01789                 0.01008          15.09          40.68            97.65       711.4            0.1853             1.058            1.105                 0.221           0.4366                  0.2075

For checking the statistical attributes of each inividual feature, we can use the decsribe() method

pprint(describe(df))
31×7 DataFrame
 Row │ variable                 mean        min        median    max      nmissing  eltype
     │ Symbol                   Union…      Any        Union…    Any      Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────────────────
   1 │ Class                                B                    M               0  String1
   2 │ mean radius              14.1273     6.981      13.37     28.11           0  Float64
   3 │ mean texture             19.2896     9.71       18.84     39.28           0  Float64
   4 │ mean perimeter           91.969      43.79      86.24     188.5           0  Float64
   5 │ mean area                654.889     143.5      551.1     2501.0          0  Float64
   6 │ mean smoothness          0.0963603   0.05263    0.09587   0.1634          0  Float64
   7 │ mean compactness         0.104341    0.01938    0.09263   0.3454          0  Float64
   8 │ mean concavity           0.0887993   0.0        0.06154   0.4268          0  Float64
   9 │ mean concave points      0.0489191   0.0        0.0335    0.2012          0  Float64
  10 │ mean symmetry            0.181162    0.106      0.1792    0.304           0  Float64
  11 │ mean fractal dimension   0.0627976   0.04996    0.06154   0.09744         0  Float64
  12 │ radius error             0.405172    0.1115     0.3242    2.873           0  Float64
  13 │ texture error            1.21685     0.3602     1.108     4.885           0  Float64
  14 │ perimeter error          2.86606     0.757      2.287     21.98           0  Float64
  15 │ area error               40.3371     6.802      24.53     542.2           0  Float64
  16 │ smoothness error         0.00704098  0.001713   0.00638   0.03113         0  Float64
  17 │ compactness error        0.0254781   0.002252   0.02045   0.1354          0  Float64
  18 │ concavity error          0.0318937   0.0        0.02589   0.396           0  Float64
  19 │ concave points error     0.0117961   0.0        0.01093   0.05279         0  Float64
  20 │ symmetry error           0.0205423   0.007882   0.01873   0.07895         0  Float64
  21 │ fractal dimension error  0.0037949   0.0008948  0.003187  0.02984         0  Float64
  22 │ worst radius             16.2692     7.93       14.97     36.04           0  Float64
  23 │ worst texture            25.6772     12.02      25.41     49.54           0  Float64
  24 │ worst perimeter          107.261     50.41      97.66     251.2           0  Float64
  25 │ worst area               880.583     185.2      686.5     4254.0          0  Float64
  26 │ worst smoothness         0.132369    0.07117    0.1313    0.2226          0  Float64
  27 │ worst compactness        0.254265    0.02729    0.2119    1.058           0  Float64
  28 │ worst concavity          0.272188    0.0        0.2267    1.252           0  Float64
  29 │ worst concave points     0.114606    0.0        0.09993   0.291           0  Float64
  30 │ worst symmetry           0.290076    0.1565     0.2822    0.6638          0  Float64
  31 │ worst fractal dimension  0.0839458   0.05504    0.08004   0.2075          0  Float64

As we can see the feature set consists of varying features that have different ranges and quantiles. This can cause trouble for the optimization techniques and might cause convergence issues. We can use a feature scaling technique like Standardizer() to handle this.

But first, let's handle the scientific types of all the features. We can use the schema() method from MLJ.jl package to do this

pprint(schema(df))
ScientificTypes.Schema{(:Class, Symbol("mean radius"), Symbol("mean texture"), Symbol("mean perimeter"), Symbol("mean area"), Symbol("mean smoothness"), Symbol("mean compactness"), Symbol("mean concavity"), Symbol("mean concave points"), Symbol("mean symmetry"), Symbol("mean fractal dimension"), Symbol("radius error"), Symbol("texture error"), Symbol("perimeter error"), Symbol("area error"), Symbol("smoothness error"), Symbol("compactness error"), Symbol("concavity error"), Symbol("concave points error"), Symbol("symmetry error"), Symbol("fractal dimension error"), Symbol("worst radius"), Symbol("worst texture"), Symbol("worst perimeter"), Symbol("worst area"), Symbol("worst smoothness"), Symbol("worst compactness"), Symbol("worst concavity"), Symbol("worst concave points"), Symbol("worst symmetry"), Symbol("worst fractal dimension")), Tuple{ScientificTypesBase.Textual, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous, ScientificTypesBase.Continuous}, Tuple{InlineStrings.String1, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64, Float64}}(nothing, nothing, nothing)

As the target variable is 'Textual' in nature, we'll have to change it to a more appropriate scientific type. Using the coerce() method, let's change it to an OrderedFactor.

coerce!(df, :Class => OrderedFactor{2});

Unpacking the values

Now that our data is fully processed, we can separate the target variable 'y' from the feature set 'X' using the unpack() method.

y, X = unpack(df, ==(:Class),name->true, rng = RANDOM_SEED);

Standardizing the "feature set"

Now that our feature set is separated from the target variable, we can use the Standardizer() worklow to obtain to standadrize our feature set 'X'.

transformer_instance = Standardizer()
transformer_model = machine(transformer_instance, X)
fit!(transformer_model)
X = MLJ.transform(transformer_model, X);

Train-test split

After feature scaling, our data is ready to put into a Machine Learning model for classification! Using 80% of data for training, we can perform a train-test split using the partition() method.

train, test = partition(eachindex(y), 0.8, shuffle=true, rng=RANDOM_SEED);

Model compatibility

Now that we have separate training and testing set, let's see the models compatible with our data!

for m in models(matching(X, y))
    println("Model name = ",m.name,", ","Prediction type = ",m.prediction_type,", ","Package name = ",m.package_name);
end
Model name = AdaBoostClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = AdaBoostStumpClassifier, Prediction type = probabilistic, Package name = DecisionTree
Model name = BaggingClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = BayesianLDA, Prediction type = probabilistic, Package name = MultivariateStats
Model name = BayesianLDA, Prediction type = probabilistic, Package name = ScikitLearn
Model name = BayesianQDA, Prediction type = probabilistic, Package name = ScikitLearn
Model name = BayesianSubspaceLDA, Prediction type = probabilistic, Package name = MultivariateStats
Model name = ConstantClassifier, Prediction type = probabilistic, Package name = MLJModels
Model name = DSADDetector, Prediction type = unknown, Package name = OutlierDetectionNetworks
Model name = DecisionTreeClassifier, Prediction type = probabilistic, Package name = BetaML
Model name = DecisionTreeClassifier, Prediction type = probabilistic, Package name = DecisionTree
Model name = DeterministicConstantClassifier, Prediction type = deterministic, Package name = MLJModels
Model name = DummyClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = ESADDetector, Prediction type = unknown, Package name = OutlierDetectionNetworks
Model name = EvoTreeClassifier, Prediction type = probabilistic, Package name = EvoTrees
Model name = ExtraTreesClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = GaussianNBClassifier, Prediction type = probabilistic, Package name = NaiveBayes
Model name = GaussianNBClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = GaussianProcessClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = GradientBoostingClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = KNNClassifier, Prediction type = probabilistic, Package name = NearestNeighborModels
Model name = KNeighborsClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = KernelPerceptronClassifier, Prediction type = probabilistic, Package name = BetaML
Model name = LDA, Prediction type = probabilistic, Package name = MultivariateStats
Model name = LGBMClassifier, Prediction type = probabilistic, Package name = LightGBM
Model name = LinearBinaryClassifier, Prediction type = probabilistic, Package name = GLM
Model name = LinearSVC, Prediction type = deterministic, Package name = LIBSVM
Model name = LogisticCVClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = LogisticClassifier, Prediction type = probabilistic, Package name = MLJLinearModels
Model name = LogisticClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = MultinomialClassifier, Prediction type = probabilistic, Package name = MLJLinearModels
Model name = NeuralNetworkClassifier, Prediction type = probabilistic, Package name = MLJFlux
Model name = NuSVC, Prediction type = deterministic, Package name = LIBSVM
Model name = PassiveAggressiveClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = PegasosClassifier, Prediction type = probabilistic, Package name = BetaML
Model name = PerceptronClassifier, Prediction type = probabilistic, Package name = BetaML
Model name = PerceptronClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = ProbabilisticSGDClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = RandomForestClassifier, Prediction type = probabilistic, Package name = BetaML
Model name = RandomForestClassifier, Prediction type = probabilistic, Package name = DecisionTree
Model name = RandomForestClassifier, Prediction type = probabilistic, Package name = ScikitLearn
Model name = RidgeCVClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = RidgeClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = SGDClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = SVC, Prediction type = deterministic, Package name = LIBSVM
Model name = SVMClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = SVMLinearClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = SVMNuClassifier, Prediction type = deterministic, Package name = ScikitLearn
Model name = SubspaceLDA, Prediction type = probabilistic, Package name = MultivariateStats
Model name = XGBoostClassifier, Prediction type = probabilistic, Package name = XGBoost

Analyzing the performance of different models

Thats a lot of models for our data! To narrow it down, lets analyze the performance of "probabilistic classifiers" from the "ScikitLearn" package.

Creating various empty vectors for our analysis

  • model_names captures the names of the models being iterated

  • loss_acc captures the value of the model accuracy on the test set

  • loss_ce captures the values of the Cross-entropy loss on the test set

  • loss_f1 captures the values of F1-Score on the test set

model_names=Vector{String}();
loss_acc=[];
loss_ce=[];
loss_f1=[];

Collecting data for analysis

figure(figsize=(8, 6))
for m in models(matching(X, y))
    if m.prediction_type==Symbol("probabilistic") && m.package_name=="ScikitLearn" && m.name!="LogisticCVClassifier"
        #Excluding LogisticCVClassfiier as we can infer similar baseline results from the LogisticClassifier

        #Capturing the model and loading it using the @load utility
        model_name=m.name
        package_name=m.package_name
        eval(:(clf = @load $model_name pkg=$package_name verbosity=1))

        #Fitting the captured model onto the training set
        clf_machine = machine(clf(), X, y)
        fit!(clf_machine, rows=train)

        #Getting the predictions onto the test set
        y_pred = MLJ.predict(clf_machine, rows=test);

        #Plotting the ROC-AUC curve for each model being iterated
        fprs, tprs, thresholds = roc(y_pred, y[test])
        plot(fprs, tprs,label=model_name);

        #Obtaining different evaluation metrics
        ce_loss = mean(cross_entropy(y_pred,y[test]))
        acc = accuracy(mode.(y_pred), y[test])
        f1_score = f1score(mode.(y_pred), y[test])

        #Adding the different obtained values of the evaluation metrics to the respective vectors
        push!(model_names, m.name)
        append!(loss_acc, acc)
        append!(loss_ce, ce_loss)
        append!(loss_f1, f1_score)
    end
end

#Adding labels and legend to the ROC-AUC curve
xlabel("False Positive Rate")
ylabel("True Positive Rate")
legend(loc="best", fontsize="xx-small")
title("ROC curve")
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
import MLJScikitLearnInterface ✔
ROC-AUC Curve

Analyzing models

Let's collect the data in form a dataframe for a more precise analysis

model_info=DataFrame(ModelName=model_names,Accuracy=loss_acc,CrossEntropyLoss=loss_ce,F1Score=loss_f1);

Now, let's sort the data on basis of the Cross-entropy loss

pprint(sort!(model_info,[:CrossEntropyLoss]));
13×4 DataFrame
 Row │ ModelName                   Accuracy  CrossEntropyLoss  F1Score
     │ String                      Any       Any               Any
─────┼──────────────────────────────────────────────────────────────────
   1 │ LogisticClassifier          0.973684  0.13142           0.962025
   2 │ BayesianLDA                 0.95614   0.145701          0.935065
   3 │ ExtraTreesClassifier        0.947368  0.15706           0.923077
   4 │ RandomForestClassifier      0.938596  0.171584          0.911392
   5 │ GradientBoostingClassifier  0.938596  0.236792          0.909091
   6 │ AdaBoostClassifier          0.95614   0.3495            0.936709
   7 │ KNeighborsClassifier        0.938596  0.432447          0.911392
   8 │ BaggingClassifier           0.929825  0.473429          0.9
   9 │ ProbabilisticSGDClassifier  0.938596  0.501896          0.915663
  10 │ GaussianProcessClassifier   0.938596  0.597197          0.909091
  11 │ GaussianNBClassifier        0.903509  0.929409          0.864198
  12 │ BayesianQDA                 0.903509  1.1303            0.857143
  13 │ DummyClassifier             0.552632  16.1248           0.385542

It seems like a simple LogisticClassifier works really well with this dataset!

Conclusion

This article covered iterative feature selection on the Breast cancer classification dataset. In this tutorial, we only analyzed the ScikitLearn models so as to keep the flow of the content precise, but the same workflow can be applied to any compatible model in the MLJ family.