Breast Cancer Wisconsin(Diagnostic)
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
This tutorial covers programmatic model selection on the popular "Breast Cancer Wisconsin (Diagnostic) Data Set" from the UCI archives. The tutorial also covers basic data preprocessing and usage of MLJ Scientific Types.
using UrlDownload
using DataFrames
using MLJ
using StatsBase
using StableRNGs # for an RNG stable across julia versions
Using the package UrlDownload.jl, we can capture the data from the given link using the below commands.
url = "";
feature_names = ["ID", "Class", "mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness", "mean compactness", "mean concavity", "mean concave points", "mean symmetry", "mean fractal dimension", "radius error", "texture error", "perimeter error", "area error", "smoothness error", "compactness error", "concavity error", "concave points error", "symmetry error", "fractal dimension error", "worst radius", "worst texture", "worst perimeter", "worst area", "worst smoothness", "worst compactness", "worst concavity", "worst concave points", "worst symmetry", "worst fractal dimension"]
data = urldownload(url, true, format = :CSV, header = feature_names);
using Plots, legend=false,)
ylabel!("Number of samples")
df = DataFrame(data)[:, 2:end];
Printing the 1st 10 rows so as to get a visual idea about the type of data we're dealing with
10×31 DataFrame
Row │ Class mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
│ String1 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 │ M 17.99 10.38 122.8 1001.0 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871 1.095 0.9053 8.589 153.4 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.1189
2 │ M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.0186 0.0134 0.01389 0.003532 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.186 0.275 0.08902
3 │ M 19.69 21.25 130.0 1203.0 0.1096 0.1599 0.1974 0.1279 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.00615 0.04006 0.03832 0.02058 0.0225 0.004571 23.57 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.243 0.3613 0.08758
4 │ M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414 0.1052 0.2597 0.09744 0.4956 1.156 3.445 27.23 0.00911 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.5 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.173
5 │ M 20.29 14.34 135.1 1297.0 0.1003 0.1328 0.198 0.1043 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.01149 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.2 1575.0 0.1374 0.205 0.4 0.1625 0.2364 0.07678
6 │ M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578 0.08089 0.2087 0.07613 0.3345 0.8902 2.217 27.19 0.00751 0.03345 0.03672 0.01137 0.02165 0.005082 15.47 23.75 103.4 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.1244
7 │ M 18.25 19.98 119.6 1040.0 0.09463 0.109 0.1127 0.074 0.1794 0.05742 0.4467 0.7732 3.18 53.91 0.004314 0.01382 0.02254 0.01039 0.01369 0.002179 22.88 27.66 153.2 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368
8 │ M 13.71 20.83 90.2 577.9 0.1189 0.1645 0.09366 0.05985 0.2196 0.07451 0.5835 1.377 3.856 50.96 0.008805 0.03029 0.02488 0.01448 0.01486 0.005412 17.06 28.14 110.6 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.1151
9 │ M 13.0 21.82 87.5 519.8 0.1273 0.1932 0.1859 0.09353 0.235 0.07389 0.3063 1.002 2.406 24.32 0.005731 0.03502 0.03553 0.01226 0.02143 0.003749 15.49 30.73 106.2 739.3 0.1703 0.5401 0.539 0.206 0.4378 0.1072
10 │ M 12.46 24.04 83.97 475.9 0.1186 0.2396 0.2273 0.08543 0.203 0.08243 0.2976 1.599 2.039 23.94 0.007149 0.07217 0.07743 0.01432 0.01789 0.01008 15.09 40.68 97.65 711.4 0.1853 1.058 1.105 0.221 0.4366 0.2075
For checking the statistical attributes of each inividual feature, we can use the decsribe() method
31×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
1 │ Class B M 0 String1
2 │ mean radius 14.1273 6.981 13.37 28.11 0 Float64
3 │ mean texture 19.2896 9.71 18.84 39.28 0 Float64
4 │ mean perimeter 91.969 43.79 86.24 188.5 0 Float64
5 │ mean area 654.889 143.5 551.1 2501.0 0 Float64
6 │ mean smoothness 0.0963603 0.05263 0.09587 0.1634 0 Float64
7 │ mean compactness 0.104341 0.01938 0.09263 0.3454 0 Float64
8 │ mean concavity 0.0887993 0.0 0.06154 0.4268 0 Float64
9 │ mean concave points 0.0489191 0.0 0.0335 0.2012 0 Float64
10 │ mean symmetry 0.181162 0.106 0.1792 0.304 0 Float64
11 │ mean fractal dimension 0.0627976 0.04996 0.06154 0.09744 0 Float64
12 │ radius error 0.405172 0.1115 0.3242 2.873 0 Float64
13 │ texture error 1.21685 0.3602 1.108 4.885 0 Float64
14 │ perimeter error 2.86606 0.757 2.287 21.98 0 Float64
15 │ area error 40.3371 6.802 24.53 542.2 0 Float64
16 │ smoothness error 0.00704098 0.001713 0.00638 0.03113 0 Float64
17 │ compactness error 0.0254781 0.002252 0.02045 0.1354 0 Float64
18 │ concavity error 0.0318937 0.0 0.02589 0.396 0 Float64
19 │ concave points error 0.0117961 0.0 0.01093 0.05279 0 Float64
20 │ symmetry error 0.0205423 0.007882 0.01873 0.07895 0 Float64
21 │ fractal dimension error 0.0037949 0.0008948 0.003187 0.02984 0 Float64
22 │ worst radius 16.2692 7.93 14.97 36.04 0 Float64
23 │ worst texture 25.6772 12.02 25.41 49.54 0 Float64
24 │ worst perimeter 107.261 50.41 97.66 251.2 0 Float64
25 │ worst area 880.583 185.2 686.5 4254.0 0 Float64
26 │ worst smoothness 0.132369 0.07117 0.1313 0.2226 0 Float64
27 │ worst compactness 0.254265 0.02729 0.2119 1.058 0 Float64
28 │ worst concavity 0.272188 0.0 0.2267 1.252 0 Float64
29 │ worst concave points 0.114606 0.0 0.09993 0.291 0 Float64
30 │ worst symmetry 0.290076 0.1565 0.2822 0.6638 0 Float64
31 │ worst fractal dimension 0.0839458 0.05504 0.08004 0.2075 0 Float64
As we can see the feature set consists of varying features that have different ranges and quantiles. This can cause trouble for the optimization techniques and might cause convergence issues. We can use a feature scaling technique like Standardizer() to handle this.
But first, let's handle the scientific types of all the features. We can use the schema()
method from MLJ.jl package to do this
│ names │ scitypes │ types │
│ Class │ Textual │ String1 │
│ mean radius │ Continuous │ Float64 │
│ mean texture │ Continuous │ Float64 │
│ mean perimeter │ Continuous │ Float64 │
│ mean area │ Continuous │ Float64 │
│ mean smoothness │ Continuous │ Float64 │
│ mean compactness │ Continuous │ Float64 │
│ mean concavity │ Continuous │ Float64 │
│ mean concave points │ Continuous │ Float64 │
│ mean symmetry │ Continuous │ Float64 │
│ mean fractal dimension │ Continuous │ Float64 │
│ radius error │ Continuous │ Float64 │
│ texture error │ Continuous │ Float64 │
│ perimeter error │ Continuous │ Float64 │
│ area error │ Continuous │ Float64 │
│ smoothness error │ Continuous │ Float64 │
│ compactness error │ Continuous │ Float64 │
│ concavity error │ Continuous │ Float64 │
│ concave points error │ Continuous │ Float64 │
│ symmetry error │ Continuous │ Float64 │
│ fractal dimension error │ Continuous │ Float64 │
│ worst radius │ Continuous │ Float64 │
│ worst texture │ Continuous │ Float64 │
│ worst perimeter │ Continuous │ Float64 │
│ worst area │ Continuous │ Float64 │
│ worst smoothness │ Continuous │ Float64 │
│ worst compactness │ Continuous │ Float64 │
│ worst concavity │ Continuous │ Float64 │
│ worst concave points │ Continuous │ Float64 │
│ worst symmetry │ Continuous │ Float64 │
│ worst fractal dimension │ Continuous │ Float64 │
As Textual
is a sciytype reserved for text data "with sentiment", we need to coerce
the scitype to the more appropriate OrderedFactor
coerce!(df, :Class => OrderedFactor{2});
AbstractVector{OrderedFactor{2}} (alias for AbstractArray{ScientificTypesBase.OrderedFactor{2}, 1})
Now that our data is fully processed, we can separate the target variable 'y' from the feature set 'X' using the unpack() method.
rng = StableRNG(123)
y, X = unpack(df, ==(:Class); rng);
We'll be using 80% of data for training, and can perform a train-test split using the partition
train, test = partition(eachindex(y), 0.8; rng)
([281, 534, 44, 524, 554, 470, 50, 199, 295, 513, 569, 156, 176, 30, 404, 185, 307, 92, 373, 403, 9, 333, 210, 488, 79, 539, 561, 151, 366, 492, 178, 221, 209, 261, 43, 10, 475, 472, 352, 336, 407, 111, 275, 411, 90, 486, 390, 334, 549, 7, 421, 229, 415, 16, 257, 274, 160, 192, 510, 474, 310, 86, 114, 428, 72, 532, 317, 33, 558, 183, 36, 59, 489, 288, 551, 507, 170, 38, 144, 23, 31, 135, 456, 358, 252, 424, 223, 77, 203, 357, 158, 482, 224, 487, 303, 304, 434, 14, 251, 550, 149, 408, 268, 253, 244, 39, 519, 351, 491, 493, 538, 239, 112, 218, 238, 222, 546, 473, 356, 448, 517, 37, 213, 102, 41, 227, 168, 560, 47, 266, 327, 406, 480, 452, 143, 80, 361, 234, 469, 109, 173, 506, 365, 396, 541, 55, 246, 164, 372, 540, 495, 413, 61, 207, 374, 27, 189, 545, 457, 12, 188, 154, 468, 446, 471, 264, 494, 343, 236, 548, 335, 4, 350, 412, 103, 249, 430, 20, 69, 348, 186, 116, 65, 159, 146, 232, 128, 522, 313, 150, 233, 96, 113, 504, 405, 57, 445, 523, 435, 179, 191, 293, 202, 371, 329, 320, 544, 402, 139, 3, 119, 214, 215, 410, 130, 278, 325, 265, 153, 120, 375, 171, 53, 88, 70, 339, 376, 379, 19, 100, 105, 512, 364, 22, 535, 453, 437, 337, 349, 67, 99, 508, 83, 117, 305, 323, 226, 526, 399, 341, 204, 71, 565, 377, 414, 419, 163, 289, 467, 131, 297, 8, 206, 463, 441, 294, 398, 177, 477, 52, 18, 93, 260, 431, 145, 443, 250, 84, 87, 483, 400, 520, 94, 211, 568, 108, 290, 383, 490, 511, 107, 444, 422, 433, 389, 152, 362, 353, 49, 360, 240, 529, 311, 552, 543, 368, 462, 157, 432, 78, 66, 454, 393, 563, 104, 429, 201, 369, 259, 24, 367, 97, 40, 499, 449, 387, 76, 219, 296, 417, 17, 248, 292, 527, 32, 241, 308, 300, 409, 58, 394, 465, 95, 395, 81, 392, 62, 440, 380, 98, 167, 15, 442, 316, 148, 284, 500, 484, 243, 271, 386, 136, 194, 29, 514, 359, 230, 450, 122, 25, 464, 279, 556, 129, 416, 89, 542, 235, 322, 537, 285, 458, 60, 332, 459, 321, 461, 547, 85, 553, 426, 42, 237, 283, 401, 91, 138, 200, 231, 272, 439, 478, 326, 11, 518, 181, 174, 255, 505, 515, 509, 273, 280, 496, 460, 331, 344, 466, 567, 64, 438, 262, 299, 34, 75, 54, 263, 267, 205, 277, 328, 63, 525, 342, 388, 82, 220, 397, 225, 502, 169, 172, 126, 133, 115, 370, 291, 45, 74, 298, 503, 1, 338, 346, 256], [56, 536, 557, 182, 455, 35, 198, 481, 309, 282, 6, 124, 423, 347, 141, 562, 101, 498, 212, 118, 28, 533, 516, 48, 134, 391, 193, 132, 564, 197, 180, 187, 378, 385, 155, 276, 286, 476, 319, 190, 381, 479, 306, 217, 345, 137, 501, 287, 161, 354, 142, 247, 555, 208, 127, 330, 228, 110, 340, 147, 302, 485, 254, 270, 245, 447, 26, 427, 318, 162, 5, 363, 315, 216, 528, 314, 530, 21, 68, 436, 51, 242, 121, 2, 165, 195, 196, 497, 269, 566, 73, 451, 382, 559, 13, 312, 301, 324, 521, 531, 384, 420, 355, 418, 125, 123, 106, 258, 166, 46, 140, 425, 184, 175])
Now that our feature set is separated from the target variable, we can use theStandardizer()
worklow to obtain to standardize our feature set X
transformer_instance = Standardizer()
transformer_model = machine(transformer_instance, X[train,:])
X = MLJ.transform(transformer_model, X);
With feature scaling complete, we are ready to compare the performance of various machine learning models for classification.
Now that we have separate training and testing set, let's see the models compatible with our data!
models(matching(X, y))
55-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
(name = AdaBoostClassifier, package_name = MLJScikitLearnInterface, ... )
(name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
(name = BaggingClassifier, package_name = MLJScikitLearnInterface, ... )
(name = BayesianLDA, package_name = MLJScikitLearnInterface, ... )
(name = BayesianLDA, package_name = MultivariateStats, ... )
(name = BayesianQDA, package_name = MLJScikitLearnInterface, ... )
(name = BayesianSubspaceLDA, package_name = MultivariateStats, ... )
(name = CatBoostClassifier, package_name = CatBoost, ... )
(name = ConstantClassifier, package_name = MLJModels, ... )
(name = DecisionTreeClassifier, package_name = BetaML, ... )
(name = DecisionTreeClassifier, package_name = DecisionTree, ... )
(name = DeterministicConstantClassifier, package_name = MLJModels, ... )
(name = DummyClassifier, package_name = MLJScikitLearnInterface, ... )
(name = EvoTreeClassifier, package_name = EvoTrees, ... )
(name = ExtraTreesClassifier, package_name = MLJScikitLearnInterface, ... )
(name = GaussianNBClassifier, package_name = MLJScikitLearnInterface, ... )
(name = GaussianNBClassifier, package_name = NaiveBayes, ... )
(name = GaussianProcessClassifier, package_name = MLJScikitLearnInterface, ... )
(name = GradientBoostingClassifier, package_name = MLJScikitLearnInterface, ... )
(name = HistGradientBoostingClassifier, package_name = MLJScikitLearnInterface, ... )
(name = KNNClassifier, package_name = NearestNeighborModels, ... )
(name = KNeighborsClassifier, package_name = MLJScikitLearnInterface, ... )
(name = KernelPerceptronClassifier, package_name = BetaML, ... )
(name = LDA, package_name = MultivariateStats, ... )
(name = LGBMClassifier, package_name = LightGBM, ... )
(name = LinearBinaryClassifier, package_name = GLM, ... )
(name = LinearSVC, package_name = LIBSVM, ... )
(name = LogisticCVClassifier, package_name = MLJScikitLearnInterface, ... )
(name = LogisticClassifier, package_name = MLJLinearModels, ... )
(name = LogisticClassifier, package_name = MLJScikitLearnInterface, ... )
(name = MultinomialClassifier, package_name = MLJLinearModels, ... )
(name = NeuralNetworkClassifier, package_name = BetaML, ... )
(name = NeuralNetworkClassifier, package_name = MLJFlux, ... )
(name = NuSVC, package_name = LIBSVM, ... )
(name = PassiveAggressiveClassifier, package_name = MLJScikitLearnInterface, ... )
(name = PegasosClassifier, package_name = BetaML, ... )
(name = PerceptronClassifier, package_name = BetaML, ... )
(name = PerceptronClassifier, package_name = MLJScikitLearnInterface, ... )
(name = ProbabilisticNuSVC, package_name = LIBSVM, ... )
(name = ProbabilisticSGDClassifier, package_name = MLJScikitLearnInterface, ... )
(name = ProbabilisticSVC, package_name = LIBSVM, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )
(name = RandomForestClassifier, package_name = DecisionTree, ... )
(name = RandomForestClassifier, package_name = MLJScikitLearnInterface, ... )
(name = RidgeCVClassifier, package_name = MLJScikitLearnInterface, ... )
(name = RidgeClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SGDClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SVC, package_name = LIBSVM, ... )
(name = SVMClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SVMLinearClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SVMNuClassifier, package_name = MLJScikitLearnInterface, ... )
(name = StableForestClassifier, package_name = SIRUS, ... )
(name = StableRulesClassifier, package_name = SIRUS, ... )
(name = SubspaceLDA, package_name = MultivariateStats, ... )
(name = XGBoostClassifier, package_name = XGBoost, ... )
Thats a lot of models for our data! To narrow it down, we'll analyze the performance of probablistic predictors with pure julia implementations:
: captures the names of the models being evaluatedaccuracies
: accuracies of the value of the model accuracy on the test setlog_losses
: values of the log loss (cross entropy) on the test setf1_scores
: captures the values of F1-Score on the test set
models_to_evaluate = models(matching(X, y)) do m
m.prediction_type==:probabilistic && m.is_pure_julia &&
m.package_name != "SIRUS"
p = plot(legendfontsize=7, title="ROC Curve")
plot!([0, 1], [0, 1], linewidth=2, linestyle=:dash, color=:black)
for m in models_to_evaluate
pkg = m.package_name
model_name = "$model ($pkg)"
@info "Evaluating $model_name. "
eval(:(clf = @load $model pkg=$pkg verbosity=0))
clf_machine = machine(clf(), X, y)
fit!(clf_machine, rows=train, verbosity=0)
y_pred = MLJ.predict(clf_machine, rows=test);
fprs, tprs, thresholds = roc_curve(y_pred, y[test])
plot!(p, fprs, tprs,label=model_name)
push!(model_names, model_name)
push!(accuracies, accuracy(mode.(y_pred), y[test]))
push!(log_losses, log_loss(y_pred,y[test]))
push!(f1_scores, f1score(mode.(y_pred), y[test]))
#Adding labels and legend to the ROC-AUC curve
xlabel!("False Positive Rate (positive=malignant)")
ylabel!("True Positive Rate")
Let's collect the data in form a dataframe for a more precise analysis
Finally, let's sort the data on basis of the log loss:
sort!(model_comparison, [:LogLoss])
21×4 DataFrame
Row │ ModelName Accuracy LogLoss F1Score
│ String Any Any Any
1 │ NeuralNetworkClassifier (BetaML) 0.982456 0.0815565 0.97619
2 │ NeuralNetworkClassifier (MLJFlux) 0.964912 0.0841788 0.953488
3 │ RandomForestClassifier (Decision… 0.95614 0.108848 0.942529
4 │ RandomForestClassifier (BetaML) 0.964912 0.111642 0.953488
5 │ EvoTreeClassifier (EvoTrees) 0.95614 0.127662 0.941176
6 │ BayesianLDA (MultivariateStats) 0.929825 0.166699 0.9
7 │ BayesianSubspaceLDA (Multivariat… 0.929825 0.166709 0.9
8 │ SubspaceLDA (MultivariateStats) 0.938596 0.209371 0.91358
9 │ AdaBoostStumpClassifier (Decisio… 0.947368 0.275107 0.926829
10 │ KernelPerceptronClassifier (Beta… 0.894737 0.418525 0.863636
11 │ KNNClassifier (NearestNeighborMo… 0.95614 0.430947 0.942529
12 │ PegasosClassifier (BetaML) 0.912281 0.498056 0.891304
13 │ ConstantClassifier (MLJModels) 0.622807 0.662744 0.0
14 │ LDA (MultivariateStats) 0.938596 0.677149 0.91358
15 │ GaussianNBClassifier (NaiveBayes) 0.929825 0.898701 0.906977
16 │ PerceptronClassifier (BetaML) 0.947368 1.36307 0.926829
17 │ MultinomialClassifier (MLJLinear… 0.938596 1.57496 0.915663
18 │ DecisionTreeClassifier (BetaML) 0.95614 1.58694 0.941176
19 │ LogisticClassifier (MLJLinearMod… 0.938596 2.20713 0.915663
20 │ LinearBinaryClassifier (GLM) 0.912281 2.88829 0.878049
21 │ DecisionTreeClassifier (Decision… 0.903509 3.4779 0.873563