MLJ for Data Scientists in Two Hours

An end-to-end example using the Telco Churn dataset

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

An application of the MLJ toolbox to the Telco Customer Churn dataset, aimed at practicing data scientists new to MLJ (Machine Learning in Julia). This tutorial does not cover exploratory data analysis.

MLJ is a general machine learning toolbox (i.e., not just deep-learning).

For other MLJ learning resources see the Learning MLJ section of the manual.

Topics covered: Grabbing and preparing a dataset, basic fit/predict workflow, constructing a pipeline to include data pre-processing, estimating performance metrics, ROC curves, confusion matrices, feature importance, basic feature selection, controlling iterative models, hyper-parameter optimization (tuning).

Prerequisites for this tutorial. Previous experience building, evaluating, and optimizing machine learning models using scikit-learn, caret, MLR, weka, or similar tool. No previous experience with MLJ. Only fairly basic familiarity with Julia is required. Uses DataFrames.jl but in a minimal way (this cheatsheet may help).

Time. Between two and three hours, first time through.