Working with Tasks

Warning. The task API described here is likely change soon, with the notion of task being not bound to any particular data set.

In MLJ a task is a synthesis of three elements: data, an interpretation of that data, and a learning objective. Once one has a task one is ready to choose learning models.

Scientific types and the interpretation of data

Generally the columns of a table, such as a DataFrame, represents real quantities. However, the nature of a quantity is not always clear from the representation. For example, we might count phone calls using the UInt32 type but also use UInt32 to represent a categorical feature, such as the species of conifers. MLJ mitigates such ambiguity with the use of scientific types. See Getting Started) for details

Explicitly specifying scientific types during the construction of a MLJ task is the user's opportunity to articulate how the supplied data should be interpreted.

Learning objectives

In MLJ specifying a learning objective means specifying: (i) whether learning is supervised or not; (ii) whether, in the supervised case, predictions are to be probabilistic or deterministic; and (iii) what part of the data is relevant and what role is each part to play.

Sample usage

Load a built-in task:

using MLJ
using CSV
task = load_iris()

SupervisedTask @ 1…62

Extract input and target:

X, y = task()
X[1:3, :]

3 rows × 4 columns

	sepal_length	sepal_width	petal_length	petal_width
	Float64	Float64	Float64	Float64
1	5.1	3.5	1.4	0.2
2	4.9	3.0	1.4	0.2
3	4.7	3.2	1.3	0.2

Now, starting with some tabular data...

using RDatasets
df = dataset("boot", "channing");
first(df, 4)

4 rows × 5 columns

	Sex	Entry	Exit	Time	Cens
	Categorical…	Int32	Int32	Int32	Int32
1	Male	782	909	127	1
2	Male	1020	1128	108	1
3	Male	856	969	113	1
4	Male	915	957	42	1

...we can check MLJ's interpretation of that data:

schema(df)

(names = (:Sex, :Entry, :Exit, :Time, :Cens),
 types = (CategoricalArrays.CategoricalString{UInt8}, Int32, Int32, Int32, Int32),
 scitypes = (Multiclass{2}, Count, Count, Count, Count),
 nrows = 462,)

And construct a task by wrapping the data in a learning objective, and coercing the data into a form MLJ will correctly interpret. (The middle three fields of df refer to ages, in months, the last is a flag.):

task = supervised(data=df,
                  target=:Exit,
                  ignore=:Time,
                  is_probabilistic=true,
                  types=Dict(:Entry=>Continuous,
                             :Exit=>Continuous,
                             :Cens=>Multiclass))
schema(task.X)

(names = (:Sex, :Entry, :Cens),
 types = (CategoricalArrays.CategoricalString{UInt8}, Float64, CategoricalArrays.CategoricalValue{Int32,UInt8}),
 scitypes = (Multiclass{2}, Continuous, Multiclass{2}),
 nrows = 462,)

Shuffle the rows of a task:

task = load_iris()
using Random
rng = MersenneTwister(1234)
shuffle!(rng, task) # rng is optional
task[1:4].y

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "virginica"
 "virginica"
 "setosa"   
 "virginica"

Counting and selecting rows of a task:

nrows(task)

task[1:2].y

2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "virginica"
 "virginica"

Listing the models available to complete a task:

models(task)

Dict{Any,Any} with 4 entries:
  "MLJ"          => Any["DeterministicConstantClassifier"]
  "DecisionTree" => Any["DecisionTreeRegressor"]
  "ScikitLearn"  => Any["SVMNuClassifier", "SVMClassifier", "SVMLClassifier"]
  "LIBSVM"       => Any["LinearSVC", "NuSVC", "SVC"]

Binding a model to a task and evaluating performance:

@load DecisionTreeClassifier
mach = machine(DecisionTreeClassifier(), task)
evaluate!(mach, operation=predict_mode, resampling=Holdout(), measure=misclassification_rate, verbosity=0)

(measure = MLJ.MisclassificationRate[misclassification_rate],
 measurement = [0.0888889],
 per_fold = Array{Float64,1}[[0.0888889]],
 per_observation = Missing[missing],)

API Reference

MLJ.supervised — Function.

task = supervised(data=nothing,
                  types=Dict(),
                  target=nothing,
                  ignore=Symbol[],
                  is_probabilistic=false,
                  verbosity=1)

Construct a supervised learning task with input features X and target y, where: y is the column vector from data named target, if this is a single symbol, or a vector of tuples, if target is a vector; X consists of all remaining columns of data not named in ignore, and is a table unless it has only one column, in which case it is a vector.

The data types of elements in a column of data named as a key of the dictionary types are coerced to have a scientific type given by the corresponding value. Possible values are Continuous, Multiclass, OrderedFactor and Count. So, for example, types=Dict(:x1=>Count) means elements of the column of data named :x1 will be coerced into integers (whose scitypes are always Count).

task = supervised(X, y;
                  is_probabilistic=false,
                  verbosity=1)

A more customizable constructor, this returns a supervised learning task with input features X and target y, where: X is a table or vector (univariate inputs), while y must be a vector whose elements are scalars, or tuples scalars (of constant length for ordinary multivariate predictions, and of variable length for sequence prediction). Table rows must correspond to patterns and columns to features. Type coercion is not available for this constructor (but see also coerce).

X, y = task()

Returns the input X and target y of the task, also available as task.X and task.y.

source

MLJ.unsupervised — Function.

task = unsupervised(data=nothing, types=Dict(), ignore=Symbol[], verbosity=1)

Construct an unsupervised learning task with given input data, which should be a table or, in the case of univariate inputs, a single vector.

Rows of data must correspond to patterns and columns to features. Columns in data whose names appear in ignore are ignored.

X = task()

Return the input data in form to be used in models.

source

MLJ.models — Function.

models(W)

A vector of all models referenced by a node, each model appearing exactly once.

source

models()

List all model as a dictionary indexed on package name`. Models available for immediate use appear under the key "MLJ".

models(conditional)

Restrict results to package model pairs (m, p) satisfying conditional(info(m, pkg=p)) == true.

models(task::MLJTask)

List all models matching the specified task.

Example

To retrieve all proababilistic classifiers:

models(x -> x[:is_supervised] && x[:is_probabilistic]==true)