Working with Tasks
Warning. The task API may be depreciated in the future.
In MLJ a task is a synthesis of three elements: data, an interpretation of that data, and a learning objective. Once one has a task one is ready to choose learning models.
Scientific types and the interpretation of data
Generally the columns of a table, such as a DataFrame, represents real quantities. However, the nature of a quantity is not always clear from the representation. For example, we might count phone calls using the UInt32
type but also use UInt32
to represent a categorical feature, such as the species of conifers. MLJ mitigates such ambiguity with the use of scientific types. See Getting Started) for details
Explicitly specifying scientific types during the construction of a MLJ task is the user's opportunity to articulate how the supplied data should be interpreted.
Learning objectives
In MLJ specifying a learning objective means specifying: (i) whether learning is supervised or not; (ii) whether, in the supervised case, predictions are to be probabilistic or deterministic; and (iii) what part of the data is relevant and what role is each part to play.
Sample usage
Given some data,
using RDatasets
df = dataset("boot", "channing");
first(df, 4)
Sex | Entry | Exit | Time | Cens | |
---|---|---|---|---|---|
Categorical… | Int32 | Int32 | Int32 | Int32 | |
1 | Male | 782 | 909 | 127 | 1 |
2 | Male | 1020 | 1128 | 108 | 1 |
3 | Male | 856 | 969 | 113 | 1 |
4 | Male | 915 | 957 | 42 | 1 |
we can check MLJ's interpretation of that data:
schema(df)
(names = (:Sex, :Entry, :Exit, :Time, :Cens),
types = (CategoricalArrays.CategoricalString{UInt8}, Int32, Int32, Int32, Int32),
scitypes = (Multiclass{2}, Count, Count, Count, Count),
nrows = 462,)
We construct a task by wrapping the data in a learning objective, and coercing the data into a form MLJ will correctly interpret. (The middle three fields of df
refer to ages, in months, the last is a flag.):
task = supervised(data=df,
target=:Exit,
ignore=:Time,
is_probabilistic=true,
types=Dict(:Entry=>Continuous,
:Exit=>Continuous,
:Cens=>Multiclass))
schema(task.X)
(names = (:Sex, :Entry, :Cens),
types = (CategoricalArrays.CategoricalString{UInt8}, Float64, CategoricalArrays.CategoricalValue{Int32,UInt8}),
scitypes = (Multiclass{2}, Continuous, Multiclass{2}),
nrows = 462,)
Row selection and shuffling:
task[1:3].y
3-element Array{Float64,1}:
909.0
1128.0
969.0
using Random
rng = MersenneTwister(1234)
shuffle!(rng, task) # rng is optional
task[1:3].y
3-element Array{Float64,1}:
1015.0
976.0
1044.0
Counting rows:
nrows(task)
462
Listing the models available to complete a task:
models(task)
1-element Array{NamedTuple,1}:
(name = ConstantRegressor, package_name = MLJModels, ... )
Choosing a model and evaluating on the task:
julia> using RDatasets
julia> iris = dataset("datasets", "iris"); # a DataFrame
julia> task = supervised(data=iris, target=:Species, is_probabilistic=true)
┌ Info:
│ is_probabilistic = true
│ input_scitype = Table{AbstractArray{Continuous,1}}
└ target_scitype = AbstractArray{Multiclass{3},1}
SupervisedTask @ 1…00
julia> tree = @load DecisionTreeClassifier verbosity=1
import MLJModels ✔
import DecisionTree ✔
import MLJModels.DecisionTree_.DecisionTreeClassifier ✔
MLJModels.DecisionTree_.DecisionTreeClassifier(pruning_purity = 1.0,
max_depth = -1,
min_samples_leaf = 1,
min_samples_split = 2,
min_purity_increase = 0.0,
n_subfeatures = 0,
display_depth = 5,
post_prune = false,
merge_purity_threshold = 0.9,
pdf_smoothing = 0.05,) @ 1…55
julia> mach = machine(tree, task)
Machine{DecisionTreeClassifier} @ 3…12
julia> evaluate!(mach, operation=predict_mode,
resampling=Holdout(), measure=misclassification_rate, verbosity=0)
(measure = MLJBase.MisclassificationRate[misclassification_rate],
measurement = [0.222222],
per_fold = Array{Float64,1}[[0.222222]],
per_observation = Missing[missing],)
API Reference
MLJ.supervised
— Function.task = supervised(data=nothing,
types=Dict(),
target=nothing,
ignore=Symbol[],
is_probabilistic=false,
verbosity=1)
Construct a supervised learning task with input features X
and target y
, where: y
is the column vector from data
named target
, if this is a single symbol, or a vector of tuples, if target
is a vector; X
consists of all remaining columns of data
not named in ignore
, and is a table unless it has only one column, in which case it is a vector.
The data types of elements in a column of data
named as a key of the dictionary types
are coerced to have a scientific type given by the corresponding value. Possible values are Continuous
, Multiclass
, OrderedFactor
and Count
. So, for example, types=Dict(:x1=>Count)
means elements of the column of data
named :x1
will be coerced into integers (whose scitypes are always Count
).
task = supervised(X, y;
is_probabilistic=false,
verbosity=1)
A more customizable constructor, this returns a supervised learning task with input features X
and target y
, where: X
is a table or vector (univariate inputs), while y
must be a vector whose elements are scalars, or tuples scalars (of constant length for ordinary multivariate predictions, and of variable length for sequence prediction). Table rows must correspond to patterns and columns to features. Type coercion is not available for this constructor (but see also coerce
).
X, y = task()
Returns the input X
and target y
of the task, also available as task.X
and task.y
.
MLJ.unsupervised
— Function.task = unsupervised(data=nothing, types=Dict(), ignore=Symbol[], verbosity=1)
Construct an unsupervised learning task with given input data
, which should be a table or, in the case of univariate inputs, a single vector.
The data types of elements in a column of data
named as a key of the dictionary types
are coerced to have a scientific type given by the corresponding value. Possible values are Continuous
, Multiclass
, OrderedFactor
and Count
. So, for example, types=Dict(:x1=>Count)
means elements of the column of data
named :x1
will be coerced into integers (whose scitypes are always Count
).
Rows of data
must correspond to patterns and columns to features. Columns in data
whose names appear in ignore
are ignored.
X = task()
Return the input data in form to be used in models.
See also scitype
, scitype_union
MLJModels.models
— Function.models(W)
A vector of all models referenced by a node, each model appearing exactly once.
models()
List all models in the MLJ registry. Here and below model means the registry metadata entry for a genuine model type (a proxy for types whose defining code may not be loaded).
models(conditions...)
List all models satisifying the specified conditions
. A condition is any Bool
-valued function on models.
Excluded in the listings are the built-in model-wraps EnsembleModel
, TunedModel
, and IteratedModel
.
Example
If
task(model) = model.is_supervised && model.is_probabilistic
then models(task)
lists all supervised models making probabilistic predictions.
See also: localmodels
.
MLJModels.localmodels
— Function.localmodels(; modl=Main)
localmodels(conditions...; modl=Main)
List all models whose names are in the namespace of the specified module modl
, additionally solving the task
, or meeting the conditions
, if specified. Here a condition is a Bool
-valued function on models.
See also models