Working with Tasks
In MLJ a task is a synthesis of three elements: data, an interpretation of that data, and a learning objective. Once one has a task one is ready to choose learning models.
Scientific types and the interpretation of data
Generally the columns of a table, such as a DataFrame, represents real quantities. However, the nature of a quantity is not always clear from the representation. For example, we might count phone calls using the UInt32
type but also use UInt32
to represent a categorical feature, such as the species of conifers. MLJ mitigates such ambiguity by: (i) distinguishing between the machine and scientific type of scalar data; (ii) disallowing the representation of multiple scientific types by the same machine type during learning; and (iii) establising a convention for what scientific types a given machine type may represent (see the table at the end of Getting Started).
Explicitly specifying scientific types during the construction of a MLJ task is the user's opportunity to articulate how the supplied data should be interpreted.
WIP: At present scitypes cannot be specified and the user must manually coerce data before task construction.
Learning objectives
In MLJ specifying a learing objective means specifying: (i) whether learning is supervised or not; (ii) whether, in the supervised case, predictions are to be probabilistic or deterministic; and (iii) what part of the data is relevant and what role is each part to play.
Sample usage
Load a built-in task:
using MLJ
task = load_iris()
[34mSupervisedTask @ 5…09[39m
Extract input and target:
X, y = task()
X[1:3, :]
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | |
1 | 5.1 | 3.5 | 1.4 | 0.2 |
2 | 4.9 | 3.0 | 1.4 | 0.2 |
3 | 4.7 | 3.2 | 1.3 | 0.2 |
Supposing we have some new data, say
coltable = (height = Float64[183, 145, 160, 78, 182, 76],
gender = categorical([:m, :f, :f, :f, :m, :m]),
weight = Float64[92, 67, 62, 25, 80, 31],
age = Float64[53, 12, 60, 5, 31, 7],
overall_health = categorical([1, 2, 1, 3, 3, 1], ordered=true))
(height = [183.0, 145.0, 160.0, 78.0, 182.0, 76.0],
gender = CategoricalArrays.CategoricalValue{Symbol,UInt32}[:m, :f, :f, :f, :m, :m],
weight = [92.0, 67.0, 62.0, 25.0, 80.0, 31.0],
age = [53.0, 12.0, 60.0, 5.0, 31.0, 7.0],
overall_health = CategoricalArrays.CategoricalValue{Int64,UInt32}[1, 2, 1, 3, 3, 1],)
we can check MLJ's default interpretation of that data:
scitypes(X)
(sepal_length = Continuous,
sepal_width = Continuous,
petal_length = Continuous,
petal_width = Continuous,)
And construct a associated task:
task = SupervisedTask(data=coltable, target=:overall_health, ignore=:gender, is_probabilistic=true)
[34mSupervisedTask @ 2…07[39m
WIP: In the near future users will be able to override the default interpretation of the data.
To list models matching a task:
models(task)
Dict{String,Any} with 4 entries:
"MLJ" => Any["ConstantClassifier"]
"DecisionTree" => Any["DecisionTreeClassifier"]
"NaiveBayes" => Any["GaussianNBClassifier"]
"XGBoost" => Any["XGBoostClassifier"]
Row selection for a task:
nrows(task)
6
task[1:2].y
2-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
Shuffle the rows of a task:
task = load_iris()
using Random
rng = MersenneTwister(1234)
shuffle!(rng, task) # rng is optional
task[1:4].y
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"virginica"
"virginica"
"setosa"
"virginica"
Binding a model to a task and evalutating performance:
@load DecisionTreeClassifier
mach = machine(DecisionTreeClassifier(target_type=String), task)
evaluate!(mach, operation=predict_mode, resampling=Holdout(), measure=misclassification_rate, verbosity=0)
0.08888888888888889
API Reference
MLJBase.UnsupervisedTask
— Type.task = UnsupervisedTask(data=nothing, ignore=Symbol[], verbosity=1)
Construct an unsupervised learning task with given input data
, which should be a table or, in the case of univariate inputs, a single vector.
Rows of data
must correspond to patterns and columns to features. Columns in data
whose names appear in ignore
are ignored.
X = task()
Return the input data in form to be used in models.
MLJBase.SupervisedTask
— Type.task = SupervisedTask(data=nothing, is_probabilistic=false, target=nothing, ignore=Symbol[], verbosity=1)
Construct a supervised learning task with input features X
and target y
, where: y
is the column vector from data
named target
, if this is a single symbol, or, a vector of tuples, if target
is a vector; X
consists of all remaining columns of data
not named in ignore
, and is a table unless it has only one column, in which case it is a vector.
task = SupervisedTask(X, y; is_probabilistic=false, input_is_multivariate=true, verbosity=1)
A more customizable constructor, this returns a supervised learning task with input features X
and target y
, where: X
must be a table or vector, according to whether it is multivariate or univariate, while y
must be a vector whose elements are scalars, or tuples scalars (of constant length for ordinary multivariate predictions, and of variable length for sequence prediction). Table rows must correspond to patterns and columns to features.
X, y = task()
Returns the input X
and target y
of the task, also available as task.X
and task.y
.
MLJ.models
— Method.models(; show_dotted=false)
List all models as a dictionary indexed on package name. Models available for immediate use appear under the key "MLJ".
By declaring show_dotted=true
models not in the top-level of the current namespace - which require dots to call, such as MLJ.DeterministicConstantModel
- are also included.
models(task; show_dotted=false)
List all models matching the specified task
.
See also: localmodels
MLJ.localmodels
— Method.localmodels()
List all models available for immediate use. Equivalent to models()["MLJ"]
localmodels(task)
List all such models additionally matching the specified task
. Equivalent to models(task)["MLJ"]
.