Data interpretation: Scientific Types
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
When analysing data, it is important to distinguish between
how the data is encoded (e.g.
Int), andhow the data should be interpreted (e.g. a class label, a count, ...)
How the data is encoded will be referred to as the machine type whereas how the data should be interpreted will be referred to as the scientific type (or scitype).
In some cases, this may be un-ambiguous, for instance if you have a vector of floating point values, this should usually be interpreted as a continuous feature (e.g.: weights, speeds, temperatures, ...).
In many other cases however, there may be ambiguities, we list a few examples below:
A vector of
Inte.g.[1, 2, ...]which should be interpreted as categorical labels,A vector of
Inte.g.[1, 2, ...]which should be interpreted as count data,A vector of
Stringe.g.["High", "Low", "High", ...]which should be interpreted as ordered categorical labels,A vector of
Stringe.g.["John", "Maria", ...]which should not interpreted as informative data,A vector of floating points
[1.5, 1.5, -2.3, -2.3]which should be interpreted as categorical data (e.g. the few possible values of some setting), etc.
The package ScientificTypes.jl defines a barebone type hierarchy which can be used to indicate how a particular feature should be interpreted; in particular:
Found
├─ Known
│ ├─ Textual
│ ├─ Finite
│ │ ├─ Multiclass
│ │ └─ OrderedFactor
│ └─ Infinite
│ ├─ Continuous
│ └─ Count
└─ Unknown
A scientific type convention is a specific implementation indicating how machine types can be related to scientific types. It may also provide helper functions to convert data to a given scitype.
The convention used in MLJ is implemented in ScientificTypes.jl. This is what we will use throughout; you never need to use ScientificTypes.jl unless you intend to implement your own scientific type convention.
The schema function
using RDatasets
using ScientificTypes
boston = dataset("MASS", "Boston")
sch = schema(boston)┌─────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├─────────┼────────────┼─────────┤
│ Crim │ Continuous │ Float64 │
│ Zn │ Continuous │ Float64 │
│ Indus │ Continuous │ Float64 │
│ Chas │ Count │ Int64 │
│ NOx │ Continuous │ Float64 │
│ Rm │ Continuous │ Float64 │
│ Age │ Continuous │ Float64 │
│ Dis │ Continuous │ Float64 │
│ Rad │ Count │ Int64 │
│ Tax │ Count │ Int64 │
│ PTRatio │ Continuous │ Float64 │
│ Black │ Continuous │ Float64 │
│ LStat │ Continuous │ Float64 │
│ MedV │ Continuous │ Float64 │
└─────────┴────────────┴─────────┘
In this cases, most of the variables have a (machine) type Float64 and their default interpretation is Continuous. There is also :Chas, :Rad and :Tax that have a (machine) type Int64 and their default interpretation is Count.
While the interpretation as Continuous is usually fine, the interpretation as Count needs a bit more attention. For instance note that:
unique(boston.Chas)2-element Vector{Int64}:
0
1
so even though it's got a machine type of Int64 and consequently a default interpretation of Count, it would be more appropriate to interpret it as an OrderedFactor.
In order to re-specify the scitype(s) of feature(s) in a dataset, you can use the coerce function and specify pairs of variable name and scientific type:
boston2 = coerce(boston, :Chas => OrderedFactor);
the effect of this is to convert the :Chas column to an ordered categorical vector:
eltype(boston2.Chas)CategoricalArrays.CategoricalValue{Int64, UInt32}
corresponding to the OrderedFactor scitype:
elscitype(boston2.Chas)ScientificTypesBase.OrderedFactor{2}
You can also specify multiple pairs in one shot with coerce:
boston3 = coerce(boston, :Chas => OrderedFactor, :Rad => OrderedFactor);
If a feature in your dataset has String elements, then the default scitype is Textual; you can either choose to drop such columns or to coerce them to categorical:
feature = ["AA", "BB", "AA", "AA", "BB"]
elscitype(feature)ScientificTypesBase.Textual
which you can coerce:
feature2 = coerce(feature, Multiclass)
elscitype(feature2)ScientificTypesBase.Multiclass{2}
In some cases you will want to reinterpret all features currently interpreted as some scitype S1 into some other scitype S2. An example is if some features are currently interpreted as Count because their original type was Int but you want to consider all such as Continuous:
data = select(boston, [:Rad, :Tax])
schema(data)┌───────┬──────────┬───────┐
│ names │ scitypes │ types │
├───────┼──────────┼───────┤
│ Rad │ Count │ Int64 │
│ Tax │ Count │ Int64 │
└───────┴──────────┴───────┘
let's coerce from Count to Continuous:
data2 = coerce(data, Count => Continuous)
schema(data2)┌───────┬────────────┬─────────┐
│ names │ scitypes │ types │
├───────┼────────────┼─────────┤
│ Rad │ Continuous │ Float64 │
│ Tax │ Continuous │ Float64 │
└───────┴────────────┴─────────┘
A last useful tool is autotype which allows you to specify rules to define the interpretation of features automatically. You can code your own rules but there are three useful ones that are pre- coded:
the
:few_to_finiterule which checks how many unique entries are present
in a vector and if there are "few" suggests a categorical type,
the
:discrete_to_continuousrule convertsIntegerorCountto
Continuous
the
:string_to_multiclasswhich returnsMulticlassfor any string-like
column.
For instance:
boston3 = coerce(boston, autotype(boston, :few_to_finite))
schema(boston3)┌─────────┬───────────────────┬───────────────────────────────────┐
│ names │ scitypes │ types │
├─────────┼───────────────────┼───────────────────────────────────┤
│ Crim │ Continuous │ Float64 │
│ Zn │ OrderedFactor{26} │ CategoricalValue{Float64, UInt32} │
│ Indus │ Continuous │ Float64 │
│ Chas │ OrderedFactor{2} │ CategoricalValue{Int64, UInt32} │
│ NOx │ Continuous │ Float64 │
│ Rm │ Continuous │ Float64 │
│ Age │ Continuous │ Float64 │
│ Dis │ Continuous │ Float64 │
│ Rad │ OrderedFactor{9} │ CategoricalValue{Int64, UInt32} │
│ Tax │ Count │ Int64 │
│ PTRatio │ OrderedFactor{46} │ CategoricalValue{Float64, UInt32} │
│ Black │ Continuous │ Float64 │
│ LStat │ Continuous │ Float64 │
│ MedV │ Continuous │ Float64 │
└─────────┴───────────────────┴───────────────────────────────────┘
You can also specify multiple rules, see the docs for more information.