Data interpretation: Scientific Types
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
When analysing data, it is important to distinguish between
how the data is encoded (e.g.
Int
), andhow the data should be interpreted (e.g. a class label, a count, ...)
How the data is encoded will be referred to as the machine type whereas how the data should be interpreted will be referred to as the scientific type (or scitype
).
In some cases, this may be un-ambiguous, for instance if you have a vector of floating point values, this should usually be interpreted as a continuous feature (e.g.: weights, speeds, temperatures, ...).
In many other cases however, there may be ambiguities, we list a few examples below:
A vector of
Int
e.g.[1, 2, ...]
which should be interpreted as categorical labels,A vector of
Int
e.g.[1, 2, ...]
which should be interpreted as count data,A vector of
String
e.g.["High", "Low", "High", ...]
which should be interpreted as ordered categorical labels,A vector of
String
e.g.["John", "Maria", ...]
which should not interpreted as informative data,A vector of floating points
[1.5, 1.5, -2.3, -2.3]
which should be interpreted as categorical data (e.g. the few possible values of some setting), etc.
The package ScientificTypes.jl defines a barebone type hierarchy which can be used to indicate how a particular feature should be interpreted; in particular:
Found
├─ Known
│ ├─ Textual
│ ├─ Finite
│ │ ├─ Multiclass
│ │ └─ OrderedFactor
│ └─ Infinite
│ ├─ Continuous
│ └─ Count
└─ Unknown
A scientific type convention is a specific implementation indicating how machine types can be related to scientific types. It may also provide helper functions to convert data to a given scitype.
The convention used in MLJ is implemented in ScientificTypes.jl. This is what we will use throughout; you never need to use ScientificTypes.jl unless you intend to implement your own scientific type convention.
The schema
function
using RDatasets
using ScientificTypes
boston = dataset("MASS", "Boston")
sch = schema(boston)
┌─────────┬────────────┬─────────┐
│ names │ scitypes │ types │
├─────────┼────────────┼─────────┤
│ Crim │ Continuous │ Float64 │
│ Zn │ Continuous │ Float64 │
│ Indus │ Continuous │ Float64 │
│ Chas │ Count │ Int64 │
│ NOx │ Continuous │ Float64 │
│ Rm │ Continuous │ Float64 │
│ Age │ Continuous │ Float64 │
│ Dis │ Continuous │ Float64 │
│ Rad │ Count │ Int64 │
│ Tax │ Count │ Int64 │
│ PTRatio │ Continuous │ Float64 │
│ Black │ Continuous │ Float64 │
│ LStat │ Continuous │ Float64 │
│ MedV │ Continuous │ Float64 │
└─────────┴────────────┴─────────┘
In this cases, most of the variables have a (machine) type Float64
and their default interpretation is Continuous
. There is also :Chas
, :Rad
and :Tax
that have a (machine) type Int64
and their default interpretation is Count
.
While the interpretation as Continuous
is usually fine, the interpretation as Count
needs a bit more attention. For instance note that:
unique(boston.Chas)
2-element Vector{Int64}:
0
1
so even though it's got a machine type of Int64
and consequently a default interpretation of Count
, it would be more appropriate to interpret it as an OrderedFactor
.
In order to re-specify the scitype(s) of feature(s) in a dataset, you can use the coerce
function and specify pairs of variable name and scientific type:
boston2 = coerce(boston, :Chas => OrderedFactor);
the effect of this is to convert the :Chas
column to an ordered categorical vector:
eltype(boston2.Chas)
CategoricalArrays.CategoricalValue{Int64, UInt32}
corresponding to the OrderedFactor
scitype:
elscitype(boston2.Chas)
ScientificTypesBase.OrderedFactor{2}
You can also specify multiple pairs in one shot with coerce
:
boston3 = coerce(boston, :Chas => OrderedFactor, :Rad => OrderedFactor);
If a feature in your dataset has String elements, then the default scitype is Textual
; you can either choose to drop such columns or to coerce them to categorical:
feature = ["AA", "BB", "AA", "AA", "BB"]
elscitype(feature)
ScientificTypesBase.Textual
which you can coerce:
feature2 = coerce(feature, Multiclass)
elscitype(feature2)
ScientificTypesBase.Multiclass{2}
In some cases you will want to reinterpret all features currently interpreted as some scitype S1
into some other scitype S2
. An example is if some features are currently interpreted as Count
because their original type was Int
but you want to consider all such as Continuous
:
data = select(boston, [:Rad, :Tax])
schema(data)
┌───────┬──────────┬───────┐
│ names │ scitypes │ types │
├───────┼──────────┼───────┤
│ Rad │ Count │ Int64 │
│ Tax │ Count │ Int64 │
└───────┴──────────┴───────┘
let's coerce from Count
to Continuous
:
data2 = coerce(data, Count => Continuous)
schema(data2)
┌───────┬────────────┬─────────┐
│ names │ scitypes │ types │
├───────┼────────────┼─────────┤
│ Rad │ Continuous │ Float64 │
│ Tax │ Continuous │ Float64 │
└───────┴────────────┴─────────┘
A last useful tool is autotype
which allows you to specify rules to define the interpretation of features automatically. You can code your own rules but there are three useful ones that are pre- coded:
the
:few_to_finite
rule which checks how many unique entries are present
in a vector and if there are "few" suggests a categorical type,
the
:discrete_to_continuous
rule convertsInteger
orCount
to
Continuous
the
:string_to_multiclass
which returnsMulticlass
for any string-like
column.
For instance:
boston3 = coerce(boston, autotype(boston, :few_to_finite))
schema(boston3)
┌─────────┬───────────────────┬───────────────────────────────────┐
│ names │ scitypes │ types │
├─────────┼───────────────────┼───────────────────────────────────┤
│ Crim │ Continuous │ Float64 │
│ Zn │ OrderedFactor{26} │ CategoricalValue{Float64, UInt32} │
│ Indus │ Continuous │ Float64 │
│ Chas │ OrderedFactor{2} │ CategoricalValue{Int64, UInt32} │
│ NOx │ Continuous │ Float64 │
│ Rm │ Continuous │ Float64 │
│ Age │ Continuous │ Float64 │
│ Dis │ Continuous │ Float64 │
│ Rad │ OrderedFactor{9} │ CategoricalValue{Int64, UInt32} │
│ Tax │ Count │ Int64 │
│ PTRatio │ OrderedFactor{46} │ CategoricalValue{Float64, UInt32} │
│ Black │ Continuous │ Float64 │
│ LStat │ Continuous │ Float64 │
│ MedV │ Continuous │ Float64 │
└─────────┴───────────────────┴───────────────────────────────────┘
You can also specify multiple rules, see the docs for more information.