Data interpretation: Scientific Types

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

Machine type vs Scientific Type
Tips and tricks
1. Type to Type coercion
2. Autotype

The package ScientificTypes.jl defines a barebone type hierarchy which can be used to indicate how a particular feature should be interpreted; in particular:

Found
├─ Known
│  ├─ Textual
│  ├─ Finite
│  │  ├─ Multiclass
│  │  └─ OrderedFactor
│  └─ Infinite
│     ├─ Continuous
│     └─ Count
└─ Unknown

A scientific type convention is a specific implementation indicating how machine types can be related to scientific types. It may also provide helper functions to convert data to a given scitype.

The convention used in MLJ is implemented in ScientificTypes.jl. This is what we will use throughout; you never need to use ScientificTypes.jl unless you intend to implement your own scientific type convention.

‎

The schema function

using RDatasets
using ScientificTypes

boston = dataset("MASS", "Boston")
sch = schema(boston)

┌─────────┬────────────┬─────────┐
│ names   │ scitypes   │ types   │
├─────────┼────────────┼─────────┤
│ Crim    │ Continuous │ Float64 │
│ Zn      │ Continuous │ Float64 │
│ Indus   │ Continuous │ Float64 │
│ Chas    │ Count      │ Int64   │
│ NOx     │ Continuous │ Float64 │
│ Rm      │ Continuous │ Float64 │
│ Age     │ Continuous │ Float64 │
│ Dis     │ Continuous │ Float64 │
│ Rad     │ Count      │ Int64   │
│ Tax     │ Count      │ Int64   │
│ PTRatio │ Continuous │ Float64 │
│ Black   │ Continuous │ Float64 │
│ LStat   │ Continuous │ Float64 │
│ MedV    │ Continuous │ Float64 │
└─────────┴────────────┴─────────┘

In this cases, most of the variables have a (machine) type Float64 and their default interpretation is Continuous. There is also :Chas, :Rad and :Tax that have a (machine) type Int64 and their default interpretation is Count.

While the interpretation as Continuous is usually fine, the interpretation as Count needs a bit more attention. For instance note that:

unique(boston.Chas)

2-element Vector{Int64}:
 0
 1

so even though it's got a machine type of Int64 and consequently a default interpretation of Count, it would be more appropriate to interpret it as an OrderedFactor.

‎

In order to re-specify the scitype(s) of feature(s) in a dataset, you can use the coerce function and specify pairs of variable name and scientific type:

boston2 = coerce(boston, :Chas => OrderedFactor);

the effect of this is to convert the :Chas column to an ordered categorical vector:

eltype(boston2.Chas)

CategoricalArrays.CategoricalValue{Int64, UInt32}

corresponding to the OrderedFactor scitype:

elscitype(boston2.Chas)

ScientificTypesBase.OrderedFactor{2}

You can also specify multiple pairs in one shot with coerce:

boston3 = coerce(boston, :Chas => OrderedFactor, :Rad => OrderedFactor);

‎

If a feature in your dataset has String elements, then the default scitype is Textual; you can either choose to drop such columns or to coerce them to categorical:

feature = ["AA", "BB", "AA", "AA", "BB"]
elscitype(feature)

ScientificTypesBase.Textual

which you can coerce:

feature2 = coerce(feature, Multiclass)
elscitype(feature2)

ScientificTypesBase.Multiclass{2}

‎

In some cases you will want to reinterpret all features currently interpreted as some scitype S1 into some other scitype S2. An example is if some features are currently interpreted as Count because their original type was Int but you want to consider all such as Continuous:

data = select(boston, [:Rad, :Tax])
schema(data)

┌───────┬──────────┬───────┐
│ names │ scitypes │ types │
├───────┼──────────┼───────┤
│ Rad   │ Count    │ Int64 │
│ Tax   │ Count    │ Int64 │
└───────┴──────────┴───────┘

let's coerce from Count to Continuous:

data2 = coerce(data, Count => Continuous)
schema(data2)

┌───────┬────────────┬─────────┐
│ names │ scitypes   │ types   │
├───────┼────────────┼─────────┤
│ Rad   │ Continuous │ Float64 │
│ Tax   │ Continuous │ Float64 │
└───────┴────────────┴─────────┘

‎

A last useful tool is autotype which allows you to specify rules to define the interpretation of features automatically. You can code your own rules but there are three useful ones that are pre- coded:

the :few_to_finite rule which checks how many unique entries are present

in a vector and if there are "few" suggests a categorical type,

the :discrete_to_continuous rule converts Integer or Count to

Continuous

the :string_to_multiclass which returns Multiclass for any string-like

column.

For instance:

boston3 = coerce(boston, autotype(boston, :few_to_finite))
schema(boston3)

┌─────────┬───────────────────┬───────────────────────────────────┐
│ names   │ scitypes          │ types                             │
├─────────┼───────────────────┼───────────────────────────────────┤
│ Crim    │ Continuous        │ Float64                           │
│ Zn      │ OrderedFactor{26} │ CategoricalValue{Float64, UInt32} │
│ Indus   │ Continuous        │ Float64                           │
│ Chas    │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ NOx     │ Continuous        │ Float64                           │
│ Rm      │ Continuous        │ Float64                           │
│ Age     │ Continuous        │ Float64                           │
│ Dis     │ Continuous        │ Float64                           │
│ Rad     │ OrderedFactor{9}  │ CategoricalValue{Int64, UInt32}   │
│ Tax     │ Count             │ Int64                             │
│ PTRatio │ OrderedFactor{46} │ CategoricalValue{Float64, UInt32} │
│ Black   │ Continuous        │ Float64                           │
│ LStat   │ Continuous        │ Float64                           │
│ MedV    │ Continuous        │ Float64                           │
└─────────┴───────────────────┴───────────────────────────────────┘

You can also specify multiple rules, see the docs for more information.

‎