Home

ScientificTypes.jl

A light-weight Julia interface for implementing conventions about the scientific interpretation of data, and for performing type coercions enforcing those conventions.

The package makes the distinction between between machine type and scientific type:

As a motivating example, the data might contain a column corresponding to a number of transactions, the machine type in that case could be an Int whereas the scientific type would be a Count.

The usefulness of this machinery becomes evident when the machine type does not directly connect with a scientific type; taking the previous example, the data could have been encoded as a Float64 whereas the meaning should still be a Count.

Features

The package ScientificTypes provides:

Found
├─ Known
│  ├─ Finite
│  │  ├─ Multiclass
│  │  └─ OrderedFactor
│  ├─ Infinite
│  │  ├─ Continuous
│  │  └─ Count
│  ├─ Image
│  │  ├─ ColorImage
│  │  └─ GrayImage
│  └─ Table
└─ Unknown

Getting started

The package is registered and can be installed via the package manager with add ScientificTypes.

To get the scientific type of a Julia object according to the convention in use, call scitype:

scitype(3.14)
Continuous

For a vector, you can use scitype or scitype_union (which will give you a scitype corresponding to the elements):

scitype([1,2,3,missing])
AbstractArray{Union{Missing, Count},1}
scitype_union([1,2,3,missing])
Union{Missing, Count}

Type coercion work-flow for tabular data

The standard workflow involves the following two steps:

  1. inspect the schema of the data and the scitypes in particular
  2. provide pairs or a dictionary with column names and scitypes for any changes you may want and coerce the data to those scitypes
using DataFrames, Tables
X = DataFrame(
     name=["Siri", "Robo", "Alexa", "Cortana"],
     height=[152, missing, 148, 163],
     rating=[1, 5, 2, 1])
schema(X)
_.table = 
┌─────────┬───────────────────────┬───────────────────────┐
│ _.names │ _.types               │ _.scitypes            │
├─────────┼───────────────────────┼───────────────────────┤
│ name    │ String                │ Textual               │
│ height  │ Union{Missing, Int64} │ Union{Missing, Count} │
│ rating  │ Int64                 │ Count                 │
└─────────┴───────────────────────┴───────────────────────┘
_.nrows = 4

inspecting the scitypes:

schema(X).scitypes
(Textual, Union{Missing, Count}, Count)

but in this case you may want to map the names to Multiclass, the height to Continuous and the ratings to OrderedFactor; to do so:

Xfixed = coerce(X, :name=>Multiclass,
                   :height=>Continuous,
                   :rating=>OrderedFactor)
schema(Xfixed).scitypes
(Multiclass{4}, Union{Missing, Continuous}, OrderedFactor{3})

Note that, as it encountered missing values in height it coerced the type to Union{Missing,Continuous}.

One can also make a replacement based on existing scientific type, instead of feature name:

X  = (x = [1, 2, 3],
      y = rand(3),
      z = [10, 20, 30])
Xfixed = coerce(X, Count=>Continuous)
schema(Xfixed).scitypes
(Continuous, Continuous, Continuous)

Finally there is a coerce! method that does in-place coercion provided the data structure allows it (at the moment only DataFrames.DataFrame is supported).

Notes

Special note on binary data

ScientificTypes does not define a separate "binary" scientific type. Rather, when binary data has an intrinsic "true" class (for example pass/fail in a product test), then it should be assigned an OrderedFactor{2} scitype, while data with no such class (e.g., gender) should be assigned a Multiclass{2} scitype. In the former case we recommend that the "true" class come after "false" in the ordering (corresponding to the usual assignment "false=0" and "true=1"). Of course, Finite{2} covers both cases of binary data.

Detailed usage examples

using ScientificTypes
# activate a convention
ScientificTypes.set_convention(MLJ) # redundant as it's the default

scitype((2.718, 42))

Let's try with categorical valued objects:

using CategoricalArrays
v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])
OrderedFactor{3}

and

scitype_union(v)
Union{Missing, OrderedFactor{3}}

you could coerce this to Multiclass:

w = coerce(v, Multiclass)
scitype_union(w)
Union{Missing, Multiclass{3}}

Working with tables

using Tables
data = (x1=rand(10), x2=rand(10), x3=collect(1:10))
scitype(data)
Table{Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}}}

you can also use schema:

schema(data)
_.table = 
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1      │ Float64 │ Continuous │
│ x2      │ Float64 │ Continuous │
│ x3      │ Int64   │ Count      │
└─────────┴─────────┴────────────┘
_.nrows = 10

and use <: for type checks:

scitype(data) <: Table(Continuous)
false
scitype(data) <: Table(Infinite)
true

or specify multiple types directly:

data = (x=rand(10), y=collect(1:10), z = [1,2,3,1,2,3,1,2,3,1])
data = coerce(data, :z=>OrderedFactor)
scitype(data) <: Table(Continuous,Count,OrderedFactor)
true

The scientific type of tuples, arrays and tables

Under any convention, the scitype of a tuple is a Tuple type parameterized by scientific types:

scitype((1, 4.5))
Tuple{Count,Continuous}

Similarly, the scitype of an AbstractArray is AbstractArray{U} where U is the union of the element scitypes:

scitype([1.3, 4.5, missing])
AbstractArray{Union{Missing, Continuous},1}

Performance note: Computing type unions over large arrays is expensive and, depending on the convention's implementation and the array eltype, computing the scitype can be slow. (In the MLJ convention this is mitigated with the help of the ScientificTypes.Scitype method, of which other conventions could make use. Do ?ScientificTypes.Scitype for details.) An eltype Any will always be slow and you may want to consider replacing an array A with broadcast(identity, A) to collapse the eltype and speed up the computation.

Provided the Tables.jl package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:

using CategoricalArrays, Tables
X = (x1=rand(10),
     x2=rand(10),
     x3=categorical(rand("abc", 10)),
     x4=categorical(rand("01", 10)))
scitype(X)
Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{3},1}, AbstractArray{Multiclass{2},1}}}

Sepcifically, if X has columns c1, ..., cn, then, by definition,

scitype(X) == Table{Union{scitype(c1), ..., scitype(cn)}}

With this definition, common type checks can be performed with tables. For instance, you could check that each column of X has an element scitype that is either Continuous or Finite:

scitype(X) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}
true

A built-in Table constructor provides a shorthand for the right-hand side:

scitype(X) <: Table(Continuous, Finite)
true

Note that Table(Continuous,Finite) is a type union and not a Table instance.

The MLJ convention

The table below summarizes the MLJ convention for representing scientific types:

Type Tscitype(x) for x::Tpackage required
MissingMissing
AbstractFloatContinuous
IntegerCount
CategoricalValueMulticlass{N} where N = nlevels(x), provided x.pool.ordered == falseCategoricalArrays
CategoricalStringMulticlass{N} where N = nlevels(x), provided x.pool.ordered == falseCategoricalArrays
CategoricalValueOrderedFactor{N} where N = nlevels(x), provided x.pool.ordered == trueCategoricalArrays
CategoricalStringOrderedFactor{N} where N = nlevels(x) provided x.pool.ordered == trueCategoricalArrays
AbstractArray{<:Gray,2}GrayImage{W,H} where (W, H) = size(x)ColorTypes
AbstractArrray{<:AbstractRGB,2}ColorImage{W,H} where (W, H) = size(x)ColorTypes
any table type T supported by Tables.jlTable{K} where K=Union{column_scitypes...}Tables

Here nlevels(x) = length(levels(x.pool)).

Automatic type conversion for tabular data

The autotype function allows to use specific rules in order to guess appropriate scientific types for the data. Such rules would typically be more constraining than the ones implied by the active convention. When autotype is used, a dictionary of suggested types is returned for each column in the data; if none of the specified rule applies, the ambient convention is used as "fallback".

The function is called as:

autotype(X)

If the keyword only_changes is passed set to true, then only the column names for which the suggested type is different from that provided by the convention are returned.

autotype(X; only_changes=true)

To specify which rules are to be applied, use the rules keyword and specify a tuple of symbols referring to specific rules; the default rule is :few_to_finite which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate Finite type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.

autotype(X; rules=(:few_to_finite,))

Finally, you can also use the following shorthands:

autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))

Available rules

Rule symbolscitype suggestion
:few_to_finitean appropriate Finite subtype for columns with few distinct values
:discrete_to_continuousif not Finite, then Continuous for any Count or Integer scitypes/types
:string_to_multiclassMulticlass for any string-like column

Autotype can be used in conjunction with coerce:

X_coerced = coerce(X, autotype(X))

Examples

By default it only applies the :few_to_many rule

n = 50
X = (a = rand("abc", n),         # 3 values, not number        --> Multiclass
     b = rand([1,2,3,4], n),     # 4 values, number            --> OrderedFactor
     c = rand([true,false], n),  # 2 values, number but only 2 --> Multiclass
     d = randn(n),               # many values                 --> unchanged
     e = rand(collect(1:n), n))  # many values                 --> unchanged
autotype(X, only_changes=true)
Dict{Symbol,Type{#s19} where #s19<:Union{Missing, Found}} with 3 entries:
  :a => Multiclass
  :b => OrderedFactor
  :c => OrderedFactor

For example, we could first apply the :discrete_to_continuous rule, followed by :few_to_finite rule. The first rule will apply to b and e but the subsequent application of the second rule will mean we will get the same result apart for e (which will be Continuous)

autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))
Dict{Symbol,Type{#s19} where #s19<:Union{Missing, Found}} with 4 entries:
  :a => Multiclass
  :b => OrderedFactor
  :e => Continuous
  :c => OrderedFactor

One should check and possibly modify the returned dictionary before passing to coerce.