ScientificTypes.jl
A light-weight Julia interface for implementing conventions about the scientific interpretation of data, and for performing type coercions enforcing those conventions.
The package makes the distinction between between machine type and scientific type:
- the machine type is a Julia type the data is currently encoded as (for instance:
Float64
) - the scientific type is a type defined by this package which encapsulates how the data should be interpreted in the rest of the code (for instance:
Continuous
orMulticlass
)
As a motivating example, the data might contain a column corresponding to a number of transactions, the machine type in that case could be an Int
whereas the scientific type would be a Count
.
The usefulness of this machinery becomes evident when the machine type does not directly connect with a scientific type; taking the previous example, the data could have been encoded as a Float64
whereas the meaning should still be a Count
.
Features
The package ScientificTypes
provides:
- A hierarchy of new Julia types representing scientific data types for use in method dispatch (eg, for trait values). Instances of the types play no role:
Found
├─ Known
│ ├─ Finite
│ │ ├─ Multiclass
│ │ └─ OrderedFactor
│ ├─ Infinite
│ │ ├─ Continuous
│ │ └─ Count
│ ├─ Image
│ │ ├─ ColorImage
│ │ └─ GrayImage
│ └─ Table
└─ Unknown
A single method
scitype
for articulating a convention about what scientific type each Julia object can represent. For example, one might declarescitype(::AbstractFloat) = Continuous
.A default convention called MLJ, based on dependencies
CategoricalArrays
,ColorTypes
, andTables
, which includes a convenience methodcoerce
for performing scientific type coercion onAbstractVectors
and columns of tabular data (any table implementing the Tables.jl interface).A
schema
method for tabular data, based on the optional Tables dependency, for inspecting the machine and scientific types of tabular data, in addition to column names and number of rows.
Getting started
The package is registered and can be installed via the package manager with add ScientificTypes
.
To get the scientific type of a Julia object according to the convention in use, call scitype
:
scitype(3.14)
Continuous
For a vector, you can use scitype
or scitype_union
(which will give you a scitype corresponding to the elements):
scitype([1,2,3,missing])
AbstractArray{Union{Missing, Count},1}
scitype_union([1,2,3,missing])
Union{Missing, Count}
Type coercion work-flow for tabular data
The standard workflow involves the following two steps:
- inspect the
schema
of the data and thescitypes
in particular - provide pairs or a dictionary with column names and scitypes for any changes you may want and coerce the data to those scitypes
using DataFrames, Tables
X = DataFrame(
name=["Siri", "Robo", "Alexa", "Cortana"],
height=[152, missing, 148, 163],
rating=[1, 5, 2, 1])
schema(X)
_.table =
┌─────────┬───────────────────────┬───────────────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼───────────────────────┼───────────────────────┤
│ name │ String │ Textual │
│ height │ Union{Missing, Int64} │ Union{Missing, Count} │
│ rating │ Int64 │ Count │
└─────────┴───────────────────────┴───────────────────────┘
_.nrows = 4
inspecting the scitypes:
schema(X).scitypes
(Textual, Union{Missing, Count}, Count)
but in this case you may want to map the names to Multiclass
, the height to Continuous
and the ratings to OrderedFactor
; to do so:
Xfixed = coerce(X, :name=>Multiclass,
:height=>Continuous,
:rating=>OrderedFactor)
schema(Xfixed).scitypes
(Multiclass{4}, Union{Missing, Continuous}, OrderedFactor{3})
Note that, as it encountered missing values in height
it coerced the type to Union{Missing,Continuous}
.
One can also make a replacement based on existing scientific type, instead of feature name:
X = (x = [1, 2, 3],
y = rand(3),
z = [10, 20, 30])
Xfixed = coerce(X, Count=>Continuous)
schema(Xfixed).scitypes
(Continuous, Continuous, Continuous)
Finally there is a coerce!
method that does in-place coercion provided the data structure allows it (at the moment only DataFrames.DataFrame
is supported).
Notes
- We regard the built-in Julia type
Missing
as a scientific type. The new scientific types introduced in the current package are rooted in the abstract typeFound
(see tree above) and you export the aliasScientific = Union{Missing, Found}
. Finite{N}
,Multiclass{N}
andOrderedFactor{N}
are all parametrised by the number of levelsN
. We export the aliasBinary = Finite{2}
.Image{W,H}
,GrayImage{W,H}
andColorImage{W,H}
are all parametrised by the image width and height dimensions,(W, H)
.- The function
scitype
has the fallback valueUnknown
. - Since Tables is an optional dependency, the
scitype
of aTables.jl
supported table isUnknown
unless Tables has been imported. - Developers can define their own conventions using the code in
src/conventions/mlj/
as a template. The active convention is controlled by the value ofScientificTypes.CONVENTION[1]
.
Special note on binary data
ScientificTypes does not define a separate "binary" scientific type. Rather, when binary data has an intrinsic "true" class (for example pass/fail in a product test), then it should be assigned an OrderedFactor{2}
scitype, while data with no such class (e.g., gender) should be assigned a Multiclass{2}
scitype. In the former case we recommend that the "true" class come after "false" in the ordering (corresponding to the usual assignment "false=0" and "true=1"). Of course, Finite{2}
covers both cases of binary data.
Detailed usage examples
using ScientificTypes
# activate a convention
ScientificTypes.set_convention(MLJ) # redundant as it's the default
scitype((2.718, 42))
Let's try with categorical valued objects:
using CategoricalArrays
v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])
OrderedFactor{3}
and
scitype_union(v)
Union{Missing, OrderedFactor{3}}
you could coerce this to Multiclass
:
w = coerce(v, Multiclass)
scitype_union(w)
Union{Missing, Multiclass{3}}
Working with tables
using Tables
data = (x1=rand(10), x2=rand(10), x3=collect(1:10))
scitype(data)
Table{Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}}}
you can also use schema
:
schema(data)
_.table =
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Int64 │ Count │
└─────────┴─────────┴────────────┘
_.nrows = 10
and use <:
for type checks:
scitype(data) <: Table(Continuous)
false
scitype(data) <: Table(Infinite)
true
or specify multiple types directly:
data = (x=rand(10), y=collect(1:10), z = [1,2,3,1,2,3,1,2,3,1])
data = coerce(data, :z=>OrderedFactor)
scitype(data) <: Table(Continuous,Count,OrderedFactor)
true
The scientific type of tuples, arrays and tables
Under any convention, the scitype of a tuple is a Tuple
type parameterized by scientific types:
scitype((1, 4.5))
Tuple{Count,Continuous}
Similarly, the scitype of an AbstractArray
is AbstractArray{U}
where U
is the union of the element scitypes:
scitype([1.3, 4.5, missing])
AbstractArray{Union{Missing, Continuous},1}
Performance note: Computing type unions over large arrays is expensive and, depending on the convention's implementation and the array eltype, computing the scitype can be slow. (In the MLJ convention this is mitigated with the help of the ScientificTypes.Scitype
method, of which other conventions could make use. Do ?ScientificTypes.Scitype
for details.) An eltype Any
will always be slow and you may want to consider replacing an array A
with broadcast(identity, A)
to collapse the eltype and speed up the computation.
Provided the Tables.jl package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:
using CategoricalArrays, Tables
X = (x1=rand(10),
x2=rand(10),
x3=categorical(rand("abc", 10)),
x4=categorical(rand("01", 10)))
scitype(X)
Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{3},1}, AbstractArray{Multiclass{2},1}}}
Sepcifically, if X
has columns c1, ..., cn
, then, by definition,
scitype(X) == Table{Union{scitype(c1), ..., scitype(cn)}}
With this definition, common type checks can be performed with tables. For instance, you could check that each column of X
has an element scitype that is either Continuous
or Finite
:
scitype(X) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}
true
A built-in Table
constructor provides a shorthand for the right-hand side:
scitype(X) <: Table(Continuous, Finite)
true
Note that Table(Continuous,Finite)
is a type union and not a Table
instance.
The MLJ convention
The table below summarizes the MLJ convention for representing scientific types:
Type T | scitype(x) for x::T | package required |
---|---|---|
Missing | Missing | |
AbstractFloat | Continuous | |
Integer | Count | |
CategoricalValue | Multiclass{N} where N = nlevels(x) , provided x.pool.ordered == false | CategoricalArrays |
CategoricalString | Multiclass{N} where N = nlevels(x) , provided x.pool.ordered == false | CategoricalArrays |
CategoricalValue | OrderedFactor{N} where N = nlevels(x) , provided x.pool.ordered == true | CategoricalArrays |
CategoricalString | OrderedFactor{N} where N = nlevels(x) provided x.pool.ordered == true | CategoricalArrays |
AbstractArray{<:Gray,2} | GrayImage{W,H} where (W, H) = size(x) | ColorTypes |
AbstractArrray{<:AbstractRGB,2} | ColorImage{W,H} where (W, H) = size(x) | ColorTypes |
any table type T supported by Tables.jl | Table{K} where K=Union{column_scitypes...} | Tables |
Here nlevels(x) = length(levels(x.pool))
.
Automatic type conversion for tabular data
The autotype
function allows to use specific rules in order to guess appropriate scientific types for the data. Such rules would typically be more constraining than the ones implied by the active convention. When autotype
is used, a dictionary of suggested types is returned for each column in the data; if none of the specified rule applies, the ambient convention is used as "fallback".
The function is called as:
autotype(X)
If the keyword only_changes
is passed set to true
, then only the column names for which the suggested type is different from that provided by the convention are returned.
autotype(X; only_changes=true)
To specify which rules are to be applied, use the rules
keyword and specify a tuple of symbols referring to specific rules; the default rule is :few_to_finite
which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate Finite
type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.
autotype(X; rules=(:few_to_finite,))
Finally, you can also use the following shorthands:
autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))
Available rules
Rule symbol | scitype suggestion |
---|---|
:few_to_finite | an appropriate Finite subtype for columns with few distinct values |
:discrete_to_continuous | if not Finite , then Continuous for any Count or Integer scitypes/types |
:string_to_multiclass | Multiclass for any string-like column |
Autotype can be used in conjunction with coerce
:
X_coerced = coerce(X, autotype(X))
Examples
By default it only applies the :few_to_many
rule
n = 50
X = (a = rand("abc", n), # 3 values, not number --> Multiclass
b = rand([1,2,3,4], n), # 4 values, number --> OrderedFactor
c = rand([true,false], n), # 2 values, number but only 2 --> Multiclass
d = randn(n), # many values --> unchanged
e = rand(collect(1:n), n)) # many values --> unchanged
autotype(X, only_changes=true)
Dict{Symbol,Type{#s19} where #s19<:Union{Missing, Found}} with 3 entries:
:a => Multiclass
:b => OrderedFactor
:c => OrderedFactor
For example, we could first apply the :discrete_to_continuous
rule, followed by :few_to_finite
rule. The first rule will apply to b
and e
but the subsequent application of the second rule will mean we will get the same result apart for e
(which will be Continuous
)
autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))
Dict{Symbol,Type{#s19} where #s19<:Union{Missing, Found}} with 4 entries:
:a => Multiclass
:b => OrderedFactor
:e => Continuous
:c => OrderedFactor
One should check and possibly modify the returned dictionary before passing to coerce
.