ScientificTypes.jl
This package makes a distinction between machine type and scientific type of a Julia object:
The machine type refers to the Julia type being used to represent the object (for instance,
Float64
).The scientific type is one of the types defined in ScientificTypesBase.jl reflecting how the object should be interpreted (for instance,
Continuous
orMulticlass
).
A scientific type convention is an assignment of a scientific type to every Julia object, articulated by overloading the scitype
method. The DefaultConvention
convention is the convention used in various Julia ecosystems.
This package additionally defines tools for type coercion (the coerce
method) and scientific type "guessing" (the autotype
method).
Developers interested in implementing a different convention will instead import Scientific TypesBase.jl, following the documentation there, possibly using this repo as a template.
Type hierarchy
The supported scientific types have the following hierarchy:
Finite{N}
├─ Multiclass{N}
└─ OrderedFactor{N}
Infinite
├─ Continuous
└─ Count
Image{W,H}
├─ ColorImage{W,H}
└─ GrayImage{W,H}
ScientificTimeType
├─ ScientificDate
├─ ScientificTime
└─ ScientificDateTime
Sampleable{Ω}
└─ Density{Ω}
Annotated{S}
AnnotationFor{S}
Multiset{S}
Table{K}
Textual
ManifoldPoint{MT}
Unknown
Additionally, we regard the Julia native types Missing
and Nothing
as scientific types as well.
Getting started
This documentation focuses on properties of the scitype
method specific to the default convention. The scitype
method satisfies certain universal properties, with respect to its operation on tuples, arrays and tables, set out in the ScientificTypesBase.jl readme, but only implicitly described here.
To get the scientific type of a Julia object defined by the default convention, call scitype
:
julia> using ScientificTypes
julia> scitype(3.14)
Continuous
For a vector, you can use scitype
or elscitype
(which will give you a scitype corresponding to the elements):
julia> scitype([1,2,3,missing])
AbstractVector{Union{Missing, Count}} (alias for AbstractArray{Union{Missing, Count}, 1})
julia> elscitype([1,2,3,missing])
Union{Missing, Count}
Occasionally, you may want to find the union of all scitypes of elements of an arbitrary iterable, which you can do with scitype_union
:
julia> scitype_union((ifelse(isodd(i), i, missing) for i in 1:5))
Union{Missing, Count}
Note calling scitype_union
on a large array, for example, is typically much slower than calling scitype
or elscitype
.
Summary of the default convention
The table below summarizes the default convention for representing scientific types:
Type T | scitype(x) for x::T | package/module required |
---|---|---|
Missing | Missing | |
Nothing | Nothing | |
AbstractFloat | Continuous | |
Integer | Count | |
String | Textual | |
CategoricalValue | Multiclass{N} where N = nlevels(x) , provided x.pool.ordered == false | CategoricalArrays.jl |
CategoricalString | Multiclass{N} where N = nlevels(x) , provided x.pool.ordered == false | CategoricalArrays.jl |
CategoricalValue | OrderedFactor{N} where N = nlevels(x) , provided x.pool.ordered == true | CategoricalArrays.jl |
CategoricalString | OrderedFactor{N} where N = nlevels(x) provided x.pool.ordered == true | CategoricalArrays.jl |
Date | ScientificDate | Dates |
Time | ScientificTime | Dates |
DateTime | ScientificDateTime | Dates |
Distributions.Sampleable{F,S} | Sampleable{Ω} where Ω is scitype of sample space, according to {F,S} | |
Distributions.Distributions{F,S} | Density{Ω} where Ω is scitype of sample space, according to {F,S} | |
AbstractArray{<:Gray,2} | GrayImage{W,H} where (W, H) = size(x) | ColorTypes.jl |
AbstractArrray{<:AbstractRGB,2} | ColorImage{W,H} where (W, H) = size(x) | ColorTypes.jl |
PersistenceDiagram | PersistenceDiagram | PersistenceDiagramsBase |
(*) any table type T | Table{K} where K=Union{column_scitypes...} | Tables.jl |
† CorpusLoaders.TaggedWord | Annotated{Textual} | CorpusLoaders.jl |
† CorpusLoaders.Document{AbstractVector{Q}} | Annotated{AbstractVector{Scitype(Q)}} | CorpusLoaders.jl |
† AbstractDict{<:AbstractString,<:Integer} | Multiset{Textual} | |
† AbstractDict{<:TaggedWord,<:Integer} | Multiset{Annotated{Textual}} | CorpusLoaders.jl |
(*) More precisely, any object X
for which Tables.istable(X) == true
will have sctiype(X) = Table{K}
, where K
is the union of the column scitypes, with the following exceptions: abstract dictionaries with AbstractString
keys, and abstract vectors of abstract dictionaries with AbstractString
keys are not considered tables by ScientificTypes.jl. Prior to Tables.jl 1.8, one had Tables.istable(X) == false
for these objects but in releases 1.8 and 1.10, this behaviour changed. These changes were breaking for ScientificTypes.jl, which has accordingly enforced the old behaviour, as far as scitype
is concerned.
† Experimental and subject to change in new minor or patch release
Here nlevels(x) = length(levels(x.pool))
.
Notes
- We regard the built-in Julia types
Missing
andNothing
as scientific types. Finite{N}
,Multiclass{N}
andOrderedFactor{N}
are all parameterized by the number of levelsN
. We export the aliasBinary = Finite{2}
.Image{W,H}
,GrayImage{W,H}
andColorImage{W,H}
are all parameterized by the image width and height dimensions,(W, H)
.Sampleable{K}
andDensity{K} <: Sampleable{K}
are parameterized by the sample space scitype.- On objects for which the default convention has nothing to say, the
scitype
function returnsUnknown
.
Special note on binary data
ScientificTypes does not define a separate "binary" scientific type. Rather, when binary data has an intrinsic "true" class (for example pass/fail in a product test), then it should be assigned an OrderedFactor{2}
scitype, while data with no such class (e.g., gender) should be assigned a Multiclass{2}
scitype. In the OrderedFactor{2}
case we adopt the convention that the "true" class come after the "false" class in the ordering (corresponding to the usual assignment "false=0" and "true=1"). Of course, Finite{2}
covers both cases of binary data.
Type coercion for tabular data
A common two-step work-flow is:
Inspect the
schema
of some table, and the columnscitypes
in particular.Provide pairs of column names and scitypes (or a dictionary) that change the column machine types to reflect the desired scientific interpretation (scitype).
using DataFrames, Tables
X = DataFrame(
name=["Siri", "Robo", "Alexa", "Cortana"],
height=[152, missing, 148, 163],
rating=[1, 5, 2, 1])
schema(X)
┌────────┬───────────────────────┬───────────────────────┐ │ names │ scitypes │ types │ ├────────┼───────────────────────┼───────────────────────┤ │ name │ Textual │ String │ │ height │ Union{Missing, Count} │ Union{Missing, Int64} │ │ rating │ Count │ Int64 │ └────────┴───────────────────────┴───────────────────────┘
In some further analysis of the data in X
, a more likely interpretation is that :name
is Multiclass
, the :height
is Continuous
, and the :rating
an OrderedFactor
. Correcting the types with coerce
:
Xfixed = coerce(X, :name=>Multiclass,
:height=>Continuous,
:rating=>OrderedFactor)
schema(Xfixed).scitypes
(Multiclass{4}, Union{Missing, Continuous}, OrderedFactor{3})
Note that because missing values were encountered in height
, an "imperfect" type coercion to Union{Missing,Continuous}
has been performed, and a warning issued. To avoid the warning, coerce to Union{Missing,Continuous}
instead.
"Global" replacements based on existing scientific types are also possible, and can be mixed with the name-based replacements:
X = (x = [1, 2, 3],
y = ['A', 'B', 'A'],
z = [10, 20, 30])
Xfixed = coerce(X, Count=>Continuous, :y=>OrderedFactor)
schema(Xfixed).scitypes
(Continuous, OrderedFactor{2}, Continuous)
Finally there is a coerce!
method that does in-place coercion provided the data structure supports it.
Type coercion for image data
To have a scientific type of Image
a julia object must be a two-dimensional array whose element type is subtype of Gray
or AbstractRGB
(color types from the ColorTypes.jl package). And models typically expect collections of images to be vectors of such two-dimensional arrays. Implementations of coerce
allow the conversion of some common image formats into one of these. The eltype in these other formats can be any subtype of Real
, which includes the FixedPoint
type from the FixedPointNumbers.jl package.
Coercing a single image
Coercing a gray image, represented as a Real
matrix (W x H format):
img = rand(10, 10)
coerce(img, GrayImage) |> scitype
GrayImage{10, 10}
Coercing a color image, represented as a Real
3-D array (W x H x C format):
img = rand(10, 10, 3)
coerce(img, ColorImage) |> scitype
ColorImage{10, 10}
Coercing collections of images
Coercing a collection of gray images, represented as a Real
3-D array (W x H x N format):
imgs = rand(10, 10, 3)
coerce(imgs, GrayImage) |> scitype
AbstractVector{GrayImage{10, 10}} (alias for AbstractArray{GrayImage{10, 10}, 1})
Coercing a collection of gray images, represented as a Real
4-D array (W x H x {1} x N format):
imgs = rand(10, 10, 1, 3)
coerce(imgs, GrayImage) |> scitype
AbstractVector{GrayImage{10, 10}} (alias for AbstractArray{GrayImage{10, 10}, 1})
Coercing a collection of color images, represented as a Real
4-D array (W x H x C x N format):
imgs = rand(10, 10, 3, 5)
coerce(imgs, ColorImage) |> scitype
AbstractVector{ColorImage{10, 10}} (alias for AbstractArray{ColorImage{10, 10}, 1})
Detailed usage examples
Basics
using CategoricalArrays
scitype((2.718, 42))
Tuple{Continuous, Count}
In the default convention, to construct arrays with categorical scientific element type one needs to use CategorialArrays
:
v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])
OrderedFactor{3}
elscitype(v)
Union{Missing, OrderedFactor{3}}
Coercing to Multiclass
:
w = coerce(v, Union{Missing,Multiclass})
elscitype(w)
Union{Missing, Multiclass{3}}
Working with tables
While schema
is convenient for inspecting the column scitypes of a table, there is also a scitype for the tables themselves:
data = (x1=rand(10), x2=rand(10))
schema(data)
┌───────┬────────────┬─────────┐ │ names │ scitypes │ types │ ├───────┼────────────┼─────────┤ │ x1 │ Continuous │ Float64 │ │ x2 │ Continuous │ Float64 │ └───────┴────────────┴─────────┘
scitype(data)
Table{AbstractVector{Continuous}}
Similarly, any table (see (*) above for the definition) has scitype Table{K}
, where K
is the union of the scitypes of its columns.
Table scitypes are useful for dispatch and type checks, as shown here, with the help of a constructor for Table
scitypes provided by Scientific Types.jl:
Table(Continuous, Count)
Table{<:Union{AbstractArray{<:Continuous},AbstractArray{<:Count}}}
scitype(data) <: Table(Continuous)
true
scitype(data) <: Table(Infinite)
true
data = (x=rand(10), y=collect(1:10), z = [1,2,3,1,2,3,1,2,3,1])
data = coerce(data, :z=>OrderedFactor)
scitype(data) <: Table(Continuous,Count,OrderedFactor)
true
Note that Table(Continuous,Finite)
is a type union and not a Table
instance.
Tuples and arrays
The behavior of scitype
on tuples is as you would expect:
scitype((1, 4.5))
Tuple{Count, Continuous}
For performance reasons, the behavior of scitype
on arrays has some wrinkles, in the case of missing values:
The scitype of an array. The scitype of an AbstractArray
, A
, is alwaysAbstractArray{U}
where U
is the union of the scitypes of the elements of A
, with one exception: If typeof(A) <: AbstractArray{Union{Missing,T}}
for some T
different from Any
, then the scitype of A
is AbstractArray{Union{Missing, U}}
, where U
is the union over all non-missing elements, even if A
has no missing elements.
julia> v = [1.3, 4.5, missing]
julia> scitype(v)
AbstractArray{Union{Missing, Continuous},1}
julia> scitype(v[1:2])
AbstractArray{Union{Missing, Continuous},1}
Automatic type conversion
The autotype
function allows to use specific rules in order to guess appropriate scientific types for tabular data. Such rules would typically be more constraining than the ones implied by the active convention. When autotype
is used, a dictionary of suggested types is returned for each column in the data; if none of the specified rule applies, the ambient convention is used as "fallback".
The function is called as:
autotype(X)
If the keyword only_changes
is passed set to true
, then only the column names for which the suggested type is different from that provided by the convention are returned.
autotype(X; only_changes=true)
To specify which rules are to be applied, use the rules
keyword and specify a tuple of symbols referring to specific rules; the default rule is :few_to_finite
which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate Finite
type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.
autotype(X; rules=(:few_to_finite,))
Finally, you can also use the following shorthands:
autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))
Available rules
Rule symbol | scitype suggestion |
---|---|
:few_to_finite | an appropriate Finite subtype for columns with few distinct values |
:discrete_to_continuous | if not Finite , then Continuous for any Count or Integer scitypes/types |
:string_to_multiclass | Multiclass for any string-like column |
Autotype can be used in conjunction with coerce
:
X_coerced = coerce(X, autotype(X))
Examples
By default it only applies the :few_to_finite
rule
n = 50
X = (a = rand("abc", n), # 3 values, not number --> Multiclass
b = rand([1,2,3,4], n), # 4 values, number --> OrderedFactor
c = rand([true,false], n), # 2 values, number but only 2 --> Multiclass
d = randn(n), # many values --> unchanged
e = rand(collect(1:n), n)) # many values --> unchanged
autotype(X, only_changes=true)
Dict{Symbol, Type} with 3 entries: :a => Multiclass :b => OrderedFactor :c => OrderedFactor
For example, we could first apply the :discrete_to_continuous
rule, followed by :few_to_finite
rule. The first rule will apply to b
and e
but the subsequent application of the second rule will mean we will get the same result apart for e
(which will be Continuous
)
autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))
Dict{Symbol, Type} with 4 entries: :a => Multiclass :b => OrderedFactor :e => Continuous :c => OrderedFactor
One should check and possibly modify the returned dictionary before passing to coerce
.
API reference
ScientificTypes.scitype
— Functionscitype(X)
The scientific type (interpretation) of X
, as distinct from its machine type. Atomic scientific types (Continuous
, Multiclass
, etc) are mostly abstract types defined in the package ScientificTypesBase.jl. Scientific types do not ordinarily have instances.
Examples
julia> scitype(3.14)
Continuous
julia> scitype([1, 2, missing])
AbstractVector{Union{Missing, Count}}
julia> scitype((5, "beige"))
Tuple{Count, Textual}
julia> using CategoricalArrays
julia> table = (gender = categorical(['M', 'M', 'F', 'M', 'F']),
ndevices = [1, 3, 2, 3, 2])
julia> scitype(table)
Table{Union{AbstractVector{Count}, AbstractVector{Multiclass{2}}}}
Column scitpes of a table can also be inspected with schema
.
The behavior of scitype
is detailed in the ScientificTypes documentation. Key features of the default behavior are:
AbstractFloat
has scitype asContinuous <: Infinite
.Any
Integer
has scitype asCount <: Infinite
.Any
CategoricalValue
x
has scitype asMulticlass <: Finite
orOrderedFactor <: Finite
, depending on the value ofisordered(x)
.String
s andChar
s do not have scitypeMulticlass
orOrderedFactor
; they have scitypesTextual
andUnknown
respectively.The scientific types of
nothing
andmissing
areNothing
andMissing
, Julia types that are also regarded as scientific.
Third party packages may extend the behavior of scitype
: Objects previously having Unknown
scitype may no longer do so.
ScientificTypes.coerce
— Functioncoerce(A, S)
Return new version of the array A
whose scientific element type is S
.
julia> v = coerce([3, 7, 5], Continuous)
3-element Vector{Float64}:
3.0
7.0
5.0
julia> scitype(v)
AbstractVector{Continuous}
coerce(X, specs...; tight=false, verbosity=1)
Given a table X
, return a copy of X
, ensuring that the element scitypes of the columns match the new specification, specs
. There are three valid specifications:
(i) one or more column_name=>Scitype
pairs:
coerce(X, col1=>Scitype1, col2=>Scitype2, ... ; verbosity=1)
(ii) one or more OldScitype=>NewScitype
pairs (OldScitype
covering both the OldScitype
and Union{Missing,OldScitype}
cases):
coerce(X, OldScitype1=>NewScitype1, OldScitype2=>NewScitype2, ... ; verbosity=1)
(iii) a dictionary of scientific types keyed on column names:
coerce(X, d::AbstractDict{<:ColKey, <:Type}; verbosity=1)
where ColKey = Union{Symbol,AbstractString}
.
Examples
Specifying column_name=>Scitype
pairs:
using CategoricalArrays, DataFrames, Tables
X = DataFrame(name=["Siri", "Robo", "Alexa", "Cortana"],
height=[152, missing, 148, 163],
rating=[1, 5, 2, 1])
Xc = coerce(X, :name=>Multiclass, :height=>Continuous, :rating=>OrderedFactor)
schema(Xc).scitypes # (Multiclass, Continuous, OrderedFactor)
Specifying OldScitype=>NewScitype
pairs:
X = (x = [1, 2, 3],
y = rand(3),
z = [10, 20, 30])
Xc = coerce(X, Count=>Continuous)
schema(Xfixed).scitypes # (Continuous, Continuous, Continuous)
coerce(image::AbstractArray{<:Real, N}, I)
Given an array called image
representing one or more images, return a transformed version of the data so as to enforce an appropriate scientific interpretation I
:
single or collection ? | N | I | scitype of result |
---|---|---|---|
single | 2 | GrayImage | GrayImage{W,H} |
single | 3 | ColorImage | ColorImage{W,H} |
collection | 3 | GrayImage | AbstractVector{<:GrayImage} |
collection | 4 (W x H x {1} x C) | GrayImage | AbstractVector{<:GrayImage} |
collection | 4 | ColorImage | AbstractVector{<:ColorImage} |
imgs = rand(10, 10, 3, 5)
v = coerce(imgs, ColorImage)
julia> typeof(v)
Vector{Matrix{ColorTypes.RGB{Float64}}}
julia> scitype(v)
AbstractVector{ColorImage{10, 10}}
ScientificTypes.autotype
— Functionautotype(X; kw...)
Return a dictionary of suggested scitypes for each column of X
, a table or an array based on rules
Kwargs
only_changes=true
: if true, return only a dictionary of the names for which applying autotype differs from just using the ambient convention. When coercing with autotype,only_changes
should be true.rules=(:few_to_finite,)
: the set of rules to apply.