# ScientificTypes.jl

This package makes a distinction between **machine type** and **scientific type** of a Julia object:

The

*machine type*refers to the Julia type being used to represent the object (for instance,`Float64`

).The

*scientific type*is one of the types defined in ScientificTypesBase.jl reflecting how the object should be*interpreted*(for instance,`Continuous`

or`Multiclass`

).

A *scientific type convention* is an assignment of a scientific type to every Julia object, articulated by overloading the `scitype`

method. The `DefaultConvention`

convention is the convention used in various Julia ecosystems.

This package additionally defines tools for type coercion (the `coerce`

method) and scientific type "guessing" (the `autotype`

method).

Developers interested in implementing a different convention will instead import Scientific TypesBase.jl, following the documentation there, possibly using this repo as a template.

## Type hierarchy

The supported scientific types have the following hierarchy:

```
Finite{N}
├─ Multiclass{N}
└─ OrderedFactor{N}
Infinite
├─ Continuous
└─ Count
Image{W,H}
├─ ColorImage{W,H}
└─ GrayImage{W,H}
ScientificTimeType
├─ ScientificDate
├─ ScientificTime
└─ ScientificDateTime
Sampleable{Ω}
└─ Density{Ω}
Annotated{S}
AnnotationFor{S}
Multiset{S}
Table{K}
Textual
ManifoldPoint{MT}
Unknown
```

Additionally, we regard the Julia native types `Missing`

and `Nothing`

as scientific types as well.

## Getting started

This documentation focuses on properties of the `scitype`

method specific to the default convention. The `scitype`

method satisfies certain universal properties, with respect to its operation on tuples, arrays and tables, set out in the ScientificTypes readme, but only implicitly described here.

To get the scientific type of a Julia object defined by the default convention, call `scitype`

:

```
julia> using ScientificTypes
julia> scitype(3.14)
Continuous
```

For a vector, you can use `scitype`

or `elscitype`

(which will give you a scitype corresponding to the elements):

```
julia> scitype([1,2,3,missing])
AbstractVector{Union{Missing, Count}} (alias for AbstractArray{Union{Missing, Count}, 1})
```

```
julia> elscitype([1,2,3,missing])
Union{Missing, Count}
```

Occasionally, you may want to find the union of all scitypes of elements of an arbitrary iterable, which you can do with `scitype_union`

:

```
julia> scitype_union((ifelse(isodd(i), i, missing) for i in 1:5))
Union{Missing, Count}
```

Note calling `scitype_union`

on a large array, for example, is typically much slower than calling `scitype`

or `elscitype`

.

## Summary of the default convention

The table below summarizes the default convention for representing scientific types:

Type `T` | `scitype(x)` for `x::T` | package/module required |
---|---|---|

`Missing` | `Missing` | |

`Nothing` | `Nothing` | |

`AbstractFloat` | `Continuous` | |

`Integer` | `Count` | |

`String` | `Textual` | |

`CategoricalValue` | `Multiclass{N}` where `N = nlevels(x)` , provided `x.pool.ordered == false` | CategoricalArrays.jl |

`CategoricalString` | `Multiclass{N}` where `N = nlevels(x)` , provided `x.pool.ordered == false` | CategoricalArrays.jl |

`CategoricalValue` | `OrderedFactor{N}` where `N = nlevels(x)` , provided `x.pool.ordered == true` | CategoricalArrays.jl |

`CategoricalString` | `OrderedFactor{N}` where `N = nlevels(x)` provided `x.pool.ordered == true` | CategoricalArrays.jl |

`Date` | `ScientificDate` | Dates |

`Time` | `ScientificTime` | Dates |

`DateTime` | `ScientificDateTime` | Dates |

`Distributions.Sampleable{F,S}` | `Sampleable{Ω}` where `Ω` is scitype of sample space, according to `{F,S}` | |

`Distributions.Distributions{F,S}` | `Density{Ω}` where `Ω` is scitype of sample space, according to `{F,S}` | |

`AbstractArray{<:Gray,2}` | `GrayImage{W,H}` where `(W, H) = size(x)` | ColorTypes.jl |

`AbstractArrray{<:AbstractRGB,2}` | `ColorImage{W,H}` where `(W, H) = size(x)` | ColorTypes.jl |

`PersistenceDiagram` | `PersistenceDiagram` | PersistenceDiagramsBase |

any table type `T` supported by Tables.jl | `Table{K}` where `K=Union{column_scitypes...}` | Tables.jl |

† `CorpusLoaders.TaggedWord` | `Annotated{Textual}` | CorpusLoaders.jl |

† `CorpusLoaders.Document{AbstractVector{Q}}` | `Annotated{AbstractVector{Scitype(Q)}}` | CorpusLoaders.jl |

† `AbstractDict{<:AbstractString,<:Integer}` | `Multiset{Textual}` | |

† `AbstractDict{<:TaggedWord,<:Integer}` | `Multiset{Annotated{Textual}}` | CorpusLoaders.jl |

† *Experimental* and subject to change in new minor or patch release

Here `nlevels(x) = length(levels(x.pool))`

.

## Notes

- We regard the built-in Julia types
`Missing`

and`Nothing`

as scientific types. `Finite{N}`

,`Multiclass{N}`

and`OrderedFactor{N}`

are all parameterized by the number of levels`N`

. We export the alias`Binary = Finite{2}`

.`Image{W,H}`

,`GrayImage{W,H}`

and`ColorImage{W,H}`

are all parameterized by the image width and height dimensions,`(W, H)`

.`Sampleable{K}`

andb

`Density{K} <: Sampleable{K}`

are parameterized by the sample space scitype.

- On objects for which the default convention has nothing to say, the
`scitype`

function returns`Unknown`

.

### Special note on binary data

ScientificTypes does not define a separate "binary" scientific type. Rather, when binary data has an intrinsic "true" class (for example pass/fail in a product test), then it should be assigned an `OrderedFactor{2}`

scitype, while data with no such class (e.g., gender) should be assigned a `Multiclass{2}`

scitype. In the `OrderedFactor{2}`

case we adopt the convention that the "true" class come *after* the "false" class in the ordering (corresponding to the usual assignment "false=0" and "true=1"). Of course, `Finite{2}`

covers both cases of binary data.

## Type coercion for tabular data

A common two-step work-flow is:

Inspect the

`schema`

of some table, and the column`scitypes`

in particular.Provide pairs of column names and scitypes (or a dictionary) that change the column machine types to reflect the desired scientific interpretation (scitype).

```
using DataFrames, Tables
X = DataFrame(
name=["Siri", "Robo", "Alexa", "Cortana"],
height=[152, missing, 148, 163],
rating=[1, 5, 2, 1])
schema(X)
```

┌────────┬───────────────────────┬───────────────────────┐ │ names │ scitypes │ types │ ├────────┼───────────────────────┼───────────────────────┤ │ name │ Textual │ String │ │ height │ Union{Missing, Count} │ Union{Missing, Int64} │ │ rating │ Count │ Int64 │ └────────┴───────────────────────┴───────────────────────┘

In some further analysis of the data in `X`

, a more likely interpretation is that `:name`

is `Multiclass`

, the `:height`

is `Continuous`

, and the `:rating`

an `OrderedFactor`

. Correcting the types with `coerce`

:

```
Xfixed = coerce(X, :name=>Multiclass,
:height=>Continuous,
:rating=>OrderedFactor)
schema(Xfixed).scitypes
```

(Multiclass{4}, Union{Missing, Continuous}, OrderedFactor{3})

Note that because missing values were encountered in `height`

, an "imperfect" type coercion to `Union{Missing,Continuous}`

has been performed, and a warning issued. To avoid the warning, coerce to `Union{Missing,Continuous}`

instead.

"Global" replacements based on existing scientific types are also possible, and can be mixed with the name-based replacements:

```
X = (x = [1, 2, 3],
y = ['A', 'B', 'A'],
z = [10, 20, 30])
Xfixed = coerce(X, Count=>Continuous, :y=>OrderedFactor)
schema(Xfixed).scitypes
```

(Continuous, OrderedFactor{2}, Continuous)

Finally there is a `coerce!`

method that does in-place coercion provided the data structure supports it.

## Type coercion for image data

To have a scientific type of `Image`

a julia object must be a two-dimensional array whose element type is subtype of `Gray`

or `AbstractRGB`

(color types from the ColorTypes.jl package). And models typically expect *collections* of images to be vectors of such two-dimensional arrays. Implementations of `coerce`

allow the conversion of some common image formats into one of these. The eltype in these other formats can be any subtype of `Real`

, which includes the `FixedPoint`

type from the FixedPointNumbers.jl package.

### Coercing a single image

Coercing a **gray** image, represented as a `Real`

matrix (W x H format):

```
img = rand(10, 10)
coerce(img, GrayImage) |> scitype
```

GrayImage{10, 10}

Coercing a **color** image, represented as a `Real`

3-D array (W x H x C format):

```
img = rand(10, 10, 3)
coerce(img, ColorImage) |> scitype
```

ColorImage{10, 10}

### Coercing collections of images

Coercing a **collection** of **gray** images, represented as a `Real`

3-D array (W x H x N format):

```
imgs = rand(10, 10, 3)
coerce(imgs, GrayImage) |> scitype
```

AbstractVector{GrayImage{10, 10}} (alias for AbstractArray{GrayImage{10, 10}, 1})

Coercing a **collection** of **gray** images, represented as a `Real`

4-D array (W x H x {1} x N format):

```
imgs = rand(10, 10, 1, 3)
coerce(imgs, GrayImage) |> scitype
```

AbstractVector{GrayImage{10, 10}} (alias for AbstractArray{GrayImage{10, 10}, 1})

Coercing a **collection** of **color** images, represented as a `Real`

4-D array (W x H x C x N format):

```
imgs = rand(10, 10, 3, 5)
coerce(imgs, ColorImage) |> scitype
```

AbstractVector{ColorImage{10, 10}} (alias for AbstractArray{ColorImage{10, 10}, 1})

## Detailed usage examples

### Basics

```
using CategoricalArrays
scitype((2.718, 42))
```

Tuple{Continuous, Count}

In the default convention, to construct arrays with categorical scientific element type one needs to use `CategorialArrays`

:

```
v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])
```

OrderedFactor{3}

`elscitype(v)`

Union{Missing, OrderedFactor{3}}

Coercing to `Multiclass`

:

```
w = coerce(v, Union{Missing,Multiclass})
elscitype(w)
```

Union{Missing, Multiclass{3}}

### Working with tables

While `schema`

is convenient for inspecting the column scitypes of a table, there is also a scitype for the tables themselves:

```
data = (x1=rand(10), x2=rand(10))
schema(data)
```

┌───────┬────────────┬─────────┐ │ names │ scitypes │ types │ ├───────┼────────────┼─────────┤ │ x1 │ Continuous │ Float64 │ │ x2 │ Continuous │ Float64 │ └───────┴────────────┴─────────┘

`scitype(data)`

Table{AbstractVector{Continuous}}

Similarly, any table implementing the Tables interface has scitype `Table{K}`

, where `K`

is the union of the scitypes of its columns.

Table scitypes are useful for dispatch and type checks, as shown here, with the help of a constructor for `Table`

scitypes provided by Scientific Types.jl:

`Table(Continuous, Count)`

`Table{<:Union{AbstractArray{<:Continuous},AbstractArray{<:Count}}}`

`scitype(data) <: Table(Continuous)`

true

`scitype(data) <: Table(Infinite)`

true

```
data = (x=rand(10), y=collect(1:10), z = [1,2,3,1,2,3,1,2,3,1])
data = coerce(data, :z=>OrderedFactor)
scitype(data) <: Table(Continuous,Count,OrderedFactor)
```

true

Note that `Table(Continuous,Finite)`

is a *type* union and not a `Table`

*instance*.

### Tuples and arrays

The behavior of `scitype`

on tuples is as you would expect:

`scitype((1, 4.5))`

Tuple{Count, Continuous}

For performance reasons, the behavior of `scitype`

on arrays has some wrinkles, in the case of missing values:

**The scitype of an array.** The scitype of an `AbstractArray`

, `A`

, is always`AbstractArray{U}`

where `U`

is the union of the scitypes of the elements of `A`

, with one exception: If `typeof(A) <: AbstractArray{Union{Missing,T}}`

for some `T`

different from `Any`

, then the scitype of `A`

is `AbstractArray{Union{Missing, U}}`

, where `U`

is the union over all non-missing elements, **even if A has no missing elements.**

```
julia> v = [1.3, 4.5, missing]
julia> scitype(v)
AbstractArray{Union{Missing, Continuous},1}
```

```
julia> scitype(v[1:2])
AbstractArray{Union{Missing, Continuous},1}
```

## Automatic type conversion

The `autotype`

function allows to use specific rules in order to guess appropriate scientific types for *tabular* data. Such rules would typically be more constraining than the ones implied by the active convention. When `autotype`

is used, a dictionary of suggested types is returned for each column in the data; if none of the specified rule applies, the ambient convention is used as "fallback".

The function is called as:

`autotype(X)`

If the keyword `only_changes`

is passed set to `true`

, then only the column names for which the suggested type is different from that provided by the convention are returned.

`autotype(X; only_changes=true)`

To specify which rules are to be applied, use the `rules`

keyword and specify a tuple of symbols referring to specific rules; the default rule is `:few_to_finite`

which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate `Finite`

type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.

`autotype(X; rules=(:few_to_finite,))`

Finally, you can also use the following shorthands:

```
autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))
```

### Available rules

Rule symbol | scitype suggestion |
---|---|

`:few_to_finite` | an appropriate `Finite` subtype for columns with few distinct values |

`:discrete_to_continuous` | if not `Finite` , then `Continuous` for any `Count` or `Integer` scitypes/types |

`:string_to_multiclass` | `Multiclass` for any string-like column |

Autotype can be used in conjunction with `coerce`

:

`X_coerced = coerce(X, autotype(X))`

### Examples

By default it only applies the `:few_to_finite`

rule

```
n = 50
X = (a = rand("abc", n), # 3 values, not number --> Multiclass
b = rand([1,2,3,4], n), # 4 values, number --> OrderedFactor
c = rand([true,false], n), # 2 values, number but only 2 --> Multiclass
d = randn(n), # many values --> unchanged
e = rand(collect(1:n), n)) # many values --> unchanged
autotype(X, only_changes=true)
```

Dict{Symbol, Type} with 3 entries: :a => Multiclass :b => OrderedFactor :c => OrderedFactor

For example, we could first apply the `:discrete_to_continuous`

rule, followed by `:few_to_finite`

rule. The first rule will apply to `b`

and `e`

but the subsequent application of the second rule will mean we will get the same result apart for `e`

(which will be `Continuous`

)

`autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))`

Dict{Symbol, Type} with 4 entries: :a => Multiclass :b => OrderedFactor :e => Continuous :c => OrderedFactor

One should check and possibly modify the returned dictionary before passing to `coerce`

.

## API reference

`ScientificTypes.scitype`

— FunctionThe scientific type (interpretation) of `X`

, as distinct from its machine type, as specified by the active convention.

**Examples**

```
julia> scitype(3.14)
Continuous
julia> scitype([1, 2, missing])
AbstractVector{Union{Missing, Count}}
julia> scitype((5, "beige"))
Tuple{Count, Textual}
using CategoricalArrays
X = (gender = categorical(['M', 'M', 'F', 'M', 'F']),
ndevices = [1, 3, 2, 3, 2])
julia> scitype(X)
Table{Union{AbstractVector{Count}, AbstractVector{Multiclass{2}}}}
```

`ScientificTypes.coerce`

— Function`coerce(A, S)`

Return new version of the array `A`

whose scientific element type is `S`

.

```
julia> v = coerce([3, 7, 5], Continuous)
3-element Vector{Float64}:
3.0
7.0
5.0
julia> scitype(v)
AbstractVector{Continuous}
```

`coerce(X, specs...; tight=false, verbosity=1)`

Given a table `X`

, return a copy of `X`

, ensuring that the element scitypes of the columns match the new specification, `specs`

. There are three valid specifications:

(i) one or more `column_name=>Scitype`

pairs:

`coerce(X, col1=>Scitype1, col2=>Scitype2, ... ; verbosity=1)`

(ii) one or more `OldScitype=>NewScitype`

pairs (`OldScitype`

covering both the `OldScitype`

and `Union{Missing,OldScitype}`

cases):

`coerce(X, OldScitype1=>NewScitype1, OldScitype2=>NewScitype2, ... ; verbosity=1)`

(iii) a dictionary of scientific types keyed on column names:

`coerce(X, d::AbstractDict{<:ColKey, <:Type}; verbosity=1)`

where `ColKey = Union{Symbol,AbstractString}`

.

**Examples**

Specifying `column_name=>Scitype`

pairs:

```
using CategoricalArrays, DataFrames, Tables
X = DataFrame(name=["Siri", "Robo", "Alexa", "Cortana"],
height=[152, missing, 148, 163],
rating=[1, 5, 2, 1])
Xc = coerce(X, :name=>Multiclass, :height=>Continuous, :rating=>OrderedFactor)
schema(Xc).scitypes # (Multiclass, Continuous, OrderedFactor)
```

Specifying `OldScitype=>NewScitype`

pairs:

```
X = (x = [1, 2, 3],
y = rand(3),
z = [10, 20, 30])
Xc = coerce(X, Count=>Continuous)
schema(Xfixed).scitypes # (Continuous, Continuous, Continuous)
```

`coerce(image::AbstractArray{<:Real, N}, I)`

Given a an array called `image`

representing one or more images, return a transformed version of the data so as to enforce an appropriate scientific interpretation `I`

:

single or collection ? | N | I | `scitype` of result |
---|---|---|---|

single | 2 | `GrayImage` | `GrayImage{W,H}` |

single | 3 | `ColorImage` | `ColorImage{W,H}` |

collection | 3 | `GrayImage` | `AbstractVector{<:GrayImage}` |

collection | 4 (W x H x {1} x C) | `GrayImage` | `AbstractVector{<:GrayImage}` |

collection | 4 | `ColorImage` | `AbstractVector{<:ColorImage}` |

```
imgs = rand(10, 10, 3, 5)
v = coerce(imgs, ColorImage)
julia> typeof(v)
Vector{Matrix{ColorTypes.RGB{Float64}}}
julia> scitype(v)
AbstractVector{ColorImage{10, 10}}
```

`ScientificTypes.autotype`

— Function`autotype(X; kw...)`

Return a dictionary of suggested scitypes for each column of `X`

, a table or an array based on rules

**Kwargs**

`only_changes=true`

: if true, return only a dictionary of the names for which applying autotype differs from just using the ambient convention. When coercing with autotype,`only_changes`

should be true.`rules=(:few_to_finite,)`

: the set of rules to apply.