CountTransformer

CountTransformer

A model type for constructing a count transformer, based on MLJText.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

CountTransformer = @load CountTransformer pkg=MLJText

Do model = CountTransformer() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in CountTransformer(max_doc_freq=...).

The transformer converts a collection of documents, tokenized or pre-parsed as bags of words/ngrams, to a matrix of term counts.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

Here:

  • X is any vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element is one of the following:

    • A vector of abstract strings (tokens), e.g., ["I", "like", "Sam", ".", "Sam", "is", "nice", "."] (scitype AbstractVector{Textual})
    • A dictionary of counts, indexed on abstract strings, e.g., Dict("I"=>1, "Sam"=>2, "Sam is"=>1) (scitype Multiset{Textual}})
    • A dictionary of counts, indexed on plain ngrams, e.g., Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1) (scitype Multiset{<:NTuple{N,Textual} where N}); here a plain ngram is a tuple of abstract strings.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • max_doc_freq=1.0: Restricts the vocabulary that the transformer will consider. Terms that occur in > max_doc_freq documents will not be considered by the transformer. For example, if max_doc_freq is set to 0.9, terms that are in more than 90% of the documents will be removed.
  • min_doc_freq=0.0: Restricts the vocabulary that the transformer will consider. Terms that occur in < max_doc_freq documents will not be considered by the transformer. A value of 0.01 means that only terms that are at least in 1% of the documents will be included.

Operations

  • transform(mach, Xnew): Based on the vocabulary learned in training, return the matrix of counts for Xnew, a vector of the same form as X above. The matrix has size (n, p), where n = length(Xnew) and p the size of the vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.

Fitted parameters

The fields of fitted_params(mach) are:

  • vocab: A vector containing the string used in the transformer's vocabulary.

Examples

CountTransformer accepts a variety of inputs. The example below transforms tokenized documents:

using MLJ
import TextAnalysis

CountTransformer = @load CountTransformer pkg=MLJText

docs = ["Hi my name is Sam.", "How are you today?"]
count_transformer = CountTransformer()

julia> tokenized_docs = TextAnalysis.tokenize.(docs)
2-element Vector{Vector{String}}:
 ["Hi", "my", "name", "is", "Sam", "."]
 ["How", "are", "you", "today", "?"]

mach = machine(count_transformer, tokenized_docs)
fit!(mach)

fitted_params(mach)

tfidf_mat = transform(mach, tokenized_docs)

Alternatively, one can provide documents pre-parsed as ngrams counts:

using MLJ
import TextAnalysis

docs = ["Hi my name is Sam.", "How are you today?"]
corpus = TextAnalysis.Corpus(TextAnalysis.NGramDocument.(docs, 1, 2))
ngram_docs = TextAnalysis.ngrams.(corpus)

julia> ngram_docs[1]
Dict{AbstractString, Int64} with 11 entries:
  "is"      => 1
  "my"      => 1
  "name"    => 1
  "."       => 1
  "Hi"      => 1
  "Sam"     => 1
  "my name" => 1
  "Hi my"   => 1
  "name is" => 1
  "Sam ."   => 1
  "is Sam"  => 1

count_transformer = CountTransformer()
mach = machine(count_transformer, ngram_docs)
MLJ.fit!(mach)
fitted_params(mach)

tfidf_mat = transform(mach, ngram_docs)

See also TfidfTransformer, BM25Transformer