CountTransformer
CountTransformerA model type for constructing a count transformer, based on MLJText.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
CountTransformer = @load CountTransformer pkg=MLJTextDo model = CountTransformer() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in CountTransformer(max_doc_freq=...).
The transformer converts a collection of documents, tokenized or pre-parsed as bags of words/ngrams, to a matrix of term counts.
Training data
In MLJ or MLJBase, bind an instance model to data with
mach = machine(model, X)Here:
Xis any vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element is one of the following:- A vector of abstract strings (tokens), e.g.,
["I", "like", "Sam", ".", "Sam", "is", "nice", "."](scitypeAbstractVector{Textual}) - A dictionary of counts, indexed on abstract strings, e.g.,
Dict("I"=>1, "Sam"=>2, "Sam is"=>1)(scitypeMultiset{Textual}}) - A dictionary of counts, indexed on plain ngrams, e.g.,
Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1)(scitypeMultiset{<:NTuple{N,Textual} where N}); here a plain ngram is a tuple of abstract strings.
- A vector of abstract strings (tokens), e.g.,
Train the machine using fit!(mach, rows=...).
Hyper-parameters
max_doc_freq=1.0: Restricts the vocabulary that the transformer will consider. Terms that occur in> max_doc_freqdocuments will not be considered by the transformer. For example, ifmax_doc_freqis set to 0.9, terms that are in more than 90% of the documents will be removed.min_doc_freq=0.0: Restricts the vocabulary that the transformer will consider. Terms that occur in< max_doc_freqdocuments will not be considered by the transformer. A value of 0.01 means that only terms that are at least in 1% of the documents will be included.
Operations
transform(mach, Xnew): Based on the vocabulary learned in training, return the matrix of counts forXnew, a vector of the same form asXabove. The matrix has size(n, p), wheren = length(Xnew)andpthe size of the vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.
Fitted parameters
The fields of fitted_params(mach) are:
vocab: A vector containing the string used in the transformer's vocabulary.
Examples
CountTransformer accepts a variety of inputs. The example below transforms tokenized documents:
using MLJ
import TextAnalysis
CountTransformer = @load CountTransformer pkg=MLJText
docs = ["Hi my name is Sam.", "How are you today?"]
count_transformer = CountTransformer()
julia> tokenized_docs = TextAnalysis.tokenize.(docs)
2-element Vector{Vector{String}}:
["Hi", "my", "name", "is", "Sam", "."]
["How", "are", "you", "today", "?"]
mach = machine(count_transformer, tokenized_docs)
fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, tokenized_docs)Alternatively, one can provide documents pre-parsed as ngrams counts:
using MLJ
import TextAnalysis
docs = ["Hi my name is Sam.", "How are you today?"]
corpus = TextAnalysis.Corpus(TextAnalysis.NGramDocument.(docs, 1, 2))
ngram_docs = TextAnalysis.ngrams.(corpus)
julia> ngram_docs[1]
Dict{AbstractString, Int64} with 11 entries:
"is" => 1
"my" => 1
"name" => 1
"." => 1
"Hi" => 1
"Sam" => 1
"my name" => 1
"Hi my" => 1
"name is" => 1
"Sam ." => 1
"is Sam" => 1
count_transformer = CountTransformer()
mach = machine(count_transformer, ngram_docs)
MLJ.fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, ngram_docs)See also TfidfTransformer, BM25Transformer