CountTransformer
CountTransformer
A model type for constructing a count transformer, based on MLJText.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
CountTransformer = @load CountTransformer pkg=MLJText
Do model = CountTransformer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in CountTransformer(max_doc_freq=...)
.
The transformer converts a collection of documents, tokenized or pre-parsed as bags of words/ngrams, to a matrix of term counts.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
Here:
X
is any vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element is one of the following:- A vector of abstract strings (tokens), e.g.,
["I", "like", "Sam", ".", "Sam", "is", "nice", "."]
(scitypeAbstractVector{Textual}
) - A dictionary of counts, indexed on abstract strings, e.g.,
Dict("I"=>1, "Sam"=>2, "Sam is"=>1)
(scitypeMultiset{Textual}}
) - A dictionary of counts, indexed on plain ngrams, e.g.,
Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1)
(scitypeMultiset{<:NTuple{N,Textual} where N}
); here a plain ngram is a tuple of abstract strings.
- A vector of abstract strings (tokens), e.g.,
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
max_doc_freq=1.0
: Restricts the vocabulary that the transformer will consider. Terms that occur in> max_doc_freq
documents will not be considered by the transformer. For example, ifmax_doc_freq
is set to 0.9, terms that are in more than 90% of the documents will be removed.min_doc_freq=0.0
: Restricts the vocabulary that the transformer will consider. Terms that occur in< max_doc_freq
documents will not be considered by the transformer. A value of 0.01 means that only terms that are at least in 1% of the documents will be included.
Operations
transform(mach, Xnew)
: Based on the vocabulary learned in training, return the matrix of counts forXnew
, a vector of the same form asX
above. The matrix has size(n, p)
, wheren = length(Xnew)
andp
the size of the vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.
Fitted parameters
The fields of fitted_params(mach)
are:
vocab
: A vector containing the string used in the transformer's vocabulary.
Examples
CountTransformer
accepts a variety of inputs. The example below transforms tokenized documents:
using MLJ
import TextAnalysis
CountTransformer = @load CountTransformer pkg=MLJText
docs = ["Hi my name is Sam.", "How are you today?"]
count_transformer = CountTransformer()
julia> tokenized_docs = TextAnalysis.tokenize.(docs)
2-element Vector{Vector{String}}:
["Hi", "my", "name", "is", "Sam", "."]
["How", "are", "you", "today", "?"]
mach = machine(count_transformer, tokenized_docs)
fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, tokenized_docs)
Alternatively, one can provide documents pre-parsed as ngrams counts:
using MLJ
import TextAnalysis
docs = ["Hi my name is Sam.", "How are you today?"]
corpus = TextAnalysis.Corpus(TextAnalysis.NGramDocument.(docs, 1, 2))
ngram_docs = TextAnalysis.ngrams.(corpus)
julia> ngram_docs[1]
Dict{AbstractString, Int64} with 11 entries:
"is" => 1
"my" => 1
"name" => 1
"." => 1
"Hi" => 1
"Sam" => 1
"my name" => 1
"Hi my" => 1
"name is" => 1
"Sam ." => 1
"is Sam" => 1
count_transformer = CountTransformer()
mach = machine(count_transformer, ngram_docs)
MLJ.fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, ngram_docs)
See also TfidfTransformer
, BM25Transformer