TfidfTransformer
TfidfTransformer
A model type for constructing a TF-IFD transformer, based on MLJText.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
TfidfTransformer = @load TfidfTransformer pkg=MLJText
Do model = TfidfTransformer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TfidfTransformer(max_doc_freq=...)
.
The transformer converts a collection of documents, tokenized or pre-parsed as bags of words/ngrams, to a matrix of TF-IDF scores. Here "TF" means term-frequency while "IDF" means inverse document frequency (defined below). The TF-IDF score is the product of the two. This is a common term weighting scheme in information retrieval, that has also found good use in document classification. The goal of using TF-IDF instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
In textbooks and implementations there is variation in the definition of IDF. Here two IDF definitions are available. The default, smoothed option provides the IDF for a term t
as log((1 + n)/(1 + df(t))) + 1
, where n
is the total number of documents and df(t)
the number of documents in which t
appears. Setting smooth_df = false
provides an IDF of log(n/df(t)) + 1
.
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
Here:
X
is any vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element is one of the following:- A vector of abstract strings (tokens), e.g.,
["I", "like", "Sam", ".", "Sam", "is", "nice", "."]
(scitypeAbstractVector{Textual}
) - A dictionary of counts, indexed on abstract strings, e.g.,
Dict("I"=>1, "Sam"=>2, "Sam is"=>1)
(scitypeMultiset{Textual}}
) - A dictionary of counts, indexed on plain ngrams, e.g.,
Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1)
(scitypeMultiset{<:NTuple{N,Textual} where N}
); here a plain ngram is a tuple of abstract strings.
- A vector of abstract strings (tokens), e.g.,
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
max_doc_freq=1.0
: Restricts the vocabulary that the transformer will consider. Terms that occur in> max_doc_freq
documents will not be considered by the transformer. For example, ifmax_doc_freq
is set to 0.9, terms that are in more than 90% of the documents will be removed.min_doc_freq=0.0
: Restricts the vocabulary that the transformer will consider. Terms that occur in< max_doc_freq
documents will not be considered by the transformer. A value of 0.01 means that only terms that are at least in 1% of the documents will be included.smooth_idf=true
: Control which definition of IDF to use (see above).
Operations
transform(mach, Xnew)
: Based on the vocabulary and IDF learned in training, return the matrix of TF-IDF scores forXnew
, a vector of the same form asX
above. The matrix has size(n, p)
, wheren = length(Xnew)
andp
the size of the vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.
Fitted parameters
The fields of fitted_params(mach)
are:
vocab
: A vector containing the strings used in the transformer's vocabulary.idf_vector
: The transformer's calculated IDF vector.
Examples
TfidfTransformer
accepts a variety of inputs. The example below transforms tokenized documents:
using MLJ
import TextAnalysis
TfidfTransformer = @load TfidfTransformer pkg=MLJText
docs = ["Hi my name is Sam.", "How are you today?"]
tfidf_transformer = TfidfTransformer()
julia> tokenized_docs = TextAnalysis.tokenize.(docs)
2-element Vector{Vector{String}}:
["Hi", "my", "name", "is", "Sam", "."]
["How", "are", "you", "today", "?"]
mach = machine(tfidf_transformer, tokenized_docs)
fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, tokenized_docs)
Alternatively, one can provide documents pre-parsed as ngrams counts:
using MLJ
import TextAnalysis
docs = ["Hi my name is Sam.", "How are you today?"]
corpus = TextAnalysis.Corpus(TextAnalysis.NGramDocument.(docs, 1, 2))
ngram_docs = TextAnalysis.ngrams.(corpus)
julia> ngram_docs[1]
Dict{AbstractString, Int64} with 11 entries:
"is" => 1
"my" => 1
"name" => 1
"." => 1
"Hi" => 1
"Sam" => 1
"my name" => 1
"Hi my" => 1
"name is" => 1
"Sam ." => 1
"is Sam" => 1
tfidf_transformer = TfidfTransformer()
mach = machine(tfidf_transformer, ngram_docs)
MLJ.fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, ngram_docs)
See also CountTransformer
, BM25Transformer