TfidfTransformer

TfidfTransformer

A model type for constructing a TF-IFD transformer, based on MLJText.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

TfidfTransformer = @load TfidfTransformer pkg=MLJText

Do model = TfidfTransformer() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in TfidfTransformer(max_doc_freq=...).

The transformer converts a collection of documents, tokenized or pre-parsed as bags of words/ngrams, to a matrix of TF-IDF scores. Here "TF" means term-frequency while "IDF" means inverse document frequency (defined below). The TF-IDF score is the product of the two. This is a common term weighting scheme in information retrieval, that has also found good use in document classification. The goal of using TF-IDF instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

In textbooks and implementations there is variation in the definition of IDF. Here two IDF definitions are available. The default, smoothed option provides the IDF for a term t as log((1 + n)/(1 + df(t))) + 1, where n is the total number of documents and df(t) the number of documents in which t appears. Setting smooth_df = false provides an IDF of log(n/df(t)) + 1.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

Here:

  • X is any vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element is one of the following:

    • A vector of abstract strings (tokens), e.g., ["I", "like", "Sam", ".", "Sam", "is", "nice", "."] (scitype AbstractVector{Textual})
    • A dictionary of counts, indexed on abstract strings, e.g., Dict("I"=>1, "Sam"=>2, "Sam is"=>1) (scitype Multiset{Textual}})
    • A dictionary of counts, indexed on plain ngrams, e.g., Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1) (scitype Multiset{<:NTuple{N,Textual} where N}); here a plain ngram is a tuple of abstract strings.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • max_doc_freq=1.0: Restricts the vocabulary that the transformer will consider. Terms that occur in > max_doc_freq documents will not be considered by the transformer. For example, if max_doc_freq is set to 0.9, terms that are in more than 90% of the documents will be removed.
  • min_doc_freq=0.0: Restricts the vocabulary that the transformer will consider. Terms that occur in < max_doc_freq documents will not be considered by the transformer. A value of 0.01 means that only terms that are at least in 1% of the documents will be included.
  • smooth_idf=true: Control which definition of IDF to use (see above).

Operations

  • transform(mach, Xnew): Based on the vocabulary and IDF learned in training, return the matrix of TF-IDF scores for Xnew, a vector of the same form as X above. The matrix has size (n, p), where n = length(Xnew) and p the size of the vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.

Fitted parameters

The fields of fitted_params(mach) are:

  • vocab: A vector containing the strings used in the transformer's vocabulary.
  • idf_vector: The transformer's calculated IDF vector.

Examples

TfidfTransformer accepts a variety of inputs. The example below transforms tokenized documents:

using MLJ
import TextAnalysis

TfidfTransformer = @load TfidfTransformer pkg=MLJText

docs = ["Hi my name is Sam.", "How are you today?"]
tfidf_transformer = TfidfTransformer()

julia> tokenized_docs = TextAnalysis.tokenize.(docs)
2-element Vector{Vector{String}}:
 ["Hi", "my", "name", "is", "Sam", "."]
 ["How", "are", "you", "today", "?"]

mach = machine(tfidf_transformer, tokenized_docs)
fit!(mach)

fitted_params(mach)

tfidf_mat = transform(mach, tokenized_docs)

Alternatively, one can provide documents pre-parsed as ngrams counts:

using MLJ
import TextAnalysis

docs = ["Hi my name is Sam.", "How are you today?"]
corpus = TextAnalysis.Corpus(TextAnalysis.NGramDocument.(docs, 1, 2))
ngram_docs = TextAnalysis.ngrams.(corpus)

julia> ngram_docs[1]
Dict{AbstractString, Int64} with 11 entries:
  "is"      => 1
  "my"      => 1
  "name"    => 1
  "."       => 1
  "Hi"      => 1
  "Sam"     => 1
  "my name" => 1
  "Hi my"   => 1
  "name is" => 1
  "Sam ."   => 1
  "is Sam"  => 1

tfidf_transformer = TfidfTransformer()
mach = machine(tfidf_transformer, ngram_docs)
MLJ.fit!(mach)
fitted_params(mach)

tfidf_mat = transform(mach, ngram_docs)

See also CountTransformer, BM25Transformer