BM25Transformer

BM25Transformer

A model type for constructing a b m25 transformer, based on MLJText.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

BM25Transformer = @load BM25Transformer pkg=MLJText

Do model = BM25Transformer() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in BM25Transformer(max_doc_freq=...).

The transformer converts a collection of documents, tokenized or pre-parsed as bags of words/ngrams, to a matrix of Okapi BM25 document-word statistics. The BM25 scoring function uses both term frequency (TF) and inverse document frequency (IDF, defined below), as in TfidfTransformer, but additionally adjusts for the probability that a user will consider a search result relevant based, on the terms in the search query and those in each document.

In textbooks and implementations there is variation in the definition of IDF. Here two IDF definitions are available. The default, smoothed option provides the IDF for a term t as log((1 + n)/(1 + df(t))) + 1, where n is the total number of documents and df(t) the number of documents in which t appears. Setting smooth_df = false provides an IDF of log(n/df(t)) + 1.

References:

http://ethen8181.github.io/machine-learning/search/bm25_intro.html
https://en.wikipedia.org/wiki/Okapi_BM25
https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X)

Here:

X is any vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element is one of the following:
- A vector of abstract strings (tokens), e.g., ["I", "like", "Sam", ".", "Sam", "is", "nice", "."] (scitype AbstractVector{Textual})
- A dictionary of counts, indexed on abstract strings, e.g., Dict("I"=>1, "Sam"=>2, "Sam is"=>1) (scitype Multiset{Textual}})
- A dictionary of counts, indexed on plain ngrams, e.g., Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1) (scitype Multiset{<:NTuple{N,Textual} where N}); here a plain ngram is a tuple of abstract strings.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

max_doc_freq=1.0: Restricts the vocabulary that the transformer will consider. Terms that occur in > max_doc_freq documents will not be considered by the transformer. For example, if max_doc_freq is set to 0.9, terms that are in more than 90% of the documents will be removed.
min_doc_freq=0.0: Restricts the vocabulary that the transformer will consider. Terms that occur in < max_doc_freq documents will not be considered by the transformer. A value of 0.01 means that only terms that are at least in 1% of the documents will be included.
κ=2: The term frequency saturation characteristic. Higher values represent slower saturation. What we mean by saturation is the degree to which a term occurring extra times adds to the overall score.
β=0.075: Amplifies the particular document length compared to the average length. The bigger β is, the more document length is amplified in terms of the overall score. The default value is 0.75, and the bounds are restricted between 0 and 1.
smooth_idf=true: Control which definition of IDF to use (see above).

Operations

transform(mach, Xnew): Based on the vocabulary, IDF, and mean word counts learned in training, return the matrix of BM25 scores for Xnew, a vector of the same form as X above. The matrix has size (n, p), where n = length(Xnew) and p the size of the vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.

Fitted parameters

The fields of fitted_params(mach) are:

vocab: A vector containing the string used in the transformer's vocabulary.
idf_vector: The transformer's calculated IDF vector.
mean_words_in_docs: The mean number of words in each document.

Examples

BM25Transformer accepts a variety of inputs. The example below transforms tokenized documents:

using MLJ
import TextAnalysis

BM25Transformer = @load BM25Transformer pkg=MLJText

docs = ["Hi my name is Sam.", "How are you today?"]
bm25_transformer = BM25Transformer()

julia> tokenized_docs = TextAnalysis.tokenize.(docs)
2-element Vector{Vector{String}}:
 ["Hi", "my", "name", "is", "Sam", "."]
 ["How", "are", "you", "today", "?"]

mach = machine(bm25_transformer, tokenized_docs)
fit!(mach)

fitted_params(mach)

tfidf_mat = transform(mach, tokenized_docs)

Alternatively, one can provide documents pre-parsed as ngrams counts:

using MLJ
import TextAnalysis

docs = ["Hi my name is Sam.", "How are you today?"]
corpus = TextAnalysis.Corpus(TextAnalysis.NGramDocument.(docs, 1, 2))
ngram_docs = TextAnalysis.ngrams.(corpus)

julia> ngram_docs[1]
Dict{AbstractString, Int64} with 11 entries:
  "is"      => 1
  "my"      => 1
  "name"    => 1
  "."       => 1
  "Hi"      => 1
  "Sam"     => 1
  "my name" => 1
  "Hi my"   => 1
  "name is" => 1
  "Sam ."   => 1
  "is Sam"  => 1

bm25_transformer = BM25Transformer()
mach = machine(bm25_transformer, ngram_docs)
MLJ.fit!(mach)
fitted_params(mach)

tfidf_mat = transform(mach, ngram_docs)

See also TfidfTransformer, CountTransformer