BM25Transformer
BM25Transformer
A model type for constructing a b m25 transformer, based on MLJText.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
BM25Transformer = @load BM25Transformer pkg=MLJText
Do model = BM25Transformer()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in BM25Transformer(max_doc_freq=...)
.
The transformer converts a collection of documents, tokenized or pre-parsed as bags of words/ngrams, to a matrix of Okapi BM25 document-word statistics. The BM25 scoring function uses both term frequency (TF) and inverse document frequency (IDF, defined below), as in TfidfTransformer
, but additionally adjusts for the probability that a user will consider a search result relevant based, on the terms in the search query and those in each document.
In textbooks and implementations there is variation in the definition of IDF. Here two IDF definitions are available. The default, smoothed option provides the IDF for a term t
as log((1 + n)/(1 + df(t))) + 1
, where n
is the total number of documents and df(t)
the number of documents in which t
appears. Setting smooth_df = false
provides an IDF of log(n/df(t)) + 1
.
References:
- http://ethen8181.github.io/machine-learning/search/bm25_intro.html
- https://en.wikipedia.org/wiki/Okapi_BM25
- https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html
Training data
In MLJ or MLJBase, bind an instance model
to data with
mach = machine(model, X)
Here:
X
is any vector whose elements are either tokenized documents or bags of words/ngrams. Specifically, each element is one of the following:- A vector of abstract strings (tokens), e.g.,
["I", "like", "Sam", ".", "Sam", "is", "nice", "."]
(scitypeAbstractVector{Textual}
) - A dictionary of counts, indexed on abstract strings, e.g.,
Dict("I"=>1, "Sam"=>2, "Sam is"=>1)
(scitypeMultiset{Textual}}
) - A dictionary of counts, indexed on plain ngrams, e.g.,
Dict(("I",)=>1, ("Sam",)=>2, ("I", "Sam")=>1)
(scitypeMultiset{<:NTuple{N,Textual} where N}
); here a plain ngram is a tuple of abstract strings.
- A vector of abstract strings (tokens), e.g.,
Train the machine using fit!(mach, rows=...)
.
Hyper-parameters
max_doc_freq=1.0
: Restricts the vocabulary that the transformer will consider. Terms that occur in> max_doc_freq
documents will not be considered by the transformer. For example, ifmax_doc_freq
is set to 0.9, terms that are in more than 90% of the documents will be removed.min_doc_freq=0.0
: Restricts the vocabulary that the transformer will consider. Terms that occur in< max_doc_freq
documents will not be considered by the transformer. A value of 0.01 means that only terms that are at least in 1% of the documents will be included.κ=2
: The term frequency saturation characteristic. Higher values represent slower saturation. What we mean by saturation is the degree to which a term occurring extra times adds to the overall score.β=0.075
: Amplifies the particular document length compared to the average length. The bigger β is, the more document length is amplified in terms of the overall score. The default value is 0.75, and the bounds are restricted between 0 and 1.smooth_idf=true
: Control which definition of IDF to use (see above).
Operations
transform(mach, Xnew)
: Based on the vocabulary, IDF, and mean word counts learned in training, return the matrix of BM25 scores forXnew
, a vector of the same form asX
above. The matrix has size(n, p)
, wheren = length(Xnew)
andp
the size of the vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.
Fitted parameters
The fields of fitted_params(mach)
are:
vocab
: A vector containing the string used in the transformer's vocabulary.idf_vector
: The transformer's calculated IDF vector.mean_words_in_docs
: The mean number of words in each document.
Examples
BM25Transformer
accepts a variety of inputs. The example below transforms tokenized documents:
using MLJ
import TextAnalysis
BM25Transformer = @load BM25Transformer pkg=MLJText
docs = ["Hi my name is Sam.", "How are you today?"]
bm25_transformer = BM25Transformer()
julia> tokenized_docs = TextAnalysis.tokenize.(docs)
2-element Vector{Vector{String}}:
["Hi", "my", "name", "is", "Sam", "."]
["How", "are", "you", "today", "?"]
mach = machine(bm25_transformer, tokenized_docs)
fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, tokenized_docs)
Alternatively, one can provide documents pre-parsed as ngrams counts:
using MLJ
import TextAnalysis
docs = ["Hi my name is Sam.", "How are you today?"]
corpus = TextAnalysis.Corpus(TextAnalysis.NGramDocument.(docs, 1, 2))
ngram_docs = TextAnalysis.ngrams.(corpus)
julia> ngram_docs[1]
Dict{AbstractString, Int64} with 11 entries:
"is" => 1
"my" => 1
"name" => 1
"." => 1
"Hi" => 1
"Sam" => 1
"my name" => 1
"Hi my" => 1
"name is" => 1
"Sam ." => 1
"is Sam" => 1
bm25_transformer = BM25Transformer()
mach = machine(bm25_transformer, ngram_docs)
MLJ.fit!(mach)
fitted_params(mach)
tfidf_mat = transform(mach, ngram_docs)
See also TfidfTransformer
, CountTransformer