In my previous post I showed how to build a simple semantic search engine using pre-trained transformer models.

In this post I'll show how to build a simple search engine using Python and the BM25 ranking algorithm. BM25 is a bag-of-words ranking function designed for information retrieval. It is an enhanced version of the term frequency–inverse document frequency (tf-idf) method.

The basic idea behind tf-idf is that it looks at the frequency of the term in the document (more the better) and it looks at the inverse document frequency (common words are less important). One of the drawbacks of the standard tf-idf is that long documents with the same term frequency are considered less important.

BM25 makes couple of enhancements to the traditional tf-idf. First, BM25 adds a term frequency saturation to tf-idf, limiting the influence of term frequency on the score. Intuitively this means that more frequent the term is the less impact each occurence of it has on the score. Second, BM25 adds a document length weighting, which makes sure that document length doesn't have such a dramatic negative impact on the relevance score as in tf-idf.

Building a BM25-based search engine with Python is easy. There's a great library rank-bm25 which implements the scoring algorithm. To demonstrate the use of this library I'm going to build a simple search engine that retrieves article abstracts from arXiv and searches for the most relevant articles based on a given search term. So, let's get started.

Let's first install and import the needed Python libraries.

!pip install rank_bm25 nltk numpy feedparser
from rank_bm25 import BM25Okapi
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import feedparser
import time

We will use NLTK to tokenize the documents and to identify stopwords, so we need to download the relevant data files used by NLTK.

nltk.download('stopwords')
nltk.download('punkt')

Next we retrieve the documents from arXiv. For this we use the arXiv API. We download 10,000 abstracts from arXiv's Computation and Language (cs.CL) category. The arXiv API returns an Atom feed which we can parse using the feedparser library. This gives us an easy access to the article titles, abstracts, links to the full texts as well as many other useful attributes. Parsing the 10,000 entries will take a couple of minutes.

num_docs = 10000
url = f"http://export.arxiv.org/api/query?search_query=cat:cs.CL&start=0&max_results={str(num_docs)}"
feed = feedparser.parse(url)

Next we remove any special characters, tokenise the article abstracts and remove any stopwords.

stop_words = list(stopwords.words('english'))
tokenized_docs = []
for doc in feed.entries:
    # Article abstracts are stored in entries[i].summary 
    doc = str(doc.summary).lower()
    doc = re.sub(r"([\w].)([\~\!\@\#\$\%\^\&\*\(\)\-\+\[\]\{\}\/\"\'\:\;])([\s\w].)", "\\1 \\2 \\3", doc)
    doc = re.sub(r"\s+", " ", doc)
    doc = [token for token in word_tokenize(doc.lower()) if token not in stop_words and token.isalpha()]
    tokenized_docs.append(doc)

Now we can define the scoring function using the rank-mb25 library. This function takes the tokenized article abstracts (docs) and the tokenised search term. It also stores the indices for the scores so that we can later link the scores to the correct article.

def get_bm25_scores(tokenized_docs, query):
    bm25 = BM25Okapi(tokenized_docs)
    scores = bm25.get_scores(query)
    scores = [(i, v) for i, v in enumerate(scores)]
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores

Now let's define our search term and ther number of results we want to retrieve. In this demo we will search for articles about Natural Language Inference and we want to retrieve the top 5 articles.

num_results = 5
search_term = "Natural Language Inference"

Next we tokenise the search term and call the ranking algorithm we defined earlier.

query = search_term.lower().split(' ')

t0 = time.time()
scores = get_bm25_scores(tokenized_docs, query)
t1 = time.time()
query_time = t1-t0 

Once we have the scores we can use the indices to identify the correct entries in the original data.

idx = np.array(scores)[:num_results, 0].astype(int)
final_scores = np.array(scores)[:num_results, 1]
documents = [feed.entries[i] for i in idx]

We now have the top 5 article entries together with their scores retrieved form 10,000 entries in the cs.CL category. Let's print out the titles, links to the full texts and the summaries.

print(f"Searched {str(num_docs)} documents in {str(round(query_time, 3))} seconds\n")
# Article titles are stored in entries[i].title
# Article abstracts are stored in entries[i].summary
# Article links are stored in entries[i].link
for doc, score in zip(documents, final_scores):
    print(f"{doc.title}\n{doc.link}\n\n{doc.summary}\n[Score: {str(round(score, 4))}]\n\n")
Searched 10000 documents in 0.434 seconds

Stochastic Answer Networks for Natural Language Inference
http://arxiv.org/abs/1804.07888v2

We propose a stochastic answer network (SAN) to explore multi-step inference
strategies in Natural Language Inference. Rather than directly predicting the
results given the inputs, the model maintains a state and iteratively refines
its predictions. Our experiments show that SAN achieves the state-of-the-art
results on three benchmarks: Stanford Natural Language Inference (SNLI)
dataset, MultiGenre Natural Language Inference (MultiNLI) dataset and Quora
Question Pairs dataset.
[Score: 8.8786]


Attention Boosted Sequential Inference Model
http://arxiv.org/abs/1812.01840v2

Attention mechanism has been proven effective on natural language processing.
This paper proposes an attention boosted natural language inference model named
aESIM by adding word attention and adaptive direction-oriented attention
mechanisms to the traditional Bi-LSTM layer of natural language inference
models, e.g. ESIM. This makes the inference model aESIM has the ability to
effectively learn the representation of words and model the local subsentential
inference between pairs of premise and hypothesis. The empirical studies on the
SNLI, MultiNLI and Quora benchmarks manifest that aESIM is superior to the
original ESIM model.
[Score: 8.591]


A Neural Architecture Mimicking Humans End-to-End for Natural Language
  Inference
http://arxiv.org/abs/1611.04741v2

In this work we use the recent advances in representation learning to propose
a neural architecture for the problem of natural language inference. Our
approach is aligned to mimic how a human does the natural language inference
process given two statements. The model uses variants of Long Short Term Memory
(LSTM), attention mechanism and composable neural networks, to carry out the
task. Each part of our model can be mapped to a clear functionality humans do
for carrying out the overall task of natural language inference. The model is
end-to-end differentiable enabling training by stochastic gradient descent. On
Stanford Natural Language Inference(SNLI) dataset, the proposed model achieves
better accuracy numbers than all published models in literature.
[Score: 8.5675]


Neural Natural Language Inference Models Enhanced with External
  Knowledge
http://arxiv.org/abs/1711.04289v3

Modeling natural language inference is a very challenging task. With the
availability of large annotated data, it has recently become feasible to train
complex models such as neural-network-based inference models, which have shown
to achieve the state-of-the-art performance. Although there exist relatively
large annotated data, can machines learn all knowledge needed to perform
natural language inference (NLI) from these data? If not, how can
neural-network-based NLI models benefit from external knowledge and how to
build NLI models to leverage it? In this paper, we enrich the state-of-the-art
neural natural language inference models with external knowledge. We
demonstrate that the proposed models improve neural NLI models to achieve the
state-of-the-art performance on the SNLI and MultiNLI datasets.
[Score: 8.2145]


Baselines and test data for cross-lingual inference
http://arxiv.org/abs/1704.05347v2

The recent years have seen a revival of interest in textual entailment,
sparked by i) the emergence of powerful deep neural network learners for
natural language processing and ii) the timely development of large-scale
evaluation datasets such as SNLI. Recast as natural language inference, the
problem now amounts to detecting the relation between pairs of statements: they
either contradict or entail one another, or they are mutually neutral. Current
research in natural language inference is effectively exclusive to English. In
this paper, we propose to advance the research in SNLI-style natural language
inference toward multilingual evaluation. To that end, we provide test data for
four major languages: Arabic, French, Spanish, and Russian. We experiment with
a set of baselines. Our systems are based on cross-lingual word embeddings and
machine translation. While our best system scores an average accuracy of just
over 75%, we focus largely on enabling further research in multilingual
inference.
[Score: 8.1544]


As you can see the BM25 algorithm works pretty well and the rank-bm25 Python library is able to rank 10,000 article abstracts in just 0.434 seconds. In this demo the most time consuming part was retrieving and parsing the 10,000 entries using feedparser. In real life applications you would index the data to allow faster retrieval.

Hope you enjoyed this demo. Feel free to contact me if you have any questions.