Simple Bag-of-Words Search Engine in Python
Demo on how to build a search engine using Python and the BM25 algorithm
In my previous post I showed how to build a simple semantic search engine using pre-trained transformer models.
In this post I'll show how to build a simple search engine using Python and the BM25 ranking algorithm. BM25 is a bag-of-words ranking function designed for information retrieval. It is an enhanced version of the term frequency–inverse document frequency (tf-idf) method.
The basic idea behind tf-idf is that it looks at the frequency of the term in the document (more the better) and it looks at the inverse document frequency (common words are less important). One of the drawbacks of the standard tf-idf is that long documents with the same term frequency are considered less important.
BM25 makes couple of enhancements to the traditional tf-idf. First, BM25 adds a term frequency saturation to tf-idf, limiting the influence of term frequency on the score. Intuitively this means that more frequent the term is the less impact each occurence of it has on the score. Second, BM25 adds a document length weighting, which makes sure that document length doesn't have such a dramatic negative impact on the relevance score as in tf-idf.
Building a BM25-based search engine with Python is easy. There's a great library rank-bm25
which implements the scoring algorithm. To demonstrate the use of this library I'm going to build a simple search engine that retrieves article abstracts from arXiv and searches for the most relevant articles based on a given search term. So, let's get started.
Let's first install and import the needed Python libraries.
!pip install rank_bm25 nltk numpy feedparser
from rank_bm25 import BM25Okapi
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import feedparser
import time
We will use NLTK to tokenize the documents and to identify stopwords, so we need to download the relevant data files used by NLTK.
nltk.download('stopwords')
nltk.download('punkt')
Next we retrieve the documents from arXiv. For this we use the arXiv API. We download 10,000 abstracts from arXiv's Computation and Language (cs.CL) category. The arXiv API returns an Atom feed which we can parse using the feedparser
library. This gives us an easy access to the article titles, abstracts, links to the full texts as well as many other useful attributes. Parsing the 10,000 entries will take a couple of minutes.
num_docs = 10000
url = f"http://export.arxiv.org/api/query?search_query=cat:cs.CL&start=0&max_results={str(num_docs)}"
feed = feedparser.parse(url)
Next we remove any special characters, tokenise the article abstracts and remove any stopwords.
stop_words = list(stopwords.words('english'))
tokenized_docs = []
for doc in feed.entries:
# Article abstracts are stored in entries[i].summary
doc = str(doc.summary).lower()
doc = re.sub(r"([\w].)([\~\!\@\#\$\%\^\&\*\(\)\-\+\[\]\{\}\/\"\'\:\;])([\s\w].)", "\\1 \\2 \\3", doc)
doc = re.sub(r"\s+", " ", doc)
doc = [token for token in word_tokenize(doc.lower()) if token not in stop_words and token.isalpha()]
tokenized_docs.append(doc)
Now we can define the scoring function using the rank-mb25
library. This function takes the tokenized article abstracts (docs) and the tokenised search term. It also stores the indices for the scores so that we can later link the scores to the correct article.
def get_bm25_scores(tokenized_docs, query):
bm25 = BM25Okapi(tokenized_docs)
scores = bm25.get_scores(query)
scores = [(i, v) for i, v in enumerate(scores)]
scores.sort(key=lambda x: x[1], reverse=True)
return scores
Now let's define our search term and ther number of results we want to retrieve. In this demo we will search for articles about Natural Language Inference and we want to retrieve the top 5 articles.
num_results = 5
search_term = "Natural Language Inference"
Next we tokenise the search term and call the ranking algorithm we defined earlier.
query = search_term.lower().split(' ')
t0 = time.time()
scores = get_bm25_scores(tokenized_docs, query)
t1 = time.time()
query_time = t1-t0
Once we have the scores we can use the indices to identify the correct entries in the original data.
idx = np.array(scores)[:num_results, 0].astype(int)
final_scores = np.array(scores)[:num_results, 1]
documents = [feed.entries[i] for i in idx]
We now have the top 5 article entries together with their scores retrieved form 10,000 entries in the cs.CL category. Let's print out the titles, links to the full texts and the summaries.
print(f"Searched {str(num_docs)} documents in {str(round(query_time, 3))} seconds\n")
# Article titles are stored in entries[i].title
# Article abstracts are stored in entries[i].summary
# Article links are stored in entries[i].link
for doc, score in zip(documents, final_scores):
print(f"{doc.title}\n{doc.link}\n\n{doc.summary}\n[Score: {str(round(score, 4))}]\n\n")
As you can see the BM25 algorithm works pretty well and the rank-bm25
Python library is able to rank 10,000 article abstracts in just 0.434 seconds. In this demo the most time consuming part was retrieving and parsing the 10,000 entries using feedparser
. In real life applications you would index the data to allow faster retrieval.
Hope you enjoyed this demo. Feel free to contact me if you have any questions.
- Twitter: @AarneTalman
- Website: talman.io