In the previous two posts I've explored the HuggingFace Transformers library and demonstrated:

  1. how to train a classification model for NLI

  2. how to use an NLI model to rank news articles based on a keyword

In this post I will show how to use a pre-trained sentence encoder model to create a simple semantic search engine for website content. The search engine will be "semantic" in the sense that it will try to find sentences from the website whose vector representation "matches" the vector representation of the search term.

We will use a pre-trained transformers model to encode all the sentences of a website as well as the search term and then calculate the cosine similarity between each encoded sentence and the search term. We will then rank the sentences based on the cosine similarity.

Let's get started. As before we will first install the libraries we need for this demo. We are using a library called Sentence Transformers, which provides pre-trained transformers models specifically for the purpose of computing sentence-level vector representations. We also need to install scikit-learn, as we will use it to calculate the cosine similarities between sentence vectors.

!pip install sentence_transformers sklearn lxml bs4

Once we have installed the liraries we will import them together with some other usefult librarires we will use in the demo.

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
from IPython.display import Markdown, display

As in the previous post we define some functions we will use for cleaning up the text from the website html as well as a function to print our output in markdown format.

def tag_visible(element):
    if element.parent.name in ['p']:
        return True
    if isinstance(element, Comment):
        return False
    return False

def text_from_html(html):
    soup = BeautifulSoup(html.content, 'lxml')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

def printmd(string):
    display(Markdown(string))

Now we can define the website we are interested in together with the search term. We also define how many search results we want to see. In this demo we are interested in finding top five sentences from the Deep Learning Wikipedia article that are about Natural Language Processing.

website = 'https://en.wikipedia.org/wiki/Deep_learning'
search_term = 'natural language processing'
num_results = 5

We then retrieve and extract the text from the website and split the sentences to a list.

html = requests.get(website)             
src_text = text_from_html(html)
input_text = src_text.split('.')

For our model we use the roberta-large-nli-stsb-mean-tokens model which according to the Sentence transformers Github page beats the other Sentence Transformer models in the Semantic Textual Similarity (STS) benchmark.

model = SentenceTransformer('roberta-base-nli-stsb-mean-tokens')

Next we encode the list of sentences retrieved from the website as well as the search term using the Sentence Transformer model.

encoded_text = np.array(model.encode(input_text))
encoded_query = np.array(model.encode([search_term]))

We then compare the cosine similarity of each encoded sentence with the encoded search term and rank the sentences accordingly.

results = cosine_similarity(encoded_query, encoded_text)[0]
num_results = results.argsort()[-num_results:][::-1]
scores = results[num_results]
sentences = [input_text[idx] for idx in num_results]

Finally, we can print the list of five sentences that best match the search term.

print('*'*30 + ' Start of output ' + '*'*30)
printmd('**Search Results:**')
for sentence, score in zip(sentences, scores):
    printmd(f'* {sentence} (score: {score:<.4f})')

print('*'*30 + ' End of output ' + '*'*30)
****************************** Start of output ******************************

Search Results:

  • Neural networks have been used for implementing language models since the early 2000s (score: 0.6076)
  • Word embedding, such as , can be thought of as a representational layer in a deep learning architecture that transforms an atomic word into a positional representation of the word relative to other words in the dataset; the position is represented as a point in a (score: 0.5070)
  • LSTM helped to improve machine translation and language modeling (score: 0.4982)
  • A deep neural network (DNN) is an (ANN) with multiple layers between the input and output layers (score: 0.4969)
  • An ANN is based on a collection of connected units called , (analogous to biological neurons in a ) (score: 0.4947)
****************************** End of output ******************************

As you can see the output is fairly good and at least the top three sentences are highly relevant for our search.

In this demo we have seen that with just a few lines of code you can create a simple search engine that does whet it is supposed to do: it finds sentences from the source website/document that match our keyword. The same idea can be used in wide variety of use cases.

Hope you enjoyed this demo. Feel free to contact me if you have any questions.