In my previous post I showed how to fine-tune a pre-trained transformers model for the natural language inference (NLI) classification task. In this post I'm taking a fine-tuned NLI model and use it to classify and rank articles in a news feed.

The idea is simple: we use an NLI model that has been trained on the MultiNLI task and pass an excerpt of the source text together with a search term we are interested in to the model. The model will then check if the source text entails the search term and returns a score. We can use these scores to rank the articles. We could use the model we trained on the previous demo, but luckily the people at Huggingface have made our lives much easier by releasing a new pipeline for zero-shot classification which uses a pre-trained NLI model.

So let's get started! First we need to install the required libraries and import them. In addition to the PyTorch and transformers libraries we are also installing and importing some libraries we need for retrieving and processing the news feeds and the articles.

!pip install transformers torch lxml bs4 feedparser
from transformers import pipeline, logging
import torch
import sys
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
import re
import feedparser
import time
from operator import itemgetter
from IPython.display import Markdown, display
from tqdm import tqdm
logging.set_verbosity_error()

Next we will define some functions we need when we process html content retrieved from the websites. The text_from_html function uses BeautifulSoup to filter out unwanted content like html tags, comments, etc. We also define a function we will use to print out the results in markdown format.

def tag_visible(element):
    if element.parent.name in ['p']:
        return True
    if isinstance(element, Comment):
        return False
    return False


def text_from_html(html):
    soup = BeautifulSoup(html.content, 'lxml')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)


def printmd(string):
    display(Markdown(string))

We need to define our classifier function that will take the source text and the search term and provide the result. The model we are using has a 1024 limit for the input text. Note that this will significantly impact the results as we might be cutting out some important information. However, for our demonstration purposes we will not care about this. You could of course split the text into smaller junks and perform classification of those junks separately and then combine the results at the end.

def classifier(source_text, search_term):
    src_text = source_text[:1024]
    classification = pipeline("zero-shot-classification", device=0)
    results = classification(src_text, search_term)
    return results

Now that we have defined our classifier function we have to define the news feed we want to retrieve the articles from. For this demo I want to understand what news releases have come out from Amazon Web Services (AWS) about machine learning in the past 7 days. So I'm using AWS blog as the source and "machine learning" as the search term / classification label. We also define the number of articles we want to display. Let's say we want to see the top 4 articles about machine learning.

feed = "https://aws.amazon.com/blogs/aws/feed/"
search_term = "machine learning"
days = 7
number_of_articles = 4

Next, we retrieve the newsfeed using the feedparser library. We then retrieve the html source for all the articles from the feed that have been published in the last 7 days. We use the text_from_html function to extract the text from the html source and call the classifier function. Finally, we save the classification score and other relevant information for each article.

newsfeed = feedparser.parse(feed)
articles = []
entries = [entry for entry in newsfeed.entries if time.time() - time.mktime(entry.published_parsed) < (86400*days)]
for entry in tqdm(entries, total=len(entries)):
    html = requests.get(entry.link)             
    src_text = text_from_html(html)           

    # This is where we call our classifier function using the source text and the search term
    classification = classifier(src_text, search_term)
    
    article = dict()
    article["title"] = entry.title
    article["link"] = entry.link
    article["src_text"] = src_text
    article["published"] = entry.published_parsed
    article["relevancy"] = classification["scores"][0]
    articles.append(article)
100%|██████████| 19/19 [07:39<00:00, 24.16s/it]

Now that we have a list of classified articles we can sort them using the classification scores.

sorted_articles = sorted(articles, key=itemgetter("relevancy"), reverse=True)

Before we display the results, I'm defining another useful function that utilises the transformers summarization pipeline. We will use this function to create a short summary of each article on our list.

def summarise(source_text):
    src_text = source_text[:1024]
    summarization = pipeline("summarization")
    summary_text = summarization(src_text, min_length = 100)[0]['summary_text']
    summary_text = re.sub(r'\s([?.!",](?:\s|$))', r'\1', summary_text)
    return summary_text

Finally, we can summarise the texts for our top 4 articles and print the results in a sorted order based on their ranking.

print('*'*20 + ' Start of output ' + '*'*20)
for article in sorted_articles[:number_of_articles]:
    summary = summarise(article["src_text"])
    printmd("**{}**<br>{}<br>{}<br>**Search term:** {} | **Score:** {:6.3f}<br><br>".format(article["title"],
                                                                                            article["link"],
                                                                                            summary, 
                                                                                            search_term, 
                                                                                            100*article["relevancy"]))
print('*'*20 + ' End of output ' + '*'*20)
******************** Start of output ********************

New – Managed Data Parallelism in Amazon SageMaker Simplifies Training on Large Datasets
https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/
Machine learning (ML) practitioners working on large distributed training jobs have to face increasingly long training times. Long training times are a severe bottleneck for ML projects, hurting productivity and slowing down innovation. SageMaker Data Parallelism (SDP) library now helps ML teams reduce distributed training time and cost, thanks to the SageMaker data parallelism library. It takes over 6 hours to train advanced object detection models such as Mask RCNN and Faster RCNN on the publicly available COCO dataset.
Search term: machine learning | Score: 99.549

Amazon SageMaker Simplifies Training Deep Learning Models With Billions of Parameters
https://aws.amazon.com/blogs/aws/amazon-sagemaker-simplifies-training-deep-learning-models-with-billions-of-parameters/
Deep learning (DL) has taken the world by storm in the last 10 years. Based on neural networks, DL algorithms have an extraordinary ability to extract information patterns hidden in vast amounts of unstructured data. DL has quickly achieved impressive results on a variety of complex human-like tasks, especially on computer vision and natural language processing. Today, I'm extremely happy to announce that simplifies the training of very large deep learning models that were previously difficult to train due to hardware limitations. In order to tackle ever more complex tasks, DL researchers are designing increasingly sophisticated models.
Search term: machine learning | Score: 99.120

New – Amazon SageMaker Pipelines Brings DevOps Capabilities to your Machine Learning Projects
https://aws.amazon.com/blogs/aws/amazon-sagemaker-pipelines-brings-devops-to-machine-learning-projects/
Machine learning (ML) is intrinsically experimental and unpredictable in nature. You spend days or weeks exploring and processing data in many different ways. Then, you experiment with different algorithms and parameters, training and optimizing lots of models in search of highest accuracy. Finally? Not quite, as you’ll certainly iterate again and again, either to try out new ideas, or simply to retrain your models on new data. Today, I’m extremely happy to announce a new capability of that makes it easy for data scientists and engineers to build, automate, and scale end to end machine learning pipelines.
Search term: machine learning | Score: 98.130

Amazon SageMaker JumpStart Simplifies Access to Pre-built Models and Machine Learning Solutions
https://aws.amazon.com/blogs/aws/amazon-sagemaker-jumpstart-simplifies-access-to-prebuilt-models-and-machine-learning-models/
Machine learning (ML) has proven to be a valuable technique in improving and automating business processes. Working with these models requires skills and experience that only a subset of scientists and developers have. In order to simplify the model building process, the ML community has created model zoos, that is to say, collections of models built with popular open source librarie. Today, a capability of that accelerates your machine learning workflows with one-click access to popular model collections (also known as “model zoos”)
Search term: machine learning | Score: 98.004

******************** End of output ********************

There we have it: a working article ranker using a pre-trained NLI model. Super easy and fun! There are literally hundreds of use cases where these models and pipelines can be used to create useful applications.

Hope you enjoyed this demo. Feel free to contact me if you have any questions.