How I Built an NLP-Based Recommendation System in 24 Hours

Table of contents

No heading

No headings in the article.

TLDR;
I built an NLP-based recommendation system that recommends ML papers and explains the reasoning behind that recommendation using OpenAI’s GPT API.

In this article, I will share my experience of building an NLP-based recommendation system from scratch in just 24 hours using a variety of tools and techniques, including Next Js, OpenAI ChatGPT, BERT, SentenceTransformers, SPECTER, FastAPI, and scraped data from ArXiv of last 6 months ML papers.

Tech stack of the System

Whenever I read a paper, I found myself constantly seeking recommendations(cause of Tiktok, youtube 🤔). This inspired me to create a recommendation system for academic papers.

Initially, I searched for a dataset specifically related to machine learning papers. While I found some datasets, they didn’t suit my needs, so I began scraping metadata from ArXiv instead. I thought this might be helpful to others so I deployed and scraper here and code is here.

After obtaining the dataset containing metadata, I explored various methods for computing embeddings for each paper:

1.TF-IDF

TF-IDF is a numerical statistic that measures how important a word is to a document in a corpus.

2. Sentence-Transformers

SentenceTransformers are a set of pre-trained models that can be used for various NLP tasks such as semantic similarity, sentence classification, and clustering.

3. Specter from Allen AI (BERT)

SPECTER is a pre-trained neural network developed by Allen Institute for AI that generates high-quality document embeddings for natural language processing tasks such as information retrieval and text classification.

In this Specter gave good accuracy as it was trained on similar academic data. For similarity measures, a simple cosine similarity has been used. So there I got a question, whenever a user sees a recommendation, they’ll be curious to know the reason behind that suggestion. I’ve integrated OpenAI’s GPT API to explain the reasoning behind that suggestion.

I deployed the backend app using FastAPI and Deta space and developed a simple user interface with Next.js and Vercel to retrieve data from the backend server. The app is currently deployed on Vercel.

App deployed on Vercel

In the future, I plan to integrate a monitoring dashboard to track performance and implement a continuous training pipeline for daily data scraping. Additionally, I’m considering adding an interactive chat feature for discussing papers.

Building this app was both challenging and rewarding, and I welcome any ideas or feedback here.

References:

https://fastapi.tiangolo.com/