TripAdvisor Recommendation Challenge - Beating BM25

Overview

This project was developed as part of the ESILV A5 Machine Learning for NLP course. The objective is to improve recommendation quality for places on TripAdvisor based on review text, surpassing the BM25 model using advanced Natural Language Processing techniques.

Authors

Sarujan Denson
Joyce Lapilus

Problem Statement

The challenge is to recommend places on TripAdvisor that are most similar based on review content. The task involves using textual similarity methods to predict ratings and rank places effectively. Our aim is to explore both lexical and semantic models, evaluate their performance, and achieve better accuracy than the BM25 baseline.

Models Explored

BM25 Model: A traditional lexical retrieval model using term frequency and inverse document frequency.
TF-IDF Model: A custom implementation that uses term weighting to calculate similarity.
Embedding-Based Model: Utilizes SentenceTransformer to obtain dense semantic embeddings for review content.
Hybrid Model: A combination of TF-IDF for initial candidate retrieval, followed by embedding-based re-ranking for refined results.

Implementation Details

Embedding-Based Model: Uses SentenceTransformer (all-MiniLM-L6-v2) to create high-dimensional vector embeddings and compute cosine similarity.
Hybrid Model: Combines TF-IDF and embedding similarity to balance lexical and semantic matching.

Evaluation Metrics

Mean Squared Error (MSE): Measures the difference between predicted and actual ratings. A lower MSE indicates a more accurate model.
Normalized Discounted Cumulative Gain (NDCG): Ranks model performance, where a score closer to 1 indicates better ranking alignment with ground truth.

Results

MSE Performance:
- Hybrid Model: 22.13% (Best Performance)
- Embedding-Based Model: 29.76%
- BM25 Model: 29.91%
- TF-IDF Model: 31.34%
NDCG Performance:
- Hybrid Model: 0.9949 (Highest Score)
- BM25 Model: 0.9933
- TF-IDF Model: 0.9933
- Embedding-Based Model: 0.9917

The Hybrid Model outperformed all others, achieving the lowest MSE and the highest NDCG score, demonstrating the effectiveness of combining lexical and semantic similarity approaches.

File Structure

report/ESILV_A5_ML_for_NLP_Project_1_Report.pdf: The detailed project report.
notebooks/recommendation_model.ipynb: The Jupyter notebook containing code for data processing, model training, and evaluation.

How to Run

Clone the repository:

git clone https://github.com/atinyshrimp/TripAdvisor-Recommendation-Challenge.git
cd TripAdvisor-Recommendation-Challenge

Install dependencies:
```
pip install -r requirements.txt
```

Open the Jupyter notebook:

jupyter notebook notebooks/recommendation_model.ipynb

Download NLTK data packages (stopwords, punkt, and wordnet):
```
python -m nltk.downloader punkt stopwords wordnet
```
Follow the steps in the notebook to train and evaluate the models.

Dataset Source

The dataset used in this project is publicly available on Kaggle:

Dataset: TripAdvisor Hotel Reviews
Author(s)/Contributor(s): Joakim Arvidsson

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
report		report
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TripAdvisor Recommendation Challenge - Beating BM25

Overview

Authors

Problem Statement

Models Explored

Implementation Details

Evaluation Metrics

Results

File Structure

How to Run

Dataset Source

About

Languages

atinyshrimp/TripAdvisor-Recommendation-ML-NLP

Folders and files

Latest commit

History

Repository files navigation

TripAdvisor Recommendation Challenge - Beating BM25

Overview

Authors

Problem Statement

Models Explored

Implementation Details

Evaluation Metrics

Results

File Structure

How to Run

Dataset Source

About

Topics

Resources

Stars

Watchers

Forks

Languages