This project was developed as part of the ESILV A5 Machine Learning for NLP course. The objective is to improve recommendation quality for places on TripAdvisor based on review text, surpassing the BM25 model using advanced Natural Language Processing techniques.
- Sarujan Denson
- Joyce Lapilus
The challenge is to recommend places on TripAdvisor that are most similar based on review content. The task involves using textual similarity methods to predict ratings and rank places effectively. Our aim is to explore both lexical and semantic models, evaluate their performance, and achieve better accuracy than the BM25 baseline.
- BM25 Model: A traditional lexical retrieval model using term frequency and inverse document frequency.
- TF-IDF Model: A custom implementation that uses term weighting to calculate similarity.
- Embedding-Based Model: Utilizes SentenceTransformer to obtain dense semantic embeddings for review content.
- Hybrid Model: A combination of TF-IDF for initial candidate retrieval, followed by embedding-based re-ranking for refined results.
- Embedding-Based Model: Uses
SentenceTransformer
(all-MiniLM-L6-v2) to create high-dimensional vector embeddings and compute cosine similarity. - Hybrid Model: Combines TF-IDF and embedding similarity to balance lexical and semantic matching.
- Mean Squared Error (MSE): Measures the difference between predicted and actual ratings. A lower MSE indicates a more accurate model.
- Normalized Discounted Cumulative Gain (NDCG): Ranks model performance, where a score closer to 1 indicates better ranking alignment with ground truth.
- MSE Performance:
- Hybrid Model: 22.13% (Best Performance)
- Embedding-Based Model: 29.76%
- BM25 Model: 29.91%
- TF-IDF Model: 31.34%
- NDCG Performance:
- Hybrid Model: 0.9949 (Highest Score)
- BM25 Model: 0.9933
- TF-IDF Model: 0.9933
- Embedding-Based Model: 0.9917
The Hybrid Model outperformed all others, achieving the lowest MSE and the highest NDCG score, demonstrating the effectiveness of combining lexical and semantic similarity approaches.
report/ESILV_A5_ML_for_NLP_Project_1_Report.pdf
: The detailed project report.notebooks/recommendation_model.ipynb
: The Jupyter notebook containing code for data processing, model training, and evaluation.
- Clone the repository:
git clone https://github.com/atinyshrimp/TripAdvisor-Recommendation-Challenge.git cd TripAdvisor-Recommendation-Challenge
- Install dependencies:
pip install -r requirements.txt
- Open the Jupyter notebook:
jupyter notebook notebooks/recommendation_model.ipynb
- Download NLTK data packages (stopwords, punkt, and wordnet):
python -m nltk.downloader punkt stopwords wordnet
- Follow the steps in the notebook to train and evaluate the models.
The dataset used in this project is publicly available on Kaggle:
- Dataset: TripAdvisor Hotel Reviews
- Author(s)/Contributor(s): Joakim Arvidsson