Skip to content

Jl-wei/mini-bar

Repository files navigation

Mini-BAR

DOI

Mini-BAR is a tool for the mining of bilingual app reviews. image

If you find our work useful, please cite our paper:

@INPROCEEDINGS{Wei2023ICTAI,
  author={Wei, Jialiang and Courbis, Anne-Lise and Lambolais, Thomas and Xu, Binbin and Bernard, Pierre Louis and Dray, Gérard},
  booktitle={2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)}, 
  title={Zero-shot Bilingual App Reviews Mining with Large Language Models}, 
  year={2023},
  pages={898-904},
  doi={10.1109/ICTAI59109.2023.00135},
  arxiv={arXiv:2311.03058}
}

Dataset

Classification

App Total Feature request Problem report Irrelevant
Garmin Connect (en) 2000 223 579 1231
Garmin Connect (fr) 2000 217 772 1051
Huawei Health (en) 2000 415 876 764
Huawei Health (fr) 2000 387 842 817
Samsung Health (en) 2000 528 500 990
Samsung Health (fr) 2000 496 492 1047

The sum of each category does not equal the total of reviews, as some reviews have been assigned to more than one label.

Clustering

Garmin Connect Huawei Health Samsung Health
#clusters in feature request 89 74 69
#clusters($size\geq5$) in feature request 7 9 11
#clusters in problem report 45 44 41
#clusters($size\geq5$) in problem report 10 13 12

Installation

Create a new conda env

conda create --name mini-bar python=3.11

Activate the conda env

conda activate mini-bar

Install poetry (https://python-poetry.org/docs/)

Install dependencies

poetry install

Copy your OpenAI key (https://platform.openai.com/account/api-keys) to the environment variable OPENAI_API_KEY

export OPENAI_API_KEY='your openai key'

Experiments

Classification

Pre-trained models and Machine Learning models

Change the current working directory

cd classification

Train and test deep learning based models. nohup is used to keep the script running even after exiting the shell

nohup python dl_train.py &

Train and test machine learning based models

nohup python ml_train_test.py &

The precision, recall and f1 are saved in the log file in classification/lightning_logs

Large language models

Test with large language models

nohup python llm.py &

Calculate precision, recall and f1

python llm_analyse.py --csv_path csv_file_path --model "model name (chatgpt or guanaco)"

Clustering

Change the current working directory by

cd clustering

Perform text embedding and dimension reduction for the data in dataset/for_clustering/labelled, the results will be saved in dataset/for_clustering/embedded

python embed_script.py

Cluster the text embedding stored in dataset/for_clustering/embedded

python main.py

Evaluate the clustering results in clustering/output/multi-hdbscan

python evaluate.py --name "multi-hdbscan" --length 1 --scale 10

Summarization

Summarize the reviews

python summarizer.py

Usage

Change the current working directory

cd tool

Perform analysis on a csv file

python mini_bar.py --file csv_file_path

Generate the report

python report.py