This is the repository for the experiments of the paper "Exploring NLP Techniques for Code Smell Detection: A Comparative Study." The study compares various NLP-based models for detecting code smells to a baseline model, highlighting their strengths and weaknesses.
This research was conducted as part of a CIFRE doctoral program (Convention Industrielle de Formation par la Recherche) in collaboration with Adservio, Universite Paris-Saclay and ISEP.
First, create a conda environment and install the dependencies
conda create -n mlcqenv python=3.10
conda activate mlcqenv
conda install -f requirements.txt
In order to recreate the json containing the code snippet according to the paths specified in the MLCQ dataset
- First set your github token to communicate with the api, see here for more information on setting your token
- Export your acquired token as an environment variable :
export GITHUB_TOKEN=<your_github_token>
- Run the DataExtractor script :
python DataExtractor.py
The baseline here is j48, a decision tree-based algorithm widely considered state-of-the-art for code smell detection ( see 1 and 2 )
We need to first compute code metrics as they are the features needed for this model, to do so we use Designite, install it following the official repo
- Run
python baseline/MetricsExtractor.py
to prepare the code snippets in .java files. - Run
python baseline/DesigniteRun.py
to execute Designite on the java files producing a DesigniteOutput file. - Run
baseline/DatasetCreator.py
to prepare the final dataset to feed to the model. - Finally, run
train.py
to train and test the model.
Tip: You can speed up the Designite processing by specifying the number of workers when using
MetricsExtractor.py
. This divides the dataset into batches, enabling parallel processing for faster execution.
There are different models each with different components, to train the final bilstm with attention model run :
python bilstm_attn_train.py --batch_size 16 --epochs 20 --learning_rate 0.0001 --hidden_dim 512 --num_layers 2
Whereas to run the CodeBert model run :
python bert.py
All the results will be stored to their corresponding log files.
Authors :
- Djamel Mesbah ( djamel.mesbah@adservio.fr/djamel.mesbah@universite-paris-saclay.fr)
- Nour El Madhoun ( nour.el-madhoun@isep.fr)
- Hani Chalouati ( hani.chalouati@adservio.fr )
- Khaldoun Al Agha (alagha@lri.fr)
This work relies on:
- The MLCQ dataset
- The Designite tool
- CodeBert pretrained model from huggingface