The Benchmark repository is where we start measuring success. It’s all about testing and evaluating Large Language Models (LLMs) to ensure they perform well in detecting code clones. This repository lays the groundwork for the CodeMatch project, helping us identify the best-performing model for the job.
The benchmark component is critical for evaluating LLMs on their ability to detect code clones. It replicates the system's key functionalities at scale, ensuring that each model is rigorously tested. The benchmark process involves:
- Generating embeddings for code snippets.
- Running two evaluation methods to measure similarity and retrieval accuracy.
- Direct Clone Comparison Test: Measuring similarity between pairs of code snippets.
- Global Clone Search: Identifying the closest original code snippets from a database for given clones.
- Producing metrics and results in structured CSVs to analyze model performance.
By focusing on these elements, the benchmark helps determine the best model to integrate into the CodeMatch system.
- Finding/Creating Datasets: The first step involves gathering or creating datasets tailored for evaluating LLMs. These datasets consist of original code snippets and their clones, covering various programming languages and clone types (e.g., exact, renamed, semantic).
- Developing a Benchmark Mechanism: This step focuses on evaluating multiple LLMs using the created datasets to determine their ability to detect code clones. It involves a detailed scoring mechanism based on similarity metrics.
- Training/Fine-Tuning the Chosen LLM: Although part of the workflow, this step was executed using prompt engineering rather than custom fine-tuning, leveraging the selected LLM's capabilities without additional training.
Out of these three steps, only the first two are implemented in this repository:
- Step 1: Dataset Creation - additional information can be found in the
data
folder in this repository. - Step 2: Benchmark Development - we will delve into this here.
a. Evaluation Datasets: Two evaluation datasets generated in Step 1 (documented in the data
folder) are used along with the selected LLM. This setup replicates the final system's workflow to benchmark its performance.
b-c. Embedding Original Code: Each original code snippet in the original_code_benchmark
dataset is embedded using the selected LLM. The embeddings are stored in a dedicated collection within the vector database (Qdrant).
d. Running the System for Evaluation: With the embeddings of the original code stored, the system is "run" using the selected LLM model to assess its performance against the test dataset.
Before moving to Steps e and f, note that these represent the two evaluation methods, each producing two outputs:
- Scores: The system's raw output for the given method.
- Evaluation: Performance metrics based on specific criteria for the method.
These files are stored in the output
folder, with separate subfolders created for each LLM model tested.
e. Direct Clone Comparison: This method directly compares each test code embedding to its corresponding original embedding in the vector database. It evaluates the system's ability to accurately identify clone relationships within known pairs, generating similarity scores and evaluations.
f. Global Clone Search: In this method, each test code embedding is searched against the entire vector database. This approach assesses the system's ability to retrieve the most similar code snippets globally. Similarity scores and corresponding evaluations are recorded to analyze performance.
The following models have been benchmarked in this project:
codebert-base
- Hugging Facegraphcodebert-base
- Hugging Facecodet5-base
- Hugging FaceQwen2.5-Coder-0.5B
- Hugging FaceQwen2.5-Coder-0.5B-pe
(prompt-engineered version)
to choose the desired model go to config.py
and uncomment the model name.
Top 5 Hit Rate
(AKA Hit@5/ Top-5 Accuracy) - The percentage of times the original code snippet of the clone appears among the top 5 results retrieved by the system.First Hit Rate
(AKA Hit@1/ Top-1 Accuracy) - The percentage of times the original code snippet of the clone is ranked first among the top 5 results retrieved by the system.
The repository is organized into the following key folders:
EDA-and-visualizations/
:- Contains exploratory data analysis (EDA) of the Rosetta dataset.
- Includes evaluation of the output CSVs generated by running each model.
core/
:- Core modules for managing database interactions, logging, utility functions, and metrics evaluation.
data/
:- Contains dataset files and scripts for dataset creation.
logs/
:- Stores logs for debugging and tracking benchmark execution.
output/
:- Stores evaluation CSVs and results generated for each model.
scripts/
:- Contains scripts for various tasks.
- Logging:
- Logs are added throughout the project to capture progress, errors, and important metadata.
- Logs are stored in a dedicated
logs
folder for detailed debugging and tracking.
- Multiprocessing:
- Utilized to run both evaluation methods in parallel, reducing runtime.
- Local Model Management:
- Automatically downloads and caches the selected model locally if not already present, ensuring smooth execution without manual intervention.
The project is designed with modularity in mind, separating core functionalities into distinct modules:
-
config.py
:- Handles configuration settings such as file paths, available models, and Qdrant settings.
- Dynamically determines whether the project is running locally or in Docker to set the appropriate Qdrant host.
-
core/utils.py
:- Provides utility functions for tasks like starting the Qdrant server, loading models, and generating embeddings.
-
core/database.py
:- Manages interactions with the Qdrant vector database, including initializing the client, creating collections, and populating the database with embeddings and metadata.
-
core/metrics.py
:- Implements the two evaluation methods: direct clone comparison and global clone search, calculating similarity scores and generating results.
-
core/mylogger.py
:- Configures logging for the benchmark, including custom filters and JSON formatting.
- Provides setup functions for single and multiprocess workflows.
-
main.py
:- Serves as the entry point for the benchmark.
- Coordinates the workflow, integrating functionalities from all modules to execute the benchmark process.
- Docker: Installed (if running with Docker).
- Python: Version 3.9 or above (if running locally).
- Hardware: Ensure sufficient RAM and CPU to handle LLM embeddings and Qdrant operations effectively.
- Check the
requirements.txt
file for a list of required Python dependencies.
-
Build the Docker Image:
docker build -t benchmark-image:v1 .
-
Setup and Start Qdrant:
- Pull the Qdrant Docker image:
docker pull qdrant/qdrant
- Create a Docker network and start Qdrant:
docker network create benchmark-network docker run --network benchmark-network --name qdrant -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
- Pull the Qdrant Docker image:
-
Run the Benchmark Container:
docker run --network benchmark-network -it --rm benchmark-image:v1
-
Clone the Repository:
git clone https://github.com/codematch-llm/benchmark.git cd benchmark
-
Install Dependencies:
pip install -r requirements.txt
-
Download and Start Qdrant:
- Pull the Qdrant Docker image:
docker pull qdrant/qdrant
- Start the Qdrant container:
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 qdrant/qdrant
- Pull the Qdrant Docker image:
-
Run the Benchmark:
python main.py
-
Editing Code:
- Clone the repository and edit files locally in your favorite editor (e.g., VS Code).
- Rebuild the Docker image after making changes:
docker build -t benchmark-image:v1 .
-
Logs:
- Logs for the benchmark run will be printed in the console and stored in the
logs
folder for analysis.
- Logs for the benchmark run will be printed in the console and stored in the
-
Generated CSVs:
- Evaluation results are saved as CSV files in a dedicated
output
folder. Each model has a subfolder, and within it, results for the two evaluation methods are stored separately.
- Evaluation results are saved as CSV files in a dedicated