This project focuses on addressing the shortcomings of existing models in terms of memory optimization, prediction latency, and space usage. The project also introduces two new models:
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.
- DistilBERT: A distilled version of BERT: smaller, faster, cheaper, and lighter.
The goal is to optimize space and memory usage with minimal impact on prediction accuracy.
The project uses the DBpedia ontology classification dataset, which consists of 14 non-overlapping classes selected from DBpedia 2014. From each class, 40,000 training samples and 5,000 testing samples are randomly chosen, resulting in a training dataset of 560,000 and a testing dataset of 70,000 samples. The dataset includes three columns:
- Title: The title of the document.
- Content: The body of the document.
- Label: One of 14 possible topics.
The project aims to build two classification models, ALBERT and DistilBERT, for the DBpedia ontology dataset.
- Language:
Python
- Libraries:
datasets
,numpy
,pandas
,matplotlib
,ktrain
,transformers
,tensorflow
, `sklearn
- Install the required libraries.
- Load the 'DBpedia' dataset.
- Load train and test data.
- Data pre-processing:
- Assign column names to the dataset.
- Append and save the dataset.
- Drop redundant columns.
- Add a text length column for visualization.
- Perform data visualization:
- Histogram plots.
- ALBERT model:
- Check for hardware and RAM availability.
- Import necessary libraries.
- Data interpretations.
- Create an ALBERT model instance.
- Split the train and validation data.
- Perform data pre-processing.
- Compile ALBERT model in a K-train learner object.
- Fine-tune the ALBERT model on the dataset.
- Check performance on validation data.
- Save the ALBERT model.
- DistilBERT model:
- Check for hardware and RAM availability.
- Import necessary libraries.
- Data interpretations.
- Create a DistilBERT model instance.
- Split the train and validation data.
- Perform data pre-processing.
- Compile DistilBERT model in a K-train learner object.
- Fine-tune the DistilBERT model on the dataset.
- Check performance on validation data.
- Save the DistilBERT model.
- Create a BERT model using the DBpedia dataset for comparative study.
- Follow the above steps for creating a BERT model on the 'Emotion' dataset.
- Follow the above steps for creating an ALBERT model on the 'Emotion' dataset.
- Follow the above steps for creating a DistilBERT model on the 'Emotion' dataset.
- Save all the generated models.
-
Input: Contains data for analysis, including CSV files and a tar.gz file.
- dbpedia_14_test.csv
- dbpedia_14_train.csv
- dbpedia_csv.tar.gz
-
Src: The most important folder with modularized code for all the project steps. It includes:
Engine.py
ML_Pipeline
: A folder with functions split into different Python files, appropriately named. These functions are called withinEngine.py
.
-
Output: Contains the ALBERT and DistilBERT models trained on this data. These models can be easily loaded and used for future applications without retraining.
- Note: These models are built on a subset of the data. To obtain models for the entire dataset, run
Engine.py
using the complete data for training.
- Note: These models are built on a subset of the data. To obtain models for the entire dataset, run
-
Lib: A reference folder with original IPython notebooks
-
Install the required packages stated in the
requirements.txt
file:- For Anaconda:
conda create --name <yourenvname> conda activate <yourenvname> pip install -r requirements.txt
- For Python Interpreter:
pip install -r requirements.txt
- For Anaconda:
-
The repository is modularized into individual sections performing specific tasks.
- Navigate to the
src
folder. - Under
src
, you'll find:- ML_Pipeline: Contains modules with function declarations for specific machine learning tasks.
- engine.py: The core of the project where all function calls are made.
- Navigate to the
-
Run or debug the
engine.py
file, and all necessary steps will be executed automatically based on the logic. -
Input datasets are stored in the
input
folder. -
Predictions and models are stored in the
output
folder.