This project aims to classify sentences into predefined categories using a combination of token-level, character-level, and positional embeddings. The workflow includes data preprocessing, model building, training, evaluation, and prediction.
Data Preprocessing:
- Load and preprocess data from the PubMed RCT dataset.
- Convert text data into token-level and character-level sequences.
- One-hot encode line numbers and total lines.
Model Building:
- Create token-level and character-level embedding models.
- Create models for line number and total lines features.
- Combine embeddings using
tf.keras.layers.Concatenate
. - Build a tribrid embedding model using
tf.keras.Model
.
Training:
- Compile the model with
CategoricalCrossentropy
loss andAdam
optimizer. - Train the model on the training dataset with validation.
Evaluation:
- Evaluate the model on the test dataset.
- Calculate accuracy, precision, recall, and F1 score.
- Display confusion matrix and analyze misclassifications.
Prediction:
- Load the trained model and make predictions on new abstracts.
- Visualize predicted labels for each sentence in the abstract.
char_vectorizer.pkl
: Saved character vectorizer.final_model.keras
: Trained model.label_encoder.pkl
: Saved label encoder.
- Clone the repository:
git clone https://github.com/davydantoniuk/skimlit-nlp-project.git
- Navigate to the project directory:
cd skimlit-nlp-project
- Make virtual environment:
python -m venv venv
- Activate virtual environment:
Windows:
venv\Scripts\activate
macOS/Linux:
source venv/bin/activate
- Install the required packages:
pip install -r requirements.txt
- Run the application:
python app/app.py