This repository demonstrates a complete pipeline for text clustering using Sentence-Transformers (SBERT). The project includes text preprocessing, generation of sentence embeddings, and clustering with K-Means and DBSCAN algorithms. It also features visualization techniques for interpreting clustering results and analyzing example texts from each cluster.
-
Text Preprocessing:
- Removal of punctuation and numbers
- Conversion of text to lowercase
- Removal of stop words
- Normalization using stemming or lemmatization
-
Embedding Generation:
- Utilizes the
paraphrase-MiniLM-L6-v2
model from Sentence-Transformers for creating embeddings
- Utilizes the
-
Clustering:
- K-Means:
- Fixed number of clusters or determination of the optimal number of clusters
- DBSCAN:
- Automatic tuning of
eps
andmin_samples
parameters
- Automatic tuning of
- K-Means:
-
Visualization:
- PCA for dimensionality reduction
- Scatter plots for cluster visualization
-
Cluster Analysis:
- Display of sample texts from each cluster
Make sure you have the following Python packages installed:
pip install sentence-transformers pandas scikit-learn matplotlib seaborn nltk
- Load Data: Place your dataset files in the
data/
directory. - Preprocess Data: Apply text cleaning and normalization to prepare data for embedding generation.
- Generate Embeddings: Convert the preprocessed text into embeddings using the SBERT model.
- Save Embeddings: Store the generated embeddings for use in clustering.
-
K-Means Clustering:
- Perform clustering with a specified number of clusters.
- Optionally, find the optimal number of clusters based on silhouette scores.
-
DBSCAN Clustering:
- Perform clustering with automatically tuned
eps
andmin_samples
parameters.
- Perform clustering with automatically tuned
-
Visualize Clusters:
- Use PCA to reduce dimensionality and plot the clusters.
-
Review Results:
- Analyze clustering results and review example texts from each cluster.
data/
: Contains input datasets (train.parquet
,test.parquet
).results/
: Directory for output files and visualizations.scripts/
: Contains Python scripts for preprocessing, embedding generation, clustering, and visualization.README.md
: Project documentation.
This project is licensed under the MIT License. See the LICENSE file for details.