Reproducibility and Dimensionality Reduction Issues with BERTopic Using High-Dimensional Embeddings #2244
-
Dear Author, Hello! I encountered two main issues while using the BERTopic model with my dataset: one is about reproducibility, and the other concerns dimensionality reduction after embedding the text. The specific questions are as follows: 1. Reproducibility IssueWhen I use the stella_en_v5 embedding model (currently ranked 4th on the MTEB leaderboard as of December 12, 2024), the number of topics extracted during the Training phase changes every time I run the code. I have already ensured that the I also noticed a minor topic representation reproducibility issue: when using ChatGPT to assist with topic naming, the topic representations exhibit slight variations across runs. Is this caused by the inherent randomness of ChatGPT's responses? 2. Dimensionality Reduction IssueThe Embedding Dimensions of the all-MiniLM-L6-v2 model are only 384, whereas the stella_en_v5 model has an impressive 8192 dimensions (and other top-ranked models on the MTEB leaderboard also have embedding dimensions around 4000). In this case, does it mean that when using UMAP for dimensionality reduction, I should increase the Below is the code I used for testing:import pandas as pd
df = pd.read_csv('/home/ubuntu/my_data.csv', usecols=['NewsContent'])
docs = df['NewsContent'].tolist()
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("dunzhang/stella_en_1.5B_v5", trust_remote_code=True).cuda()
# The reason I added trust_remote_code=True and .cuda() to the code is because it was mentioned this way on the official model page. However, even if I remove these parameters and simplify it to embedding_model = SentenceTransformer("dunzhang/stella_en_1.5B_v5") the reproducibility issue still occurs.
embeddings = embedding_model.encode(docs, show_progress_bar=True)
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))
import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech
keybert_model = KeyBERTInspired()
pos_model = PartOfSpeech("en_core_web_sm")
mmr_model = MaximalMarginalRelevance(diversity=0.3)
prompt = """I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
client = openai.OpenAI(api_key="sk-…", base_url='https://…')
openai_model = OpenAI(client, model="gpt-4o", exponential_backoff=True, chat=True, prompt=prompt)
representation_model = {"KeyBERT": keybert_model, "OpenAI": openai_model, "MMR": mmr_model, "POS": pos_model}
from bertopic import BERTopic
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
representation_model=representation_model,
top_n_words=10,
verbose=True,
calculate_probabilities=True
)
topics, probs = topic_model.fit_transform(docs, embeddings) I would greatly appreciate your insights into these questions. Thank you for your time! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
It shouldn't be related to BERTopic since all it is doing is creating embeddings. You could test creating the embeddings yourself with sentence-transformers and see whether they change if you run it multiple times.
That is related the ChatGPT which indeed has some randomness (same applies to any LLM). You could check whether it has same parameters that might help with this, like temperature or perhaps some greedy selection of tokens.
You could but increasing those parameters too much will likely hurt how well the clustering algorithm can handle the data. The main reason for reducing the dimensionality is for the cluster model to handle the data better (due to the curse of dimensionality). I personally wouldn't increase that value that much. That said, it is definitely worth a try to see what happens when you change those parameters. |
Beta Was this translation helpful? Give feedback.
It shouldn't be related to BERTopic since all it is doing is creating embeddings. You could test creating the embeddings yourself with sentence-transformers and see whether they change if you run it multiple times.
That is related th…