Reproducibility and Dimensionality Reduction Issues with BERTopic Using High-Dimensional Embeddings #2244

Feng-Xin-yu · 2024-12-12T11:23:40Z

Feng-Xin-yu
Dec 12, 2024

Dear Author,

Hello!

I encountered two main issues while using the BERTopic model with my dataset: one is about reproducibility, and the other concerns dimensionality reduction after embedding the text. The specific questions are as follows:

1. Reproducibility Issue

When I use the stella_en_v5 embedding model (currently ranked 4th on the MTEB leaderboard as of December 12, 2024), the number of topics extracted during the Training phase changes every time I run the code. I have already ensured that the random_state parameter is set in the UMAP function. However, when using the all-MiniLM-L6-v2 model suggested in the "best practices" section on the BERTopic website, the results are completely reproducible, with the same number of topics generated each time.
Could this issue be due to incompatibility between the stella_en_v5 model and the BERTopic library, or something intrinsic to the model itself? I have tested this on multiple servers and observed the same problem.

I also noticed a minor topic representation reproducibility issue: when using ChatGPT to assist with topic naming, the topic representations exhibit slight variations across runs. Is this caused by the inherent randomness of ChatGPT's responses?

2. Dimensionality Reduction Issue

The Embedding Dimensions of the all-MiniLM-L6-v2 model are only 384, whereas the stella_en_v5 model has an impressive 8192 dimensions (and other top-ranked models on the MTEB leaderboard also have embedding dimensions around 4000). In this case, does it mean that when using UMAP for dimensionality reduction, I should increase the n_components parameter (e.g., from 5 to 10 or 15) to avoid excessive information loss?
Additionally, should the n_neighbors parameter be adjusted to accommodate the higher embedding dimensions?

Below is the code I used for testing:

import pandas as pd
df = pd.read_csv('/home/ubuntu/my_data.csv', usecols=['NewsContent']) 
docs = df['NewsContent'].tolist()

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("dunzhang/stella_en_1.5B_v5", trust_remote_code=True).cuda()
# The reason I added trust_remote_code=True and .cuda() to the code is because it was mentioned this way on the official model page. However, even if I remove these parameters and simplify it to embedding_model = SentenceTransformer("dunzhang/stella_en_1.5B_v5") the reproducibility issue still occurs.
embeddings = embedding_model.encode(docs, show_progress_bar=True)

from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech
keybert_model = KeyBERTInspired()
pos_model = PartOfSpeech("en_core_web_sm")
mmr_model = MaximalMarginalRelevance(diversity=0.3)

prompt = """I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""

client = openai.OpenAI(api_key="sk-…", base_url='https://…')
openai_model = OpenAI(client, model="gpt-4o", exponential_backoff=True, chat=True, prompt=prompt)

representation_model = {"KeyBERT": keybert_model, "OpenAI": openai_model, "MMR": mmr_model, "POS": pos_model}

from bertopic import BERTopic
topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,
  top_n_words=10,
  verbose=True,
  calculate_probabilities=True
)

topics, probs = topic_model.fit_transform(docs, embeddings)

I would greatly appreciate your insights into these questions. Thank you for your time!

Answered by MaartenGr

Dec 13, 2024

Could this issue be due to incompatibility between the stella_en_v5 model and the BERTopic library, or something intrinsic to the model itself? I have tested this on multiple servers and observed the same problem.

It shouldn't be related to BERTopic since all it is doing is creating embeddings. You could test creating the embeddings yourself with sentence-transformers and see whether they change if you run it multiple times.

I also noticed a minor topic representation reproducibility issue: when using ChatGPT to assist with topic naming, the topic representations exhibit slight variations across runs. Is this caused by the inherent randomness of ChatGPT's responses?

That is related th…

View full answer

MaartenGr · 2024-12-13T08:06:03Z

MaartenGr
Dec 13, 2024
Maintainer

Could this issue be due to incompatibility between the stella_en_v5 model and the BERTopic library, or something intrinsic to the model itself? I have tested this on multiple servers and observed the same problem.

It shouldn't be related to BERTopic since all it is doing is creating embeddings. You could test creating the embeddings yourself with sentence-transformers and see whether they change if you run it multiple times.

I also noticed a minor topic representation reproducibility issue: when using ChatGPT to assist with topic naming, the topic representations exhibit slight variations across runs. Is this caused by the inherent randomness of ChatGPT's responses?

That is related the ChatGPT which indeed has some randomness (same applies to any LLM). You could check whether it has same parameters that might help with this, like temperature or perhaps some greedy selection of tokens.

The Embedding Dimensions of the all-MiniLM-L6-v2 model are only 384, whereas the stella_en_v5 model has an impressive 8192 dimensions (and other top-ranked models on the MTEB leaderboard also have embedding dimensions around 4000). In this case, does it mean that when using UMAP for dimensionality reduction, I should increase the n_components parameter (e.g., from 5 to 10 or 15) to avoid excessive information loss?

You could but increasing those parameters too much will likely hurt how well the clustering algorithm can handle the data. The main reason for reducing the dimensionality is for the cluster model to handle the data better (due to the curse of dimensionality). I personally wouldn't increase that value that much. That said, it is definitely worth a try to see what happens when you change those parameters.

5 replies

Feng-Xin-yu Dec 13, 2024
Author

Thank you so much for your prompt and detailed response. Your insights are truly appreciated, and they have provided me with a clearer understanding of the issues I encountered. I will actively follow your suggestions and incorporate them into my workflow for further testing and improvement. I greatly appreciate your time and support.

Feng-Xin-yu Dec 14, 2024
Author

Dear Maarten,

The reproducibility issue still persists. Even when I use the all-MiniLM-L6-v2 model, as suggested in the best practices example on the BERTopic official website, the number of topics extracted changes with each run of the code. I have tried running the same code in different environments, and the issue remains.

Here is the code I used:

import pandas as pd
df = pd.read_csv('/home/ubuntu/my_data.csv', usecols=['NewsContent']) 
docs = df['NewsContent'].tolist()

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech
keybert_model = KeyBERTInspired()
pos_model = PartOfSpeech("en_core_web_sm")
mmr_model = MaximalMarginalRelevance(diversity=0.3)

prompt = """I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""

client = openai.OpenAI(api_key="sk-…", base_url='https://…')
openai_model = OpenAI(client, model="gpt-4o", exponential_backoff=True, chat=True, prompt=prompt)

representation_model = {"KeyBERT": keybert_model, "OpenAI": openai_model, "MMR": mmr_model, "POS": pos_model}

from bertopic import BERTopic
topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,
  top_n_words=10,
  verbose=True,
  calculate_probabilities=True
)

topics, probs = topic_model.fit_transform(docs, embeddings)

Even when I run the official best practices example in my own environment, I encounter discrepancies in the number of topics extracted, compared to the example on the website (even though my code is identical to the one in the best practices). Moreover, even when the number of topics is consistent, the cluster count also varies.

I believe this issue may stem from the HDBSCAN clustering algorithm, which could be introducing some random variability during the clustering process. Could you suggest any ways to resolve this issue?

Thank you once again for your time and assistance.

MaartenGr Dec 15, 2024
Maintainer

Note that there might be differences between OS. For instance, if you run on windows, you will get different results when you run the same on linux. That might explain the differences between the code in the documentation and what you are running. Most importantly, you should get the same results when you run the same code multiple times in the same environment on the same OS.

Feng-Xin-yu Dec 15, 2024
Author

Dear Maarten,

Thank you for your response. Even when I run the code on a Windows system, the reproducibility issue still occurs, just as it does on Linux. However, I finally identified the root cause of the problem. The issue arises when I include the n_neighbors=15 parameter in the UMAP model. If I remove this parameter and leave n_neighbors to its default setting, the reproducibility issue is resolved. The updated code looks like this:

from umap import UMAP
umap_model = UMAP(n_components=5, min_dist=0.0, metric='cosine', random_state=42)

It’s strange because, theoretically, setting the random_state should ensure reproducibility in UMAP. But, in the end, removing the adjustment to n_neighbors while keeping the random_state setting solved the issue.

Thank you again for your support!

MaartenGr Dec 18, 2024
Maintainer

That is indeed odd, as I do not have the same issue. I can reproduce the results without any issues whether that parameter is there or not. I might be mistaken, but I even believe the default value of that param is 15, so there shouldn't be any difference.

That said, glad to hear that it resolved your issue and thank you for sharing your solution!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility and Dimensionality Reduction Issues with BERTopic Using High-Dimensional Embeddings #2244

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reproducibility and Dimensionality Reduction Issues with BERTopic Using High-Dimensional Embeddings #2244

Feng-Xin-yu Dec 12, 2024

1. Reproducibility Issue

2. Dimensionality Reduction Issue

Below is the code I used for testing:

Replies: 1 comment · 5 replies

MaartenGr Dec 13, 2024 Maintainer

Feng-Xin-yu Dec 13, 2024 Author

Feng-Xin-yu Dec 14, 2024 Author

MaartenGr Dec 15, 2024 Maintainer

Feng-Xin-yu Dec 15, 2024 Author

MaartenGr Dec 18, 2024 Maintainer

Feng-Xin-yu
Dec 12, 2024

Replies: 1 comment 5 replies

MaartenGr
Dec 13, 2024
Maintainer

Feng-Xin-yu Dec 13, 2024
Author

Feng-Xin-yu Dec 14, 2024
Author

MaartenGr Dec 15, 2024
Maintainer

Feng-Xin-yu Dec 15, 2024
Author

MaartenGr Dec 18, 2024
Maintainer