Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering model updates #74

Merged
merged 14 commits into from
Nov 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,32 @@

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.

### New in version 0.10.0

You can interactively explore clusters using `datamapplot` directly in Turftopic!
You will first have to install `datamapplot` for this to work.

```python
from turftopic import ClusteringTopicModel
from turftopic.namers import OpenAITopicNamer

model = ClusteringTopicModel(feature_importance="centroid")
model.fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

fig = model.plot_clusters_datamapplot()
fig.save("clusters_visualization.html")
fig
```
> If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.

<figure>
<img src="docs/images/cluster_datamapplot.png" width="70%" style="margin-left: auto;margin-right: auto;">
<figcaption>Interactive figure to explore cluster structure in a clustering topic model.</figcaption>
</figure>

### New in version 0.9.0

#### Dynamic S³ 🧭
Expand Down
35 changes: 34 additions & 1 deletion docs/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,40 @@ model.print_topics()

### Visualization

Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models.
#### Datamapplot *(clustering models only)*

You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic!
You will first have to install `datamapplot` for this to work:

```bash
pip install turftopic[datamapplot]
```

```python
from turftopic import ClusteringTopicModel
from turftopic.namers import OpenAITopicNamer

model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

fig = model.plot_clusters_datamapplot()
fig.save("clusters_visualization.html")
fig
```
!!! info
If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.

<figure>
<iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:800px;width:800px;padding:0px;border:none;"></iframe>
<figcaption> Interactive figure to explore cluster structure in a clustering topic model. </figcaption>
</figure>


#### topicwizard

Turftopic integrates with [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic visualization.

```bash
pip install topic-wizard
Expand Down
50 changes: 44 additions & 6 deletions docs/clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,36 @@ that the other libraries boast.
from sklearn.manifold import TSNE
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=TSNE())
model = ClusteringTopicModel(dimensionality_reduction=TSNE())
```

It is common practice to reduce the dimensionality of the embeddings before clustering them.
This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by.
Dimensionality reduction by default is done with scikit-learn's **TSNE** implementation in Turftopic,
Dimensionality reduction by default is done with [**TSNE**](https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne) in Turftopic,
but users are free to specify the model that will be used for dimensionality reduction.

!!! tip "Use openTSNE for better performance!"
By default, a scikit-learn implementation is used, but if you have the [openTSNE](https://github.com/pavlin-policar/openTSNE) package installed on your system, Turftopic will automatically use it.
You can potentially speed up your clustering topic models by multiple orders of magnitude.
```bash
pip install turftopic[opentsne]
```

??? note "What reduction model should I choose?"
Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature.
Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
Top2Vec and BERTopic both use [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html), which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).

### Clustering

```python
from sklearn.cluster import OPTICS
from sklearn.cluster import HDBSCAN
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(clustering=OPTICS())
model = ClusteringTopicModel(clustering=HDBSCAN())
```

After reducing the dimensionality of the embeddings, they are clustered with a clustering model.
As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses **OPTICS** as its default.
Turftopic uses [**HDBSCAN**](https://scikit-learn.org/stable/modules/clustering.html#hdbscan) as its default.

??? note "What clustering model should I choose?"
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
Some clustering models are capable of discovering the number of clusters in the data (HDBSCAN, DBSCAN, OPTICS, etc.).
Expand Down Expand Up @@ -174,6 +181,37 @@ To reset topics to the original clustering, use the `reset_topics()` method:
model.reset_topics()
```

### Visualization

You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic!
You will first have to install `datamapplot` for this to work:

```bash
pip install turftopic[datamapplot]
```

```python
from turftopic import ClusteringTopicModel
from turftopic.namers import OpenAITopicNamer

model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)

namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

fig = model.plot_clusters_datamapplot()
fig.save("clusters_visualization.html")
fig
```
!!! info
If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.

<figure>
<iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:800px;width:800px;padding:0px;border:none;"></iframe>
<figcaption> Interactive figure to explore cluster structure in a clustering topic model. </figcaption>
</figure>


### Manual Topic Merging

You can also manually merge topics using the `join_topics()` method.
Expand Down
Loading