Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical Topic Modeling (Divisive) #56

Merged
merged 22 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
79e31e4
Added implementation of multiplicative update weighted NMF
x-tabdeveloping Jul 23, 2024
ccb8c22
Added basic implementation of topic hierarchies
x-tabdeveloping Jul 23, 2024
4128595
Added hierarchy to KeyNMF
x-tabdeveloping Jul 23, 2024
8fbf0d7
Added pretty printing and repr to hierarchy
x-tabdeveloping Jul 23, 2024
2edb49f
Added path info to hierarchy, made pretty printing less confusing
x-tabdeveloping Jul 23, 2024
b7704de
Restructured hierarchy to allow for dividing single topics
x-tabdeveloping Jul 24, 2024
7445666
Made hierarchy printing compact
x-tabdeveloping Jul 24, 2024
9d47e0c
Added tree plotting utility to hierarchies
x-tabdeveloping Jul 24, 2024
f7c6b4d
Added normalization to subtopic components
x-tabdeveloping Jul 24, 2024
54b4585
Added docs for hierarchical modeling
x-tabdeveloping Jul 24, 2024
996070c
Added Hierarchical page to mkdocs.yml
x-tabdeveloping Jul 30, 2024
6c678bd
Made KeyNMF online fitting more error tolerant and safer
x-tabdeveloping Jul 30, 2024
b37200e
Merge branch 'main' into hierarchical
x-tabdeveloping Jul 30, 2024
1d615ac
joined prepare_topic_data and _export_table tests
x-tabdeveloping Jul 30, 2024
87262d7
Fixed prepare_topic_data for KeyNMF
x-tabdeveloping Jul 30, 2024
24c2046
FASTopic is no longer an optional import
x-tabdeveloping Jul 30, 2024
3efb559
Added docs page for FASTopic
x-tabdeveloping Jul 30, 2024
74d3139
Version bump
x-tabdeveloping Jul 30, 2024
e2486ab
Updated readme
x-tabdeveloping Jul 30, 2024
29649e3
Added FASTopic to documentation index page
x-tabdeveloping Jul 30, 2024
1cda1df
Added test for hierarchical
x-tabdeveloping Jul 31, 2024
9a48eaa
Update README.md
x-tabdeveloping Jul 31, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 30 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,53 +9,56 @@
- Semantic Signal Separation - S³ 🧭
- KeyNMF 🔑 (paper in progress ⏳)
- GMM :gem: (paper soon)
- Implementations of existing transformer-based topic models
- Implementations of other transformer-based topic models
- Clustering Topic Models: BERTopic and Top2Vec
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
- FASTopic :zap:
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
- Streamlined scikit-learn compatible API 🛠️
- Easy topic interpretation 🔍
- Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.

### New in version 0.4.0
### New in version 0.5.0

#### Online KeyNMF
#### Hierarchical KeyNMF

You can now online fit and finetune KeyNMF as you wish!
You can now subdivide topics in KeyNMF at will.

```python
from itertools import batched
from turftopic import KeyNMF

model = KeyNMF(10, top_n=5)

corpus = ["some string", "etc", ...]
for batch in batched(corpus, 200):
batch = list(batch)
model.partial_fit(batch)
model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.hierarchy.divide_children(n_subtopics=3)
print(model.hierarchy)
```

#### $S^3$ Concept Compasses
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

You can now produce a compass of concepts along two semantic axes using $S^3$.

<table>
<tr>
<td>

```python
model = SemanticSignalSeparation(10).fit(corpus)
fig = model.concept_compass(topic_x=1, topic_y=4)
fig.show()
```
#### FASTopic *(Experimental)*

</td>
<td><img src="./docs/images/arxiv_ml_compass.png" width="350" style="margin-left: auto;margin-right: auto;"></td>
</tr>
</table>
You can now use [FASTopic](https://github.com/BobXWu/FASTopic) inside Turftopic.

```python
from turftopic import FASTopic

model = FASTopic(10).fit(corpus)
model.print_topics()
```

## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
Expand Down Expand Up @@ -180,6 +183,7 @@ Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/to

## References
- Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
- Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
- Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
- Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
- Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
Expand Down
15 changes: 15 additions & 0 deletions docs/FASTopic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# FASTopic

FASTopic is a neural topic model based on Dual Semantic-relation Reconstruction.

> Turftopic contains an implementation repurposed for our API, but the implementation is mostly from the [original FASTopic package](https://github.com/BobXWu/FASTopic).

:warning: This part of the documentation is still under construction :warning:

## References

Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.

## API Reference

::: turftopic.models.fastopic.FASTopic
41 changes: 41 additions & 0 deletions docs/KeyNMF.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,47 @@ for batch in batched(zip(corpus, timestamps)):
model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins)
```

## Hierarchical Topic Modeling

When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.

This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
In other words:

1. Decompose keyword matrix $M \approx WH$
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
2. Perform multiplicative updates until convergence. <br>
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$

To create a hierarchical model, you can use the `hierarchy` property of the model.

```python
# This divides each of the topics in the model to 3 subtopics.
model.hierarchy.divide_children(n_subtopics=3)
print(model.hierarchy)
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).

## Considerations

### Strengths
Expand Down
2 changes: 1 addition & 1 deletion docs/dynamic.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ model.plot_topics_over_time(top_k=5)
<figcaption>Topics over time on a Figure</figcaption>
</figure>

## Interface
## API reference

All dynamic topic models have a `temporal_components_` attribute, which contains the topic-term matrices for each time slice, along with a `temporal_importance_` attribute, which contains the importance of each topic in each time slice.

Expand Down
152 changes: 152 additions & 0 deletions docs/hierarchical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Hierarchical Topic Modeling

> Note: Hierarchical topic modeling in Turftopic is still in its early stages, you can expect more visualization utilities, tools and models in the future :sparkles:

You might expect some topics in your corpus to belong to a hierarchy of topics.
Some models in Turftopic (currently only [KeyNMF](KeyNMF.md)) allow you to investigate hierarchical relations and build a taxonomy of topics in a corpus.

## Divisive Hierarchical Modeling

Currently Turftopic, in contrast with other topic modeling libraries only allows for hierarchical modeling in a divisive context.
This means that topics can be divided into subtopics in a **top-down** manner.
[KeyNMF](KeyNMF.md) does not discover a topic hierarchy automatically,
but you can manually instruct the model to find subtopics in larger topics.

As a demonstration, let's load a corpus, that we know to have hierarchical themes.

```python
from sklearn.datasets import fetch_20newsgroups

corpus = fetch_20newsgroups(
subset="all",
remove=("headers", "footers", "quotes"),
categories=[
"comp.os.ms-windows.misc",
"comp.sys.ibm.pc.hardware",
"talk.religion.misc",
"alt.atheism",
],
).data
```

In this case, we have two base themes, which are **computers**, and **religion**.
Let us fit a KeyNMF model with two topics to see if the model finds these.

```python
from turftopic import KeyNMF

model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.print_topics()
```

| Topic ID | Highest Ranking |
| - | - |
| 0 | windows, dos, os, disk, card, drivers, file, pc, files, microsoft |
| 1 | atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs |

The results conform our intuition. Topic 0 seems to revolve around IT, while Topic 1 around atheism and religion.
We can already suspect, however that more granular topics could be discovered in this corpus.
For instance Topic 0 contains terms related to operating systems, like *windows* and *dos*, but also components, like *disk* and *card*.

We can access the hierarchy of topics in the model at the current stage, with the model's `hierarchy` property.

```python
print(model.hierarchy)
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
</tt>
</div>

There isn't much to see yet, the model contains a flat hierarchy of the two topics we discovered and we are at root level.
We can dissect these topics, by adding a level to the hierarchy.

Let us add 3 subtopics to each topic on the root level.

```python
model.hierarchy.divide_children(n_subtopics=3)
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

As you can see, the model managed to identify meaningful subtopics of the two larger topics we found earlier.
Topic 0 got divided into a topic mostly concerned with dos and windows, a topic on operating systems in general, and one about hardware,
while Topic 1 contains a topic about newsgroups, one about atheism, and one about morality and christianity.

You can also easily access nodes of the hierarchy by indexing it:
```python
model.hierarchy[0]
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
└── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
</tt>
</div>

You can also divide individual topics to a number of subtopics, by using the `divide()` method.
Let us divide Topic 0.0 to 5 subtopics.

```python
model.hierarchy[0][0].divide(5)
model.hierarchy
```

<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
<tt style="font-size: 11pt">
<b>Root </b><br>
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
│ │ ├── <b style="color: green">0.0.1</b>: file, files, ftp, bmp, program, windows, shareware, directory, bitmap, zip <br>
│ │ ├── <b style="color: green">0.0.2</b>: os, windows, unix, microsoft, crash, apps, crashes, nt, pc, operating <br>
│ │ ├── <b style="color: green">0.0.3</b>: disk, disks, floppy, drive, drives, scsi, boot, hd, norton, ide <br>
│ │ ├── <b style="color: green">0.0.4</b>: dos, modem, command, ms, emm386, serial, commands, 386, drivers, batch <br>
│ │ └── <b style="color: green">0.0.5</b>: printer, print, printing, fonts, font, postscript, hp, printers, output, driver <br>
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
</tt>
</div>

## Visualization
You can visualize hierarchies in Turftopic by using the `plot_tree()` method of a topic hierarchy.
The plot is interactive and you can zoom in or hover on individual topics to get an overview of the most important words.

```python
model.hierarchy.plot_tree()
```

<figure>
<img src="../images/hierarchy_tree.png" width="90%" style="margin-left: auto;margin-right: auto;">
<figcaption>Tree plot of the hierarchy.</figcaption>
</figure>


## API reference

::: turftopic.hierarchical.TopicNode



Binary file added docs/images/hierarchy_tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 3 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,16 @@ pip install turftopic[pyro-ppl]
You can use most transformer-based topic models in Turftopic, these include:

- [Semantic Signal Separation - $S^3$](s3.md) :compass:
- [KeyNMF](KeyNMF.md) :key:
- [KeyNMF](KeyNMF.md) :key:
- [Gaussian Mixture Models (GMM)](gmm.md)
- [Clustering Topic Models](clustering.md):
- [BERTopic](clustering.md#bertopic_and_top2vec)
- [Top2Vec](clustering.md#bertopic_and_top2vec)
- [Auto-encoding Topic Models](ctm.md):
- CombinedTM
- ZeroShotTM
- [FASTopic](fastopic.md) :zap:



## Basic Usage
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ nav:
- Using Turftopic: basics.md
- Dynamic Topic Modeling: dynamic.md
- Online Topic Modeling: online.md
- Hierarchical Topic Modeling: hierarchical.md
- Model Persistence: persistence.md
- Models:
- Model Overview: model_overview.md
Expand All @@ -15,6 +16,7 @@ nav:
- GMM: GMM.md
- Clustering Models: clustering.md
- Autoencoding Models: ctm.md
- FASTopic: fastopic.md
- Encoders: encoders.md
theme:
name: material
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ line-length=79

[tool.poetry]
name = "turftopic"
version = "0.4.5"
version = "0.5.0"
description = "Topic modeling with contextual representations from sentence transformers."
authors = ["Márton Kardos <power.up1163@gmail.com>"]
license = "MIT"
Expand Down
Loading
Loading