Skip to content

Commit

Permalink
Refining docs + added finetuning and modifying models page
Browse files Browse the repository at this point in the history
  • Loading branch information
x-tabdeveloping committed Oct 23, 2024
1 parent 6c0f251 commit 6ecdd44
Show file tree
Hide file tree
Showing 5 changed files with 213 additions and 10 deletions.
4 changes: 2 additions & 2 deletions docs/GMM.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# GMM
# GMM (Gaussian Mixture Model)

GMM is a generative probabilistic model over the contextual embeddings.
The model assumes that contextual embeddings are generated from a mixture of underlying Gaussian components.
Expand Down Expand Up @@ -33,7 +33,7 @@ A document-topic-matrix ($T$) is built from the likelihoods of each component gi
??? info "Click to see formula"
- For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$

### 3. Soft c-TF-IDF
### Soft c-TF-IDF

Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
an extension of __c-TF-IDF__, that can be used with continuous labels.
Expand Down
195 changes: 195 additions & 0 deletions docs/finetuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Modifying and finetuning models

Some models in Turftopic can be flexibly modified after being fitted.
This allows users to fit pretrained topic models to their specific use cases.

## Naming/renaming topics

Topics can be freely renamed in all topic models.
This can be beneficial when interpreting models, as it allows you to assign labels to the topics you've already looked at.

```python
from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10).fit(corpus)

# you can specify a dict mapping IDs to names
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
# or a list of topic names
model.rename_topics([f"Topic {i}" for i in range(10)])
```

## Changing the number of topics

Multiple models allow you to change the number of topics in a model after fitting them.

### Refitting $S^3$ with different number of topics

$S^3$ models store all information that is needed to refit them using a different number of topics, iterations or random seed.
This process is incredibly fast and allows you to explore semantics in a corpora on multiple levels of detail.
Moreover, any model you load from a third party can be refitted at will.

```python
from turftopic import load_model

model = load_model("hf_user/some_s3_model")

print(type(model))
# turftopic.models.decomp.SemanticSignalSeparation

print(len(model.topic_names))
# 10

model.refit(n_components=20, random_seed=42)
print(len(model.topic_names))
# 20
```

### Merging topics in clustering models

Clustering models are very flexible in this regard, as they allow you to merge clusters after the model has been fitted.

#### Manual topic merging

You can merge topics manually in a clustering model by using the `join_topics()` method:

```python
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel().fit(corpus)

# This will join topic 0, 5 and 4 into topic 0
model.join_topics([0,5,4])
```

#### Hierarchical merging

You can also merge clusters automatically into a desired number of topics.
This can be done with the `reduce_topics()` method:

!!! info
For more info on topic merging methods, check out [this page](clustering.md)

```python
model = ClusteringTopicModel().fit(corpus)
model.reduce_topics(n_reduce_to=20, reduction_method="smallest")
```

## Finetuning models on a new corpus.

Currently, you can only finetune KeyNMF to a new corpus.
You can do this by using the `partial_fit()` method on texts the model hasn't seen before:

```python
from turftopic import load_model

model = load_model("pretrained_keynmf_model")

print(type(model))
# turftopic.models.keynmf.KeyNMF

new_corpus: list[str] = [...]
# Finetune the model to the new corpus
model.partial_fit(new_corpus)

model.to_disk("finetuned_model/")
```


## Re-estimating word importance

Both $S^3$ and Clustering models come with multiple ways of estimating the importance of words for topics.
Since both of these models use post-hoc measures, these scores can be calculated without fitting a new model or refitting an old one.
This allows you to play around with different types of feature importance estimation measures for the same model (same underlying clusters or axes).

Here's an example with $S^3$:
```python
from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(5, feature_importance="combined").fit(corpus)
model.print_topics()
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃ Lowest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
0 │ hypocrisy, hypocritical, fallacy, debated, skeptics │ xfree86, emulator, codes, 9600, cd300 │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
1 │ spectrometer, dblspace, statistically, nutritional, makefile │ uh, um, yeah, hm, oh │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
2 │ bullpen, goaltenders, pitchers, goaltender, pitching │ intel, nsa, spying, encrypt, terrorism │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
3 │ espionage, wiretapping, cia, fbi, wiretaps │ agnosticism, agnostic, upgrading, affordable, cheaper │
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
4 │ affordable, dealers, warrants, handguns, dealership │ semitic, theologians, judaism, persecuted, pagan │
└──────────┴──────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────┘


model.estimate_components("angular")
model.print_topics()
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃ Lowest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
0 │ hypocritical, debated, hypotheses, misconceptions, fallacy │ diagnostics, win31, modems, cd300, gd3004 │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
1 │ spectrometer, dblspace, statistically, makefile, nutritional │ ye, sub, naked, experiences, uh │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
2 │ bullpen, puckett, hitters, clemens, jenks │ encryption, encrypt, intel, cryptosystem, cryptosystems │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
3 │ journalists, cdc, chlorine, npr, briefing │ values, ratios, upgrading, calculations, inherit │
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
4 │ handguns, warrants, warranty, reliability, handgun │ nutritional, metabolism, deuteronomy, pathology, hormone │
└──────────┴──────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘

```

And one with clustering models:

!!! info
Remember, these are the same underlying clusters, just described in two different ways. For further details, check out [this page](clustering.md)

```python
from turftopic import ClusteringTopicModel

model = ClusteringTopicModel(n_reduce_to=5, feature_importance="soft-c-tf-idf").fit(corpus)
model.print_topics()
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
-1 │ like, just, don, use, does, know, time, good, people, edu │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
0 │ people, said, god, president, mr, think, going, say, did, myers │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
1max, g9v, b8f, a86, pl, 00, 145, 1d9, dos, 34u
├──────────┼────────────────────────────────────────────────────────────────────────────┤
2 │ msg, cancer, food, battery, water, candida, medical, vitamin, yeast, diet │
├──────────┼────────────────────────────────────────────────────────────────────────────┤
325, 55, pit, det, pts, la, bos, 03, 10, 11
├──────────┼────────────────────────────────────────────────────────────────────────────┤
4 │ insurance, car, dog, radar, health, bike, helmet, private, detector, speed │
└──────────┴────────────────────────────────────────────────────────────────────────────┘


model.estimate_components(feature_importance="centroid")
model.print_topics()

┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
-1 │ documented, concerns, dubious, obsolete, concern, alternative, et4000, complaints, cx, discussed │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
0 │ persecutions, persecution, condemning, condemnation, fundamentalists, persecuted, fundamentalism, │
│ │ theology, advocating, fundamentalist │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
1 │ xfree86, pcx, emulation, microsoft, hardware, emulator, x11r5, netware, workstations, chipset │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
2 │ contamination, fungal, precautions, harmful, poisoning, chemicals, treatments, toxicity, dangers, │
│ │ prevention │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
3 │ nhl, bullpen, goaltenders, standings, sabres, canucks, braves, mlb, flyers, playoffs │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
4 │ automotive, vehicle, vehicles, speeding, automobile, automobiles, driving, motorcycling, │
│ │ motorcycles, highways │
└──────────┴───────────────────────────────────────────────────────────────────────────#───────────────────────────┘

```


8 changes: 4 additions & 4 deletions docs/online.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,18 +54,18 @@ for epoch in range(5):
You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before.
This will morph the model's topics to the corpus at hand.

In this example I will load a pretrained KeyNMF model from disk. (see [Persistance](persistance.md))
In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistance.md))

```python
import joblib
from turftopic import load_model

model = joblib.load("pretrained_keynmf.joblib")
model = load_model("pretrained_keynmf_model")

new_corpus: list[str] = [...]
# Finetune the model to the new corpus
model.partial_fit(new_corpus)

joblib.dump(model, "finetuned_keynmf.joblib")
model.to_disk("finetuned_model/")
```

## Precomputed Embeddings
Expand Down
Loading

0 comments on commit 6ecdd44

Please sign in to comment.