Refining docs + added finetuning and modifying models page

x-tabdeveloping · Oct 23, 2024 · 6ecdd44 · 6ecdd44
1 parent 6c0f251
commit 6ecdd44
Show file tree

Hide file tree

Showing 5 changed files with 213 additions and 10 deletions.
diff --git a/docs/GMM.md b/docs/GMM.md
@@ -1,4 +1,4 @@
-# GMM
+# GMM (Gaussian Mixture Model)
 
 GMM is a generative probabilistic model over the contextual embeddings.
 The model assumes that contextual embeddings are generated from a mixture of underlying Gaussian components.
@@ -33,7 +33,7 @@ A document-topic-matrix ($T$) is built from the likelihoods of each component gi
 ??? info "Click to see formula"
     - For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
 
-### 3. Soft c-TF-IDF
+### Soft c-TF-IDF
 
 Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
 an extension of __c-TF-IDF__, that can be used with continuous labels.

diff --git a/docs/finetuning.md b/docs/finetuning.md
@@ -0,0 +1,195 @@
+# Modifying and finetuning models
+
+Some models in Turftopic can be flexibly modified after being fitted.
+This allows users to fit pretrained topic models to their specific use cases.
+
+## Naming/renaming topics
+
+Topics can be freely renamed in all topic models.
+This can be beneficial when interpreting models, as it allows you to assign labels to the topics you've already looked at. 
+
+```python
+from turftopic import SemanticSignalSeparation
+
+model = SemanticSignalSeparation(10).fit(corpus)
+
+# you can specify a dict mapping IDs to names
+model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
+# or a list of topic names
+model.rename_topics([f"Topic {i}" for i in range(10)])
+```
+
+## Changing the number of topics
+
+Multiple models allow you to change the number of topics in a model after fitting them.
+
+### Refitting $S^3$ with different number of topics
+
+$S^3$ models store all information that is needed to refit them using a different number of topics, iterations or random seed.
+This process is incredibly fast and allows you to explore semantics in a corpora on multiple levels of detail.
+Moreover, any model you load from a third party can be refitted at will.
+
+```python
+from turftopic import load_model
+
+model = load_model("hf_user/some_s3_model")
+
+print(type(model))
+# turftopic.models.decomp.SemanticSignalSeparation
+
+print(len(model.topic_names))
+# 10
+
+model.refit(n_components=20, random_seed=42)
+print(len(model.topic_names))
+# 20
+```
+
+### Merging topics in clustering models
+
+Clustering models are very flexible in this regard, as they allow you to merge clusters after the model has been fitted.
+
+#### Manual topic merging
+
+You can merge topics manually in a clustering model by using the `join_topics()` method:
+
+```python
+from turftopic import ClusteringTopicModel
+
+model = ClusteringTopicModel().fit(corpus)
+
+# This will join topic 0, 5 and 4 into topic 0
+model.join_topics([0,5,4])
+```
+
+#### Hierarchical merging
+
+You can also merge clusters automatically into a desired number of topics.
+This can be done with the `reduce_topics()` method:
+
+!!! info
+    For more info on topic merging methods, check out [this page](clustering.md)
+
+```python
+model = ClusteringTopicModel().fit(corpus)
+model.reduce_topics(n_reduce_to=20, reduction_method="smallest")
+```
+
+## Finetuning models on a new corpus.
+
+Currently, you can only finetune KeyNMF to a new corpus.
+You can do this by using the `partial_fit()` method on texts the model hasn't seen before:
+
+```python
+from turftopic import load_model
+
+model = load_model("pretrained_keynmf_model")
+
+print(type(model))
+# turftopic.models.keynmf.KeyNMF
+
+new_corpus: list[str] = [...]
+# Finetune the model to the new corpus
+model.partial_fit(new_corpus)
+
+model.to_disk("finetuned_model/")
+```
+
+
+## Re-estimating word importance
+
+Both $S^3$ and Clustering models come with multiple ways of estimating the importance of words for topics.
+Since both of these models use post-hoc measures, these scores can be calculated without fitting a new model or refitting an old one.
+This allows you to play around with different types of feature importance estimation measures for the same model (same underlying clusters or axes).
+
+Here's an example with $S^3$:
+```python
+from turftopic import SemanticSignalSeparation
+
+model = SemanticSignalSeparation(5, feature_importance="combined").fit(corpus)
+model.print_topics()
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Topic ID ┃ Highest Ranking                                              ┃ Lowest Ranking                                        ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│        0 │ hypocrisy, hypocritical, fallacy, debated, skeptics          │ xfree86, emulator, codes, 9600, cd300                 │
+├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
+│        1 │ spectrometer, dblspace, statistically, nutritional, makefile │ uh, um, yeah, hm, oh                                  │
+├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
+│        2 │ bullpen, goaltenders, pitchers, goaltender, pitching         │ intel, nsa, spying, encrypt, terrorism                │
+├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
+│        3 │ espionage, wiretapping, cia, fbi, wiretaps                   │ agnosticism, agnostic, upgrading, affordable, cheaper │
+├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
+│        4 │ affordable, dealers, warrants, handguns, dealership          │ semitic, theologians, judaism, persecuted, pagan      │
+└──────────┴──────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────┘
+
+
+model.estimate_components("angular")
+model.print_topics()
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Topic ID ┃ Highest Ranking                                              ┃ Lowest Ranking                                           ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│        0 │ hypocritical, debated, hypotheses, misconceptions, fallacy   │ diagnostics, win31, modems, cd300, gd3004                │
+├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
+│        1 │ spectrometer, dblspace, statistically, makefile, nutritional │ ye, sub, naked, experiences, uh                          │
+├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
+│        2 │ bullpen, puckett, hitters, clemens, jenks                    │ encryption, encrypt, intel, cryptosystem, cryptosystems  │
+├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
+│        3 │ journalists, cdc, chlorine, npr, briefing                    │ values, ratios, upgrading, calculations, inherit         │
+├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
+│        4 │ handguns, warrants, warranty, reliability, handgun           │ nutritional, metabolism, deuteronomy, pathology, hormone │
+└──────────┴──────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘
+
+```
+
+And one with clustering models:
+
+!!! info
+    Remember, these are the same underlying clusters, just described in two different ways. For further details, check out [this page](clustering.md)
+
+```python
+from turftopic import ClusteringTopicModel
+
+model = ClusteringTopicModel(n_reduce_to=5, feature_importance="soft-c-tf-idf").fit(corpus)
+model.print_topics()
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Topic ID ┃ Highest Ranking                                                            ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│       -1 │ like, just, don, use, does, know, time, good, people, edu                  │
+├──────────┼────────────────────────────────────────────────────────────────────────────┤
+│        0 │ people, said, god, president, mr, think, going, say, did, myers            │
+├──────────┼────────────────────────────────────────────────────────────────────────────┤
+│        1 │ max, g9v, b8f, a86, pl, 00, 145, 1d9, dos, 34u                             │
+├──────────┼────────────────────────────────────────────────────────────────────────────┤
+│        2 │ msg, cancer, food, battery, water, candida, medical, vitamin, yeast, diet  │
+├──────────┼────────────────────────────────────────────────────────────────────────────┤
+│        3 │ 25, 55, pit, det, pts, la, bos, 03, 10, 11                                 │
+├──────────┼────────────────────────────────────────────────────────────────────────────┤
+│        4 │ insurance, car, dog, radar, health, bike, helmet, private, detector, speed │
+└──────────┴────────────────────────────────────────────────────────────────────────────┘
+
+
+model.estimate_components(feature_importance="centroid")
+model.print_topics()
+
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Topic ID ┃ Highest Ranking                                                                                      ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│       -1 │ documented, concerns, dubious, obsolete, concern, alternative, et4000, complaints, cx, discussed     │
+├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
+│        0 │ persecutions, persecution, condemning, condemnation, fundamentalists, persecuted, fundamentalism,    │
+│          │ theology, advocating, fundamentalist                                                                 │
+├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
+│        1 │ xfree86, pcx, emulation, microsoft, hardware, emulator, x11r5, netware, workstations, chipset        │
+├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
+│        2 │ contamination, fungal, precautions, harmful, poisoning, chemicals, treatments, toxicity, dangers,    │
+│          │ prevention                                                                                           │
+├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
+│        3 │ nhl, bullpen, goaltenders, standings, sabres, canucks, braves, mlb, flyers, playoffs                 │
+├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
+│        4 │ automotive, vehicle, vehicles, speeding, automobile, automobiles, driving, motorcycling,             │
+│          │ motorcycles, highways                                                                                │
+└──────────┴───────────────────────────────────────────────────────────────────────────#───────────────────────────┘
+
+```
+
+
diff --git a/docs/online.md b/docs/online.md
@@ -54,18 +54,18 @@ for epoch in range(5):
 You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before.
 This will morph the model's topics to the corpus at hand.
 
-In this example I will load a pretrained KeyNMF model from disk. (see [Persistance](persistance.md))
+In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistance.md))
 
 ```python
-import joblib
+from turftopic import load_model
 
-model = joblib.load("pretrained_keynmf.joblib")
+model = load_model("pretrained_keynmf_model")
 
 new_corpus: list[str] = [...]
 # Finetune the model to the new corpus
 model.partial_fit(new_corpus)
 
-joblib.dump(model, "finetuned_keynmf.joblib")
+model.to_disk("finetuned_model/")
 ```
 
 ## Precomputed Embeddings