Neural topic modeling methods #59

soroush-ziaeinejad · 2022-12-02T23:44:53Z

This issue page is for your second task on SEERa. As you know, SEERa has a layered structure and its second layer is tml (topic modeling layer). For now, we have 3 methods in SEERa to do the topic modeling: lda, gsdmm, and btm. Since none of them is a neural model (which probably works better than our current models), we decided to add at least one neural topic modeling method to SEERa. As the first step, please do some research about the current neural topic modeling methods and report your findings. Then we can discuss them and decide which ones should be added to SEERa. I can share two links with you to take a look as well:
https://github.com/MaartenGr/BERTopic
https://github.com/MilaNLProc/contextualized-topic-models

Please let us know if you have any questions regarding this task.

@hosseinfani, please feel free to add your comments on this task.

Lillliant · 2022-12-10T03:03:25Z

Based on my search, neural topic models are usually based on these methods:

VAE (variational autoencoders): one of the most popular type of NTM that I can find on GitHub. If my interpretation is correct, VAE is useful when dealing with large-scale datasets. Particularly, many also use transformer-based NLP (like BERT) to better classify semantics / context. Some examples include:
- OCTIS: a framework of topic models, including NTM (NeuralLDA, CTM, and ProdLDA). These models can also be optimized using OCTIS.
- BERTopic: supports use cases like dynamic topic modelling (how topics evolve over time) and topics per class (how topics are represented in each group/category of data).
- CTM: includes two types of model - ZeroShotTM (good for multilingual data sets with words that weren't in the test set) and CombinedTM (produces better topic coherence). However, they are less efficient with text that has many distinct words.
- AVITM: PyTorch implementation of AVITM, which is also used in ProdLDA. The advantages highlighted are efficiency to use, train, and improved topic coherence compared to LDA.
GNN (Graph Neural Networks)
WAE (Wasserstein auto-encoders)
GAN (generative adversarial networks)
- Neural Topic Models provides PyTorch implementations of NTMs that uses VAE, GAN, and WAE. However, modifications may be needed to ensure that these models support English.
- Compared to VAE models with NLP, I was not able to find as much implementations using GNN/WAE/GAN.
NADE (Neural Autoregressive Density Estimation)
- iDocNADEe: aims to optimize short text classification using techniques similar to BERT.

Many of these models that I've found also includes performance benchmarks (ex. this paper examines performance for BERTopic and CTM). It seems that CTM is better in topic diversity, but BERTopic is better in topic coherence. However, I'm not sure if I understand the benchmarks enough to make an accurate judgement.

I tried to understand the different terms by reading this paper. Please let me know if something's inaccurate or unclear!

hosseinfani · 2022-12-10T06:14:18Z

@Lillliant Thank you.

@soroush-ziaeinejad @Hamedloghmani @farinamhz @ZahraTaherikhonakdar @yogeswarl @karan96
This is an instance of a short but nice literature review which is done in 7 days by an undergrad student who is getting close to final exams having 3-4 courses! Do you have such a thing for your research domain after months/years of work?!

soroush-ziaeinejad added enhancement New feature or request baseline method Code labels Dec 2, 2022

soroush-ziaeinejad assigned Lillliant Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neural topic modeling methods #59

Neural topic modeling methods #59

soroush-ziaeinejad commented Dec 2, 2022

Lillliant commented Dec 10, 2022

hosseinfani commented Dec 10, 2022

Neural topic modeling methods #59

Neural topic modeling methods #59

Comments

soroush-ziaeinejad commented Dec 2, 2022

Lillliant commented Dec 10, 2022

hosseinfani commented Dec 10, 2022