Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neural topic modeling methods #59

Open
soroush-ziaeinejad opened this issue Dec 2, 2022 · 2 comments
Open

Neural topic modeling methods #59

soroush-ziaeinejad opened this issue Dec 2, 2022 · 2 comments
Assignees
Labels

Comments

@soroush-ziaeinejad
Copy link
Contributor

Hi @Lillliant,

This issue page is for your second task on SEERa. As you know, SEERa has a layered structure and its second layer is tml (topic modeling layer). For now, we have 3 methods in SEERa to do the topic modeling: lda, gsdmm, and btm. Since none of them is a neural model (which probably works better than our current models), we decided to add at least one neural topic modeling method to SEERa. As the first step, please do some research about the current neural topic modeling methods and report your findings. Then we can discuss them and decide which ones should be added to SEERa. I can share two links with you to take a look as well:
https://github.com/MaartenGr/BERTopic
https://github.com/MilaNLProc/contextualized-topic-models

Please let us know if you have any questions regarding this task.

@hosseinfani, please feel free to add your comments on this task.

@Lillliant
Copy link
Member

Based on my search, neural topic models are usually based on these methods:

  • VAE (variational autoencoders): one of the most popular type of NTM that I can find on GitHub. If my interpretation is correct, VAE is useful when dealing with large-scale datasets. Particularly, many also use transformer-based NLP (like BERT) to better classify semantics / context. Some examples include:
    • OCTIS: a framework of topic models, including NTM (NeuralLDA, CTM, and ProdLDA). These models can also be optimized using OCTIS.
    • BERTopic: supports use cases like dynamic topic modelling (how topics evolve over time) and topics per class (how topics are represented in each group/category of data).
    • CTM: includes two types of model - ZeroShotTM (good for multilingual data sets with words that weren't in the test set) and CombinedTM (produces better topic coherence). However, they are less efficient with text that has many distinct words.
    • AVITM: PyTorch implementation of AVITM, which is also used in ProdLDA. The advantages highlighted are efficiency to use, train, and improved topic coherence compared to LDA.
  • GNN (Graph Neural Networks)
  • WAE (Wasserstein auto-encoders)
  • GAN (generative adversarial networks)
    • Neural Topic Models provides PyTorch implementations of NTMs that uses VAE, GAN, and WAE. However, modifications may be needed to ensure that these models support English.
    • Compared to VAE models with NLP, I was not able to find as much implementations using GNN/WAE/GAN.
  • NADE (Neural Autoregressive Density Estimation)
    • iDocNADEe: aims to optimize short text classification using techniques similar to BERT.

Many of these models that I've found also includes performance benchmarks (ex. this paper examines performance for BERTopic and CTM). It seems that CTM is better in topic diversity, but BERTopic is better in topic coherence. However, I'm not sure if I understand the benchmarks enough to make an accurate judgement.

I tried to understand the different terms by reading this paper. Please let me know if something's inaccurate or unclear!

@hosseinfani
Copy link
Member

@Lillliant Thank you.

@soroush-ziaeinejad @Hamedloghmani @farinamhz @ZahraTaherikhonakdar @yogeswarl @karan96
This is an instance of a short but nice literature review which is done in 7 days by an undergrad student who is getting close to final exams having 3-4 courses! Do you have such a thing for your research domain after months/years of work?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants