The aim of the page is to have a common wiki for all NLP literature and tools existing for the Gujarati language.
For dataset papers, refer to the Data section.
- Patel, Chirag, and Karthik Gali. "Part-of-speech tagging for Gujarati using conditional random fields." Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages. 2008. [Paper,Cite]
- Suba, Kartik, Dipti Jiandani, and Pushpak Bhattacharyya. "Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati." Proceedings of the 2nd workshop on South Southeast Asian natural language processing (WSSANLP). 2011. [Paper,Cite]
- Patel, Pratikkumar, Kashyap Popat, and Pushpak Bhattacharyya. "Hybrid stemmer for Gujarati." Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 2010. [Paper,Cite]
- Bhensdadia, C. K., Brijesh Bhatt, and Pushpak Bhattacharyya. "Introduction to Gujarati wordnet." Third national workshop on indowordnet proceedings. Vol. 494. 2010. [Paper, Browser]
- Gujarati Grammar Wikipedia Page
- MuRIL: Khanuja, Simran, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam et al. "Muril: Multilingual representations for indian languages." arXiv preprint arXiv:2103.10730 (2021).[Paper,Model,Cite]
- Indic-BERT: Kakwani, Divyanshu, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. "iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4948-4961. 2020. [Paper, Documentation, Model, Cite]
- Indic-FT: Kakwani, Divyanshu, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. "iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4948-4961. 2020. [Paper, Documentation, Model, Cite]
- IndicBART Dabre, Raj, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, and Pratyush Kumar. "IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages." arXiv preprint arXiv:2109.02903 (2021).[Paper,Documentation,Model,Cite]
- GujTB: Mayank Jobanputra, Maitrey Mehta, and Çağrı Çöltekin. 2024. A Universal Dependencies Treebank for Gujarati. In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 56–62, Torino, Italia. ELRA and ICCL. [Paper]
- Dakshina (Size: 200k sentences, Year: 2020, Source: Wikipedia)[Data, Repo, Cite]
- AI4Bharat-IndicNLP (Size: 7.8 M sentences, Year : 2020, Source: News Articles) [Data,Paper, Cite]
- Gujarati Wikipedia Articles (Size: 31k articles, Year: 2020) [Data]
- Gujarati News Articles (Size: 6.5k articles, Year: 2020) [Data]
- Ramesh, Gowtham, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo et al. "Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages." arXiv preprint arXiv:2104.05596 (2021) (Size: 3M sentences (En-Gu) and 15M sentences (In-Gu), Year: 2021, Source: News, Video subtitles, etc.). [Paper, In-In Data, En-In Data, Cite]
- Shah, Parth, and Vishvajit Bakrola. "Neural Machine Translation System of Indic Languages-An Attention based Approach." In 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), pp. 1-5. IEEE, 2019. (Size: 65k sentences, Year: 2019, Source: MSCOCO captions) [Paper, Data, Cite]
- AI4Bharat-IndicNLP Article Topic Classification (Size: 680 articles, Year : 2020, Source: News Articles, Classes: 3) [Data,Paper, Cite]
- INLTK Headline Classification Corpus (Size: 6587 headlines, Year: 2020, Classes: 3) [Data, Repo]
- Baxi, Jatayu, and Dr Bhatt. "Morpheme Boundary Detection & Grammatical Feature Prediction for Gujarati: Dataset & Model." arXiv preprint arXiv:2112.09860 (2021). [Paper, Data,Cite]
- Part of Speech Tagset: Gujarati (Also contains grammatical features)(Year:2009) - Document