FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts.
FarSSiBERT is a monolingual large language model based on Google’s BERT architecture. This model is pre-trained on a large corpus of informal Persian short texts with various writing styles, including more than 104M tweets from diverse subjects.
Paper presenting FarSSiBERT:
It included a Python library for measuring the semantic similarity of Persian short texts.
- Text cleaning.
- Specific tokenizer for informal Persian short text.
- Word and sentence embeddings based on transformers.
- Semantic Similarity Measurement.
- Pre-train BERT model for all downstream tasks especially informal texts.
- Download and install the FarSSiBERT python package.
- Import and use as below:
from FarSSiBERT.SSMeasurement import SSMeasurement
text1 = "متن اول"
text2 = "متن دوم"
new_instance = SSMeasurement( text1 , text2 )
label = new_instance.get_similarity_label()
similarity = new_instance.get_cosine_similarity()
- pyhton=>3.7
- transformers==4.30.02
- torch==1.13.0
- scikit-learn==0.21.3
- numpy~=1.21.6
- sklearn~=0.0