- Abstract
- Paper and Slides
- Requirements
- TLDRHQ: Data and Text Pre-processing
- Extreme Extractive Text Summarization
- Topic Modeling
- Status
- Contact
- License
- Contributing
Reddit is a social news aggregation and discussion website where users post content (such as links, text posts, images, and videos) on a wide range of topics in domain-specific boards called ”communities” or ”subreddits.” The following project aims to implement text summarization and topic modelling pipelines on the textual content of Redditors’ posts. The dataset used to achieve these goals is the Reddit-based TL;DR summarization dataset TLDRHQ containing approximately 1.7 million Reddit posts (submissions as well as comments) obtained through scraping techniques. Each instance in the dataset includes a Reddit post and its TL;DR, which is an acronym for ”Too Long; Didn’t Read” and is an extremely short summary of the post’s content that is good practice for users to leave at the end of a post. While the Reddit usage has increased, the practice of write TL;DR didn’t keep the pace. In this context, a system (such as a bot) capable to automatically generate the TL;DR of a post could improve Reddit usability. However, the high abstractivity, heterogeneity and noisy of posts make the text summarization task challenging. In this work a supervised extreme extractive summarization model is developed. Despite its lower complexity, results show that its performance are not so different with respect to the state of the art BertSumExt. Moreover the topic modeling analysis of the posts could be really useful in identifying the hidden topics in a post and evaluate if it’s published in the right subreddit. In this project LSA and LDA techniques are used. On this dataset LSA outperformed LDA and identified 20 well defined topics providing the respective document-topics and topic-terms matrices
- python 3.10.7
- contractions==0.1.73
- gensim==4.3.0
- ipython==8.8.0
- matplotlib==3.6.3
- nltk==3.8.1
- num2words==0.5.12
- numpy==1.22.1
- pandas==1.3.5
- seaborn==0.12.2
- simplemma==0.9.0
- spacy==3.4.4
- swifter==1.3.4
- textblob==0.17.1
- tqdm==4.64.1
- scikit-learn==1.2.0
- rouge-score==0.1.2
- imbalanced-learn==0.10.1
- wordcloud==1.8.2.2
- pyLDAvis==3.3.1
First of all, create three empty folders: ./DatasetTLDRHQ
,./ProcessedData
and ./Dataset_splitted
.
Download annotations from the official Google Drive Folder and extract them in ./DatasetTLDRHQ
, resulting in a folder tree like this:
project_folder
└───Dataset_TLDRHQ
├───dataset-m0
├───dataset-m1
├───dataset-m2
├───dataset-m2021
├───dataset-m3
├───dataset-m4
└───dataset-m6
Run the 0_cleaning.py
script which will perform data cleaning (removing duplicates), splits the dataset into training/validation and test sets (splitting the training set so that it is easier to manage) and then save it splitted into .JSON
files in ./Dataset_splitted
. You get a directory tree like this:
project_folder
└───Dataset_splitted
├───test.json
├───train_1.json
├───train_10.json
├───train_2.json
├───train_3.json
├───train_4.json
├───train_5.json
├───train_6.json
├───train_7.json
├───train_8.json
├───train_9.json
└───val.json
Run the 0_normalizing.py
script which will perform senteces splitting, text normalization, tokenization, stop-words removal, lemmatization and POS tagging on document
variable, containing reddit posts. Then save it splitted into various .JSON
files in ./ProcessedData
. You get a directory tree like this:
project_folder
└───ProcessedData
├───test.json
├───train_1.json
├───train_10.json
├───train_2.json
├───train_3.json
├───train_4.json
├───train_5.json
├───train_6.json
├───train_7.json
├───train_8.json
├───train_9.json
└───val.json
The text normalisation operations performed include, in order: Sentence Splitting, HTML tags and entities removal, Extra White spaces Removal, URLs Removal, Emoji Removal, User Age Processing (e.g. 25m becomes 25 male), Numbers Processing, Control Characters Removal, Case Folding, Repeated characters processing (e.g. reallllly becomes really), Fix and Expand English contradictions, Special Characters and Punctuation Removal, Tokenization (Uni-Grams), Stop-Words and 1-character tokens, Lemmatization and POS tagging.
Run notebook 1_preprocessing4summarization.ipynb
in order to:
- remove document without summary
- remove document with a single sentence
- split train dataset
project_folder
└───Processed Data For Summarization
├───test_0.json
├───test_1.json
├───test_2.json
├───train_1_0.json
├───train_1_1.json
├───train_1_2.json
├───train_2_0.json
├───train_2_1.json
├───train_2_2.json
├─── ...
├───train_8_0.json
├───train_8_1.json
├───train_8_2.json
├───train_9_0.json
├───val_0.json
├───val_1.json
└───val_2.json
Run 1_featureMatrixGeneration.py
obtaining feature matrices (sentences x features). You get a directory tree like this:
project_folder
└───Feature Matrices
├───test_0.csv
├───test_1.csv
├───test_2.csv
├───train_1_0.csv
├───train_1_1.csv
├───train_1_2.csv
├───train_2_0.csv
├───train_2_1.csv
├───train_2_2.csv
├─── ...
├───train_8_0.csv
├───train_8_1.csv
├───train_8_2.csv
├───train_9_0.csv
├───val_0.csv
├───val_1.csv
└───val_2.csv
Run the notebook 1_featureMatrixGeneration2.ipynb
to join train, val and test datasets. You get a directory tree like this:
project_folder
└───Feature Matrices
├───test_0.csv
├───test_1.csv
├───test_2.csv
├───train_1_0.csv
├───train_1_1.csv
├───train_1_2.csv
├───train_2_0.csv
├───train_2_1.csv
├───train_2_2.csv
├─── ...
├───train_8_0.csv
├───train_8_1.csv
├───train_8_2.csv
├───train_9_0.csv
├───val_0.csv
├───val_1.csv
├───val_2.csv
├───test.csv
├───train.csv
└───val.csv
Features generated at this step are the following:
- sentence_relative_positions
- sentence_similarity_score_1_gram
- word_in_sentence_relative
- NOUN_tag_ratio
- VERB_tag_ratio
- ADJ_tag_ratio
- ADV_tag_ratio
- TF_ISF
Run notebook 1_featureMatrixUndersampling.ipynb
in order to perform CUR undersampling on both train and validation data sets. You get a directory tree like this:
project_folder
└───Undersampled Data
├───trainAndValMinorityClass.csv
└───trainAndValMajorityClassUndersampled.csv
Majority and minority class are splitted because CUR undersampling works only on the majority class
Run notebook 1_featureMatrixAnalysis.ipynb
to perform EEN undersampling. You get a directory tree like this:
project_folder
└───Undersampled Data
├───trainAndValUndersampledENN3.csv
├───trainAndValMinorityClass.csv
└───trainAndValMajorityClassUndersampled.csv
Run notebook 1_featureMatrixAnalysis.ipynb
to perform a RandomizedSearcCV over the following models
- RandomForestClassifier
- LogisticRegression
- HistGradientBoostingClassifier
with a few possible parameters configuration.
Then, evaluate the resulting best model on the test set with respect to:
- ROC curve
- Recall
- Precision
- Accuracy
Run notebook 1_featureMatrixAnalysis.ipynb
to perform MMR and obtain an extractive summary for each document in the test set.
Run notebook 1_featureMatrixAnalysis.ipynb
to measure summaries quality by means of
- Rouge1
- Rouge2
- RougeL
Run the 2_preprocessing4topic_modeling.ipynb
script to process and extract only the useful data. The output is saved here:
project_folder
└───processed_dataset
├───test.json
Run the 2_topic_modeling.ipynb
script which will perform LDA (with grid search of the best hyper-parameters) and LSA. The script saves 9 CSV files, 3 for LSA and 6 for LDA (UMass and CV coherence measures), containing: document-topic matrix, topic-term matrix and a table with topic insights.
project_folder
└───Results_topic_modeling
├───lda_doc_topic.csv
├───lda_doc_topic_CV.csv
├───lda_top_terms.csv
├───lda_top_terms_CV.csv
├───lda_topic_term.csv
├───lda_topic_term_cv.csv
├───lsa_doc_topic.csv
├───lsa_top_terms.csv
├───lsa_topic_term.csv
Saves images regarding the number of words per document and wordcloud in
project_folder
└───Images
Saves hyperparameters grid search results for UMass and CV coherence in
project_folder
└───Hyperparameters
├───tuning.csv
├───tuning_CV.csv
Giorgio Carbone - feel free to contact me!
-
You can check out the full license here
This project is licensed under the terms of the MIT license.
- Fork it (https://github.com/giocoal/reddit-tldr-summarizer-and-topic-modeling.git)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request