Manual Team Airplane Mode DBL 2023/2024
Contents Overview .................................................................................................................... 3 Prerequisites: make sure you have the following libraries installed before running the codes .......................................................................................................................... 3 Step 1: Set up development environment ............................................................ 4 Step 2: Unzip the data file .................................................................................... 4 Step 3: Install MongoDB Community Server and MongoDB Shell ....................... 5 Step 4 Download the file ...................................................................................... 5 Step 5: Run Files ................................................................................................. 6 Data cleaning ............................................................................................................. 7 Business Idea ............................................................................................................. 9 Step 1: Check Names .......................................................................................... 9 Step 2: Configure Dashboard .............................................................................. 9 Step 3: Run File ................................................................................................... 9 Sentiment analysis ................................................................................................... 11 Step 1: Check Names ........................................................................................ 11 Step 2: Configure Dashboard ............................................................................ 11 Step 3: Run File ................................................................................................. 12 Conversation Mining................................................................................................. 13 Sentiment Evolution ................................................................................................. 16 Step 1: Check Names ........................................................................................ 16 Step 2: Configure Dashboard ............................................................................ 16 Step 3: Run File ................................................................................................. 16 Appendices .............................................................................................................. 18 If BERTopic has issues ......................................................................................... 18 Creating random sample of Tweets ...................................................................... 18
Overview This manual provides a comprehensive guide to setting up and running a script for mining conversation data from tweets. The script automates several tasks, including filtering replies, fetching starting tweets, identifying user conversation starters, constructing conversation trees, and storing the processed data into MongoDB collections for further analysis. Additionally, it offers various functions designed to process and clean tweet data stored in files, enabling efficient reading, removal of unnecessary information, and consolidation into a unified format. The manual also includes functionality to load cleaned tweet data into MongoDB, ensuring removal of duplicates and inconsistent data. Beyond data mining and cleaning, the business idea leverages this processed data to develop topic models using BERTopic, facilitating the extraction of meaningful conversation topics. Sentiment analysis is integrated to assess the emotional tone of tweets, generating sentiment scores (positive, neutral, negative) using VADER sentiment analysis tools. Furthermore, the manual guides users through analyzing sentiment evolution over time, plotting sentiment changes within conversations to identify shifts in user sentiment and engagement dynamics.
Prerequisites: make sure you have the following libraries installed before running the codes • 'pymongo' (install using 'pip install pymongo' in your terminal) • 'treelib' (install using 'pip install treelib' in your terminal) • 'tqdm' (install using 'pip install tqdm' in your terminal) • 'matplotlib' (install using 'pip install matplotlib' in your terminal) • 'nltk' (install using 'pip install nltk' in your terminal) • 'click' (install using 'pip install click' in your terminal) • 'bertopic' (install using 'pip install BERTopic' in your terminal) • 'sklearn' (install using 'pip install scikit-learn' in your terminal) • ‘Utility_functions’ (not available on PYPI, it is obtained from your project’s repository or source) • ‘rein’ ((not available on PYPI, it is obtained from your project’s repository or source) • ‘bson’ (install using ‘pip install bson’ in your terminal) • ‘gc’ (part of Python standard library, does not require installation) • ‘json’ (part of Python standard library, does not require installation)
Step 1: Set up development environment Make sure to download and install VSCode and download the correct version of python as well
-
Visual Studio Code
-
Download and install VSCode from the official website.
-
Follow the installation instructions provided on the website.
-
Python 3.11.9 • Download and install Python version 3.11.9 from the Python Downloads page. • Follow the installation instructions provided on the website.
Step 2: Unzip the data file
Unzip the data file before running the code, you need to unzip a file named
data.zip
. This file should contain 567 JSON files, each containing tweets.
-
Place the data.zip file in your working directory.
-
Unzip the
data.zip
file into a folder called 'data' containing the JSON files.
- On Windows: Right-click the
data.zip
file and selectExtract All...
, then follow the prompts. - On Mac: Double-click the
data.zip
file to unzip it. - On Linux: Use the command
unzip data.zip
in the terminal.
Remark: The data folder containing all the data files of the different tweets should not be stored in a nested structure.
Step 3: Install MongoDB Community Server and MongoDB Shell Make sure to install MongoDB and MongoDB Shell with the correct version
-
Download and install MongoDB: • Visit the MongoDB download center and download [MongoDB Community Server Download] (MongoDB Community Server Download Page). • Choose your operating system and download the latest version '7.0.11' • Follow the installation instructions provided on the website.
-
**Download and install MongoDB Shell**: • Visit the MongoDB Compass download page and download [MongoDB Compass Download (GUI)](MongoDB Compass Download Page). • Choose your operating system and download the installer for the latest version ‘2.2.9’. • Follow the installation instructions to complete the setup.
-
**Create a new database in your MongoDB** • Open MongoDB and create a new database. • Create the database with the name "DBL" and collection name "cleaned_data". Make sure that you name it exactly this otherwise the rest of the code becomes useless.
Step 4 Download the file In order to run the business idea, you need to have the JSON file ‘topic_share’ before you can run the file ‘Topic_share_dashboard.py’. The file should be placed in the same directory as the other folders (‘conversation_mining’, ‘Business_idea’, ‘data_cleaning’, ‘Sentiment’, etc.). You can download the file using this link: topic_share.json
Step 5: Run Files Find the ‘run and debug’ tab in the left side bar (left image) and click and green play button (right image) in the top left to execute the code.
Run and debug tab icon Execute run and debug Data cleaning
Task description We have thoroughly cleaned the tweet data by removing duplicates, filtering out inconsistencies, and ensuring all tweets are in English. This process involved reading tweets from files, removing unnecessary information, and consolidating the cleaned data into a single file for analysis.
Go to VSCode, then follow the path: 'DBL-DATA-CHALLENGE' -> 'data_cleaning' -> 'Cleaning_dashboard.py' -> ‘Cleaning_dashboard_step_2’ and run the file(s): • Cleaning_dashboard.py
(with the correct path name) • Cleaning_dashboard_step_2.py
(with the correct path name) After having run the first dashboard, you will get a new JSON file called 'cleaned_data'. Afterwards, create a new database in your MongoDB called ‘DBL’ and create a new collection called 'cleaned_data' and import the 'cleaned_data'.json file into the new collection. You can do this by clicking "Import Data" after navigating to your collection and then clicking the cleaned_data.json file. It should take 20 minutes to complete. After it has finished importing, run the ‘Cleaning_dashboard_step_2’ file.
If the dashboard does not work, then run the following files manually in that order:
- data_cleans.py
(in folder 'data_cleaning') After having run data_cleans.py, you will get a new .json file called 'cleaned_data'. Afterwards, create a new database in your MongoDB and create a new collection called 'cleaned_data' and import the 'cleaned_data' .json file into the new collection. You can do this by clicking "Import Data" after navigating to your collection and then clicking the cleaned_data.json file. It should take 20 minutes to complete.
- remove_duplicates.py
(in folder 'data_cleaning') 3. remove inconsistencies
(in folder 'data_cleaning')
Functions:
- 'make_tweet_list'
###Reads the data from the file located at the specified path and returns a list of tweets. 2. 'file_paths_list'
###Generates a list containing the paths to all files in the specified data folder.
- 'remove_variables'
###Cleans a given tweet by removing unnecessary variables.
- 'check_language'
###Checks whether a tweet is in English.
- 'check_delete'
###Checks whether the tweet is deleted or not.
- 'clean_all_files'
###Cleans all data files in the specified folder and consolidates the cleaned tweets into a single file.
- 'make_collection'
###Creates a MongoDB collection and loads the cleaned tweet data into it. Export the 'cleaned_data' file manually into your database called 'DBL'.
- 'remove_duplicates'
###Removes duplicate tweets from the MongoDB collection and consolidates unique tweets into a new collection called 'removed_duplicates' in your database.
- 'remove_inconsistencies'
###Filters out inconsistent data from the MongoDB collection and stores consistent data in a new collection called 'no_inconsistency' in your database.
After you have finished running these files, you should have some collections that is called ‘cleaned_data’, ‘removed_duplicates’ and 'no_inconsistency' in your database. This is all part of the data cleaning process that we have done.
Business Idea
Task description We are going to make a topic model which produces topics based on the data given. This data is cleaned (no duplicates and no inconsistencies).
Step 1: Check Names Make sure your MongoDB Database is named 'DBL' and you have a collection named 'cleaned_data'. The following code will return errors if the names don't match.
Step 2: Configure Dashboard Go to VSCode, then follow the path: Go to VSCode, then follow the path: 'DBL- DATA-CHALLENGE' -> 'Business_idea' -> 'Topic_share_dashboard.py'.
Change ‘DBL’ to whatever you named your MongoDB database if you didn’t follow the recommended naming convention in Prerequisites Step 3. Change ‘topics’ to whatever you named your topic analysis collection post the business idea stage if you didn’t follow the recommended naming convention in Prerequisites Step 3.
Step 3: Run File Find the ‘run and debug’ tab in the left side bar (left image) and click and green play button (right image) in the top left to execute the code. Remark: This might take a very long time. If it takes more than a day to run, you can skip this part and just go to sentiment analysis.
Run and debug tab icon Execute run and debug Functions:
- ‘make_topic_file’
which is the collection topic_analysis 2. ‘get_topic’
model 3. ‘add_topics’
topic_share.json (from topic_share) 4. tweets_without_topic
topic assignment
By running this code, you will get a new collection in your MongoDB called ‘topics’. This collection contains the same fields as the ‘no_inconsistency’ collection, with an added field called ‘topic’ that contains the topic of the tweet. This approach not only facilitates understanding the prevalent topics within your data but also aids in deriving meaningful business insights from social media interactions. If the ‘Topic_share_dashboard.py’ did not run properly, you should continue with the collection called ‘no_inconsistency’.
Sentiment analysis
Task description The code looks at the content of the tweet in terms of the text that was sent and calculates an overall sentiment score of the tweet based on each of the words it contains. Ultimately, a compound score is generated which is normalized for each and every tweet in the dataset.
After having cleaned the data in the previous section, we can now allow VADER to individually create distribution of positive, neutral and negative scores and compute the compound score.
After following these steps, a set in the following format will be outputted: {'neg': w, 'neu': x, 'pos': y, 'compound': z} where w, x, y, z are floating point numbers.
Step 1: Check Names Make sure your MongoDB Database is named 'DBL' and you have a collection named ‘topics’, if the topic_share_dashboard.py worked. Otherwise you can use the ‘no_inconsistency’ collection for the rest of the manual
Step 2: Configure Dashboard Go to VSCode, then follow the path: 'DBL-DATA-CHALLENGE' -> 'Sentiment' -> 'Sentiment_analysis_dashboard.py'. In line 7 and 8, make the following edits if necessary. Line 7: db = client['DBL'] ## Use the DBL database Line 8: collection = db['topics'] ## Choose a collection of tweets (if the topic_share_dashboard.py worked), otherwise: Line 8: collection = db[‘no_inconsistency’] ## Choose a collection of tweets
Change ‘DBL’ to whatever you named your MongoDB database if you didn’t follow the recommended naming convention in Prerequisites Step 3. Change ‘topics’ to ‘no_inconsistency’ if the topic_share_dashboard.py did not work.
Remark: If you don’t have the ‘topics’ collection because running the ‘topic_share_dashboard.py’ file took too long to run, you can change line 7 to: Line 7: collection = db['no_inconsistency']. This way you do have the sentiment scores of the tweets at least.
Step 3: Run File Find the ‘run and debug’ tab in the left side bar (left image) and click and green play button (right image) in the top left to execute the code.
Functions:
- update_VADER
values. 2. analyze_sentiment
- get_full_text
This function will check if it has been shortened and attempt to obtain the full text and ignore the shortened one. 4. add_entire_document
- add_sentiment_variables ###Creates a new collection and fills it with the documents from the old collection and adds the sentiment variables to all documents in the new collection.
After running the codes, you should have a collection called ‘sentiment_included’ in your MongoDB database. This collection contains all the fields of the ‘no_inconsistency’ collection with some added fields (‘compount_sentiment’, ‘negativity’, ‘neutrality’, ‘positivity’, ‘truncated_error). Run and debug tab icon Execute run and debug Conversation Mining
Task description This part provides a comprehensive guide to setting up and running a script for mining conversation data from tweets. The script automates several tasks, including filtering replies, fetching starting tweets, identifying user conversation starters, constructing conversation trees, and storing the processed data into MongoDB collections for further analysis
Go to VSCode, then follow the path: 'DBL-DATA-CHALLENGE' -> 'conversation_mining' -> 'dashboard convo.py' and run the file(s): • dashboard convo.py
(in folder 'conversation_mining')
Remark: An error message will pop up once. The reason being is because the size of ‘user_trees.py’ is too big to store everything at once in a collection. You can skip the error with the steps below:
Step 1: Run the dashboard with run and debug
Step 2: Comment out the lines with files already run by highlighting it and clicking: ‘CTRL + /’
Step 3: Press the continue button on the horizontal run and debug toolbar
Since the size is too big the dashboard also does not run the files after user.trees.py. After this file has finished running, you should comment out the lines that have already finished running. The pictures below represent it.
Before:
After:
If the dashboard does not work, then run the following files manually in that order:
- filter_replies.py
(in folder 'conversation_mining') 2. collect_starting_conversation.py (in folder 'conversation_mining') 3. user_starting_convo.py
(in folder 'conversation_mining') 4. airline_starting_convo.py
(in folder 'conversation_mining') 5. user_trees.py
(in folder 'conversation_mining') 6. airline_trees.py
(in folder 'conversation_mining') 7. timeframe_user_trees.py
(in folder 'conversation_mining') 8. timeframe_airline_trees.py
(in folder 'conversation_mining') 9. tweet_order_user.py
(in folder 'conversation_mining') 10. tweet_order_airline.py
(in folder 'conversation_mining')
Functions:
- 'filter_replies'
###Extracts replies from the MongoDB collection of cleaned tweets and stores them in a new collection called 'replies'. 2. 'collect_starting_conversations'
###Extracts tweets that are not replies and stores them in a new collection called 'starting_tweets'. 3. 'user_convo_starters'
###Identifies tweets from users (excluding airline accounts) that start conversations and stores them in a new collection called 'user_convo_starters'. 4. 'airline_convo_starters'
them in a new collection called 'airline_convo_starters'. 5. 'user_trees'
###Constructs conversation trees from the starting tweets of user conversations and stores them in a new collection called 'user_trees' in your database. 6. 'airline_trees'
###Constructs conversation trees from the starting tweets of airline conversations and stores them in a new collection called 'airline_trees' in your database. 7. 'timeframe_user_trees
###Filters conversation trees from user tweets to include only tweets within a 24-hour timeframe from their parent tweet and stores them in a new collection called 'timevertical_trees_user' in your database. 8. 'timeframe_airline_trees'
###Filters conversation trees to include only tweets within a 24-hour timeframe from their parent tweet and stores them in a new collection called 'timevertical_trees_airline' in your database. 9. 'tweet_order_user'
###Validates conversation order in the conversation trees and stores valid user trees in a new collection called 'valid_trees_user' in your database. 10. 'tweet_order_airline'
###Validates conversation order in the conversation trees and stores valid airline trees in a new collection called 'valid_trees_airline' in your database. 11. ‘merge_valid_trees’
###The function essentially consolidates documents from two separate collections (‘valid_trees_user’ and ‘valid_trees_airline’) into a single collection
After you have finished running all the codes, you should have many collections in your database. The three most important ones are 'valid_trees_airline', 'valid_trees_user' and ‘valid_trees_merged’, these are the final conversations that match our definition. This concludes the conversation mining part.
Sentiment Evolution
Task description We can now collate all previous stages to finally understand the change in sentiment between tweets. The cleaned data is put through VADER in sentiment analysis and this function is later called in sentiment evolution after having structured the conversations. With the order clear, the sentiment scores are generated for the tweets and therefore it can be determined if preceding or succeeding tweets evolve in sentiment. Following the steps below will result in the final steps of this DBL.
Step 1: Check Names Make sure your MongoDB Database is named 'DBL' and you have a collection named 'valid_trees_merged'. The following code will return errors if the names don't match.
Step 2: Configure Dashboard Go to VSCode, then follow the path: 'DBL-DATA-CHALLENGE' -> 'Sentiment' -> 'Sentiment_evolution_dashboard.py'. In line 6 and 7, make the following edits if necessary. Line 6: db = client['DBL'] ## Use the DBL database Line 7: collection = db[' valid_trees_merged'] ## Choose a collection of tweets
Change ‘DBL’ to whatever you named your MongoDB database if you didn’t follow the recommended naming convention. Change ‘Convos’ to whatever you named your conversation mining collection if you didn’t follow the recommended naming convention.
Step 3: Run File Find the ‘run and debug’ tab in the left side bar (left image) and click and green play button (right image) in the top left to execute the code.
Run and debug tab icon Execute run and debug
Functions:
- Get_reply_by_index
- Get_convo
- extract_compounds_from_convo_vars ###Returns the compound scores from all tweets in the conversation of a given tree.
- Get_evolutions
both all of the evolutions and non-evolutions. 5. count_evolution_types
the counts as a dictionary. 6. is_airline_userID
- get_tree_docs
topic if specified. If no topic is specified the list will contain all documents. 8. get_evolution_stats
dictionary. 9. plot_evos
plot_evo_non_evo
plot_inc_dec
Appendices
If BERTopic has issues • Press ‘Windows + r’ to open the Run dialog
- Type ‘regedit’ and press Enter to open the Registry Editor.
- Navigate to the following path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
- Find the entry named ‘LongPathsEnabled’
- Double-click on ‘LongPathsEnabled’ and set its value to ‘1’.
- Click OK and close the Registry Editor.
To undo this: • Press ‘Windows + r’ to open the Run dialog 6. Type ‘regedit’ and press Enter to open the Registry Editor. 7. Navigate to the following path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem 8. Find the entry named ‘LongPathsEnabled’ 9. Double-click on ‘LongPathsEnabled’ and set its value to ‘0’. 10. Click OK and close the Registry Editor.
Creating random sample of Tweets Step 1 Open the file "Random_sample.py" after following the path: "DBL-Data-Challenge" >> "Sentiment" >> "Random_sample.py"
Step 2 Scroll down to the end of the code and find the following line: sample = select_random_lines(500, "PATH TO CLEANED DATA FILE", count_lines("PATH TO CLEANED DATA FILE"))
- Change the '500' integer to another integer if you want a different number of samples
- Write the path to where your cleaned data file is located. For example, If located here: C:\Users\USERNAME\OneDrive - TU Eindhoven\Documents\JBG030 - DBL Data Challenge\DBL-Data-Challenge\data, then replace the string with C:\Users\USERNAME\OneDrive - TU Eindhoven\Documents\JBG030 - DBL Data Challenge\DBL-Data-Challenge\data
Step 3: Run Random_sample.py file by clicking the play button in the top right. It should output a random sample of tweets that is easy to read for a human interpreter in this format: Tweet X: {tweet text ...}