An assignment for DSC540 (Machine Learning for Data Science) at GCU that focused on building an efficient classification model powered by the Naive Bayes algorithm. The specific task was to perform sentiment analysis on a dataset of tweets about ChatGPT over a 1-month period and classify them as positive, neutral, or negative.
Check out the full report here.
To perform sentiment analysis using a Naïve Bayes algorithm, complete the following:
- Access the resources related to sentiment analysis, located in the topic Resources. Note: There are about 50 datasets that are suitable for use in a sentiment analysis task. For this part of the exercise, you must choose one of these datasets, provided it includes at least 10,000 instances.
- Ensure that the datasets are suitable for classification using this method.
- You may search for data in other repositories, such as Data.gov, Kaggle or Scikit Learn.
For your selected dataset, build a classification model as follows:
- Explain the dataset and the type of information you wish to gain by applying a classification method.
- Explain the Naïve Bayes algorithm and how you will be using it in your analysis (list the steps, the intuition behind the mathematical representation, and address its assumptions).
- Import the necessary libraries, then read the dataset into a data frame and perform initial statistical exploration.
- Clean the data and address unusual phenomena (e.g., normalization, feature scaling, outliers); use illustrative diagrams and plots and explain them.
- Formulate two questions that can be answered by applying a classification method using the Naïve Bayes.
- Choose one of the Naive Bayes types of algorithms: Gaussian Naïve Bayes, Multinomial Naïve Bayes, or Bernoulli Naïve Bayes and explain your reasoning.
- Split the data into dependent and independent variables (or features and labels).
- Vectorize the text into numbers.
- Train the Naïve Bayes classifier on the training set.
- Make classification predictions.
- Interpret the results in the context of the questions you asked.
- Validate your model using a confusion matrix, accuracy score, ROC-AUC curves, and k-fold cross validation. Then, explain the results.
- Include all mathematical formulas used and graphs representing the final outcomes.
Prepare a comprehensive technical report as a markdown document or Jupyter notebook, including all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) problem statement, b) algorithm of the solution, c) analysis of the findings, and d) deferences.
While APA style is not required for the body of this assignment, solid academic writing is expected, and documentation of sources should be presented using APA formatting guidelines, which can be found in the APA Style Guide, located in the Student Success Center.
This assignment uses a rubric. Review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.
You are not required to submit this assignment to LopesWrite.