This repository contains work done as a part of a hackathon in the fall of 2020, cosponsored by Save the Children and the UVA School of Data Science. This is a small project, focused on text mining from news resources, building a pipeline to usable text data, and performing introductory topic modelling on the resulting data, and this README is an explanation of the file strucutre of the repository.
In the project, we have three notebooks: NYT
, Pipeline
, and Analysis
. Their contents are as follows:
In the NYT.ipynb
notebook, the New York Times API is employed in order to search their articles, so as to build a corpus of relevant documents.
In the Pipeline.ipynb
notebook, we query a free online news API to build another corpus, and we extract and perform cleaning operations on the text of the relevant news stories.
In the Analysis.ipynb
notebook, we use the cleaned text data to perform Latent Dirichlet Analysis topic modelling on each corpus. This notebook also generates a handful of .html
files, which are interactive visual represenations of the topic models.
This project is licensed under the terms of the MIT license.