Skip to content

Latest commit

 

History

History
113 lines (65 loc) · 4.05 KB

README.md

File metadata and controls

113 lines (65 loc) · 4.05 KB

News Articles Text Classification and Clustering

In this repository we perform Text Classification and Clustering experiments. Also, generating Word Clouds for each article category.

The input consists of 2225 documents from a news site that corresponds to stories in five local areas from 2004-2005.

Document Categories

  • Business
  • Entertainment
  • Politics
  • Sport
  • Tech

First line of each document is the title and the rest is the content of the article.

The whole procedure consists of:

  1. Create a data set of all documents
  2. Text pre-processing
    1. Remove special characters, lower case
    2. Remove Stopwords
    3. Lemmatization
    4. Stemming
    5. Tokenization
  3. Generate Word Clouds
  4. Vectorization
  5. Classification and Clustering

I also implemented a KNN Classifier using max heap, but it was too slow for this data set.

Word Clouds

Business

business

Entertainment

entertainment

Politics

politics

Sport

sport

Tech

tech

Classification

Classifier: MultinomialNB, SVM, RF, KNN

Vectorization: Bag Of Words, Tf-idf

Dimensionality Reduction: PCA, SVD and ICA

Roc curves

multinomialnb_bow_roc_curves

multinomialnb_tf_idf_roc_curves

svm_bow_roc_curves

svm_tf_idf_roc_curves

random_forest_bow_roc_curves

random_forest_tf_idf_roc_curves

knn_bow_roc_curves

knn_tf_idf_roc_curves

Clustering

Clusterer: Kmeans

Vectorization: Bag Of Words, Tf-idf, Word2vec

Dimensionality Reduction: PCA, SVD and ICA

PCA

bow_pca

tf_idf_pca

word2vec_pca

SVD

bow_svd

tf_idf_svd

word2vec_svd

ICA

bow_ica

tf_idf_ica

word2vec_ica