README.Rmd

---
output: 
  github_document:
    html_preview: false
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

<!-- badges: start -->
[![Travis build status](https://travis-ci.org/news-r/textanalysis.svg?branch=master)](https://travis-ci.org/news-r/textanalysis)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
<!-- badges: end -->

# textanalysis

Text Analysis in R via Julia.

<img src="./man/figures/logo.png" height="200" align="right" />

## Installation

Being a wrapper to a [Julia](https://julialang.org/) package, textanalysis requires the latter to be installed.

```{r, eval=FALSE}
# install.packages("remotes")
remotes::install_github("news-r/textanalysis") # github
```

## Setup

You _must_ run `init_textanalysis` at the begining of every session, you will otherwise encounter errors and be prompted to do so.

```{r}
library(textanalysis) # load the package

init_textanalysis() # initialise
```

Some funtions depend on the development version of the Julia package, to install it run:

```r
install_textanalysis(version = "latest")
```

## Basic Examples

```{r}
# build document
str <- paste(
  "They <span>write</span>, it writes too!!!",
  "This is another sentence.",
  "More stuff in this document."
)
doc <- string_document(str)

# basic cleanup
prepare(doc)
get_text(doc)

# stem
stem_words(doc)
get_text(doc)

# corpus
doc2 <- token_document("Hey write another document.")

# combine
corpus <- corpus(doc, doc2)

# standardize
standardize(corpus, "token_document")

# prepare corpus
prepare(corpus, strip_html_tags = FALSE)
get_text(corpus)

# lexicon + lexical stats
(lexicon <- lexicon(corpus))
lexical_frequency(corpus, "document")

# inverse index
inverse_index(corpus)

# dtm
m <- document_term_matrix(corpus)

# create func to easily add lexicon
bind_lexicon <- function(data){
  data %>% 
    as.data.frame() %>% 
    dplyr::bind_cols(
      lexicon %>% 
        dplyr::select(-n),
      .
    )
}

# term-frequency
tf(m) %>% bind_lexicon()

# tf-idf
tf_idf(m) %>% bind_lexicon()

# bm-25
# https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
bm_25(m) %>% bind_lexicon()

# sentiment
sentiment(corpus)

# summarise in 2 sentences
summarize(string_document(str), ns = 2L)
```

## Latent Dirichlet Allocation

fit LDA on the [gensimr](https://gensimr.news-r.org) data.

```{r}
set_seed(42L)

data("corpus", package = "gensimr")
documents <- to_documents(corpus) # convert vector to documents

crps <- corpus(documents)
dtm <- document_term_matrix(crps)

# 2 topics
# 1K iterations
lda_data <- lda(dtm, 2L, 1000L)

# classification
lda_data$ntopics_ndocs

mat <- dtm_matrix(dtm, "dense")

tfidf <- tf_idf(mat)

km <- kmeans(tfidf, centers = 2)
```

## Hash trick

```{r}
hash_func <- create_hash_function(10L)
hash("a", hash_func)
hash(doc) # doc has built-in has
```

## Naive Bayes Classifier

```{r}
classes <- factor(c("financial", "legal"))
model <- init_naive_classifer(classes)

train <- tibble::tibble(
  text = c("this is financial doc", "this is legal doc"),
  labels = factor(c("financial", "legal"))
)

train_naive_classifier(model, train, text, labels)

test <- tibble::tibble(
  text = "this should be predicted as a legal document"
)
predict_class(model, test, text)
```

## Co-occurence Matrix

Plot method uses [echarts4r](https://echarts4r.john-coene.com)

```r
matrix <- coom(crps)
plot(matrix)
```

![](man/figures/coom.png)