- New functions
add_frequency()
andadd_tf_idf()
allow for consistent phrasing of workflows. These new methods are supported byvisualize()
andtabulize()
. - New vectorized functions support using dplyr's
mutate()
and similar use cases:get_frequency()
for returning counts and ratios of values in a vector;is_new()
andis_hapax()
for testing uniqueness of values in a vector;get_cumulative_vocabulary()
,get_ttr()
,get_hir()
, andget_htr()
for measuring the cumulative change of a vector over time;get_match()
andget_sentiment()
for matching values in a dictionary; andget_tf()
,get_tf_by()
,get_idf_by()
, andget_tfidf_by()
for weighing elements of term frequency--inverse document frequency. - Bar plots of words per document now use better logic with labels, and a new
label_color
argument allows for customizing label color when needed. - Added "skills ramp" article documenting vectorized functions and customized tables and figures
get_gutenberg_corpus()
should do less, and now it does. Other functionality is available via gutenbergr.
- New function
contextualize()
shows terms in a window of context - New function
add_index()
adds a column showing word indices within each document load_texts()
adds support to keep original capitalization and punctuation alongside the tokenizedword
column with thekeep_original
argument. This process does not work in all instances, so the option defaults toFALSE
.add_dictionary()
includes an option to keep original terms. This is useful for n-gram dictionaries, where a match might otherwise span multiple rows.add_ngrams()
supports negative ranges, for building context windowsadd_partitions()
supports overlapping partitionsstandardize_titles()
capitalizes words after terminal punctuation
add_dictionary()
now supports n-gram dictionaries, matching across multiple wordsmake_dictionary()
has a slightly changed syntax, with clearer argument namesdefinitions
andname
- Along with its related
visualize()
methods,plot_doc_word_bars()
improves support forcolor_y = TRUE
andreorder_y = TRUE
- When naming colors,
change_color()
now allows setting colors for unnamed values standardize_titles()
capitalizes Roman numeralsload_texts()
adds support for custom tokenization using the dots parameter fromtidytext::unnest_tokens()
- New function
add_partitions()
adds a partition column, useful for getting same-sized samples identify_by()
now works with multiple columns, and it keeps existing metadata columns. This is especially useful with the newadd_partitions()
column, using something likemy_corpus() |> add_partitions() |> identify_by(title, partition)
before continuing to work with partitioned documents. To return framing to unpartitioned data, usedidentify_by(title)
or whatever other column is most relevant.- New visualization and tabulization methods for
expand_documents()
- Functions now imported:
count()
anddrop_na()
- When the ggraph package is loaded,
plot_bigrams()
now uses a color scale on edges, rather than spot color on nodes, with full support forchange_color()
- Improved documentation with website articles for customizing colors and showing code comparisons
- First "public" release! 🎉
- Unnecessary components removed and dependencies reduced
- Examples standardized and made reproducible
change_colors()
now works withplot_bigrams()
change_colors()
now includes a "dubois" colorsettabulize()
documentation is now improved for online outputstandardize_titles()
now works with factors- Added default behavior for
visualize()
on a corpus - Part of speech tagging should now work for more texts
- New
tabulize()
generic function for preparing tables with supported methods - Standardizing argument names between
visualize()
andtabulize()
- New package documentation for getting started
- New
collapse_rows()
function for clean tables usinggt::gt()
- New feature in
standardize_titles()
to keep initial articles - New options in
plot_doc_word_bars()
to keep the order of Y-axis values consistent and to color by Y-axis value instead of by facet - Rename
add_lexical_diversity()
toadd_vocabulary()
- Add option for renaming existing
doc_id
column when usingidentify_by()
get_gutenberg_corpus()
now retrieves HTML versions of texts from Project Gutenberg and parses header tags for section markers- New function
parse_html()
for reading headers in an HTML file - New function
move_header_to_text()
for converting header to text - New function
identify_by()
to simplify using something other thandoc_id
- Improved internal linking within documentation
- Better working
visualize()
function as generic with supported methods - Improved
change_colors()
with added support for the Okabe-Ito colorset and the option of starting with something other than the first color of a palette. With these changes, color options have been removed from other visualization functions to consolidate them withinchange_colors()
. - When a data set includes only one unique
doc_id
, visualizations are no longer divided into facets. - In an effort to reduce the number of dependencies, many packages have been removed from "Imports" (geomtextpath, ggrepel, glue, NLP, openNLP, plotly, RColorBrewer, stopwords, textstem, wordcloud). Where appropriate, these have been shifted to "Suggests" or dropped entirely.