Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 2.33 KB

README.md

File metadata and controls

30 lines (22 loc) · 2.33 KB

NaLaFi

This repo contains data and code to work on Natural Language Fingerprints (NaLaFi), and to replicate the results in Bentz (2023). The code should be run in the following order:

Data Generation

  • randomStringGenerator.Rmd: generates random strings for comparison to natural languages and other sign strings.
  • shuffledTextGenerator.Rmd: takes the files in folder NaLaFi/data/writing and shuffles the characters randomly.

Simple Stats for the Data

  • simpleStats.Rmd: this gives an overview of the files in NaLaFi/data in terms of number of files per subcorpus and number of characters per file.

Sampling of Character Strings

  • sampler.Rmd: samples chunks of UTF-8 characters of pre-defined length (e.g. 10, 100, 1000) and stores them in NaLaFi/samples. Note that this folder should be emptied before re-running the code.

Estimations of Feature Values

  • estimations.Rmd: calculating the feature values (TTR, unigram entropy, entropy rate, repetition measure) for each string of UTF-8 characters (one per line) in the files of NaLaFi/samples. The output is a csv file stored in NaLaFi/results/features.csv. Note that this file should be deleted before re-running the code.
  • estimationPlots.Rmd: provides plots for the estimated feature values.
  • stabilizationAnalyses.Rmd: estimates feature values for stepsizes (i.e. given number of characters), and creates plots of ``stabilization'', i.e. how feature values change with the number of characters.

Classification

  • classificationKnn.Rmd: classifies the character strings into "writing" and "non-writing" based on the feature values (TTR, unigram entropy, entropy rate, repetition rate) with the k-nearest neighbor method, and stores the results in results/KNN.
  • classificationLR.Rmd: classifies the character strings with logistic regression model (LR), and stores the results in results/LR.
  • classificationSVM.Rmd: classifies the character strings with a support vector machine (SVM), and stores the results in results/SVM.
  • classificationMLP.Rmd: classifies the character strings with different Multilayer Perceptron (MLP) architectures, and stores the results in results/MLP.

Hyperparameters:

  • HyperParamTuning.Rmd: gives diagnostic plots for hyperparameter values and model performance.

Reference

Bentz (2023). The Zipfian Challenge: Learning the statistical fingerprint of natural languages. CoNLL, Singapore.