From a2c99ca73d6a48fc4040e067213cc92296331fc4 Mon Sep 17 00:00:00 2001
From: Patrick Copeland <pcopeland@mitre.org>
Date: Tue, 14 Nov 2017 09:46:42 -0500
Subject: [PATCH 1/3] Add initial analytics docs

---
 docs/analytics.md | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 docs/analytics.md

diff --git a/docs/analytics.md b/docs/analytics.md
new file mode 100644
index 00000000..a4bdf322
--- /dev/null
+++ b/docs/analytics.md
@@ -0,0 +1,40 @@
+# Analytics #
+Enabling analytics and advanced queries is the primary advantage of running 
+several tools against a sample, extracting as much information as possible, and
+storing in a common datastorecs.
+
+Types of analytics and queries of interest:
+
+- cluster samples
+- outlier samples
+- samples for deep-dive analysis
+- gaps in current toolset
+
+## ssdeep Comparison ##
+Fuzzy hashing is an effective method to identify similar files based on common
+byte strings despite changes in the byte order and strcuture of the files.
+[ssdeep](https://ssdeep-project.github.io/ssdeep/index.html) provides a fuzzy
+hash implementation and provides the capability to compare hashes.
+
+Comparing ssdeep hashes at scale is a challenge. [[1]](https://www.virusbulletin.com/virusbulletin/2015/11/optimizing-ssdeep-use-scale/)
+originally described a method for comparing ssdeep hashes at scale.
+
+The ssdeep analytic computes ```ssdeep.compare``` for all samples where the
+result is non-zero and provides the capability to return all samples clustered
+based on the ssdeep hash.
+
+### Elasticsearch ###
+When possible, it can be effective to push work to the Elasticsearch cluster
+which support horizontal scaling. For the ssdeep comparison, Elasticsearch 
+[NGram  Tokenizers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html)
+are used to compute 7-grams of the chunk and double-chunk portions
+of the ssdeep hash as described here [[2]](http://www.intezer.com/intezer-community-tip-ssdeep-comparisons-with-elasticsearch/).
+This prevents ever comparing two ssdeep hashes where the result will be zero.
+
+### Python ###
+Because we need to compute ```ssdeep.compare```, the ssdeep analytic cannot be
+done entirely in Elasticsearch. Python is used to query Elasicsearch, compute
+```ssdeep.compare``` on the results, and update the documents in Elasticsearch.
+
+
+

From 0dad6a745ca211be9c8dadd0d116f356ba9939ba Mon Sep 17 00:00:00 2001
From: awest1339 <awest1339@users.noreply.github.com>
Date: Tue, 14 Nov 2017 08:28:39 -0700
Subject: [PATCH 2/3] Add some details to analytics docs

---
 docs/analytics.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/analytics.md b/docs/analytics.md
index a4bdf322..c806faa4 100644
--- a/docs/analytics.md
+++ b/docs/analytics.md
@@ -1,14 +1,16 @@
 # Analytics #
 Enabling analytics and advanced queries is the primary advantage of running 
 several tools against a sample, extracting as much information as possible, and
-storing in a common datastorecs.
+storing the output in a common datastore.
 
-Types of analytics and queries of interest:
+The following are some example types of analytics and queries that may be of interest:
 
 - cluster samples
 - outlier samples
 - samples for deep-dive analysis
 - gaps in current toolset
+- machine learning analytics on tool outputs
+- others
 
 ## ssdeep Comparison ##
 Fuzzy hashing is an effective method to identify similar files based on common
@@ -36,5 +38,5 @@ Because we need to compute ```ssdeep.compare```, the ssdeep analytic cannot be
 done entirely in Elasticsearch. Python is used to query Elasicsearch, compute
 ```ssdeep.compare``` on the results, and update the documents in Elasticsearch.
 
-
-
+### Deployment ###
+We use a Celery beat task to kick off the ssdeep comparison nightly at 2am local time, when the system is at lower user loads. This ensures that the analytic will be run on all samples without adding an exorbinant load to the system.

From e75db426e615f48fc135c7791007f97e5ce1b49d Mon Sep 17 00:00:00 2001
From: Patrick Copeland <pcopeland@mitre.org>
Date: Tue, 14 Nov 2017 13:01:28 -0500
Subject: [PATCH 3/3] Update analytics docs

(I don't really care about line width for docs; my editor just does it.)
---
 docs/analytics.md | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/docs/analytics.md b/docs/analytics.md
index c806faa4..dbeeccbb 100644
--- a/docs/analytics.md
+++ b/docs/analytics.md
@@ -3,7 +3,8 @@ Enabling analytics and advanced queries is the primary advantage of running
 several tools against a sample, extracting as much information as possible, and
 storing the output in a common datastore.
 
-The following are some example types of analytics and queries that may be of interest:
+The following are some example types of analytics and queries that may be of
+interest:
 
 - cluster samples
 - outlier samples
@@ -39,4 +40,8 @@ done entirely in Elasticsearch. Python is used to query Elasicsearch, compute
 ```ssdeep.compare``` on the results, and update the documents in Elasticsearch.
 
 ### Deployment ###
-We use a Celery beat task to kick off the ssdeep comparison nightly at 2am local time, when the system is at lower user loads. This ensures that the analytic will be run on all samples without adding an exorbinant load to the system.
+[celery beat](http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html)
+is used to schedule and kick off the ssdeep comparison task nightly at 2am
+local time, when the system is experiencing less load from users. This ensures
+that the analytic will be run on all samples without adding an exorbinant load
+to the system.