CSV Change detection system: a K.I.S.S. approach

or: How I Learned to Stop Worrying and Love the Deltas

The objective is to provide a simple and fast delta provider: the final objective is to show difference between different version of date, being them in different time (changes between yesterday and today) or between different machines.

The core of the solution is the bare metal diff Unix command, more specifically the Unified diff format which was being used (between the many cases) to patch Linux Kernel source code (1996 Linux Journal article, for those who want to spend a tear on his past :-) ).

Basically, given two files, diff –u returns (simplified):

 1,this line is the same in both files
-2,this line contains AAA in old file
+2,this line contains BBB in new file
-3,this line is just in old file
+4,this line is just in new file

Where
• Line with id 1 is in both file and is unchanged, so it has no leading sign
• Line with id 2 is different between old and new file, so it appears twice: once with the leading minus sign, with old values, and one with plus sign, with new values
• Line with id 3 is just in old file, so it appears just once, with a minus sign
• Line with id 4 is just in new file, so it appears just once, with a plus sign

Which leads to the conclusion that (taking in consideration rows with a unique leading sign only)
• Lines with leading + are INSERTS
• Lines with leading – are DELETES
• Lines with same ID repeated are UPDATES (- sign line contains old values, + sign line contains new ones)

Starting from those basic assumptions, generating deltas is pretty straight forward.

I show here my quick and dirty solution, which should be good enough for a P.O.C.
Many files attached, let’s start from the beginning:
• oneLineDiff.sh is used to do the real magic: calls the diff –u command, filters out lines without sign, and compresses output.
No magic, no 200 lines code with 20 pages manual (those numbers swap if code is in python :-P), no custom code, 99.9999999999% bugs free, one line solution, runs everywhere with no change, works out-of-the-box, no external dependencies, no compiler/interpreter, …
Maybe the K.I.S.S.-est solution I’ve ever found :-)
• deltasJSON.py takes the oneLineDiff.sh output and elaborates it, producing a JSON row for every difference in file.
Please note this is just a P.O.C., there are many ways of writing this code better! :-)
• deltasSQL.py is the very same, but produces a SQL file. Same terms & conditions apply :-)
• randomData.entities.py creates CSVs with random data.

A performance tests has been executed, logs below.
Source CSV files with 200.000 rows each.
Each CSV file has 5 columns, detailed in header files (attached as well):
ID,LABEL,ENTITYNAME,ATTR0,ATTR1

Files data are being created to reproduce the worst case scenario:
• File1 has rows with ID from 0 to 200000
• File2 has rows with ID from 100000 to 300000
• IDs from 100000 to 200000 are shared, but all attributes are changed

So, to resume, we will have
• 100k new IDs -> 100k “inserts”
• 100k changed IDs -> 100k “updates”
• 100k disappeared IDs -> 100k “deletes”

Execution example:

date

mkdir DONTBACKUP/

./randomData.entities.py 200KTest.20010101 200000 DONTBACKUP/ &
./randomData.entities.py 200KTest.20111111 200000 DONTBACKUP/ 100000 &
wait

zcat DONTBACKUP/200Ktest.20010101.csv.gz | head -4
zcat DONTBACKUP/200Ktest.20111111.csv.gz | head -4
zcat DONTBACKUP/200Ktest.20010101.csv.gz | tail -3
zcat DONTBACKUP/200Ktest.20111111.csv.gz | tail -3

zcat DONTBACKUP/200Ktest.20010101.csv.gz | grep ^\"111111
zcat DONTBACKUP/200Ktest.20111111.csv.gz | grep ^\"111111


time ./oneLineDiff.sh                     \
      DONTBACKUP/200Ktest.20010101.csv.gz \
      DONTBACKUP/200Ktest.20111111.csv.gz \
      > DONTBACKUP/200Ktest.diff.gz

zcat DONTBACKUP/200Ktest.diff.gz  | wc -l

echo ID,LABEL,ENTITYNAME,ATTR0,ATTR1 > DONTBACKUP/200Ktest.header.csv

time ./deltasJSON.py                  \
      200Ktest                        \
      DONTBACKUP/200Ktest.diff.gz     \
      DONTBACKUP/200Ktest.header.csv  \
      | gzip > DONTBACKUP/200Ktest.json.gz

time ./deltasSQL.py                   \
      200Ktest                        \
      DONTBACKUP/200Ktest.diff.gz     \
      DONTBACKUP/200Ktest.header.csv  \
      | gzip > DONTBACKUP/200Ktest.sql.gz

time ./deltasCYPHER.py                \
      200Ktest                        \
      DONTBACKUP/200Ktest.diff.gz     \
      DONTBACKUP/200Ktest.header.csv  \
      | gzip > DONTBACKUP/200Ktest.cy.gz

zcat DONTBACKUP/200Ktest.json.gz | wc -l

zcat DONTBACKUP/200Ktest.json.gz | grep \"111111\"

date

Execution output example:

$ date
Tue, Aug 16, 2016 12:05:42 PM


$ mkdir DONTBACKUP/



$ ./randomData.entities.py 200KTest.20010101 200000 DONTBACKUP/ &
[1] 1204

$ ./randomData.entities.py 200KTest.20111111 200000 DONTBACKUP/ 100000 &
[2] 9740

$ wait
[1]-  Done                    ./randomData.entities.py 200KTest.20010101 200000 DONTBACKUP/
[2]+  Done                    ./randomData.entities.py 200KTest.20111111 200000 DONTBACKUP/ 100000



$ zcat DONTBACKUP/200Ktest.20010101.csv.gz | head -4
id:ID(entity200KTest.20010101),:LABEL,entityName,attr0,attr1
"1","entity200KTest.20010101","Ent_1","attr0_527593","attr1_392675"
"2","entity200KTest.20010101","Ent_2","attr0_268000","attr1_20936"
"3","entity200KTest.20010101","Ent_3","attr0_912182","attr1_618131"

$ zcat DONTBACKUP/200Ktest.20111111.csv.gz | head -4
id:ID(entity200KTest.20111111),:LABEL,entityName,attr0,attr1
"100001","entity200KTest.20111111","Ent_100001","attr0_46062","attr1_466131"
"100002","entity200KTest.20111111","Ent_100002","attr0_226397","attr1_925353"
"100003","entity200KTest.20111111","Ent_100003","attr0_331961","attr1_362192"

$ zcat DONTBACKUP/200Ktest.20010101.csv.gz | tail -3
"199998","entity200KTest.20010101","Ent_199998","attr0_747332","attr1_56769"
"199999","entity200KTest.20010101","Ent_199999","attr0_693643","attr1_4989"
"200000","entity200KTest.20010101","Ent_200000","attr0_708617","attr1_734619"

$ zcat DONTBACKUP/200Ktest.20111111.csv.gz | tail -3
"299998","entity200KTest.20111111","Ent_299998","attr0_922601","attr1_176458"
"299999","entity200KTest.20111111","Ent_299999","attr0_588278","attr1_253595"
"300000","entity200KTest.20111111","Ent_300000","attr0_295461","attr1_36491"



$ zcat DONTBACKUP/200Ktest.20010101.csv.gz | grep ^\"111111
"111111","entity200KTest.20010101","Ent_111111","attr0_703808","attr1_665025"

$ zcat DONTBACKUP/200Ktest.20111111.csv.gz | grep ^\"111111
"111111","entity200KTest.20111111","Ent_111111","attr0_632068","attr1_273703"



$

$ time ./oneLineDiff.sh                     \
>       DONTBACKUP/200Ktest.20010101.csv.gz \
>       DONTBACKUP/200Ktest.20111111.csv.gz \
>       > DONTBACKUP/200Ktest.diff.gz

real    0m1.385s
user    0m1.337s
sys     0m0.243s



$ zcat DONTBACKUP/200Ktest.diff.gz  | wc -l
400004



$ echo ID,LABEL,ENTITYNAME,ATTR0,ATTR1 > DONTBACKUP/200Ktest.header.csv



$ time ./deltasJSON.py                  \
>       200Ktest                        \
>       DONTBACKUP/200Ktest.diff.gz     \
>       DONTBACKUP/200Ktest.header.csv  \
>       | gzip > DONTBACKUP/200Ktest.json.gz

real    0m7.533s
user    0m11.902s
sys     0m0.357s



$ time ./deltasSQL.py                   \
>       200Ktest                        \
>       DONTBACKUP/200Ktest.diff.gz     \
>       DONTBACKUP/200Ktest.header.csv  \
>       | gzip > DONTBACKUP/200Ktest.sql.gz

real    0m3.638s
user    0m4.601s
sys     0m0.217s




$ time ./deltasCYPHER.py                \
>       200Ktest                        \
>       DONTBACKUP/200Ktest.diff.gz     \
>       DONTBACKUP/200Ktest.header.csv  \
>       | gzip > DONTBACKUP/200Ktest.sql.gz

real    0m4.131s
user    0m3.790s
sys     0m0.248s

$

$ zcat DONTBACKUP/200Ktest.json.gz | wc -l
3250019



$ zcat DONTBACKUP/200Ktest.json.gz
[cut]
{
  "tableName": "200Ktest",
  "sourceTablePk": "111111",
  "headerCsv": "ID,LABEL,ENTITYNAME,ATTR0,ATTR1",
  "operationType": "U",
  "oldValuesCsv": [
    "111111",
    "entity200KTest.20010101",
    "Ent_111111",
    "attr0_88029865",
    "attr1_3997504"
  ],
  "oldValuesKv": [
    {
      "id": "111111"
    },
    {
      "label": "entity200KTest.20010101"
    },
    {
      "entityname": "Ent_111111"
    },
    {
      "attr0": "attr0_88029865"
    },
    {
      "attr1": "attr1_3997504"
    }
  ],
  "oldValuesJson": {
    "id": "111111",
    "label": "entity200KTest.20010101",
    "entityname": "Ent_111111",
    "attr0": "attr0_88029865",
    "attr1": "attr1_3997504"
  },
  "newValuesCsv": [
    "111111",
    "entity200KTest.20111111",
    "Ent_111111",
    "attr0_16702374",
    "attr1_5453705"
  ],
  "newValuesKv": [
    {
      "id": "111111"
    },
    {
      "label": "entity200KTest.20111111"
    },
    {
      "entityname": "Ent_111111"
    },
    {
      "attr0": "attr0_16702374"
    },
    {
      "attr1": "attr1_5453705"
    }
  ],
  "newValuesJson": {
    "id": "111111",
    "label": "entity200KTest.20111111",
    "entityname": "Ent_111111",
    "attr0": "attr0_16702374",
    "attr1": "attr1_5453705"
  },
  "updateJsonNewVals": {
    "label": "entity200KTest.20111111",
    "attr0": "attr0_16702374",
    "attr1": "attr1_5453705"
  },
  "updateJsonOldNewVals": [
    {
      "label": {
        "is": "entity200KTest.20111111",
        "was": "entity200KTest.20010101"
      }
    },
    {
      "attr0": {
        "is": "attr0_16702374",
        "was": "attr0_88029865"
      }
    },
    {
      "attr1": {
        "is": "attr1_5453705",
        "was": "attr1_3997504"
      }
    }
  ]
}
[cut]




$ zcat DONTBACKUP/200Ktest.cy.gz
[cut]
-- OLD VALUES= "111111","entity200KTest.20010101","Ent_111111","attr0_88029865","attr1_3997504"
-- NEW VALUES= "111111","entity200KTest.20111111","Ent_111111","attr0_16702374","attr1_5453705"
MATCH (x:200Ktest {id: '"111111"'} )
 SET
   LABEL: 'entity200KTest.20111111',
   ATTR0: 'attr0_16702374',
   ATTR1: 'attr1_5453705',
 WHERE ID="111111";
[cut]


$ zcat DONTBACKUP/200Ktest.sql.gz
[cut]
-- OLD VALUES= "111111","entity200KTest.20010101","Ent_111111","attr0_88029865","attr1_3997504"
-- NEW VALUES= "111111","entity200KTest.20111111","Ent_111111","attr0_16702374","attr1_5453705"
UPDATE 200Ktest SET
 label = "entity200KTest.20111111", attr0 = "attr0_16702374", attr1 = "attr1_5453705"
 WHERE ID="111111";
[cut]




$ date
Tue, Aug 16, 2016 12:06:03 PM

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
DONTBACKUP		DONTBACKUP
README.md		README.md
deltasCYPHER.py		deltasCYPHER.py
deltasJSON.py		deltasJSON.py
deltasSQL.py		deltasSQL.py
deltasTXT.py		deltasTXT.py
deltasTXT_hashing.py		deltasTXT_hashing.py
deltasTXT_hashing_LOWMEM.py		deltasTXT_hashing_LOWMEM.py
deltasTXT_hashing_nofields.py		deltasTXT_hashing_nofields.py
deltasTXT_hashing_noold.py		deltasTXT_hashing_noold.py
oneLineDiff.sh		oneLineDiff.sh
randomData.entities.py		randomData.entities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSV Change detection system: a K.I.S.S. approach

or: How I Learned to Stop Worrying and Love the Deltas

About

Releases

Packages

Languages

wiwwo/CSVChangeDetectionSystem

Folders and files

Latest commit

History

Repository files navigation

CSV Change detection system: a K.I.S.S. approach

or: How I Learned to Stop Worrying and Love the Deltas

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages