From 9f17887420c7c2f80e7e006c3010b91c4f858d9e Mon Sep 17 00:00:00 2001 From: Andre Castro Date: Wed, 3 Jul 2024 15:14:19 +0200 Subject: [PATCH] CVS_Creation_Process.md --- csv/2024-2028/CVS_Creation_Process.md | 36 +++++++++++++++++++++++++++ csv/README.md | 2 ++ 2 files changed, 38 insertions(+) create mode 100644 csv/2024-2028/CVS_Creation_Process.md diff --git a/csv/2024-2028/CVS_Creation_Process.md b/csv/2024-2028/CVS_Creation_Process.md new file mode 100644 index 0000000..4d49d0e --- /dev/null +++ b/csv/2024-2028/CVS_Creation_Process.md @@ -0,0 +1,36 @@ +# CSV creation from DFG's XLSXs + +## For each of the Excel (.xlsx) files + +1. remove rows 1,2 (title, empty) +2. remove empty columns A and G +5. save both DE and EN sheets into a CSV (to allow the following operations) + 5.1 CSV export: Check "Quote all text cells" so that we avoid issues with commas within the cells + 5.2 **from this point onward we shall only work on the CSVs and not the .xlsx)** + +## in CSVs (easier to edit and see errors) + +1. add headers EN: `Subject Number` and `Subject` for column A, B . DE: `Fachnummer`, `Fach` +3. add to header (row 1) "Subject Area" and "Scientific Discipline" in columns D, E +4. remove header rows (except row 1): 57, 137, 169 +5. remove empty rows (search in column A) +6. fill-in the missing values (in Review Board, Subject Area, Scientific Discipline columns) - this is tedious but important, as we cannot reply on merged cells in the CSV. And it is at the core of the tree structure (@SArndt-TIB let me knows if this needs clarification) + +## Join both CSVs + +* just a copy-pasta +* ensure that EN comes before the DE terms +* headers should be in the following sequence: +``` +Subject Number +Subject +Review Board +Subject Area +Scientific Discipline +Fachnummer +Fach +Fachkollegium +Fachgebiet +Wissenschaftsbereich +``` + diff --git a/csv/README.md b/csv/README.md index 2c5bfb2..f217bcb 100644 --- a/csv/README.md +++ b/csv/README.md @@ -8,6 +8,8 @@ The ontology is created with [create_ontology.py](/scripts/create_ontology.py), The cells also contain line breaks and trailing white spaces. These may vary in between versions. This is a problem for [create_ontology.py](/scripts/create_ontology.py). The script may not be working with new versions of the Fachsystematik, unless the table is cleaned up, e.g. unexpected line breaks need to be removed, new trailing white spaces need to be removed, etc. until the script can parse through the whole file. +For more detailed info on the CSV creation see [./2024-2028/CVS_Creation_Process.md](./2024-2028/CVS_Creation_Process.md) + ## Checking the alignment of German and English version in the .csv file The ontology can only be created properly, if English and German version of the Fachsystematik align exactly in the .csv file. This can be tested with [parse_csv.py](/scripts/parse_csv.py).