Skip to content

Commit

Permalink
CVS_Creation_Process.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Andre Castro committed Jul 3, 2024
1 parent 73a0cbb commit 9f17887
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 0 deletions.
36 changes: 36 additions & 0 deletions csv/2024-2028/CVS_Creation_Process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# CSV creation from DFG's XLSXs

## For each of the Excel (.xlsx) files

1. remove rows 1,2 (title, empty)
2. remove empty columns A and G
5. save both DE and EN sheets into a CSV (to allow the following operations)
5.1 CSV export: Check "Quote all text cells" so that we avoid issues with commas within the cells
5.2 **from this point onward we shall only work on the CSVs and not the .xlsx)**

## in CSVs (easier to edit and see errors)

1. add headers EN: `Subject Number` and `Subject` for column A, B . DE: `Fachnummer`, `Fach`
3. add to header (row 1) "Subject Area" and "Scientific Discipline" in columns D, E
4. remove header rows (except row 1): 57, 137, 169
5. remove empty rows (search in column A)
6. fill-in the missing values (in Review Board, Subject Area, Scientific Discipline columns) - this is tedious but important, as we cannot reply on merged cells in the CSV. And it is at the core of the tree structure (@SArndt-TIB let me knows if this needs clarification)

## Join both CSVs

* just a copy-pasta
* ensure that EN comes before the DE terms
* headers should be in the following sequence:
```
Subject Number
Subject
Review Board
Subject Area
Scientific Discipline
Fachnummer
Fach
Fachkollegium
Fachgebiet
Wissenschaftsbereich
```

2 changes: 2 additions & 0 deletions csv/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ The ontology is created with [create_ontology.py](/scripts/create_ontology.py),

The cells also contain line breaks and trailing white spaces. These may vary in between versions. This is a problem for [create_ontology.py](/scripts/create_ontology.py). The script may not be working with new versions of the Fachsystematik, unless the table is cleaned up, e.g. unexpected line breaks need to be removed, new trailing white spaces need to be removed, etc. until the script can parse through the whole file.

For more detailed info on the CSV creation see [./2024-2028/CVS_Creation_Process.md](./2024-2028/CVS_Creation_Process.md)

## Checking the alignment of German and English version in the .csv file

The ontology can only be created properly, if English and German version of the Fachsystematik align exactly in the .csv file. This can be tested with [parse_csv.py](/scripts/parse_csv.py).

0 comments on commit 9f17887

Please sign in to comment.