This project aims to check for Dataverse (https://dataverse.org/) quality of data and metadata. Through the use of Dataverse API, the program will check for all published dataset in odrder to establish the degree of data and metadata quality.
The project consists of:
- A document to define FAIRness and data & metadata quality
- A batch program to check for many published dataset attributes and score the attributes
The project will go through the following milestones:
- November 2022 - Write a document to define FAIRness and data & metadata quality. In the same doc, state what are project goals and to obtain them.
- December 2022 - Code a batch program to check for many published dataset attributes and score the attributes
- January 2023 - Test & produce a list of datasets eligible for improvement
The coding language is PHP, version 7 or superior. You can launch the program with a simple "php qc.php". All the files are with comments in English language. The project has the following files:
- qc.php - this is the main program.
- qc.ini - this is the configuration file.
- qc_constants.php - this is the file that contains some constants. It takes constants from qc.ini, but if you want you can take a look.
- qc_listfunctions.php - this file is the program that harvests the data.
- qc_printfunctions.php - this file is used only to printout results. It will print a txt file and a csv file.
- qc_qualityfunctions.php - this is the core of the project. This file contains functions that check the dataset quality.
- qc_utilfunctions.php - this is a collection of utility functions, like the logging and the CURL function.
The most important program is qc_qualityfunctions.php. It contains the routines to check for quality. For each dataset, the program performs:
- ORGANIZATIONAL CHECK
- COMPLETENESS CHECK
- DATA-ACCESS CHECK
- USER-ORIENTED CHECKS
- PUBLISHER CHECKS
- FILE-ORIENTED CHECKS
Each check returns a score (typically from 1 to 5). All the score put together will form the average score of the dataset.
Nothing special, you just need PHP and php CURL library, which is standard. You just need to configure qc.ini, make sure you created a
outfolder, and then you can launch the quality check with the command:
php qc.php
You just need to set some parameters before running the program. The following applies:
- Scoring votes: The following are the constants for scoring. Votes are 1 to 5, with 5 the best
SCUMVOTE = 1 LOWVOTE = 2 MEDIUMVOTE = 3 HIGHVOTE = 4 MEGAVOTE = 5
- output files Provide a path and file for each file, or leave them like these but create "out" directory
TXTDS_OUT = "out/ds.txt" CSVDS_OUT = "out/ds.csv" T_LOG = "out/log.txt"
- API Key and UNBLOCK Key Insert your apikey if you want to scan also draft datasets in the form like T_APIK = "key=xxxxxxxxxxxxxxxxxxxxxxxx"
T_APIK = "key=xxxxxxxxxxxxxxxxxxxxxxxx"
Insert your UNBLOCK Key if you want to scan also users in the form like T_UNBLOCK = "unblock-key=xxxxxxxxxxxxxxxxxxxxxxxx"
T_UNBLOCK = "unblock-key=xxxxxxxxxxxxxx";
- Mandatory parameters
T_URL = "https://dataverse.xxxxx.com.it/api/"
Mandatory!! Insert here your dataverse URL
T_DVURL = "https://dataverse.xxxxx.com/dataverse/"
Mandatory!! Insert here your root dataverse
ROOTDV = "xxxxxxx Dataverse"
Mandatory!! Insert here your typical department dataverse name
DEPTDV = "department of"
Mandatory!! Insert here your publisher
ROOTPUBLISHER = "xxxxxxx Dataverse"