Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support AI-readiness data checks in data quality engine #366

Open
mbjones opened this issue Jul 18, 2023 · 2 comments
Open

support AI-readiness data checks in data quality engine #366

mbjones opened this issue Jul 18, 2023 · 2 comments

Comments

@mbjones
Copy link
Member

mbjones commented Jul 18, 2023

For the FAIR Data Quality check engine (issue #328), consult the ESIP AI-Readiness Checklist for a list of data quality checks that the community has felt are important for assessment for prep for ML tooling readiness. See:

I propose that these would be a good candidate set of checks that have already been vetted by ESIP and would be useful way to vet the data quality engine. Maybe it would be it's own suite?

@mbjones
Copy link
Member Author

mbjones commented Jul 18, 2023

Also see the Analysis Ready Data (ARD) standards from CEOS:

There is a new OGC working group focused on ARD data standards: https://www.ogc.org/press-release/ogc-forms-new-analysis-ready-data-standards-working-group/

@jeanetteclark
Copy link
Collaborator

I reformatted the AI Readiness checklist into a csv with a column for whether the "check" could actually be implemented in an automated way. My values in that column is a best guess, first instincts kind of answer. Based on the list I identified the following checks that are already implemented:

  • Is there contact information for subject-matter experts?
  • Is there a clear data license?
  • Is the license standardized and machine-readable (e.g. Creative Commons)?
  • Is it available in at least one open, non-proprietary format?
  • Is there a comprehensive data dictionary/codebook to describe parameters?
  • Does it include details on the spatial and temporal extent?

The following checks could be easily implemented:

  • Have null values/gaps been filled?
  • What is the timeliness of the data?
  • Is there quantitative information about data resolution in space and time?
  • Is the provenance tracked and documented?
  • Is the data dictionary standardized?
  • Do the parameters follow a defined standard?
  • Are parameters crosswalked in an ontology or common vocabulary (e.g. NIEM)?
  • What is the file format?
  • Is it machine-readable?
  • Has the data been anonymized / de-identified?

The rest either are not applicable, or would be difficult/impossible to implement.

ML-checks.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants