Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify results functions #999

Open
gowthamrao opened this issue Nov 14, 2022 · 4 comments
Open

Modify results functions #999

gowthamrao opened this issue Nov 14, 2022 · 4 comments

Comments

@gowthamrao
Copy link
Member

gowthamrao commented Nov 14, 2022

I would like to request functions that help manipulate cohort diagnostics output. There are many reasons for this that justify this function

  1. Change cohort id: ability to go thru the results in output zipfile the replace oldCohortId with newCohortId. ie. take a data frame object with newCohortId and oldCohortId fields
  2. Change database id: ability to go thru the results in output zipfile and replace oldDatabaseId, oldDatabaseName with newDatabaseId, newDatabaseName
  3. filter results: ability to filter a large result set by cohortId and databaseId. i.e. function should take an array of cohortId's and databaseId's
  4. Compare cohort sql hash between zip file out put and report if there are two cohorts that have the same sql hash but with different cohortId.
  5. Join two or more zip files with results into one zip file.
@gowthamrao
Copy link
Member Author

Justification:

  1. integrate results from multiple sites that have run diagnostics using different definitions
  2. split large studies into smaller
  3. fix issues in labels e.g. use of space in databaseId

@azimov
Copy link
Collaborator

azimov commented Nov 16, 2022

Whilst more validation is good, most of these things should be set at the time the study is designed. Doing (and allowing) ad hoc comparisons to merge bits of data is bad practice. I don't think it's a good idea to allow users to merge random results together - these are things that can lead to massive interpretation errors. It's much better practice to force investigator discipline at the study design step than to have utilities to merge badly collected data.

@gowthamrao
Copy link
Member Author

gowthamrao commented Nov 17, 2022

Forcing investigator discipline is much harder in network studies, when contribution is coming from various sources. The use case i am interested in is the following:

  1. A contributor is contributing to the OHDSI Phenotype library. The requirements for submissions are met. The submission would involve executing the cohort (as developed in their local instance with local atlasId and databaseId).
  2. Once the peer review is complete and it is decided to accept this cohort - we need to integrate the submission to https://github.com/ohdsi-studies/PhenotypeLibraryDiagnostics . For this integration of the initial contribution to existing output from the PhenotypeLibraryDiagnostics study - we have ensure we need to extract (if there were unapproved cohortIds) can re-id the cohortId and if needed databaseId. I cant ask them to re-run just because the OHDSI phenotype library has now assigned the submitted cohort and new id (its hard to data partners to have it run once).

This issue is not limited to PhenotypeLibraryDiagnostics study. I have encountered the need to mix and match outputs from other studies.

@gowthamrao
Copy link
Member Author

  1. Change cohort name

Sorry missed this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants