From 1473ec7357c199939ddc7152991d2dbf05ca2b96 Mon Sep 17 00:00:00 2001 From: Ariel Rokem Date: Wed, 19 Jun 2024 12:08:13 -0700 Subject: [PATCH 1/2] More about challenges, use cases, cross-sector interactions. --- references.bib | 10 ++++ sections/01-introduction.qmd | 15 +++--- sections/02-challenges.qmd | 65 +++++++++++++++++++------- sections/xx-cross-sector.qmd | 91 ++++++++++++++++++++++++------------ sections/xx-use-cases.qmd | 53 ++++++++++----------- 5 files changed, 152 insertions(+), 82 deletions(-) diff --git a/references.bib b/references.bib index ffe4178..f05565f 100644 --- a/references.bib +++ b/references.bib @@ -1,3 +1,13 @@ +@MISC{Nosek2019CultureChange, + title = "Strategy for Culture Change", + author = "Nosek, Brian", + abstract = "Strategy for Culture Change", + howpublished = "\url{https://www.cos.io/blog/strategy-for-culture-change}", + note = "Accessed: 2024-6-19", + language = "en" +} + + @ARTICLE{Poldrack2024BIDS, title = "The Past, Present, and Future of the Brain Imaging Data Structure diff --git a/sections/01-introduction.qmd b/sections/01-introduction.qmd index f6ef5b8..b2dd77c 100644 --- a/sections/01-introduction.qmd +++ b/sections/01-introduction.qmd @@ -30,12 +30,15 @@ non-profit organization that was founded in the 1990s developed a set of guidelines for licensing of OSS that is designed to protect the rights of developers and users. On the more technical side, tools such as the Git Source-code management system support open-source development workflows that -can be adopted in the development of standards. When these social and technical -innovations are put together they enable a host of positive defining features -of OSS, such as transparency, collaboration, and decentralization. These -features allow OSS to have a remarkable level of dynamism and productivity, -while also retaining the ability of a variety of stakeholders to guide the -evolution of the software to take their needs and interests into account. +can be adopted in the development of standards. Governance approaches have been +honed to address the challenges of managing a range of stakeholder interests +and to mediate between large numbers of weakly-connected individuals that +contribute to OSS. When these social and technical innovations are put together +they enable a host of positive defining features of OSS, such as transparency, +collaboration, and decentralization. These features allow OSS to have a +remarkable level of dynamism and productivity, while also retaining the ability +of a variety of stakeholders to guide the evolution of the software to take +their needs and interests into account. The present report seeks to explore how OSS processes and tools have affected the development of data and metadata standards. The report will triangulate diff --git a/sections/02-challenges.qmd b/sections/02-challenges.qmd index c04c7b8..ae9e394 100644 --- a/sections/02-challenges.qmd +++ b/sections/02-challenges.qmd @@ -1,38 +1,69 @@ -# Challenges for open source data and metadata standards, and some solutions +# Opportunities and risks for open-source standards -## Too much flexibility, or too little +Data and metadata standards that adopt tools and practices of OSS ("open-source +standards" henceforth) stand to reap many of the benefits that the OSS model +has provided in the development of other technologies. At the same time, these +tools and practices are associated with risks that need to be mitigated. -It's a story as old as time (or at least as old as standards): users fail to -consider existing standards, or perceive an existing standard as not offering -enough flexibility to cover some use case, and they embark on the development -of a new standard [^1]. +## Flexibility vs. stability -[^1]: So old in fact that an oft-cited [XKCD comic](https://xkcd.com/927/) has -been devoted to it. +One of the defining characteristics of OSS is its dynamism and its rapid +evolution. Because OSS can be used by anyone and, in most cases, contributions +can be made by anyone, innovations flow into OSS in a bottom-up fashion from +user/developers. Pathways to contribution by members of the community are often +well-defined: both from the technical perspective (e.g., through a pull request +on GitHub, or other similar mechanisms), as well as from the social perspective +(e.g., whether contributors need to accept certain licensing conditions through +a contributor licensing agreement) and the socio-technical perspective (e.g., +how many people need to review a contribution, what are the timelines for a +contribution to be reviewed and accepted, what are the release cycles of the +software that make the contribution available to a broader community of users, +etc.). Similarly, open-source standards may also find themselves addressing use +cases and solutions that were not originally envisioned through bottom-up +contributions of members of a research community to which the standard +pertains. However, while this dynamism provides an avenue for flexibility it +also presents a source of tension. This is because data and metadata standards +apply to already existing datasets, and changes may affect the compliance of these +existing datasets. + +## Mismatches between standards developers and user communities -Another failure is the mismatch between developers of the standard and users. There is an inherent gap in both interest and ability to engage with the -technical details undergirding standards and their development between the +technical details undergirding standards and their development between the core developers of the standard and their users. In extreme cases, these interests -may be at odds, as developers implement sophisticated mechanisms to automate -the creation of the standard or advocate for more technically advanced +may even be at odds, as developers implement sophisticated mechanisms to +automate the creation of the standard or advocate for more technically advanced mechanisms for evolving the standard, leaving potential users sidelined in the development of the standard, and limiting their ability to provide feedback about the practical implications of changes to the standards. ## Unclear pathways for standards success -Standards typically develop organically through sustained and persistent efforts from dedicated -groups of data practitioneers. These include scientists and the broader ecosystem of data curators and users. However there is no playbook on the structure and components of a data standard, or the pathway that moves a data implementation to a data standard. -As a result, data standardization lacks formal avenues for research grants. +Standards typically develop organically through sustained and persistent +efforts from dedicated groups of data practitioneers. These include scientists +and the broader ecosystem of data curators and users. However there is no +playbook on the structure and components of a data standard, or the pathway +that moves a data implementation to a data standard. As a result, data +standardization lacks formal avenues for research grants. ## Cross domain funding gaps -Data standardization investment is justified if the standard is generalizable beyond any specific science domain. However while the use cases are domain sciences based, data standardization is seen as a data infrastrucutre and not a science investment. Moreover due to how science research funding works, scientists lack incentives to work across domains, or work on infrastructure problems. +Data standardization investment is justified if the standard is generalizable +beyond any specific science domain. However while the use cases are domain +sciences based, data standardization is seen as a data infrastructure and not a +science investment. Moreover due to how science research funding works, +scientists lack incentives to work across domains, or work on infrastructure +problems. ## Data instrumentation issues -Data for scientific observations are often generated by proprietary instrumentation due to commercialization or other profit driven incentives. There islack of regulatory oversight to adhere to available standards or evolve Significant data transformation is required to get data to a state that is amenable to standards, if available. If not available, there is lack of incentive to set aside investment or resources to invest in establishing data standards. +Data for scientific observations are often generated by proprietary +instrumentation due to commercialization or other profit driven incentives. +There islack of regulatory oversight to adhere to available standards or evolve +Significant data transformation is required to get data to a state that is +amenable to standards, if available. If not available, there is lack of +incentive to set aside investment or resources to invest in establishing data +standards. ## Sustainability diff --git a/sections/xx-cross-sector.qmd b/sections/xx-cross-sector.qmd index 4978d0b..3be647e 100644 --- a/sections/xx-cross-sector.qmd +++ b/sections/xx-cross-sector.qmd @@ -2,34 +2,63 @@ The importance of standards stems not only from discussions within research fields about how research can best be conducted to take advantage of existing -and growing datasets, but also arises from interactions with other sectors. - -For example, an ongoing series of policy discussions that address the -interactions between research communities and the general public. In the United -States, these policies are expressed, for example, in memos issued by the -directors of the White House Office of Science and Technology Policy (OSTP), -James Holdren (in 2013) and Alondra Nelson (in 2022). While these memos focused -primarily on making peer-reviewed publications funded by the US Federal -government available to the general public, they also lay an increasingly -detailed path toward the publication and general availability of the data that -is collected in research that is funded by the US government. The general -guidance and overall spirit of these memos dovetail with more specific policy -guidance related to data and metadata standards. For example, the importance of -standards was underscored in a recent report by the Subcommittee on Open -Science of the National Science and Technology Council on the "Desirable -characteristics of data repositories for federally funded research" -[@nstc2022desirable]. The report explicitly called out the importance of -"allow[ing] datasets and metadata to be accessed, downloaded, or exported from -the repository in widely used, preferably non-proprietary, formats consistent -with standards used in the disciplines the repository serves." This highlights -the need for data and metadata standards across a variety of different kinds of -data. In addition, a report from the National Institute of Standards and -Technology on "U.S. Leadership in AI: A Plan for Federal Engagement in -Developing Technical Standards and Related Tools" emphasized that -- -specifically for the case of AI -- "U.S. government agencies should prioritize -AI standards efforts that are [...] Consensus-based, [...] Inclusive and -accessible, [...] Multi-path, [...] Open and transparent, [...] and [that] -result in globally relevant and non-discriminatory standards..." [@NIST2019]. -The converging characteristics of standards that arise from these reports -suggest that considerable thought needs to be given to how standards arise so -that these goals are achieved. +and growing datasets, but also arises from interactions with other sectors. Several different kinds of cross-sector interactions can be defined as having important +impact on the development of open-source standards. + +## Governmental policy-setting + +The development of open practices in research has entailed an ongoing +interaction and dialogue with various governmental bodies that set policies for +research. For example, for research that is funded by the public, this entails +an ongoing series of policy discussions that address the interactions between +research communities and the general public. One way in which this manifests in +the United States specifically is in memos issued by the directors of the White +House Office of Science and Technology Policy (OSTP), James Holdren (in 1) and +Alondra Nelson (in 2022). While these memos focused primarily on making +peer-reviewed publications funded by the US Federal government available to the +general public, they also lay an increasingly detailed path toward the +publication and general availability of the data that is collected in research +that is funded by the US government. The general guidance and overall spirit of +these memos dovetail with more specific policy guidance related to data and +metadata standards. For example, the importance of standards was underscored in +a recent report by the Subcommittee on Open Science of the National Science and +Technology Council on the "Desirable characteristics of data repositories for +federally funded research" [@nstc2022desirable]. The report explicitly called +out the importance of "allow[ing] datasets and metadata to be accessed, +downloaded, or exported from the repository in widely used, preferably +non-proprietary, formats consistent with standards used in the disciplines the +repository serves." This highlights the need for data and metadata standards +across a variety of different kinds of data. In addition, a report from the +National Institute of Standards and Technology on "U.S. Leadership in AI: A +Plan for Federal Engagement in Developing Technical Standards and Related +Tools" emphasized that -- specifically for the case of AI -- "U.S. government +agencies should prioritize AI standards efforts that are [...] Consensus-based, +[...] Inclusive and accessible, [...] Multi-path, [...] Open and transparent, +[...] and [that] result in globally relevant and non-discriminatory +standards..." [@NIST2019]. The converging characteristics of standards that +arise from these reports suggest that considerable thought needs to be given to +how standards arise so that these goals are achieved. + +A compelling road map towards implementation and adoption of +community-developed standards is offered in a blog post authored by the Center +for Open Science's Brian Nosek, entitled "Strategy for Culture Change" +[@Nosek2019CultureChange]. The core idea is that affecting a turn toward open +science requires an alignment of not only incentives and values, but also +technical infrastructure and user experience. A sociotechnical bridge between +these pieces, which make adoption of standards possible, and maybe even easy, +and the policy goals, arises from a community of practice that makes adoption +of standards normative. Once all of these pieces are in place, making adoption +of open science standards required becomes more straightforward and less +onerous. + +## Funding + +While government-set policy is primarily directed towards research that is +funded through governmental funding agencies, there are other ways in which +funding relates to the development of open-source standards. One way is in +funding the development of these standards. For example, the National +Institutes of Health have provided some of the funding for the development of +the Brain Imaging Data Structure standard in neuroscience. + + + diff --git a/sections/xx-use-cases.qmd b/sections/xx-use-cases.qmd index ca8e401..8932a8e 100644 --- a/sections/xx-use-cases.qmd +++ b/sections/xx-use-cases.qmd @@ -7,8 +7,7 @@ astronomy, high-energy physics and earth sciences have a relatively long history of shared data resources from organizations such as LSST and CERN, while other fields have only relatively recently become aware of the value of data sharing and its impact. These disparate histories inform how standards -have evolved and how OSS practices have pervaded their -development. +have evolved and how OSS practices have pervaded their development. ## Astronomy @@ -20,7 +19,7 @@ used in astronomy reads and writes the FITS format. It was developed by observatories in the 1980s to store image data in the visible and x-ray spectrum. It has been endorsed by IAU, as well as funding agencies. Though the format has evolved over time, “once FITS, always FITS”. That is, the format -cannot be evolved to introduce changes that break backwards-compatibility. +cannot be evolved to introduce changes that break backwards compatibility. Among the features that make FITS so durable is that it was designed originally to have a very restricted metadata schema. That is, FITS records were designed to be the lowest common denominator of word lengths in computer systems at the @@ -30,16 +29,17 @@ be stored in this format and relationships between data from different instruments can be related, rendering manual and error-prone procedures for conforming images obsolete. -## High-energy physics +## High-energy physics (HEP) -In HEP standards to collect the data have been established and the community is -fairly homogeneous, so standards have very high penetration [@Basaglia2023-dq]. -A top-down approach is taken so that within every large collaboration standards -are enforced, and this adoption is centrally managed. Access to raw data is -essentially impossible, and making it publicly available is both technically -very hard and potentially ill-advised. Analysis tools are tuned specifically to -the standards. Incentives to use the standards are provided by funders that -require data management plans that specify how the data is shared. +Because data collection is centralized, standards to collect and store HEP data +have been established and the adoption of these standards in data analysis has +high penetration [@Basaglia2023-dq]. A top-down approach is taken so that +within every large collaboration standards are enforced, and this adoption is +centrally managed. Access to raw data is essentially impossible, and making it +publicly available is both technically very hard and potentially ill-advised. +Therefore, analysis tools are tuned specifically to the standards. Incentives +to use the standards are provided by funders that require data management plans +that specify how the data is shared. ## Neuroscience @@ -54,22 +54,19 @@ collection efforts [@Koch2012-ve]. This change has been brought on through a combination of technical advances in data acquisition techniques, which now generate large and very high-dimensional/information-rich datasets, cultural changes, which have ushered in new norms of transparency and reproducibility, -and funding initiatives that have encouraged this kind of data collection -(including the US BRAIN Initiative and the Allen Institute for Brain Science). -Neuroscience presents an interesting example because these changes are -relatively recent. This means that standards for data and metadata in -neuroscience have been prone to adopt many of the elements of OSS development. -Two salient examples in neuroscience are the Neurodata Without Borders file -format for neurophysiology data [@Rubel2022NWB] and the Brain Imaging Data -Structure (BIDS) standard for neuroimaging data [@Gorgolewski2016BIDS]. BIDS in -particular owes some of its success to the adoption and adaptation of OSS -mechanisms [@Poldrack2024BIDS]. One of the challenges that the BIDS standard -faces is that it covers only a subset of the large range of neuroscience data -types that it could cover. To evolve and include more different use cases, the -BIDS community adopted a mechanism called a BIDS Enhancement Proposal (BEP). -This mechanism is directly inspired by the Python programming language -community, which developed the Python Enhancement Proposal procedure, that is -used to introduce new ideas into the language. Though the BEP mechanism takes a +and funding initiatives that have encouraged this kind of data collection. +However, because these changes are recent relative to the other cases mentioned +above, standards for data and metadata in neuroscience have been prone to adopt +many elements of modern OSS development. Two salient examples in neuroscience +are the Neurodata Without Borders file format for neurophysiology data +[@Rubel2022NWB] and the Brain Imaging Data Structure (BIDS) standard for +neuroimaging data [@Gorgolewski2016BIDS]. BIDS in particular owes some of its +success to the adoption of OSS development mechanisms [@Poldrack2024BIDS]. For +example, small changes to the standard are managed through the GitHub pull +request mechanism; larger changes are managed through a a BIDS Enhancement +Proposal (BEP) process that is directly inspired by the Python programming +language community's Python Enhancement Proposal procedure, which used to +introduce new ideas into the language. Though the BEP mechanism takes a slightly different technical approach, it tries to emulate the open-ended and community-driven aspects of Python development to accept contributions from a wide range of stakeholders and tap a broad base of expertise. From 0ca8773c4a68ca45b02fbcd9546bc25afd799a99 Mon Sep 17 00:00:00 2001 From: Ariel Rokem Date: Wed, 19 Jun 2024 12:09:41 -0700 Subject: [PATCH 2/2] Try to push the products into a separate products branch. So we can remove them from main, avoiding git yuckiness. --- .github/workflows/publish.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index a84850a..a080c47 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -52,6 +52,7 @@ jobs: add: '_manuscript/index.pdf' author_name: 'GitHub Actions' message: 'Add paper.pdf at ${{ github.sha }}' + new_branch: products - name: Commit DOCX @@ -61,4 +62,5 @@ jobs: add: '_manuscript/index.docx' author_name: 'GitHub Actions' message: 'Add paper.doxc at ${{ github.sha }}' + new_branch: products