Allow different (multiple) inputs #106

jameshadfield · 2024-12-02T04:17:38Z

By having all phylogenetic workflows start from two lists of inputs
(config.inputs, config.additional_inputs) we enable a broad range of
uses with a consistent interface.

Using local ingest files is trivial (see added docs) and doesn't need
a bunch of special-cased logic that is prone to falling out of date
(as it had indeed done)
Adding extra / private data follows the similar pattern, with an
additional config list being used so that we are explicit that the
new data is additional and enforce an ordering which is needed for
predictable augur merge behaviour. The canonical data can be
removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

Note that this workflow uses an old version of the CI workflow,
https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240
which copies example_data. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

See added docs for examples.

victorlin

Comments regarding sequence merge

Snakefile

victorlin · 2024-12-04T18:55:53Z

Snakefile

+    input:
+        metadata = lambda w: collect_inputs(segment=w.segment)
+    output:
+        metadata = "results/sequences_merged_{segment}.fasta"


Naming nitpick:

Suggested change

input:

metadata = lambda w: collect_inputs(segment=w.segment)

output:

metadata = "results/sequences_merged_{segment}.fasta"

input:

sequences = lambda w: collect_inputs(segment=w.segment)

output:

sequences = "results/sequences_merged_{segment}.fasta"

jameshadfield · 2024-12-04T23:26:33Z

README.md

+additional_inputs:
+  - name: secret
+    metadata: secret.tsv
+    sequencs: secret_{segment}.fasta


By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. Note that this workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data.

jameshadfield · 2024-12-16T02:38:00Z

Closing in favor of #112

jameshadfield force-pushed the james/refactor-inputs branch from 82abda3 to c6d92ed Compare December 2, 2024 20:47

jameshadfield mentioned this pull request Dec 2, 2024

Provide a generic pattern for including additional user data alongside curated data nextstrain/pathogen-repo-guide#72

Open

victorlin reviewed Dec 4, 2024

View reviewed changes

jameshadfield commented Dec 4, 2024

View reviewed changes

victorlin mentioned this pull request Dec 4, 2024

merge: Support sequences with cross-checking nextstrain/augur#1601

Open

5 tasks

jameshadfield force-pushed the james/refactor-inputs branch from c6d92ed to 05b4622 Compare December 12, 2024 03:55

jameshadfield force-pushed the james/update-config-syntax branch from fb99903 to c60554a Compare December 12, 2024 03:55

jameshadfield mentioned this pull request Dec 15, 2024

Use augur merge for sequences nextstrain/zika#76

Draft

4 tasks

jameshadfield closed this Dec 16, 2024

This was referenced Dec 16, 2024

merge: Support sequences nextstrain/augur#1579

Open

Multiple inputs / overriding inputs #112

Open

victorlin deleted the james/refactor-inputs branch December 17, 2024 23:31

jameshadfield mentioned this pull request Dec 18, 2024

WIP add config schema & generate HTML docs #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow different (multiple) inputs #106

Allow different (multiple) inputs #106

jameshadfield commented Dec 2, 2024 •

edited

Loading

victorlin left a comment

victorlin Dec 4, 2024

jameshadfield Dec 4, 2024

jameshadfield commented Dec 16, 2024

Allow different (multiple) inputs #106

Allow different (multiple) inputs #106

Conversation

jameshadfield commented Dec 2, 2024 • edited Loading

victorlin left a comment

Choose a reason for hiding this comment

victorlin Dec 4, 2024

Choose a reason for hiding this comment

jameshadfield Dec 4, 2024

Choose a reason for hiding this comment

jameshadfield commented Dec 16, 2024

jameshadfield commented Dec 2, 2024 •

edited

Loading