Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - allow running workflow from outside of repo #103

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jameshadfield
Copy link
Member

Following on from nextstrain/measles#55 partly as a learning exercise and partly because I want to use it for avian-flu. The main differences here are

  • A much more complicated snakemake workflow and config structure
  • A workflow which expected to be run with --configfile, i.e. there's no "default configuration values". This is how mpox works too, among others.

Tested using two scenarios:

  1. GISAID builds. Running in a separate directory outside of the avian-flu repo. I used a custom-gisaid.yaml config of
extends: gisaid.yaml
segments:
  - pb2

# NOTE - this reveals a (rather big) issue with our config style as this _won't_
# overwrite the builds as we intend it to, it'll merge in (and thus not change anything)
builds:
  h7n9: 
    - all-time

target_sequences_per_tree: 1000 # for testing

And running via

AVIAN_FLU="path/to/avian-flu/repo"
snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 --configfile custom-gisaid.yaml -pf auspice/avian-flu_h7n9_ha_all-time.json

(The explicit target file was needed due to the config structure - see the note in the YAML above - and this makes the config override of segments irrelevant.)

  1. H5N1-cattle-outbreak. Running in a separate directory outside of the avian-flu repo (and different from the GISAID builds above). I used a config.yaml config of
extends: h5n1-cattle-outbreak.yaml
segments:
  - pb2

And ran via snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 -pf


Overall it worked really well. I'll start some threads in the code to continue discussion of various parts.

Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated
Comment on lines 29 to 38
Resolve a relative *path* given in a configuration value. Before resolving
any '{x}' substrings are replaced by their corresponding wildcards (if the
`wildcards` argument is provided).

Search order (first match returned):
1. Relative to the analysis directory
2. Relative to the directory the entry snakefile was in. Typically this
is not the Snakefile you are looking at now but (e.g.) the one in
avian-flu/gisaid
3. Relative to where this Snakefile is (i.e. `avian-flu/`)
Copy link
Member Author

@jameshadfield jameshadfield Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joverlee521 if we do implement a search order like this it opens up the possibility of moving the "default" files from (e.g.) ./config/ to ./gisaid/ or ./gisaid/defaults/. Similarly for custom rules. If we go this direction I think it warrants a rethink of how we want to structure repos.

@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch 3 times, most recently from ad0f15c to 90b994c Compare November 28, 2024 03:12
Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated
class InvalidConfigError(Exception):
pass

def resolve_config_path(original_path, wildcards=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to me: need to resolve any custom rules using this (?) function, currently they're just

for rule_file in config.get('custom_rules', []):
    include: rule_file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a decision for us to make here which wasn't needed in measles. The h5n1-cattle-outbreak config uses a custom rule which needs to be searched for within the avian-flu repo. Other workflows use a similar approach for CI - e.g. zika.

Approach 1: Use the same resolving approach as resolve_config_path to search for rulefiles.

Approach 2: Use the approach of measles (custom rules must be relative to the working directory) and replace our usage of custom_rules with a "include" directive within h5n1-cattle-outbreak/Snakefile.

There are pros and cons to each I think. Given that we use custom_rules in CI, approach 1 is nice as it'll allow CI to be run in a separate directory as desired. Approach 2 also has the "gotcha" where you develop a workflow with custom-rules within avian-flu, it all works, but then it won't work when you run it in another directory. There's also the bigger picture issue where if multiple configfiles define custom_rules only one of them will be used because they're lists.

Given the above I think using the search order from Approach 1 is the best, but we should also reference cattle-flu.smk within h5n1-cattle-outbreak/Snakefile so that overlays with custom_rules don't have to specify it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory.

This is why, FWIW, in measles I intentionally only resolved custom rules files based on the working analysis directory.

For the current "build-configs" pattern like CI or our own automation, I'd think to convert those to analysis directories. I didn't demo this for measles to keep changes to a minimum, but it's an easy change and keeps our interfaces consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory

Yeah I see that. I guess a clarifying question is "do we want to allow all workflows to be run from an external analysis directory". If that's the case, and I think it probably is, then we either avoid custom_rules entirely, expand the interface to have something like workflow_rules, or shift our usage of custom rules to their own (sub-)workflows (concrete example). Is the latter what you mean by "I'd think to convert those to analysis directories"?

If we use different workflows for CI rather than "custom_rules" as we do now, then a cattle-outbreak CI might look like avian-flu/phylogenetic/ci/h5n1-cattle-outbreak/Snakefile which would include: ../../h5n1-cattle-outbreak/Snakefile which would include: "../Snakefile". I haven't really though through how inheritance like this will work in practice.

cc @joverlee521

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're answering yes, but saying that CI isn't a "workflow" in and of itself so it's ok that it can't be run from an external analysis directory?

I think we have to cut off at some point. The CI seems like a very Nextstrain/workflow developer specific config that would not need to be run by an external user. If someone really wanted to, they can copy the contents of the CI directory to their own analysis directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone really wanted to

Just for the absence of doubt, here I'm saying that someone is us as it relates to how we run CI (on GitHub actions). Currently we check out the repo and run with --configfile <ci.yaml>. With these changes we would be doing the same, but with a slightly different invocation syntax, e.g. as Tom sketched out above. But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ... as if it were just another workflow. As I said, I'm not involved much in the CI space, but just wanted to explain myself fully here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joverlee521!

I've updated the PR to search for custom rules relative to the analysis (working) directory. The h5n1-cattle-outbreak/Snakemake workflow now imports it's own "custom" rules as expected.

I haven't moved the "base" Snakefile to rules/base.smk, but this PR is probably the place to do it if we want to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameshadfield: But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ... as if it were just another workflow.

@joverlee521: Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build.

We could have this either way. It's very doable to use nextstrain run for this, and we might want to do so in order for it to be an instructive example/regularly test that bit of interface.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting comments-in-progress as I'd started reviewing this WIP last week, then the holiday happened and I didn't click submit. (I realize the WIP here has changed again since that time too.)

Although, it seems like some comments I remember writing got lost… ugh.

Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 90b994c to 5c81b87 Compare December 8, 2024 22:57
See <https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1732568407123369>
for context.

Able to be run via a number of different ways:

- From the 'avian-flu' repo:
    - `snakemake -s gisaid/Snakefile ...`
    - `cd gisaid && snakemake ...`
- From a separate analysis directory, where ${AVIAN_FLU} is the path
  to the (locally checked out) avian-flu repo
    - without any config overlays: `snakemake -s ${AVIAN_FLU}/gisaid/Snakefile`
    - with a `config.yaml` overlay: (same as above)
    - with a `foo.yaml` overlay: `snakemake -s ${AVIAN_FLU}/gisaid/Snakefile --configfile foo.yaml`
Shifts to a concept where custom-rules are only for use in analysis
directories, and the custom snakemake file is sourced relative to that
working directory. See <#103 (comment)>
for more discussion about the benefits and limitations of this.

The cattle-outbreak workflow (`h5n1-cattle-outbreak/Snakefile`) now
directly imports the rules it needs rather than using the custom-rules
machinery.
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 5c81b87 to 0bc582d Compare December 12, 2024 03:55
jameshadfield added a commit that referenced this pull request Dec 16, 2024
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

Note that this workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
Snakefile Show resolved Hide resolved
Comment on lines +28 to +31
Resolve a relative *path* given in a configuration value. Returns a
function which takes a single argument *wildcards* and returns the resolved
path with any '{x}' substrings are replaced by their corresponding wildcards
filled in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple miscellaneous/nitpicky things here:

  1. Might be worth noting in the docstring that the wildcard expansion is via Snakemake's expand()
  2. The docstring and commentary within the method are very specific about paths being relative — might be a good idea to add a guard for that? Right now if you pass in an absolute path, as long as there's a file at that path, this function will pass (the expanded version of) that path right back to you — and maybe that's okay, or even desired! - but then that deserves some sort of acknowledgement in the docs
  3. Might be nice to let path to be a pathlib object in addition to a string; especially because right now the error message you get back is going to be very confusing, given how pathlib objects stringify

Comment on lines +33 to +38
Search order (first match returned):
1. Relative to the analysis directory
2. Relative to the directory the entry snakefile was in. Typically this is
not the Snakefile you are looking at now but (e.g.) the one in
avian-flu/gisaid
3. Relative to where this Snakefile is (i.e. `avian-flu/`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've casually skimmed thru parts of this PR but haven't actually reviewed it. I wanted to start discussion early though on this search order. I think it would be better to have a single fallback, e.g. a search order with only two locations: the working analysis directory and a fixed directory based on the workflow source. That is, we'd pick one of workflow.basedir or workflow.current_basedir and use that one consistently, not allow either. The more search paths, the harder the behaviour is to explain, and the easier it is to get tripped up in a tangle you didn't expect as paths in different parts of the repo collide.

jameshadfield added a commit that referenced this pull request Jan 1, 2025
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

When considering sequences the structure is more complex than metadata
because the influenza genome is segmented and we wish to allow users to
provide additional data for only some segments (see docstring for
`_parse_config_input`). For non-segmented pathogens the simpler
structure used here for metadata could also be used for sequences.

This workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
jameshadfield added a commit that referenced this pull request Jan 7, 2025
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

When considering sequences the structure is more complex than metadata
because the influenza genome is segmented and we wish to allow users to
provide additional data for only some segments (see docstring for
`_parse_config_input`). For non-segmented pathogens the simpler
structure used here for metadata could also be used for sequences.

This workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
jameshadfield added a commit that referenced this pull request Jan 7, 2025
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

When considering sequences the structure is more complex than metadata
because the influenza genome is segmented and we wish to allow users to
provide additional data for only some segments (see docstring for
`_parse_config_input`). For non-segmented pathogens the simpler
structure used here for metadata could also be used for sequences.

This workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants