-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - allow running workflow from outside of repo #103
base: master
Are you sure you want to change the base?
Conversation
e6b070d
to
275f585
Compare
Snakefile
Outdated
Resolve a relative *path* given in a configuration value. Before resolving | ||
any '{x}' substrings are replaced by their corresponding wildcards (if the | ||
`wildcards` argument is provided). | ||
|
||
Search order (first match returned): | ||
1. Relative to the analysis directory | ||
2. Relative to the directory the entry snakefile was in. Typically this | ||
is not the Snakefile you are looking at now but (e.g.) the one in | ||
avian-flu/gisaid | ||
3. Relative to where this Snakefile is (i.e. `avian-flu/`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joverlee521 if we do implement a search order like this it opens up the possibility of moving the "default" files from (e.g.) ./config/
to ./gisaid/
or ./gisaid/defaults/
. Similarly for custom rules. If we go this direction I think it warrants a rethink of how we want to structure repos.
ad0f15c
to
90b994c
Compare
Snakefile
Outdated
class InvalidConfigError(Exception): | ||
pass | ||
|
||
def resolve_config_path(original_path, wildcards=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to me: need to resolve any custom rules using this (?) function, currently they're just
for rule_file in config.get('custom_rules', []):
include: rule_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a decision for us to make here which wasn't needed in measles. The h5n1-cattle-outbreak config uses a custom rule which needs to be searched for within the avian-flu repo. Other workflows use a similar approach for CI - e.g. zika.
Approach 1: Use the same resolving approach as resolve_config_path
to search for rulefiles.
Approach 2: Use the approach of measles (custom rules must be relative to the working directory) and replace our usage of custom_rules
with a "include" directive within h5n1-cattle-outbreak/Snakefile
.
There are pros and cons to each I think. Given that we use custom_rules
in CI, approach 1 is nice as it'll allow CI to be run in a separate directory as desired. Approach 2 also has the "gotcha" where you develop a workflow with custom-rules within avian-flu
, it all works, but then it won't work when you run it in another directory. There's also the bigger picture issue where if multiple configfiles define custom_rules
only one of them will be used because they're lists.
Given the above I think using the search order from Approach 1 is the best, but we should also reference cattle-flu.smk
within h5n1-cattle-outbreak/Snakefile
so that overlays with custom_rules
don't have to specify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory.
This is why, FWIW, in measles I intentionally only resolved custom rules files based on the working analysis directory.
For the current "build-configs" pattern like CI or our own automation, I'd think to convert those to analysis directories. I didn't demo this for measles to keep changes to a minimum, but it's an easy change and keeps our interfaces consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory
Yeah I see that. I guess a clarifying question is "do we want to allow all workflows to be run from an external analysis directory". If that's the case, and I think it probably is, then we either avoid custom_rules
entirely, expand the interface to have something like workflow_rules
, or shift our usage of custom rules to their own (sub-)workflows (concrete example). Is the latter what you mean by "I'd think to convert those to analysis directories"?
If we use different workflows for CI rather than "custom_rules" as we do now, then a cattle-outbreak CI might look like avian-flu/phylogenetic/ci/h5n1-cattle-outbreak/Snakefile
which would include: ../../h5n1-cattle-outbreak/Snakefile
which would include: "../Snakefile"
. I haven't really though through how inheritance like this will work in practice.
cc @joverlee521
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're answering yes, but saying that CI isn't a "workflow" in and of itself so it's ok that it can't be run from an external analysis directory?
I think we have to cut off at some point. The CI seems like a very Nextstrain/workflow developer specific config that would not need to be run by an external user. If someone really wanted to, they can copy the contents of the CI directory to their own analysis directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone really wanted to
Just for the absence of doubt, here I'm saying that someone is us as it relates to how we run CI (on GitHub actions). Currently we check out the repo and run with --configfile <ci.yaml>
. With these changes we would be doing the same, but with a slightly different invocation syntax, e.g. as Tom sketched out above. But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ...
as if it were just another workflow. As I said, I'm not involved much in the CI space, but just wanted to explain myself fully here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha. Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joverlee521!
I've updated the PR to search for custom rules relative to the analysis (working) directory. The h5n1-cattle-outbreak/Snakemake
workflow now imports it's own "custom" rules as expected.
I haven't moved the "base" Snakefile to rules/base.smk
, but this PR is probably the place to do it if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameshadfield: But we wouldn't be using an approach where we ran CI in a separate analysis directory with
nextstrain run ...
as if it were just another workflow.@joverlee521: Yeah, for CI I'd think we'd want to test the source code so it'd be running
nextstrain build
.
We could have this either way. It's very doable to use nextstrain run
for this, and we might want to do so in order for it to be an instructive example/regularly test that bit of interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting comments-in-progress as I'd started reviewing this WIP last week, then the holiday happened and I didn't click submit. (I realize the WIP here has changed again since that time too.)
Although, it seems like some comments I remember writing got lost… ugh.
90b994c
to
5c81b87
Compare
See <https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1732568407123369> for context. Able to be run via a number of different ways: - From the 'avian-flu' repo: - `snakemake -s gisaid/Snakefile ...` - `cd gisaid && snakemake ...` - From a separate analysis directory, where ${AVIAN_FLU} is the path to the (locally checked out) avian-flu repo - without any config overlays: `snakemake -s ${AVIAN_FLU}/gisaid/Snakefile` - with a `config.yaml` overlay: (same as above) - with a `foo.yaml` overlay: `snakemake -s ${AVIAN_FLU}/gisaid/Snakefile --configfile foo.yaml`
Shifts to a concept where custom-rules are only for use in analysis directories, and the custom snakemake file is sourced relative to that working directory. See <#103 (comment)> for more discussion about the benefits and limitations of this. The cattle-outbreak workflow (`h5n1-cattle-outbreak/Snakefile`) now directly imports the rules it needs rather than using the custom-rules machinery.
5c81b87
to
0bc582d
Compare
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. Note that this workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
Resolve a relative *path* given in a configuration value. Returns a | ||
function which takes a single argument *wildcards* and returns the resolved | ||
path with any '{x}' substrings are replaced by their corresponding wildcards | ||
filled in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple miscellaneous/nitpicky things here:
- Might be worth noting in the docstring that the wildcard expansion is via Snakemake's
expand()
- The docstring and commentary within the method are very specific about paths being relative — might be a good idea to add a guard for that? Right now if you pass in an absolute path, as long as there's a file at that path, this function will pass (the expanded version of) that path right back to you — and maybe that's okay, or even desired! - but then that deserves some sort of acknowledgement in the docs
- Might be nice to let
path
to be a pathlib object in addition to a string; especially because right now the error message you get back is going to be very confusing, given how pathlib objects stringify
Search order (first match returned): | ||
1. Relative to the analysis directory | ||
2. Relative to the directory the entry snakefile was in. Typically this is | ||
not the Snakefile you are looking at now but (e.g.) the one in | ||
avian-flu/gisaid | ||
3. Relative to where this Snakefile is (i.e. `avian-flu/`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've casually skimmed thru parts of this PR but haven't actually reviewed it. I wanted to start discussion early though on this search order. I think it would be better to have a single fallback, e.g. a search order with only two locations: the working analysis directory and a fixed directory based on the workflow source. That is, we'd pick one of workflow.basedir
or workflow.current_basedir
and use that one consistently, not allow either. The more search paths, the harder the behaviour is to explain, and the easier it is to get tripped up in a tangle you didn't expect as paths in different parts of the repo collide.
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
Following on from nextstrain/measles#55 partly as a learning exercise and partly because I want to use it for avian-flu. The main differences here are
--configfile
, i.e. there's no "default configuration values". This is how mpox works too, among others.Tested using two scenarios:
custom-gisaid.yaml
config ofAnd running via
(The explicit target file was needed due to the config structure - see the note in the YAML above - and this makes the config override of
segments
irrelevant.)config.yaml
config ofAnd ran via
snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 -pf
Overall it worked really well. I'll start some threads in the code to continue discussion of various parts.