-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip! workflows as programs prototyping #55
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,9 +37,10 @@ curate: | |
# For the Nextstrain team, this is currently | ||
# 'https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv' | ||
geolocation_rules_url: "https://raw.githubusercontent.com/nextstrain/ncov-ingest/master/source-data/gisaid_geoLocationRules.tsv" | ||
# The path to the local geolocation rules within the pathogen repo | ||
# The path should be relative to the ingest directory. | ||
local_geolocation_rules: "defaults/geolocation_rules.tsv" | ||
# The path to the local geolocation rules for this pathogen. | ||
# The path should be relative to the working directory (e.g. --directory). | ||
# If the path doesn't exist in the working directory, the file in the workflow's defaults/ directory it used instead (if it exists). | ||
Comment on lines
+40
to
+42
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Slightly concerned that users will be confused by the term "working directory", especially those who are not familiar with Snakemake. (I had originally made everything relative to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For user reference, we can say "your analysis directory", perhaps with an explanatory parenthetical like "(Snakemake's working directory, --directory, often your current directory)". This seems like a tripping hazard we can address various ways. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Totally! Just want to make sure we address it as paths are often the most confusing thing... Seems like the working directory can be different things depending on how you run the workflow.
I do like that (3) is the most straight-forward so all for moving in that direction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Following this up after work in avian-flu and discussions in slack, I think we'll need words to describe the following 4 locations (even if some are only for developers). This is assuming we go with the slight variant to (3) which allows things like
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
local_geolocation_rules: "geolocation_rules.tsv" | ||
# List of field names to change where the key is the original field name and the value is the new field name | ||
# The original field names should match the ncbi_datasets_fields provided above. | ||
# This is the first step in the pipeline, so any references to field names in the configs below should use the new field names | ||
|
@@ -91,8 +92,9 @@ curate: | |
# Name to use for the generated abbreviated authors field | ||
abbr_authors_field: "abbr_authors" | ||
# Path to the manual annotations file | ||
# The path should be relative to the ingest directory | ||
annotations: "defaults/annotations.tsv" | ||
# The path should be relative to the working directory (e.g. --directory). | ||
# If the path doesn't exist in the working directory, the file in the workflow's defaults/ directory it used instead (if it exists). | ||
annotations: "annotations.tsv" | ||
# The ID field in the metadata to use to merge the manual annotations | ||
annotations_id: "accession" | ||
# The ID field in the metadata to use as the sequence id in the output FASTA file | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,15 +1,15 @@ | ||||||||||||
strain_id_field: "accession" | ||||||||||||
files: | ||||||||||||
exclude: "defaults/dropped_strains.txt" | ||||||||||||
include_genome: "defaults/include_strains_genome.txt" | ||||||||||||
include_N450: "defaults/include_strains_N450.txt" | ||||||||||||
reference: "defaults/measles_reference.gb" | ||||||||||||
reference_N450: "defaults/measles_reference_N450.gb" | ||||||||||||
reference_N450_fasta: "defaults/measles_reference_N450.fasta" | ||||||||||||
colors: "defaults/colors.tsv" | ||||||||||||
auspice_config: "defaults/auspice_config.json" | ||||||||||||
auspice_config_N450: "defaults/auspice_config_N450.json" | ||||||||||||
description: "defaults/description.md" | ||||||||||||
exclude: "dropped_strains.txt" | ||||||||||||
include_genome: "include_strains_genome.txt" | ||||||||||||
include_N450: "include_strains_N450.txt" | ||||||||||||
reference: "measles_reference.gb" | ||||||||||||
reference_N450: "measles_reference_N450.gb" | ||||||||||||
reference_N450_fasta: "measles_reference_N450.fasta" | ||||||||||||
colors: "colors.tsv" | ||||||||||||
auspice_config: "auspice_config.json" | ||||||||||||
auspice_config_N450: "auspice_config_N450.json" | ||||||||||||
description: "description.md" | ||||||||||||
Comment on lines
+3
to
+12
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Slightly concerned that this change may confuse external users who want to trace the default files. It's not immediately clear that these are pulling from "defaults/". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the If we're terribly concerned and don't think a comment is enough, we can also leave the Lines 20 to 24 in 6584e68
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the additional comments should be enough.
I thought the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's my understanding as well. The search order is:
By not using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry, yes, you're both correct with the code in this PR as-is and what I said. The bit I forgot to say as I was rushing out the door (ahh! 🤦) was that we could strip the (This is actually the first pattern I tinkered with over a month ago, but I thought it was more confusing than a simple unchanging path + search order.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hmm? There's only one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I run multiple invocations of the workflow with different overlays, or invocations of different workflows, each will write sets of files into the analysis directory, and these sets may not overlap or may partially overlap. To use an avian-flu example, I may have an analysis directory with both gisaid & cattle-outbreak datasets. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see. Multiple invocations (whether same workflow or not) writing to the same working analysis directory seemed to me like something we didn't need to support (in terms of keeping things separate): there's always the chance of overwriting files in that case (often overwriting is desired/expected) and confusion around what's output from which invocation. Do you think we definitely do need to support that sort of use case? If so, the implications of that would seem to extend far beyond the proposed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No - but it's something to keep in mind as the structure is built out, because what we're building out does support it and I think it'll be a common use case. For instance, if we could include some identifier based on the invocation within the filename we wouldn't be simply overwriting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, of course, but we're still overwriting all/most of the other files in
A timestamped directory to contain all this individual-run information makes the most sense to me and is the most conventional… but I don't think we need to worry about this structure right now now. I think it's too early to implement now; we should focus on the core workflows-as-programs changes first before building on them while they're still forming. I also think there's some consideration to take on what outputs get differentiated between invocations (e.g. protected from overwrite) and what outputs don't, and why that split is useful vs. the default situation of "nothing's protected" or "everything's protected". |
||||||||||||
filter: | ||||||||||||
group_by: "country year" | ||||||||||||
sequences_per_group: 20 | ||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Shared | ||
|
||
> **Warning** | ||
> Please be aware of the multiple workflows that will be affected when editing files in this directory! | ||
|
||
This directory that holds files that are shared across multiple workflows. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
import os.path | ||
|
||
def resolve_config_path(path): | ||
""" | ||
Resolve a relative *path* given in a configuration value. | ||
|
||
Resolves *path* as relative to the workflow's ``defaults/`` directory (i.e. | ||
``os.path.join(workflow.basedir, "defaults", path)``) if it doesn't exist | ||
in the workflow's analysis directory (i.e. the current working | ||
directory, or workdir, usually given by ``--directory`` (``-d``)). | ||
|
||
This behaviour allows a default configuration value to point to a default | ||
auxiliary file while also letting the file used be overridden either by | ||
setting an alternate file path in the configuration or by creating a file | ||
with the conventional name in the workflow's analysis directory. | ||
""" | ||
global workflow | ||
|
||
if not os.path.exists(path): | ||
# Special-case defaults/… for backwards compatibility with older | ||
# configs. We could achieve the same behaviour with a symlink | ||
# (defaults/defaults → .) but that seems less clear. | ||
if path.startswith("defaults/"): | ||
defaults_path = os.path.join(workflow.basedir, path) | ||
else: | ||
defaults_path = os.path.join(workflow.basedir, "defaults", path) | ||
|
||
if os.path.exists(defaults_path): | ||
return defaults_path | ||
|
||
return path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour of multiple configfiles is somewhat documented but I think will prove a tripping hazard to people.
Multiple configfiles (as per this section of code, and/or via additional
--configfile
s) will overwrite scalars and lists; dicts will be merged, but nested dicts won't be; there's no way to remove a dictionary key as far as I can find.This has implications for how we structure configs, specifically around how we encode flags to skip rules or skip a particular argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nod, for sure, but "multiple config files" is already our established approach, so this is making that more accessible/standard but isn't changing approaches.
We could change approaches, though, by combining the configs ourselves outside of Snakemake's machinery. I didn't think there was appetite for that, but it's very possible.
Hmm. In my recollection (and testing right now), nested dicts are merged? That is, the config merge is deep.
The merge is additive though, so yes, there is no way to remove keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I wasn't formatting my YAML correctly. That they are merged will be very helpful for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the point of view that this isn't a workflows-as-programs thing rather a pathgon-repo-{guide,template} thing. Similar to my comment in a recent meeting about merging in additional data I think these concepts should be thought of as intertwined rather than isolated in order to make workflows-as-programs as successful as possible.
Our current use of multiple config files are largely just adding in specific steps, e.g I think the examples in measles are a good summary of how we use them more generally:
I think it's rarer¹ to use additional configs to change parameters of builds, or to subset build targets etc. If we want to recommend this approach -- an approach that I think we should use and workflows-as-programs seems to lead us towards anyway -- then it's not going to play nicely with our current config structures. Some examples:
h1n1pdm_2y
,h3n2_2y
etc). So config overlays aren't going to be able to target a subset of those builds.cc @joverlee521
¹ The most complex workflows are seasonal-flu and ncov, so maybe those utilise the style of config merging I'm talking about here. I don't think seasonal-flu does at all, but maybe? It'd be great to have examples in Nextstrain where we do make use of config merging to replace / subset config parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
(avain-flu, seasonal-flu, RSV examples elided)
I think it's totally fair to draw a line, somewhere (not exactly sure where), and say the pathogen repos on one side of it are "legacy" and will continue to be maintained the way they have been historically, and will not use these new patterns, while pathogen repos on the other side of the line are "current" and will be keep reasonably up-to-date and consistent with our best practices. (We can also decide, down the road, to invest the effort into converting "legacy" repos into "current" repos, if and when that feels like it has a positive return-on-effort.)
Note: not currently taking a strong stance on where that line is 😁
Note that measles and yellow fever do the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example! That's a great example of a simple config structure (i.e. a single dataset target, no wildcards in the workflow) where merging works nicely. It will be interesting to see how that develops as WNV expands (e.g. to have NA & global targets) - does it go the mpox route of one config (or one config overlay) per target, or does it tend towards a more complicated config structure. It's specifically the more complicated structures I was focusing on above.
(It also uses an approach I think we've moved away from, namely
which makes it straightforward to remove/add options/params as you're in charge of the entire string. As a thought exercise, if the base config instead specified
and the overlay config wanted to skip that argument entirely how would it do it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't for the life of me find where I wrote about this recentlyish, but I'd suggested previously:
The niceness of simple config structures is why, in Slack recently, I suggested we consider "small multiples" of config (optionally with tooling to allow derived configs/inheritance) rather than a large complex cross-product config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it was a freakin' Google Docs comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. This behaviour changed in Snakemake 6.11.0 (Nov 2021). Before that nested dicts were not merged. (Yep, it was another potentially-breaking change in a minor version release.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, saw that. I think something like that's the right move, and the corresponding snakemake logic needs to ensure that it drops the
--coalescent
argument name as well as the null value. (We can probably use a helpful helper function here.)I prototyped a similar approach for nextstrain/avian-flu#104 but in the context of defining what builds to run (for gisaid there are 48 by default, so it's nice to (a) not enumerate them all and (b) be able to subset them for a particular analysis). One of the prototypes looked like this:
But I didn't like the necessity of listing out all the builds you don't want to run. I switched to a list-based interface which is much nicer in this respect. (There were other reasons for switching too, namely the hierarchy here isn't right and to make it right results in very verbose dictionaries.)