-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest with nextclade #62
base: master
Are you sure you want to change the base?
Changes from all commits
28d8e50
7c6b069
bbc56af
8adda40
43fba93
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# TSV file that is a mapping of column names for Nextclade output TSV | ||
# The first column should be the original column name of the Nextclade TSV | ||
# The second column should be the new column name to use in the final metadata TSV | ||
# Nextclade can have pathogen specific output columns so make sure to check which | ||
# columns would be useful for your downstream phylogenetic analysis. | ||
seqName seqName | ||
clade clade | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense to include the |
||
coverage coverage | ||
totalMissing missing_data | ||
totalSubstitutions divergence | ||
totalNonACGTNs nonACGTN | ||
qc.overallStatus QC_overall | ||
qc.missingData.status QC_missing_data | ||
qc.mixedSites.status QC_mixed_sites | ||
qc.privateMutations.status QC_rare_mutations | ||
qc.snpClusters.status QC_snp_clusters | ||
qc.frameShifts.status QC_frame_shifts | ||
qc.stopCodons.status QC_stop_codons | ||
frameShifts frame_shifts | ||
privateNucMutations.reversionSubstitutions private_reversion_substitutions | ||
privateNucMutations.labeledSubstitutions private_labeled_substitutions | ||
privateNucMutations.unlabeledSubstitutions private_unlabeled_substitutions | ||
privateNucMutations.totalReversionSubstitutions private_total_reversion_substitutions | ||
privateNucMutations.totalLabeledSubstitutions private_total_labeled_substitutions | ||
privateNucMutations.totalUnlabeledSubstitutions private_total_unlabeled_substitutions | ||
privateNucMutations.totalPrivateSubstitutions private_total_private_substitutions | ||
qc.snpClusters.clusteredSNPs private_snp_clusters | ||
qc.snpClusters.totalSNPs private_total_snp_clusters | ||
Comment on lines
+8
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we only keep the clade assignment and drop all of these other columns? These QC outputs are specific to the HA segment so it might not make sense to keep as part of the overall metadata. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
""" | ||
This part of the workflow handles running Nextclade on the curated metadata | ||
and sequences. | ||
""" | ||
|
||
|
||
DATASET_NAME = config["nextclade"]["dataset_name"] | ||
|
||
|
||
rule get_nextclade_dataset: | ||
"""Download Nextclade dataset""" | ||
output: | ||
dataset=f"data/nextclade/{DATASET_NAME}.zip", | ||
benchmark: | ||
"benchmarks/get_nextclade_dataset.txt" | ||
params: | ||
dataset_name=DATASET_NAME | ||
shell: | ||
""" | ||
nextclade3 dataset get \ | ||
--name={params.dataset_name:q} \ | ||
--output-zip={output.dataset} \ | ||
--verbose | ||
""" | ||
|
||
|
||
rule run_nextclade: | ||
input: | ||
dataset=f"data/nextclade/{DATASET_NAME}.zip", | ||
# The H5NX datasets should only be for the HA segment | ||
sequences="{data_source}/results/sequences_ha.fasta", | ||
output: | ||
nextclade="{data_source}/results/nextclade.tsv", | ||
benchmark: | ||
"{data_source}/benchmarks/run_nextclade.txt" | ||
shell: | ||
""" | ||
nextclade3 run \ | ||
{input.sequences} \ | ||
--input-dataset {input.dataset} \ | ||
--output-tsv {output.nextclade} | ||
""" | ||
|
||
|
||
rule join_metadata_and_nextclade: | ||
input: | ||
nextclade="{data_source}/results/nextclade.tsv", | ||
metadata="{data_source}/data/merged_segment_metadata.tsv", | ||
nextclade_field_map=config["nextclade"]["field_map"], | ||
output: | ||
metadata="{data_source}/results/metadata.tsv", | ||
params: | ||
# Making this param optional because we don't have curate pipeline for fauna data | ||
metadata_id_field=config.get("curate", {}).get("output_id_field", "strain"), | ||
nextclade_id_field=config["nextclade"]["id_field"], | ||
shell: | ||
""" | ||
export SUBSET_FIELDS=`grep -v '^#' {input.nextclade_field_map} | awk '{{print $1}}' | tr '\n' ',' | sed 's/,$//g'` | ||
|
||
csvtk fix-quotes -t {input.nextclade} \ | ||
| csvtk -t cut -f $SUBSET_FIELDS \ | ||
| csvtk -t rename2 \ | ||
-F \ | ||
-f '*' \ | ||
-p '(.+)' \ | ||
-r '{{kv}}' \ | ||
-k {input.nextclade_field_map} \ | ||
| csvtk del-quotes -t \ | ||
| tsv-join -H \ | ||
--filter-file - \ | ||
--key-fields {params.nextclade_id_field} \ | ||
--data-fields {params.metadata_id_field} \ | ||
--append-fields '*' \ | ||
--write-all ? \ | ||
{input.metadata} \ | ||
| tsv-select -H --exclude {params.nextclade_id_field} \ | ||
> {output.metadata} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the Nextclade Clade as a separate coloring so we can do comparisons across clade labels, but maybe we'll remove
h5_label_clade
eventually? Would love to hear your thoughts here @lmoncla 🙏