Skip to content

Commit

Permalink
Merge pull request #39 from laderast/tgl-edits
Browse files Browse the repository at this point in the history
Tgl edits
  • Loading branch information
caalo authored Mar 27, 2024
2 parents 6f68b89 + a11ba32 commit e64c719
Show file tree
Hide file tree
Showing 8 changed files with 93 additions and 52 deletions.
24 changes: 15 additions & 9 deletions 01-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ workflow my_workflow {

The input `fq` is defined to be a File variable type. WDL supports various variable types, such as String, Integer, Float, and Boolean. For more information on types in WDL, we recommend [OpenWDL's documentation on variable types](https://docs.openwdl.org/en/stable/WDL/variable_types/).

To access a task-level input variable in a task's command section, it is usually referenced using \~{this} notation. To access a workflow-level variable in a workflow, it is referenced just by its name without any special notation. To access a workflow-level variable in a task, it must be passed into the task as an input.
To access a task-level input variable in a task's command section, it is usually referenced using \~{*input_name*} notation, such as (`~{fastq}`) in the example below. To access a workflow-level variable in a workflow, it is referenced just by its name without any special notation, such as `fq` in the example below. To access a workflow-level variable in a task, it must be passed into the task as an input.

<!-- resources/basic_03.wdl -->

Expand All @@ -75,7 +75,7 @@ task do_something {
}
command <<<
echo "First ten lines of ~{basename_of_fq}: "
head ~{fastq}
head ~{fastq} ## example of referring to task-level input
>>>
}
Expand All @@ -88,7 +88,7 @@ workflow my_workflow {
call do_something {
input:
fastq = fq,
fastq = fq, ## example of referring to workflow-level input
basename_of_fq = basename_of_fq
}
}
Expand All @@ -111,7 +111,7 @@ task do_something {
head ~{fastq} >> output.txt
>>>
output {
File first_ten_lines = "output.txt"
File first_ten_lines = "output.txt" ## output variable for task
}
}
Expand All @@ -129,16 +129,16 @@ workflow my_workflow {
}
output {
File ten_lines = do_something.first_ten_lines
File ten_lines = do_something.first_ten_lines ## referring to task output here
}
}
```

## Using JSONs to control workflow inputs

Running a WDL workflow generally requires two files: A .wdl file, which contains the actual workflow, and a .json file, which provides the inputs for the workflow.
Running a WDL workflow generally requires two files: A `.wdl` file, which contains the actual workflow, and a `.json` file, which provides the inputs for the workflow.

In the example we showed earlier, the workflow takes in a file referred to by the variable `fq`. This needs to be provided by the user. Typically, this is done with a JSON file. Here's what a JSON file for this workflow might look like:
In the example we showed earlier, the workflow takes in a file referred to by the input variable `fq`. This needs to be provided by the user. Typically, this is done with a JSON file. Here's what a JSON file for this workflow might look like:

<!-- resources/basic_04.json -->

Expand All @@ -148,7 +148,7 @@ In the example we showed earlier, the workflow takes in a file referred to by th
}
```

JSON files consist of key-value pairs. In this case, the key is `"my_workflow.fq"` and the value is the path `"./data/example.fq"`. The first part of the key is the name of the workflow as written in the WDL file, in this case `my_workflow`. The variable being represented is referred to its name, in this case, `fq`. So, the file located at the path `./data/example.fq` is being input as a variable called `fq` into the workflow named `my_workflow`.
JSON files consist of key-value pairs. In this case, the key is `"my_workflow.fq"` and the value is the path `"./data/example.fq"`. The first part of the key is the name of the workflow as written in the WDL file, in this case `my_workflow`. The variable being represented is referred to its name, in this case, `fq`. So, in the above example, the file located at the path `./data/example.fq` is being input as a variable called `fq` into the workflow named `my_workflow`.

Files aren't the only type of variable you can refer to when using JSONs. Here's an example JSON for every common WDL variable type.

Expand Down Expand Up @@ -182,7 +182,13 @@ Resources:

## Running WDL via a computing engine

In order to run a WDL workflow, we need a computing engine to execute it. The two most popular WDL executors are miniwdl and Cromwell. Both can run WDLs on a local machine, High Performance Computing (HPC), or cloud computing backend. If you are trying to run WDL at Fred Hutch Cancer Center's HPC system, you should use the PROOF executor.
In order to run a WDL workflow, we need a *computing engine* to execute it. The two most popular WDL executors are [miniwdl](https://miniwdl.readthedocs.io/en/latest/index.html) and [Cromwell](https://cromwell.readthedocs.io/en/stable/). Both computing engines can run WDLs on multiple configurations:

- A local machine
- A High Performance Computing (HPC) Cluster
- A Cloud Computing backend (such as AWS/Terra/DNAnexus).

If you are trying to run WDL at Fred Hutch Cancer Center's HPC system, you should use the PROOF executor.

If you are computing on a HPC or the Cloud, you should find the best practice of running a WDL computing engine based on your institution's information technology system.

Expand Down
27 changes: 15 additions & 12 deletions 02-workflow-plan.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,25 +17,28 @@ The workflow diagram:

The tasks involved:

1. `BwaMem` aligns the samples to the reference genome (hg19).
2. `MarkDuplicates` marks PCR duplicates.
3. `ApplyBaseRecalibrator` perform base quality recalibration.
4. `Mutect2` performs paired somatic mutation calling.
5. `annovar` annotates the called somatic mutations.
|Task|Function|Inputs|Outputs|
|----|--------|------|-------|
|`BwaMem`|aligns the samples to the reference genome (hg19)|FASTA (`.fa`) file|`.bam` file|
|`MarkDuplicates|marks PCR duplicates|`.bam` file|marked `.bam` file|
|`ApplyBaseRecalibrator`|performs base quality recalibration|marked `.bam` file|`.bam` file|
|`Mutect2`|performs paired somatic mutation calling|`.bam` file|`.vcf` file|
|`annovar`|annotates the called somatic mutations|`.vcf` file with somatic mutations|annotated `.vcf` file|


## Workflow testing strategy

As we build out our workflow, how do we know it is running correctly besides getting a message such as "Workflow finished with status 'Succeeded'" or an exit code 0? In [software development](https://www.atlassian.com/continuous-delivery/software-testing), it is essential to test your code to see whether it generates the expected output given a specified input. This principle applies into bioinformatics workflow development also:
As we build out our workflow, how can we verify that it is running correctly besides getting a message such as "Workflow finished with status 'Succeeded'" or an exit code 0? In [software development](https://www.atlassian.com/continuous-delivery/software-testing), it is essential to test your code to see whether it generates the expected output given a specified input. This principle applies into bioinformatics workflow development also:

1. Unit Testing: We need to incorporate tests to ensure that each task we develop is correct.
1. *Unit Testing*: We need to incorporate tests to ensure that each task we develop is correct.

2. End-to-end testing: When we connect all the tasks together to form a workflow, we test that the workflow running end-to-end is correct.
2. *End-to-end testing*: When we connect all the tasks together to form a workflow, we test that the workflow running end-to-end is correct.

Here are some guidelines for any form of testing:

- The data you use for testing is representative of "real" data.
- The data you use for testing needs to be representative of "real" data.

- You have an expectation of what the resulting output is *before* you run your workflow on it. It can be as specific as a MD5 checksum, or vague such as a certain file format.
- You should have an *expectation* of what the resulting output is *before* you run your workflow on it. It can be as specific as a MD5 checksum, or vague such as a certain file format.

- The process is quick to run, ideally in the range of just a few minutes. This often means using a small subset of actual data.

Expand All @@ -46,7 +49,7 @@ Here are some guidelines for any form of testing:
To serve as an example we use here whole exome sequencing data from three cell lines from the [Cancer Cell Line Encyclopedia](https://pubmed.ncbi.nlm.nih.gov/31068700/).

### Tumor 1 : HCC4006
HCC4006 is a lung cancer cell line that has a mutation in the gene *EGFR* (Epithelial Growth Factor Receptor) a proto-oncogene. Mutations in *EGFR* result in the abnormal constitutive activation of the EGFR signaling pathway and drive cancer. In this cell-line specifically the *EGFR* mutation is an in-frame deletion in Exon 19. This mutation results in the constitutive activation of the EGFR protein and is therefore oncogenic.
HCC4006 is a lung cancer cell line that has a mutation in the gene *EGFR* (Epithelial Growth Factor Receptor), a proto-oncogene. Mutations in *EGFR* result in the abnormal constitutive activation of the EGFR signaling pathway and drive cancer. In this cell-line specifically, the *EGFR* mutation is an in-frame deletion in Exon 19. This mutation results in the constitutive activation of the EGFR protein and is therefore oncogenic.

### Tumor 2 : CALU1
CALU1 is a lung cancer cell line that has a mutation in the gene *KRAS* (Kirsten rat sarcoma viral oncogene homolog) . *KRAS* is also a proto-oncogene and the most common cancer-causing mutations lock the protein in an active conformation. Constitutive activation of *KRAS* results in carcinogenesis. In this cell-line *KRAS* has a point/missense mutation resulting in the substitution of the amino acid glycine (G) with cysteine (C) at position 12 of the KRAS protein (commonly known as the KRAS G12C mutation). This mutation results in the constitutive activation of KRAS and drives carcinogenesis.
Expand All @@ -55,4 +58,4 @@ CALU1 is a lung cancer cell line that has a mutation in the gene *KRAS* (Kirsten
MOLM 13 is a human leukemia cell line commonly used in research. While it is also a cancer cell line for the purposes of this workflow example we are going to consider it as a "normal". This cell line does not have mutations in *EGFR* nor in *KRAS* and therefore is a practical surrogate in lieu of a conventional normal sample

### Test data details
Fastq files for all these three samples were derived from their respective whole exome sequencing. However for the purpose of this guide we have limited the sequencing reads to span +/- 200 bp around the mutation sites for both genes. In doing so we are able to shrink the data files for quick testing.
Fastq files for all these three samples were derived from their respective whole exome sequencing. However, for the purpose of this guide we have limited the sequencing reads to span +/- 200 bp around the mutation sites for both genes. In doing so we are able to shrink the data files for quick testing.
Loading

0 comments on commit e64c719

Please sign in to comment.