Merge pull request #79 from tmooney/anvil_walkthrough

Publish AnVIL Walkthrough
genome · Jun 19, 2024 · 7b362f3 · 7b362f3
2 parents 1914c5d + 6bbf232
commit 7b362f3
Show file tree

Hide file tree

Showing 2 changed files with 92 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -4,8 +4,6 @@ Workflows used for QC of WGS or WES data
 ### Single Sample QC
 This WDL pipeline implements QC in human whole-genome or exome/targeted sequencing data.
 
-For more on the metrics and aggregation of metrics from multiple workflow executions, please see the [qc-metric-aggregator](https://github.com/genome/qc-metric-aggregator) repository.
-
 #### Background
 
 As part of the [AnVIL](https://anvilproject.org/) Data Processing Working Group, a Quality Control (QC) workflow was developed to harmonize and summarize the QC for all WGS and WES sequence data sets ingested and released on the AnVIL from the [Centers for Common Disease Genomics](https://www.genome.gov/Funded-Programs-Projects/NHGRI-Genome-Sequencing-Program/Centers-for-Common-Disease-Genomics). The QC workflows are a starting point or reference for any data submission to the AnVIL.
@@ -75,7 +73,7 @@ The WDL includes multiple software packages (Picard, VerifyBamID2, Samtools flag
 The current QC pass/fail status is based on three metrics: coverage, freemix, and sample contamination. QC metrics can be made available in the AnVIL workspace to aid users in sample selection.
 
 ### Example QC processing results table
-Below is the current output, generated by the workflow in a qc_results_sample data table.
+Below is an example output, generated by the workflow in a qc_results_sample data table.
 
 | Metric Name	| Metric Description	| Pass threshold	| Purpose	| Source Tool |
 |-------------|---------------------|-----------------|---------|-------------|
@@ -95,6 +93,8 @@ Below is the current output, generated by the workflow in a qc_results_sample da
 | read1_pf_mismatch_rate	| Read1 base mismatch rate	| < 0.05	| Sequence quality | Picard Collect Alignment Summary Metrics | 
 | read2_pf_mismatch_rate	| Read2 base mismatch rate	| < 0.05	| Sequence quality	| Picard Collect Alignment Summary Metrics | 
 
+See also the aforementioned [listing of all possible workflow outputs](/docs/outputs.md).
+
 1.  Select QC status criteria
 
 Data submitters should establish the specific metrics and thresholds for determining the pass/fail criteria on their dataset.  This repo contains some [example threshold files](threshold_files/) that can be used when running this workflow.  There is [a complete list of possible threshold metrics here](/docs/thresholds.md).
@@ -103,23 +103,20 @@ Data submitters should establish the specific metrics and thresholds for determi
 
 Data Submitters are responsible for running the WDL on their data to generate the QC metrics.  If this workflow is given a threshold file, then it can report QC status outputs directly to the AnVIL data table as additional columns. AnVIL Data Processing Working Group has also created QC aggregator Jupyter notebook. Once QC status criteria have been determined, the thresholds can be modified in the notebook. The criteria is used to assign QC status of pass or fail. If a sample fails multiple times, it is assigned No QC under QC status.
 
-Video - Walkthrough of WGS QC Processing
-
-[![Video - Walkthrough of WGS QC Processing](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/AnVILonDockstore_still.png)](https://youtu.be/WLpnoXySuIw "Walkthrough of WGS QC Processing - Click to Watch")If the current version of the example JSON is used as in this video, the chosen threshold file should be uploaded to the workspace files and the `gs://` path should be filled in as an input instead of pointing to the `https://` URL in this repo.  Thereafter, the outputs tab of the workflow can be used to assign which results from the workflow should be added as columns to the data table.  See also the Terra dcoumentation on [writing workflow outputs to the data table](https://support.terra.bio/hc/en-us/articles/4500420806299-Writing-workflow-outputs-to-the-data-table).
-
-
-3.  Post QC Processing to AnVIL Workspaces
-
-The output from the QC aggregator is a QC summary results TSV file. Data submitters will pass off the QC summary results file to AnVIL ingestion team. The AnVIL team will push the QC summary results to the workspaces, which will contain the QC status including those that fail QC or have no QC. The example below is the QC results table in 1000 Genomes workspace.
+There is a [text walkthrough for WGS processing on AnVIL](/docs/anvil_walkthrough.md) that uses the NA12878 data listed in the [example JSON for WGS](SingleSampleQc.json).
 
 Sample QC Results Table
 
 ![QC results in a 1000 Genomes workspace](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/qc-results.png)
 
-### Additional Resources - Upcoming AnVIL Tools
-
-AnVIL Data Processing Working Group is evaluating two tools to add to the submission process to estimate (genetic) sex and compare that to reported sex. The goal is to identify at a cohort level any major issues between the genomic data and the reported phenotype data. Variation in sex chromosome copy number (e.g., XXY, XO, somatic mosaicism) means that genetic sex prediction is not 100% accurate, although it is an excellent tool for detecting major cohort-level issues.
-
 #### Exome QC Processing
 
 There is an additional [example JSON for exome](SingleSampleQc.exome.json).  It should have all the inputs necessary to run aside from the input BAM, sample name, and threshold file.  After supplying those inputs, the workflow can be run just as in the WGS case and outputs can simiarly be written to the data table.
+
+### Additional resources
+
+Video - Walkthrough of WGS QC Processing
+
+[![Video - Walkthrough of WGS QC Processing](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/AnVILonDockstore_still.png)](https://youtu.be/WLpnoXySuIw "Walkthrough of WGS QC Processing - Click to Watch")If the current version of the example JSON is used as in this video, the chosen threshold file should be uploaded to the workspace files and the `gs://` path should be filled in as an input instead of pointing to the `https://` URL in this repo.  Thereafter, the outputs tab of the workflow can be used to assign which results from the workflow should be added as columns to the data table.  See also the Terra dcoumentation on [writing workflow outputs to the data table](https://support.terra.bio/hc/en-us/articles/4500420806299-Writing-workflow-outputs-to-the-data-table).
+
+[Video - How to combine data across workspaces](https://www.youtube.com/watch?v=1vz4kupdkms) also includes an example of running the QC workflow.
diff --git a/docs/anvil_walkthrough.md b/docs/anvil_walkthrough.md
@@ -0,0 +1,80 @@
+# Running the QC workflow in AnVIL
+
+This walkthrough demonstrates running the `SingleSampleQc.wdl` workflow in AnVIL and writing the results to a datatable.
+
+## Pre-requisites
+
+* Creating workspaces and running workflows requires accounts and billing to be configured. For more on this, see [Setting up Lab Accounts and Billing in AnVIL](https://anvilproject.org/learn/investigators/setting-up-lab-accounts).
+
+
+## Preparing a workspace
+
+1. Log into [AnVIL](https://anvil.terra.bio) if you haven't already.
+2. Visit the Workspaces page by either choosing "View Workspaces" on the main page or "Workspaces" from the left menu.
+3. Create a new workspace by using the "+" button next to the Workspaces heading.  Fill out a name, billing project, and any other desired fields. (A description can be handy for anyone looking at this workspace in the future!)
+    * For running QC on an existing workspace, one would instead search on the Workspaces page for the existing workspace and choose "Clone" from the three-dot menu.
+4. Wait while the system provisions the new workspace.
+
+## Adding data
+
+There are several ways to add data to a workspace.  This walkthrough will use a simple one using public data for demonstration purposes. See also [Ingest Data](https://anvilproject.org/learn/data-submitters/submission-guide/ingesting-data) for more on the AnVIL process for adding data.
+
+1. Go to the "Data" tab of the new workspace.
+2. Select the "Import Data" button.
+3. Select "Upload TSV".
+4. Switch to "Text Import".
+5. Copy the following TSV:
+   ```tsv
+   entity:sample_id	cram	is_wgs
+   NA12878	gs://broad-public-datasets/NA12878/NA12878.cram	true
+   ```
+6. Paste the TSV into the box and "Start Import Job".
+7. Once the import completes, click the "sample" table and see that it contains the single row.
+    * The cram will have an info alert that the file is not stored within the workspace.  That is expected here, but for a real dataset normally the files would be a part of the workspace.
+8. Select the "Files" section under other data.
+9. Download the [thresholds TSV](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/threshold_files/anvil_wgs_thresholds.tsv) and "Upload" it to the workspace.
+
+## Setting up the workflow and inputs
+
+1. Visit [the dockstore page for the QC Workflow](https://dockstore.org/workflows/github.com/genome/qc-analysis-pipeline:master).
+2. In the "Launch with" section, choose "AnVIL".
+3. Choose the correct workspace in the "Destination Workspace" dropdown and "Import" the workflow.
+4. Ensure the "Run workflow(s) with inputs defined by data table" option is selected.
+5. Select the "sample" data table under "Step 1" and then choose "Select Data" under Step 2.
+6. Tick the box for the one and only row and choose "OK".
+7. Download the [input JSON](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/SingleSampleQc.json) from the github repository, then choose "upload JSON" on the workflow page and upload the JSON that was downloaded.
+    * This will fill in all the values necessary to run the workflow.  Since the JSON includes the path to a CRAM and sample name already, if we didn't have a datatable this would be sufficient to launch a single workflow now.
+8. Swap out the following values in the "input value" boxes on the "Inputs" tab:
+    | Variable  | Input value      |
+    |-----------|------------------|
+    | base_name | `this.sample_id` |
+    | input_bam | `this.cram`      |
+    | is_wgs    | `this.is_wgs`    |
+9. For the "evaluation_thresholds", press the folder icon in the input value box and then select the `anvil_wgs_thresholds.tsv` uploaded in the previous section.
+10. Be sure to "Save" before continuing.
+
+## Running the workflow
+
+1. Switch to the "Outputs" tab" and verify that the "Input value" column is filled in for each output.  If not, choose "Use defaults" in the header of the "Input value" column and be sure to "Save".
+2. Tick the "Delete intermediate outputs" box so that the working files are not retained after the workflow completes.
+3. Choose "Run analysis" to open the launcher.
+4. Confirm that this will launch `1` analysis, enter a description if desired, and "Launch" the analysis.
+5. Wait while the workflow runs; this may take several hours.  An e-mail will be sent when the workflow completes.
+    * While it is in progress, the "Job History" tab can be used to monitor its progress. More info is available in [Terra's Job History overview](https://support.terra.bio/hc/en-us/articles/360037096272-Job-History-overview-monitoring-workflows).
+    * The workflow uses pre-emptible instances to reduce costs, so some steps may take multiple attempts to complete if they were interrupted.
+6. Visit the "Data" tab again to see that the outputs of the workflow have been added to the "sample" table in the workspace.
+
+## MultiQC notebook
+
+1. Open the "Analyses" tab of the workflow and press the "Start" button.
+2. Download [the notebook](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/notebook/MultiQC.ipynb) from the github repository, then upload it using the overlay that appeared.
+3. Click on the "MultiQC.ipynb" that appears in the list.
+4. Choose the "Open" button next to the preview.
+5. As noted in the notebook, set the "Startup script" to `https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/notebook/multiqc_terra_startup.sh` which will load a special version of MultiQC into the notebook's VM.
+6. MultiQC requires more resources than the default, so switch the "CPUs" to 2 and the "Memory (GB)" to 13.
+    * The VM should be set to auto-pause after a period of inactivity.  This is important for reducing costs.  At any time, the VM can also be manually paused by clicking the cloud with lightning bolt in the sidebar of the page and pressing Pause.  Under "Settings" the environment can also be deleted to eliminate the ongoing costs associated with the VM.
+7. Press "Create" and wait for the VM to be ready.
+8. Press "Open" again to open the notebook in the VM.
+9. Select the cell that begins `import firecloud.api as fapi` and press the "Run" button.
+10. Select the cell below it that begins `import asyncio` and press the "Run" button again.  This one will launch MultiQC.
+11. Once it finishes the report will be linked in a line of output starting with "Report has been uploaded to...". Click that link to visit the Cloud Storage page for the report and then click the "Authenticated URL" to view it in your browser.