From 7523ad7bd84177ff8bbb35e5bcc2fce52a312106 Mon Sep 17 00:00:00 2001 From: Thomas Mooney Date: Wed, 19 Jun 2024 09:19:03 -0500 Subject: [PATCH 1/4] Add a walkthrough for running the example WGS analyiss in AnVIL. --- docs/anvil_walkthrough.md | 80 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) create mode 100644 docs/anvil_walkthrough.md diff --git a/docs/anvil_walkthrough.md b/docs/anvil_walkthrough.md new file mode 100644 index 0000000..b22c055 --- /dev/null +++ b/docs/anvil_walkthrough.md @@ -0,0 +1,80 @@ +# Running the QC workflow in AnVIL + +This walkthrough demonstrates running the `SingleSampleQc.wdl` workflow in AnVIL and writing the results to a datatable. + +## Pre-requisites + +* Creating workspaces and running workflows requires accounts and billing to be configured. For more on this, see [Setting up Lab Accounts and Billing in AnVIL](https://anvilproject.org/learn/investigators/setting-up-lab-accounts). + + +## Preparing a workspace + +1. Log into [AnVIL](https://anvil.terra.bio) if you haven't already. +2. Visit the Workspaces page by either choosing "View Workspaces" on the main page or "Workspaces" from the left menu. +3. Create a new workspace by using the "+" button next to the Workspaces heading. Fill out a name, billing project, and any other desired fields. (A description can be handy for anyone looking at this workspace in the future!) + * For running QC on an existing workspace, one would instead search on the Workspaces page for the existing workspace and choose "Clone" from the three-dot menu. +4. Wait while the system provisions the new workspace. + +## Adding data + +There are several ways to add data to a workspace. This walkthrough will use a simple one using public data for demonstration purposes. See also [Ingest Data](https://anvilproject.org/learn/data-submitters/submission-guide/ingesting-data) for more on the AnVIL process for adding data. + +1. Go to the "Data" tab of the new workspace. +2. Select the "Import Data" button. +3. Select "Upload TSV". +4. Switch to "Text Import". +5. Copy the following TSV: + ```tsv + entity:sample_id cram is_wgs + NA12878 gs://broad-public-datasets/NA12878/NA12878.cram true + ``` +6. Paste the TSV into the box and "Start Import Job". +7. Once the import completes, click the "sample" table and see that it contains the single row. + * The cram will have an info alert that the file is not stored within the workspace. That is expected here, but for a real dataset normally the files would be a part of the workspace. +8. Select the "Files" section under other data. +9. Download the [thresholds TSV](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/threshold_files/anvil_wgs_thresholds.tsv) and "Upload" it to the workspace. + +## Setting up the workflow and inputs + +1. Visit [the dockstore page for the QC Workflow](https://dockstore.org/workflows/github.com/genome/qc-analysis-pipeline:master). +2. In the "Launch with" section, choose "AnVIL". +3. Choose the correct workspace in the "Destination Workspace" dropdown and "Import" the workflow. +4. Ensure the "Run workflow(s) with inputs defined by data table" option is selected. +5. Select the "sample" data table under "Step 1" and then choose "Select Data" under Step 2. +6. Tick the box for the one and only row and choose "OK". +7. Download the [input JSON](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/SingleSampleQc.json) from the github repository, then choose "upload JSON" on the workflow page and upload the JSON that was downloaded. + * This will fill in all the values necessary to run the workflow. Since the JSON includes the path to a CRAM and sample name already, if we didn't have a datatable this would be sufficient to launch a single workflow now. +8. Swap out the following values in the "input value" boxes on the "Inputs" tab: + | Variable | Input value | + |-----------|------------------| + | base_name | `this.sample_id` | + | input_bam | `this.cram` | + | is_wgs | `this.is_wgs` | +9. For the "evaluation_thresholds", press the folder icon in the input value box and then select the `anvil_wgs_thresholds.tsv` uploaded in the previous section. +10. Be sure to "Save" before continuing. + +## Running the workflow + +1. Switch to the "Outputs" tab" and verify that the "Input value" column is filled in for each output. If not, choose "Use defaults" in the header of the "Input value" column and be sure to "Save". +2. Tick the "Delete intermediate outputs" box so that the working files are not retained after the workflow completes. +3. Choose "Run analysis" to open the launcher. +4. Confirm that this will launch `1` analysis, enter a description if desired, and "Launch" the analysis. +5. Wait while the workflow runs; this may take several hours. An e-mail will be sent when the workflow completes. + * While it is in progress, the "Job History" tab can be used to monitor its progress. More info is available in [Terra's Job History overview](https://support.terra.bio/hc/en-us/articles/360037096272-Job-History-overview-monitoring-workflows). + * The workflow uses pre-emptible instances to reduce costs, so some steps may take multiple attempts to complete if they were interrupted. +6. Visit the "Data" tab again to see that the outputs of the workflow have been added to the "sample" table in the workspace. + +## MultiQC notebook + +1. Open the "Analyses" tab of the workflow and press the "Start" button. +2. Download [the notebook](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/notebook/MultiQC.ipynb) from the github repository, then upload it using the overlay that appeared. +3. Click on the "MultiQC.ipynb" that appears in the list. +4. Choose the "Open" button next to the preview. +5. As noted in the notebook, set the "Startup script" to `https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/notebook/multiqc_terra_startup.sh` which will load a special version of MultiQC into the notebook's VM. +6. MultiQC requires more resources than the default, so switch the "CPUs" to 2 and the "Memory (GB)" to 13. + * The VM should be set to auto-pause after a period of inactivity. This is important for reducing costs. At any time, the VM can also be manually paused by clicking the cloud with lightning bolt in the sidebar of the page and pressing Pause. Under "Settings" the environment can also be deleted to eliminate the ongoing costs associated with the VM. +7. Press "Create" and wait for the VM to be ready. +8. Press "Open" again to open the notebook in the VM. +9. Select the cell that begins `import firecloud.api as fapi` and press the "Run" button. +10. Select the cell below it that begins `import asyncio` and press the "Run" button again. This one will launch MultiQC. +11. Once it finishes the report will be linked in a line of output starting with "Report has been uploaded to...". Click that link to visit the Cloud Storage page for the report and then click the "Authenticated URL" to view it in your browser. From a7efbce23e087aa8d9a3169012649e5232a53b25 Mon Sep 17 00:00:00 2001 From: Thomas Mooney Date: Wed, 19 Jun 2024 15:23:04 -0500 Subject: [PATCH 2/4] Cleanup some old sections of README --- README.md | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 9a11395..2d0761e 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,6 @@ Workflows used for QC of WGS or WES data ### Single Sample QC This WDL pipeline implements QC in human whole-genome or exome/targeted sequencing data. -For more on the metrics and aggregation of metrics from multiple workflow executions, please see the [qc-metric-aggregator](https://github.com/genome/qc-metric-aggregator) repository. - #### Background As part of the [AnVIL](https://anvilproject.org/) Data Processing Working Group, a Quality Control (QC) workflow was developed to harmonize and summarize the QC for all WGS and WES sequence data sets ingested and released on the AnVIL from the [Centers for Common Disease Genomics](https://www.genome.gov/Funded-Programs-Projects/NHGRI-Genome-Sequencing-Program/Centers-for-Common-Disease-Genomics). The QC workflows are a starting point or reference for any data submission to the AnVIL. @@ -75,7 +73,7 @@ The WDL includes multiple software packages (Picard, VerifyBamID2, Samtools flag The current QC pass/fail status is based on three metrics: coverage, freemix, and sample contamination. QC metrics can be made available in the AnVIL workspace to aid users in sample selection. ### Example QC processing results table -Below is the current output, generated by the workflow in a qc_results_sample data table. +Below is an example output, generated by the workflow in a qc_results_sample data table. | Metric Name | Metric Description | Pass threshold | Purpose | Source Tool | |-------------|---------------------|-----------------|---------|-------------| @@ -95,6 +93,8 @@ Below is the current output, generated by the workflow in a qc_results_sample da | read1_pf_mismatch_rate | Read1 base mismatch rate | < 0.05 | Sequence quality | Picard Collect Alignment Summary Metrics | | read2_pf_mismatch_rate | Read2 base mismatch rate | < 0.05 | Sequence quality | Picard Collect Alignment Summary Metrics | +See also the aforementioned [listing of all possible workflow outputs](/docs/outputs.md). + 1. Select QC status criteria Data submitters should establish the specific metrics and thresholds for determining the pass/fail criteria on their dataset. This repo contains some [example threshold files](threshold_files/) that can be used when running this workflow. There is [a complete list of possible threshold metrics here](/docs/thresholds.md). @@ -107,19 +107,10 @@ Video - Walkthrough of WGS QC Processing [![Video - Walkthrough of WGS QC Processing](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/AnVILonDockstore_still.png)](https://youtu.be/WLpnoXySuIw "Walkthrough of WGS QC Processing - Click to Watch")If the current version of the example JSON is used as in this video, the chosen threshold file should be uploaded to the workspace files and the `gs://` path should be filled in as an input instead of pointing to the `https://` URL in this repo. Thereafter, the outputs tab of the workflow can be used to assign which results from the workflow should be added as columns to the data table. See also the Terra dcoumentation on [writing workflow outputs to the data table](https://support.terra.bio/hc/en-us/articles/4500420806299-Writing-workflow-outputs-to-the-data-table). - -3. Post QC Processing to AnVIL Workspaces - -The output from the QC aggregator is a QC summary results TSV file. Data submitters will pass off the QC summary results file to AnVIL ingestion team. The AnVIL team will push the QC summary results to the workspaces, which will contain the QC status including those that fail QC or have no QC. The example below is the QC results table in 1000 Genomes workspace. - Sample QC Results Table ![QC results in a 1000 Genomes workspace](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/qc-results.png) -### Additional Resources - Upcoming AnVIL Tools - -AnVIL Data Processing Working Group is evaluating two tools to add to the submission process to estimate (genetic) sex and compare that to reported sex. The goal is to identify at a cohort level any major issues between the genomic data and the reported phenotype data. Variation in sex chromosome copy number (e.g., XXY, XO, somatic mosaicism) means that genetic sex prediction is not 100% accurate, although it is an excellent tool for detecting major cohort-level issues. - #### Exome QC Processing There is an additional [example JSON for exome](SingleSampleQc.exome.json). It should have all the inputs necessary to run aside from the input BAM, sample name, and threshold file. After supplying those inputs, the workflow can be run just as in the WGS case and outputs can simiarly be written to the data table. From cd807170d2d30e955fc1b94fe82ee3f9800b9b39 Mon Sep 17 00:00:00 2001 From: Thomas Mooney Date: Wed, 19 Jun 2024 15:25:19 -0500 Subject: [PATCH 3/4] Add link to AnVIL walkthrough to README --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 2d0761e..aa231b2 100644 --- a/README.md +++ b/README.md @@ -107,6 +107,9 @@ Video - Walkthrough of WGS QC Processing [![Video - Walkthrough of WGS QC Processing](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/AnVILonDockstore_still.png)](https://youtu.be/WLpnoXySuIw "Walkthrough of WGS QC Processing - Click to Watch")If the current version of the example JSON is used as in this video, the chosen threshold file should be uploaded to the workspace files and the `gs://` path should be filled in as an input instead of pointing to the `https://` URL in this repo. Thereafter, the outputs tab of the workflow can be used to assign which results from the workflow should be added as columns to the data table. See also the Terra dcoumentation on [writing workflow outputs to the data table](https://support.terra.bio/hc/en-us/articles/4500420806299-Writing-workflow-outputs-to-the-data-table). +There is also a [text walkthrough for WGS processing on AnVIL](/docs/anvil_walkthrough.md) that uses the NA12878 data listed in the [example JSON for WGS](SingleSampleQc.json). + + Sample QC Results Table ![QC results in a 1000 Genomes workspace](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/qc-results.png) From 6bbf232ad3dcfc6f0bab459e27b7972b1ff2d948 Mon Sep 17 00:00:00 2001 From: Thomas Mooney Date: Wed, 19 Jun 2024 15:35:05 -0500 Subject: [PATCH 4/4] Rearrange video and add second one. --- README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index aa231b2..03bb66c 100644 --- a/README.md +++ b/README.md @@ -103,12 +103,7 @@ Data submitters should establish the specific metrics and thresholds for determi Data Submitters are responsible for running the WDL on their data to generate the QC metrics. If this workflow is given a threshold file, then it can report QC status outputs directly to the AnVIL data table as additional columns. AnVIL Data Processing Working Group has also created QC aggregator Jupyter notebook. Once QC status criteria have been determined, the thresholds can be modified in the notebook. The criteria is used to assign QC status of pass or fail. If a sample fails multiple times, it is assigned No QC under QC status. -Video - Walkthrough of WGS QC Processing - -[![Video - Walkthrough of WGS QC Processing](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/AnVILonDockstore_still.png)](https://youtu.be/WLpnoXySuIw "Walkthrough of WGS QC Processing - Click to Watch")If the current version of the example JSON is used as in this video, the chosen threshold file should be uploaded to the workspace files and the `gs://` path should be filled in as an input instead of pointing to the `https://` URL in this repo. Thereafter, the outputs tab of the workflow can be used to assign which results from the workflow should be added as columns to the data table. See also the Terra dcoumentation on [writing workflow outputs to the data table](https://support.terra.bio/hc/en-us/articles/4500420806299-Writing-workflow-outputs-to-the-data-table). - -There is also a [text walkthrough for WGS processing on AnVIL](/docs/anvil_walkthrough.md) that uses the NA12878 data listed in the [example JSON for WGS](SingleSampleQc.json). - +There is a [text walkthrough for WGS processing on AnVIL](/docs/anvil_walkthrough.md) that uses the NA12878 data listed in the [example JSON for WGS](SingleSampleQc.json). Sample QC Results Table @@ -117,3 +112,11 @@ Sample QC Results Table #### Exome QC Processing There is an additional [example JSON for exome](SingleSampleQc.exome.json). It should have all the inputs necessary to run aside from the input BAM, sample name, and threshold file. After supplying those inputs, the workflow can be run just as in the WGS case and outputs can simiarly be written to the data table. + +### Additional resources + +Video - Walkthrough of WGS QC Processing + +[![Video - Walkthrough of WGS QC Processing](https://raw.githubusercontent.com/genome/qc-analysis-pipeline/master/images/AnVILonDockstore_still.png)](https://youtu.be/WLpnoXySuIw "Walkthrough of WGS QC Processing - Click to Watch")If the current version of the example JSON is used as in this video, the chosen threshold file should be uploaded to the workspace files and the `gs://` path should be filled in as an input instead of pointing to the `https://` URL in this repo. Thereafter, the outputs tab of the workflow can be used to assign which results from the workflow should be added as columns to the data table. See also the Terra dcoumentation on [writing workflow outputs to the data table](https://support.terra.bio/hc/en-us/articles/4500420806299-Writing-workflow-outputs-to-the-data-table). + +[Video - How to combine data across workspaces](https://www.youtube.com/watch?v=1vz4kupdkms) also includes an example of running the QC workflow.