Merge pull request #22 from Sage-Bionetworks-Workflows/addparams

Add new main workflow for miRNA reads, add additional alignment parameters to all workflows
Sage-Bionetworks-Workflows · Sep 22, 2020 · c3461f6 · c3461f6
2 parents 55f70d5 + fb523d7
commit c3461f6
Show file tree

Hide file tree

Showing 19 changed files with 864 additions and 9 deletions.
diff --git a/README.md b/README.md
@@ -21,6 +21,7 @@ Three main workflows are present in the root of this repository:
 * [bam_paired.cwl](bam_paired.cwl): This workflow processes input BAM files from paired-end sequencing reads
 * [fastq_paired.cwl](fastq_paired.cwl): This workflow processes paired end fastq files
 * [fastq_single.cwl](fastq_single.cwl): This workflow processes single end fastq files
+* [mirna_single.cwl](mirna_single.cwl): This workflow processes single-end fastq files from miRNA libraries 
 
 Subworkflows that the main workflows utilize are present in the [subworkflows](subworkflows) folder. 
 
@@ -69,7 +70,7 @@ Each workflow requires the following inputs:
 
 * `cwl_wf_url`: A URL that points to a commit or tagged version of this github repository at the time of job submission. "https://github.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/tree/5832931a9569d9d8fba26a36146a682870d6f5f7", for example. Guidance on generating a permanent github link can be found [here](https://help.github.com/en/github/managing-files-in-a-repository/getting-permanent-links-to-files#press-y-to-permalink-to-a-file-in-a-specific-commit).
 * `cwl_args_url`: A raw github URL that points to the input parameters file for the job that you are running. "https://raw.githubusercontent.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/5832931a9569d9d8fba26a36146a682870d6f5f7/jobs/test-paired-bam/job.json", for example. To find the raw URL for a file on github, navigate to the file and follow the instructions for generating a [permanent url](https://help.github.com/en/github/managing-files-in-a-repository/getting-permanent-links-to-files#press-y-to-permalink-to-a-file-in-a-specific-commit). You can then click on the `raw` button to open the raw URL in your browser. 
-* `index_synapseid`: A [Synapse](https://www.synapse.org/) ID for the folder that contains a STAR-indexed reference genome. An example can be found in `syn22152278`
+* `index_synapseid`: A [Synapse](https://www.synapse.org/) ID for the folder that contains a STAR-indexed reference genome. An example can be found in `syn22152278`. Two gtf files must be presesnt in this folder to run the `mirna_single.cwl` workflow: A main gtf file with the filename extension ".annotation.gtf" and a gtf file that contains only miRNA annotations with the filename extension ".subset.gtf". An example of a miRNA-compatible reference genome folder can be found in `syn22342700` 
 * `nthreads`: An integer value that represents the number of compute threads that the STAR aligner should use. 
 * `synapse_parentid`: A [Synapse](https://www.synapse.org/) ID for the folder that output tables will be uploaded to. 
 * `synapse_config`: A [Synapse](https://www.synapse.org/) configuration file that will be used to authenticate data downloads and uploads during workflow execution
@@ -81,6 +82,8 @@ The fastq_paired.cwl workflow also requires the following input:
 
 An example input json file that contains values for these required inputs can be found [here](https://raw.githubusercontent.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/7d64748a3a6d7cc8cfd9f30fc43c1b9bc79b3b3f/jobs/test-paired-bam/job.json)
 
+An example input json file that contains example parameters for the mirna_single.cwl workflow can be found [here](https://github.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/blob/master/jobs/test-single-mirna/job.json)
+
 ### Optional Job inputs
 
 You can optionally supply an input parameter that specifies the strandedness parameter of the library that will be used by Picard Tools. To do so, add the `strand_specificity` argument to your job.json file. The three valid string options for this parameter are:
@@ -101,6 +104,20 @@ If this argument is not provided, the default value of `2` will be used. This is
 
 An example input json file that contains the required inputs and these optional inputs can be found [here](https://raw.githubusercontent.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/7d64748a3a6d7cc8cfd9f30fc43c1b9bc79b3b3f/jobs/test-paired-fastq/job.json)
 
+In addition, you may optionally specify the following parameters for the STAR alignment (Note that it is highly recommended to customize these arguments for the mirna_single.cwl workflow):
+
+* `alignEndsType` : A string specifying the type of read ends alignment
+* `outFilterMismatchNmax` : Integer specifying the maximum number of mismatches per pair
+* `outFilterMultimapScoreRange` : Integer specifying the score range for multi-mapping alignments
+* `outFilterMultimapNmax` : Integer specifying the maximum number of multiple alignments for a read
+* `outFilterScoreMinOverLread` : Integer specifying the minimum score for an alignment to be reported, normalized to read length 
+* `outFilterMatchNminOverLread` : Integer specifying the minimum number of matched bases for an alignment to be reported, normalized to read length
+* `outFilterMatchNmin` : Integer specifying the minimum number of matched bases for an alignment to be reported
+* `alignSJDBoverhangMin` : Integer specifying the minimum block size for annotated spliced alignments 
+* `alignIntronMax` : Integer specifying the maximum intron size
+
+For further details about these parameters, please refer to the [STAR manual](https://chagall.med.cornell.edu/RNASEQcourse/STARmanual.pdf)
+
 ## Resource Requirements
 
 Resource requirements are specified using the CWL `ResourceRequirement` class. Each subworkflow contains specific requests for RAM, disk space, and number of threads. These values are set for average-sized RNA Sequencing input files for alignment against the human reference genome. If the default values are not sufficient, please modify the `ResourceRequirement` values in the subworkflow CWL files. 

diff --git a/bam_paired.cwl b/bam_paired.cwl
@@ -33,6 +33,25 @@ inputs:
     type: string?
   - id: column_number
     type: int?
+  - id: alignEndsType
+    type: string?
+  - id: outFilterMismatchNmax
+    type: int?
+  - id: outFilterMultimapScoreRange
+    type: int?
+  - id: outFilterMultimapNmax
+    type: int?
+  - id: outFilterScoreMinOverLread
+    type: int?
+  - id: outFilterMatchNminOverLread
+    type: int?
+  - id: outFilterMatchNmin
+    type: int?
+  - id: alignSJDBoverhangMin
+    type: int?
+  - id: alignIntronMax
+    type: int?
+
 outputs:
   - id: clean_counts
     outputSource:
@@ -83,6 +102,24 @@ steps:
         source: synapse_config
       - id: synapseid
         source: synapseid
+      - id: alignEndsType
+        source: alignEndsType
+      - id: outFilterMismatchNmax
+        source: outFilterMismatchNmax
+      - id: outFilterMultimapScoreRange
+        source: outFilterMultimapScoreRange
+      - id: outFilterMultimapNmax
+        source: outFilterMultimapNmax
+      - id: outFilterScoreMinOverLread
+        source: outFilterScoreMinOverLread
+      - id: outFilterMatchNminOverLread
+        source: outFilterMatchNminOverLread
+      - id: outFilterMatchNmin
+        source: outFilterMatchNmin
+      - id: alignSJDBoverhangMin
+        source: alignSJDBoverhangMin
+      - id: alignIntronMax
+        source: alignIntronMax
     out:
       - id: splice_junctions
       - id: reads_per_gene

diff --git a/fastq_paired.cwl b/fastq_paired.cwl
@@ -29,6 +29,25 @@ inputs:
     type: string?
   - id: column_number
     type: int?
+  - id: alignEndsType
+    type: string?
+  - id: outFilterMismatchNmax
+    type: int?
+  - id: outFilterMultimapScoreRange
+    type: int?
+  - id: outFilterMultimapNmax
+    type: int?
+  - id: outFilterScoreMinOverLread
+    type: int?
+  - id: outFilterMatchNminOverLread
+    type: int?
+  - id: outFilterMatchNmin
+    type: int?
+  - id: alignSJDBoverhangMin
+    type: int?
+  - id: alignIntronMax
+    type: int?
+
 outputs:
   - id: clean_counts
     outputSource:
@@ -81,6 +100,24 @@ steps:
         source: synapseid
       - id: synapseid_2
         source: synapseid_2
+      - id: alignEndsType
+        source: alignEndsType
+      - id: outFilterMismatchNmax
+        source: outFilterMismatchNmax
+      - id: outFilterMultimapScoreRange
+        source: outFilterMultimapScoreRange
+      - id: outFilterMultimapNmax
+        source: outFilterMultimapNmax
+      - id: outFilterScoreMinOverLread
+        source: outFilterScoreMinOverLread
+      - id: outFilterMatchNminOverLread
+        source: outFilterMatchNminOverLread
+      - id: outFilterMatchNmin
+        source: outFilterMatchNmin
+      - id: alignSJDBoverhangMin
+        source: alignSJDBoverhangMin
+      - id: alignIntronMax
+        source: alignIntronMax
     out:
       - id: splice_junctions
       - id: reads_per_gene

diff --git a/fastq_single.cwl b/fastq_single.cwl
@@ -33,6 +33,24 @@ inputs:
     type: string?
   - id: column_number
     type: int?
+  - id: alignEndsType
+    type: string?
+  - id: outFilterMismatchNmax
+    type: int?
+  - id: outFilterMultimapScoreRange
+    type: int?
+  - id: outFilterMultimapNmax
+    type: int?
+  - id: outFilterScoreMinOverLread
+    type: int?
+  - id: outFilterMatchNminOverLread
+    type: int?
+  - id: outFilterMatchNmin
+    type: int?
+  - id: alignSJDBoverhangMin
+    type: int?
+  - id: alignIntronMax
+    type: int?
 outputs:
   - id: clean_counts
     outputSource:
@@ -83,6 +101,24 @@ steps:
         source: synapse_config
       - id: synapseid
         source: synapseid
+      - id: alignEndsType
+        source: alignEndsType
+      - id: outFilterMismatchNmax
+        source: outFilterMismatchNmax
+      - id: outFilterMultimapScoreRange
+        source: outFilterMultimapScoreRange
+      - id: outFilterMultimapNmax
+        source: outFilterMultimapNmax
+      - id: outFilterScoreMinOverLread
+        source: outFilterScoreMinOverLread
+      - id: outFilterMatchNminOverLread
+        source: outFilterMatchNminOverLread
+      - id: outFilterMatchNmin
+        source: outFilterMatchNmin
+      - id: alignSJDBoverhangMin
+        source: alignSJDBoverhangMin
+      - id: alignIntronMax
+        source: alignIntronMax
     out:
       - id: splice_junctions
       - id: reads_per_gene

diff --git a/jobs/default/options.json b/jobs/default/options.json
@@ -1,11 +1,13 @@
 {
   "zone": "us-east-1a",
+  "cluster_name": "rna-seq-reprocessing-scicomp-toil-cluster-v001",
   "tmpdir": "/var/lib/toil",
-  "run_name": "def",
+  "run_name": "tst",
   "log_level": "INFO",
   "retry_count": "3",
   "target_time": "1",
   "default_disk": "450G",
+  "max_nodes": "5",
   "node_types": "m5.4xlarge",
   "node_storage": "500",
   "preemptable_compensation": "0.5",

diff --git a/jobs/test-UW-mirna/job.json b/jobs/test-UW-mirna/job.json
@@ -0,0 +1,32 @@
+{
+    "cwl_wf_url": "https://github.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq",
+    "cwl_args_url": "https://raw.githubusercontent.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/master/jobs/test-UW-mirna/job.json",
+    "alignEndsType": "EndToEnd",
+    "outFilterMismatchNmax": 1,
+    "outFilterMultimapScoreRange": 0,
+    "outFilterMultimapNmax": 10,
+    "outFilterScoreMinOverLread": 0,
+    "outFilterMatchNminOverLread": 0,
+    "outFilterMatchNmin": 16,
+    "alignSJDBoverhangMin": 1000,
+    "alignIntronMax": 1,
+    "index_synapseid": "syn22337116",
+    "nthreads": 1,
+    "synapse_parentid": "syn22352005",
+    "synapse_config": {
+        "class": "File",
+        "path": "/etc/synapse/.synapseConfig"
+    },
+    "synapseid": [
+        "syn22334734",
+        "syn22334729",
+        "syn22334741",
+        "syn22334744",
+        "syn22334745",
+        "syn22334706",
+        "syn22334712",
+        "syn22334731",
+        "syn22334728",
+        "syn22334714"
+    ]
+}
diff --git a/jobs/test-UW-mirna/options.json b/jobs/test-UW-mirna/options.json
@@ -0,0 +1,17 @@
+{
+  "zone": "us-east-1a",
+  "cluster_name": "rna-seq-reprocessing-scicomp-toil-cluster-v001",
+  "tmpdir": "/var/lib/toil",
+  "run_name": "uwm",
+  "log_level": "DEBUG",
+  "retry_count": "3",
+  "target_time": "1",
+  "max_nodes": "5",
+  "default_disk": "450G",
+  "node_types": "m5.4xlarge",
+  "node_storage": "500",
+  "preemptable_compensation": "0.5",
+  "rescue_frequency": "9000",
+  "cwl": "mirna_single.cwl"
+}
+
diff --git a/jobs/test-paired-bam/options.json b/jobs/test-paired-bam/options.json
@@ -1,11 +1,13 @@
 {
   "zone": "us-east-1a",
+  "cluster_name": "rna-seq-reprocessing-scicomp-toil-cluster-v001",
   "tmpdir": "/var/lib/toil",
-  "run_name": "def",
+  "run_name": "tst",
   "log_level": "INFO",
   "retry_count": "3",
   "target_time": "1",
   "default_disk": "450G",
+  "max_nodes": "5",
   "node_types": "m5.4xlarge",
   "node_storage": "500",
   "preemptable_compensation": "0.5",

diff --git a/jobs/test-paired-fastq/options.json b/jobs/test-paired-fastq/options.json
@@ -1,11 +1,13 @@
 {
   "zone": "us-east-1a",
+  "cluster_name": "rna-seq-reprocessing-scicomp-toil-cluster-v001",
   "tmpdir": "/var/lib/toil",
-  "run_name": "def",
+  "run_name": "tst",
   "log_level": "INFO",
   "retry_count": "3",
   "target_time": "1",
   "default_disk": "450G",
+  "max_nodes": "5",
   "node_types": "m5.4xlarge",
   "node_storage": "500",
   "preemptable_compensation": "0.5",

diff --git a/jobs/test-single-fastq/options.json b/jobs/test-single-fastq/options.json
@@ -1,11 +1,13 @@
 {
   "zone": "us-east-1a",
+  "cluster_name": "rna-seq-reprocessing-scicomp-toil-cluster-v001",
   "tmpdir": "/var/lib/toil",
-  "run_name": "def",
+  "run_name": "tst",
   "log_level": "INFO",
   "retry_count": "3",
   "target_time": "1",
   "default_disk": "450G",
+  "max_nodes": "5",
   "node_types": "m5.4xlarge",
   "node_storage": "500",
   "preemptable_compensation": "0.5",

diff --git a/jobs/test-single-mirna/job.json b/jobs/test-single-mirna/job.json
@@ -0,0 +1,23 @@
+{
+    "cwl_wf_url": "https://github.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq",
+    "cwl_args_url": "https://raw.githubusercontent.com/Sage-Bionetworks-Workflows/dockstore-workflow-rnaseq/master/jobs/test-single-mirna/job.json",
+    "alignEndsType": "EndToEnd",
+    "outFilterMismatchNmax": 1,
+    "outFilterMultimapScoreRange": 0,
+    "outFilterMultimapNmax": 10,
+    "outFilterScoreMinOverLread": 0,
+    "outFilterMatchNminOverLread": 0,
+    "outFilterMatchNmin": 16,
+    "alignSJDBoverhangMin": 1000,
+    "alignIntronMax": 1,
+    "index_synapseid": "syn22342700",
+    "nthreads": 1,
+    "synapse_parentid": "syn22152380",
+    "synapse_config": {
+        "class": "File",
+        "path": "/tmp/.synapseConfig"
+    },
+    "synapseid": [
+        "syn22351902"
+    ]
+}
diff --git a/jobs/test-single-mirna/options.json b/jobs/test-single-mirna/options.json
@@ -0,0 +1,15 @@
+{
+  "zone": "us-east-1a",
+  "cluster_name": "rna-seq-reprocessing-scicomp-toil-cluster-v001",
+  "tmpdir": "/var/lib/toil",
+  "run_name": "tst",
+  "log_level": "INFO",
+  "retry_count": "3",
+  "target_time": "1",
+  "default_disk": "450G",
+  "max_nodes": "5",
+  "node_types": "m5.4xlarge",
+  "node_storage": "500",
+  "preemptable_compensation": "0.5",
+  "rescue_frequency": "9000"
+}