-
Notifications
You must be signed in to change notification settings - Fork 3
5. Pipeline construction
- Introduction
- Quality trimming pipeline
- Assembly pipeline
- Read mapping pipeline
- Practical - Build a pipeline
Most bioinformatic work involves stringing together programs, directing output from one into another with the occasional modification in between to keep the next program in your pipeline happy. BASH scripting really offers a nice way to 1) streamline your workflow 2) make it easy to re-run your pipeline, especially if you have many samples where you need to run the same pipeline, 3) make your work reproducible! and 4) to document for others but also for yourself. It can also be useful to incorporate code that performs some summary statistics on output. For example, fastQC is great and looking at the html only takes a few minutes. But what if you have hundreds of samples? FastQC also outputs flat text files that can be used to extract and generated summary statistics that are important to inform the next steps of your pipeline. (I also recently learned about multiQC, which post-processes multiple fastQC outputs).
Furthermore, any time you publish work with a bioinformatic pipeline, the correct thing to do as a scientist is to publish your code so that other can try to reproduce it. But it is also enormously helpful for your own work to be diligent about documentation and use github as early as possible in your bioinformatic career.
Below is an example of a bash script that can be used to run our quality filtering pipeline on our metagenomic reads. Take a look and see if you can begin to understand the logic.
#!/bin/bash
# template from Antonio Fernandez-Guerra
##### Set variables #####
time(
# command-line variables
NAME=$1 # sample name (ie 20110404)
WD=$2 # working directory starting from root
RAWDIR=$3 # raw read directory (ie fastq_files)
TRIMQS=$4 # quality score threshold (ie 10, 20, 30)
# hard-coded variables
MINLEN=100
# software versions
FASTQCVER=0.11.2
BBDUKVER=35.14
FASTQ=${RAWDIR}/${NAME}.raw.fastq.gz
##### Split R1 and R2 #####
SPLITEXE=~/marmic2017/software/bbmap/bbmap-35.14/reformat.sh
R1=${RAWDIR}/${NAME}.R1.raw.fastq.gz
R2=${RAWDIR}/${NAME}.R2.raw.fastq.gz
echo "Splitting reads 1 and 2 for sample:" ${NAME}
${SPLITEXE} in=${FASTQ} out1=${R1} out2=${R2}
##### Trim adapters and quality filter using bbduk #####
mkdir -p ${WD}/clean_fastq
TRIMWD=${WD}/clean_fastq
TRIMEXE=~/marmic2017/software/bbmap/bbmap-35.14/bbduk.sh
ADAPTERS=~/marmic2017/software/bbmap/bbmap-35.14/resources/truseq.fa.gz
R1CLEAN=${TRIMWD}/${NAME}.R1.clean.fastq.gz
R2CLEAN=${TRIMWD}/${NAME}.R2.clean.fastq.gz
echo "Clipping and quality trimming reads for sample:" ${NAME}
${TRIMEXE} in=${FASTQ} out1=${R1CLEAN} out2=${R2CLEAN} ref=${ADAPTERS} ktrim=r k=28 mink=12 hdist=1 tbo=t tpe=t qtrim=rl trimq=${TRIMQS} minlength=${MINLEN}
##### Run FASTQC #####
mkdir -p ${WD}/fastqc_pre-clean
mkdir -p ${WD}/fastqc_post-clean
mkdir -p ${WD}/qc_reports
FASTQCWDPRE=${WD}/fastqc_pre-clean
FASTQCWDPOST=${WD}/fastqc_post-clean
REP=${WD}/qc_reports
FASTQCEXE=~/marmic2017/software/fastqc/fastqc-0.11/fastqc
echo "Running fastqc for sample:" ${NAME}
${FASTQCEXE} ${R1} ${R2} -o ${FASTQCWDPRE}
${FASTQCEXE} ${R1CLEAN} ${R2CLEAN} -o ${FASTQCWDPOST}
illu1REP=${REP}/${NAME}.R1
illu2REP=${REP}/${NAME}.R2
touch ${illu1REP}
touch ${illu2REP}
cd ${FASTQCWDPRE}
unzip ${NAME}.R1.raw_fastqc.zip
unzip ${NAME}.R2.raw_fastqc.zip
cd ${FASTQCWDPOST}
unzip ${NAME}.R1.clean_fastqc.zip
unzip ${NAME}.R2.clean_fastqc.zip
cd ${WD}
echo "Analyzing fastqc & collecting sequence stats for sample:" ${NAME}
#Analyze fastqc output
R1FQCPREA=$(grep Adapter ${FASTQCWDPRE}/${NAME}.R1.raw_fastqc/summary.txt | cut -f 1)
R2FQCPREA=$(grep Adapter ${FASTQCWDPRE}/${NAME}.R2.raw_fastqc/summary.txt | cut -f 1)
R1FQCPRED=$(grep Duplication ${FASTQCWDPRE}/${NAME}.R1.raw_fastqc/summary.txt | cut -f 1)
R2FQCPRED=$(grep Duplication ${FASTQCWDPRE}/${NAME}.R2.raw_fastqc/summary.txt | cut -f 1)
R1FQCPOSTA=$(grep Adapter ${FASTQCWDPOST}/${NAME}.R1.clean_fastqc/summary.txt | cut -f 1)
R2FQCPOSTA=$(grep Adapter ${FASTQCWDPOST}/${NAME}.R2.clean_fastqc/summary.txt | cut -f 1)
R1FQCPOSTD=$(grep Duplication ${FASTQCWDPOST}/${NAME}.R1.clean_fastqc/summary.txt | cut -f 1)
R2FQCPOSTD=$(grep Duplication ${FASTQCWDPOST}/${NAME}.R2.clean_fastq/summary.txt | cut -f 1)
#Count sequence statistics
L1PRE=$(zcat ${RAWDIR}/${NAME}.R1.raw.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')
L2PRE=$(zcat ${RAWDIR}/${NAME}.R2.raw.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')
L1POST=$(zcat ${TRIMWD}/${NAME}.R1.clean.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')
L2POST=$(zcat ${TRIMWD}/${NAME}.R2.clean.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')
PRER1=$(echo ${L1PRE} | cut -f1 -d ' ')
POSTR1=$(echo ${L1POST} | cut -f1 -d ' ')
PCTRETR1=$(bc <<< "scale=2;${POSTR1}/${PRER1}")
PRER2=$(echo ${L2PRE} | cut -f1 -d ' ')
POSTR2=$(echo ${L2POST} | cut -f1 -d ' ')
PCTRETR2=$(bc <<< "scale=2;${POSTR2}/${PRER2}")
DATE=$(date)
#Cleanup, organize files and generate report
printf "COGITO Sample: ${NAME}\n" >> ${illu1REP}
printf "Illumina read pair: 1\n" >> ${illu1REP}
printf "Analysis date: ${DATE}\n" >> ${illu1REP}
printf "Number of reads (post/pre): $(echo ${L1POST} | cut -f1 -d ' ')/$(echo ${L1PRE} | cut -f1 -d ' ')\n" >> ${illu1REP}
printf "Percent reads retained: $(echo ${PCTRETR1})\n" >> ${illu1REP}
printf "Mean read length (post/pre): $(echo ${L1POST} | cut -f2 -d ' ')/$(echo ${L1PRE} | cut -f2 -d ' ')\n" >> ${illu1REP}
printf "Maximum read length (post/pre): $(echo ${L1POST} | cut -f3 -d ' ')/$(echo ${L1PRE} | cut -f3 -d ' ')\n" >> ${illu1REP}
printf "Minimum read length (post/pre): $(echo ${L1POST} | cut -f4 -d ' ')/$(echo ${L1PRE} | cut -f4 -d ' ')\n" >> ${illu1REP}
printf "Adapter FASTQC control (post/pre): ${R1FQCPOSTA}/${R1FQCPREA}\n" >> ${illu1REP}
printf "Duplication FASTQC control (post/pre): ${R1FQCPOSTD}/${R1FQCPRED}\n" >> ${illu1REP}
printf "BBDUK version: ${BBDUKVER}\n" >> ${illu1REP}
printf "FASTQC version: ${FASTQCVER}\n" >> ${illu1REP}
printf "COGITO Sample: ${NAME}\n" >> ${illu2REP}
printf "Illumina read pair: 2\n" >> ${illu2REP}
printf "Analysis date: ${DATE}\n" >> ${illu2REP}
printf "Number of reads (post/pre): $(echo ${L2POST} | cut -f1 -d ' ')/$(echo ${L2PRE} | cut -f1 -d ' ')\n" >> ${illu2REP}
printf "Percent reads retained: $(echo ${PCTRETR2})\n" >> ${illu2REP}
printf "Mean read length (post/pre): $(echo ${L2POST} | cut -f2 -d ' ')/$(echo ${L2PRE} | cut -f2 -d ' ')\n" >> ${illu2REP}
printf "Maximum read length (post/pre): $(echo ${L2POST} | cut -f3 -d ' ')/$(echo ${L2PRE} | cut -f3 -d ' ')\n" >> ${illu2REP}
printf "Minimum read length (post/pre): $(echo ${L2POST} | cut -f4 -d ' ')/$(echo ${L2PRE} | cut -f4 -d ' ')\n" >> ${illu2REP}
printf "Adapter FASTQC control (post/pre): ${R2FQCPOSTA}/${R2FQCPREA}\n" >> ${illu2REP}
printf "Duplication FASTQC control (post/pre): ${R2FQCPOSTD}/${R2FQCPRED}\n" >> ${illu2REP}
printf "BBDUK version: ${BBDUKVER}\n" >> ${illu2REP}
printf "FASTQC version: ${FASTQCVER}\n" >> ${illu2REP}
# remove unzipped fastqc directories
rm -r ${FASTQCWDPRE}/${NAME}.raw.R1_fastqc
rm -r ${FASTQCWDPRE}/${NAME}.raw.R2_fastqc
rm -r ${FASTQCWDPOST}/${NAME}.clean.R1_fastqc
rm -r ${FASTQCWDPOST}/${NAME}.clean.R2_fastqc
echo "Pre-processing analysis complete for sample:" ${NAME}
)
#!/bin/bash
##### Set variables #####
time(
# command-line variables
WD=$1 # Working directory with adapter clipped fastq
READ1=$2 # A comma separated list of read 1 files
READ2=$3 # A comma separated list of read 2 files
MINCONTIG=$4 # Minimum contig length
PRESET=$5 # meta, meta-sensitive, meta-large, bulk
# software versions
MEGAHIT=1.0.2
# Assemble libraries
mkdir -p ${WD}/megahit_output
OUT=${WD}/megahit_output
MHITEXEC=~/marmic2017/software/megahit/megahit-1.0.2/megahit
# Run MEGAHIT
${MHITEXEC} -1 ${READ1} -2 ${READ2} -o ${OUT} --presets ${PRESET} --min-contig-len ${MINCONTIG}
# Collect stats
~/marmic2017/software/bbmap/stats.sh in=${OUT}/final.contigs.fa > ${OUT}/megahit.stats
#!/bin/bash
# template from Antonio Fernandez-Guerra
##### Set variables #####
WD=$1 # full path of parent directory
NAME=$2 # sample name / prefix to individual NAME fastq files
READDIR=$3 # full path of directory with quality trimmed reads to be mapped to assembly - fastq
ASSEMDIR=$4 # full path of directory with assembly contigs -fasta
THREADS=$5 # number of threads for samtools-rocksort
# software versions
BBMAP=35.14
SAMTOOLS=1.2
ROCKSORT=0.2
PICCARD=1.119
# Make directories
mkdir -p ${WD}/bbmap_output
MAPDIR=${WD}/bbmap_output
cd ${MAPDIR}
# run bbmap (global alignments)
BBEXEC=~/marmic2017/software/bbmap/bbmap-35.14/bbmap.sh
${BBEXEC} ref=${ASSEMDIR}/final.contigs.fa
${BBEXEC} in1=${READDIR}/${NAME}.R1.clean.fastq.gz in2=${READDIR}/${NAME}.R2.clean.fastq.gz out=${MAPDIR}/${NAME}.sam fast=t idfilter=0.97 idtag=t covstats=${MAPDIR}/${NAME}.cov.stats
SAMEXEC=~/marmic2017/software/samtools/samtools-1.2/bin/samtools
# fast hacker version of samtools called samtools-rocksort (http://devblog.dnanexus.com/tag/samtools-rocksort/)
ROCEXEC=~/marmic2017/software/samtools-rocksort/samtools-rocksort-0.2/bin/samtools
${SAMEXEC} faidx ${ASSEMDIR}/final.contigs.fa
# convert paired sam to sorted bam and sort using rocksort
${SAMEXEC} view -@ ${THREADS} -q 10 -F 4 -bt ${ASSEMDIR}/final.contigs.fa.fai ${MAPDIR}/${NAME}.sam | ${ROCEXEC} rocksort -@ ${THREADS} -m 3 - ${NAME}.sorted
rmdir ${MAPDIR}/${NAME}.sorted.rocksort
rm ${MAPDIR}/${NAME}.sam
# mark duplicates with piccard tools
MARKDUPEXEC=~/marmic2017/picard-tools-1.119/MarkDuplicates.jar
JAVA_OPT="-Xms2g -Xmx32g -XX:ParallelGCThreads=4 -XX:MaxPermSize=2g -XX:+CMSClassUnloadingEnabled"
java ${JAVA_OPT} \
-jar ${MARKDUPEXEC} \
INPUT=${MAPDIR}/${NAME}.sorted.bam \
OUTPUT=${MAPDIR}/${NAME}.sorted.markdup.bam \
METRICS_FILE=${MAPDIR}/${NAME}.sorted.markdup.metrics \
AS=TRUE \
VALIDATION_STRINGENCY=LENIENT \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
REMOVE_DUPLICATES=TRUE
# resort, index and collect stats on resulted markdup bam file
${ROCEXEC} rocksort -@ ${THREADS} -m 3 ${MAPDIR}/${NAME}.sorted.markdup.bam ${MAPDIR}/${NAME}.sorted.markdup.sorted
${SAMEXEC} index ${MAPDIR}/${NAME}.sorted.markdup.sorted.bam
${SAMEXEC} flagstat ${MAPDIR}/${NAME}.sorted.markdup.sorted.bam > ${MAPDIR}/${NAME}.sorted.markdup.sorted.flagstat
rmdir ${MAPDIR}/${NAME}.sorted.markdup.sorted.rocksort
Everyone should pair up with your neighbor for this final exercise. Below four projects goals are outlined. Choose a project and conceptualize a pipeline you would use to process the data. Using a similar pipeline flow chart that was presented these last days, sketch out each step and include a notation for input data, calculated data, and programs. Once completed, we will compare and critique each pipeline as a class.
For those of you up for the challenge, you can attempt to execute each step or go as far as building a bash script using raw and calculated data from the last week. Alternatively, you can use data that was generated during a previous course (i.e., Symbiosis). What you get from the last day is really up to you. We will use this time to help you all on a more individual basis to help you gain a little more independence at your own pace.
Project scenarios:
[1] Diversity: Extract 16S rRNA reads from a metagenome for classification.
[2] Function: Extract gene-level abundance information from a metagenome.
[3] Assembly: Perform an assembly using PacBio reads and do an error correction with Illumina reads.
[4] Binning: Identify a bin of interest from day 4 and refine through sub-assembly.