5. Pipeline construction

Introduction

Most bioinformatic work involves stringing together programs, directing output from one into another with the occasional modification in between to keep the next program in your pipeline happy. BASH scripting really offers a nice way to 1) streamline your workflow 2) make it easy to re-run your pipeline, especially if you have many samples where you need to run the same pipeline, 3) make your work reproducible! and 4) to document for others but also for yourself. It can also be useful to incorporate code that performs some summary statistics on output. For example, fastQC is great and looking at the html only takes a few minutes. But what if you have hundreds of samples? FastQC also outputs flat text files that can be used to extract and generated summary statistics that are important to inform the next steps of your pipeline. (I also recently learned about multiQC, which post-processes multiple fastQC outputs).

Furthermore, any time you publish work with a bioinformatic pipeline, the correct thing to do as a scientist is to publish your code so that other can try to reproduce it. But it is also enormously helpful for your own work to be diligent about documentation and use github as early as possible in your bioinformatic career.

Below is an example of a bash script that can be used to run our quality filtering pipeline on our metagenomic reads. Take a look and see if you can begin to understand the logic.

Quality-trimming-pipeline

#!/bin/bash
# template from Antonio Fernandez-Guerra

##### Set variables #####
time(
# command-line variables
NAME=$1     # sample name (ie 20110404)
WD=$2		# working directory starting from root
RAWDIR=$3   # raw read directory (ie fastq_files)
TRIMQS=$4   # quality score threshold (ie 10, 20, 30)

# hard-coded variables
MINLEN=100

# software versions
FASTQCVER=0.11.2
BBDUKVER=35.14

FASTQ=${RAWDIR}/${NAME}.raw.fastq.gz

##### Split R1 and R2 #####

SPLITEXE=~/marmic2017/software/bbmap/bbmap-35.14/reformat.sh
R1=${RAWDIR}/${NAME}.R1.raw.fastq.gz
R2=${RAWDIR}/${NAME}.R2.raw.fastq.gz

echo "Splitting reads 1 and 2 for sample:" ${NAME}

${SPLITEXE} in=${FASTQ} out1=${R1} out2=${R2}


##### Trim adapters and quality filter using bbduk #####

mkdir -p ${WD}/clean_fastq
TRIMWD=${WD}/clean_fastq

TRIMEXE=~/marmic2017/software/bbmap/bbmap-35.14/bbduk.sh
ADAPTERS=~/marmic2017/software/bbmap/bbmap-35.14/resources/truseq.fa.gz
R1CLEAN=${TRIMWD}/${NAME}.R1.clean.fastq.gz
R2CLEAN=${TRIMWD}/${NAME}.R2.clean.fastq.gz

echo "Clipping and quality trimming reads for sample:" ${NAME}

${TRIMEXE} in=${FASTQ} out1=${R1CLEAN} out2=${R2CLEAN} ref=${ADAPTERS} ktrim=r k=28 mink=12 hdist=1 tbo=t tpe=t qtrim=rl trimq=${TRIMQS} minlength=${MINLEN} 

##### Run FASTQC #####

mkdir -p ${WD}/fastqc_pre-clean
mkdir -p ${WD}/fastqc_post-clean
mkdir -p ${WD}/qc_reports
FASTQCWDPRE=${WD}/fastqc_pre-clean
FASTQCWDPOST=${WD}/fastqc_post-clean
REP=${WD}/qc_reports

FASTQCEXE=~/marmic2017/software/fastqc/fastqc-0.11/fastqc

echo "Running fastqc for sample:" ${NAME}

${FASTQCEXE} ${R1} ${R2} -o ${FASTQCWDPRE}

${FASTQCEXE} ${R1CLEAN} ${R2CLEAN} -o ${FASTQCWDPOST}

illu1REP=${REP}/${NAME}.R1
illu2REP=${REP}/${NAME}.R2

touch ${illu1REP}
touch ${illu2REP}

cd ${FASTQCWDPRE}

unzip ${NAME}.R1.raw_fastqc.zip
unzip ${NAME}.R2.raw_fastqc.zip

cd ${FASTQCWDPOST}

unzip ${NAME}.R1.clean_fastqc.zip
unzip ${NAME}.R2.clean_fastqc.zip

cd ${WD}

echo "Analyzing fastqc & collecting sequence stats for sample:" ${NAME}

#Analyze fastqc output
R1FQCPREA=$(grep Adapter ${FASTQCWDPRE}/${NAME}.R1.raw_fastqc/summary.txt | cut -f 1)
R2FQCPREA=$(grep Adapter ${FASTQCWDPRE}/${NAME}.R2.raw_fastqc/summary.txt | cut -f 1)
R1FQCPRED=$(grep Duplication ${FASTQCWDPRE}/${NAME}.R1.raw_fastqc/summary.txt | cut -f 1)
R2FQCPRED=$(grep Duplication ${FASTQCWDPRE}/${NAME}.R2.raw_fastqc/summary.txt | cut -f 1)

R1FQCPOSTA=$(grep Adapter ${FASTQCWDPOST}/${NAME}.R1.clean_fastqc/summary.txt | cut -f 1)
R2FQCPOSTA=$(grep Adapter ${FASTQCWDPOST}/${NAME}.R2.clean_fastqc/summary.txt | cut -f 1)
R1FQCPOSTD=$(grep Duplication ${FASTQCWDPOST}/${NAME}.R1.clean_fastqc/summary.txt | cut -f 1)
R2FQCPOSTD=$(grep Duplication ${FASTQCWDPOST}/${NAME}.R2.clean_fastq/summary.txt | cut -f 1)

#Count sequence statistics
L1PRE=$(zcat ${RAWDIR}/${NAME}.R1.raw.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')
L2PRE=$(zcat ${RAWDIR}/${NAME}.R2.raw.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')

L1POST=$(zcat ${TRIMWD}/${NAME}.R1.clean.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')
L2POST=$(zcat  ${TRIMWD}/${NAME}.R2.clean.fastq.gz | awk 'BEGIN{L=0;i=0; max=0; min=0;}{if(NR%4==2){if (i == 0){max = length($0); min = length($0)}else{if (length($0) >= max){max = length($0)}; if (length($0) <= min){min = length($0)} }; L=L+length($0); i++}}END{print i"\t"int(L/i)"\t"max"\t"min}')

PRER1=$(echo ${L1PRE} | cut -f1 -d ' ')
POSTR1=$(echo ${L1POST} | cut -f1 -d ' ')
PCTRETR1=$(bc <<< "scale=2;${POSTR1}/${PRER1}")

PRER2=$(echo ${L2PRE} | cut -f1 -d ' ')
POSTR2=$(echo ${L2POST} | cut -f1 -d ' ')
PCTRETR2=$(bc <<< "scale=2;${POSTR2}/${PRER2}")

DATE=$(date)
#Cleanup, organize files and generate report
printf "COGITO Sample: ${NAME}\n" >> ${illu1REP}
printf "Illumina read pair: 1\n" >> ${illu1REP}
printf "Analysis date: ${DATE}\n" >> ${illu1REP}
printf "Number of reads (post/pre): $(echo ${L1POST} | cut -f1 -d ' ')/$(echo ${L1PRE} | cut -f1 -d ' ')\n" >> ${illu1REP}
printf "Percent reads retained: $(echo ${PCTRETR1})\n" >> ${illu1REP}
printf "Mean read length (post/pre): $(echo ${L1POST} | cut -f2 -d ' ')/$(echo ${L1PRE} | cut -f2 -d ' ')\n" >> ${illu1REP}
printf "Maximum read length (post/pre): $(echo ${L1POST} | cut -f3 -d ' ')/$(echo ${L1PRE} | cut -f3 -d ' ')\n" >> ${illu1REP}
printf "Minimum read length (post/pre): $(echo ${L1POST} | cut -f4 -d ' ')/$(echo ${L1PRE} | cut -f4 -d ' ')\n" >> ${illu1REP}
printf "Adapter FASTQC control (post/pre): ${R1FQCPOSTA}/${R1FQCPREA}\n" >> ${illu1REP}
printf "Duplication FASTQC control (post/pre): ${R1FQCPOSTD}/${R1FQCPRED}\n" >> ${illu1REP}
printf "BBDUK version: ${BBDUKVER}\n" >> ${illu1REP}
printf "FASTQC version: ${FASTQCVER}\n" >> ${illu1REP}

printf "COGITO Sample: ${NAME}\n" >> ${illu2REP}
printf "Illumina read pair: 2\n" >> ${illu2REP}
printf "Analysis date: ${DATE}\n" >> ${illu2REP}
printf "Number of reads (post/pre): $(echo ${L2POST} | cut -f1 -d ' ')/$(echo  ${L2PRE} | cut -f1 -d ' ')\n" >> ${illu2REP}
printf "Percent reads retained: $(echo ${PCTRETR2})\n" >> ${illu2REP}
printf "Mean read length (post/pre): $(echo ${L2POST} | cut -f2 -d ' ')/$(echo  ${L2PRE} | cut -f2 -d ' ')\n" >> ${illu2REP}
printf "Maximum read length (post/pre): $(echo ${L2POST} | cut -f3 -d ' ')/$(echo  ${L2PRE} | cut -f3 -d ' ')\n" >> ${illu2REP}
printf "Minimum read length (post/pre): $(echo ${L2POST} | cut -f4 -d ' ')/$(echo  ${L2PRE} | cut -f4 -d ' ')\n" >> ${illu2REP}
printf "Adapter FASTQC control (post/pre): ${R2FQCPOSTA}/${R2FQCPREA}\n" >> ${illu2REP}
printf "Duplication FASTQC control (post/pre): ${R2FQCPOSTD}/${R2FQCPRED}\n" >> ${illu2REP}
printf "BBDUK version: ${BBDUKVER}\n" >> ${illu2REP}
printf "FASTQC version: ${FASTQCVER}\n" >> ${illu2REP}

# remove unzipped fastqc directories
    
rm -r ${FASTQCWDPRE}/${NAME}.raw.R1_fastqc
rm -r ${FASTQCWDPRE}/${NAME}.raw.R2_fastqc
rm -r ${FASTQCWDPOST}/${NAME}.clean.R1_fastqc
rm -r ${FASTQCWDPOST}/${NAME}.clean.R2_fastqc

echo "Pre-processing analysis complete for sample:" ${NAME}
)

Assembly pipeline

#!/bin/bash
##### Set variables #####
time(
# command-line variables
WD=$1         # Working directory with adapter clipped fastq
READ1=$2	  # A comma separated list of read 1 files
READ2=$3      # A comma separated list of read 2 files
MINCONTIG=$4  # Minimum contig length 
PRESET=$5     # meta, meta-sensitive, meta-large, bulk

# software versions
MEGAHIT=1.0.2

# Assemble libraries
mkdir -p ${WD}/megahit_output
OUT=${WD}/megahit_output

MHITEXEC=~/marmic2017/software/megahit/megahit-1.0.2/megahit

# Run MEGAHIT
${MHITEXEC} -1 ${READ1} -2 ${READ2} -o ${OUT} --presets ${PRESET} --min-contig-len ${MINCONTIG}

# Collect stats
~/marmic2017/software/bbmap/stats.sh in=${OUT}/final.contigs.fa > ${OUT}/megahit.stats

Read mapping pipeline

#!/bin/bash
# template from Antonio Fernandez-Guerra

##### Set variables #####
WD=$1         # full path of parent directory 
NAME=$2       # sample name / prefix to individual NAME fastq files
READDIR=$3    # full path of directory with quality trimmed reads to be mapped to assembly - fastq
ASSEMDIR=$4   # full path of directory with assembly contigs -fasta
THREADS=$5    # number of threads for samtools-rocksort

# software versions
BBMAP=35.14
SAMTOOLS=1.2
ROCKSORT=0.2
PICCARD=1.119

# Make directories
mkdir -p ${WD}/bbmap_output
MAPDIR=${WD}/bbmap_output

cd ${MAPDIR}

# run bbmap (global alignments)
BBEXEC=~/marmic2017/software/bbmap/bbmap-35.14/bbmap.sh

${BBEXEC} ref=${ASSEMDIR}/final.contigs.fa
${BBEXEC} in1=${READDIR}/${NAME}.R1.clean.fastq.gz in2=${READDIR}/${NAME}.R2.clean.fastq.gz out=${MAPDIR}/${NAME}.sam fast=t idfilter=0.97 idtag=t covstats=${MAPDIR}/${NAME}.cov.stats

SAMEXEC=~/marmic2017/software/samtools/samtools-1.2/bin/samtools
# fast hacker version of samtools called samtools-rocksort (http://devblog.dnanexus.com/tag/samtools-rocksort/)
ROCEXEC=~/marmic2017/software/samtools-rocksort/samtools-rocksort-0.2/bin/samtools

${SAMEXEC} faidx ${ASSEMDIR}/final.contigs.fa 

# convert paired sam to sorted bam and sort using rocksort

${SAMEXEC} view -@ ${THREADS} -q 10 -F 4 -bt ${ASSEMDIR}/final.contigs.fa.fai ${MAPDIR}/${NAME}.sam | ${ROCEXEC} rocksort -@ ${THREADS} -m 3 - ${NAME}.sorted

rmdir ${MAPDIR}/${NAME}.sorted.rocksort
rm ${MAPDIR}/${NAME}.sam 

# mark duplicates with piccard tools

MARKDUPEXEC=~/marmic2017/picard-tools-1.119/MarkDuplicates.jar

JAVA_OPT="-Xms2g -Xmx32g -XX:ParallelGCThreads=4 -XX:MaxPermSize=2g -XX:+CMSClassUnloadingEnabled"
java ${JAVA_OPT} \
     -jar ${MARKDUPEXEC} \
     INPUT=${MAPDIR}/${NAME}.sorted.bam \
     OUTPUT=${MAPDIR}/${NAME}.sorted.markdup.bam \
     METRICS_FILE=${MAPDIR}/${NAME}.sorted.markdup.metrics \
     AS=TRUE \
     VALIDATION_STRINGENCY=LENIENT \
     MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
     REMOVE_DUPLICATES=TRUE

# resort, index and collect stats on resulted markdup bam file

${ROCEXEC} rocksort -@ ${THREADS} -m 3 ${MAPDIR}/${NAME}.sorted.markdup.bam ${MAPDIR}/${NAME}.sorted.markdup.sorted
${SAMEXEC} index ${MAPDIR}/${NAME}.sorted.markdup.sorted.bam
${SAMEXEC} flagstat ${MAPDIR}/${NAME}.sorted.markdup.sorted.bam > ${MAPDIR}/${NAME}.sorted.markdup.sorted.flagstat

rmdir ${MAPDIR}/${NAME}.sorted.markdup.sorted.rocksort

Practical

Build your own pipeline

Everyone should pair up with your neighbor for this final exercise. Below four projects goals are outlined. Choose a project and conceptualize a pipeline you would use to process the data. Using a similar pipeline flow chart that was presented these last days, sketch out each step and include a notation for input data, calculated data, and programs. Once completed, we will compare and critique each pipeline as a class.

For those of you up for the challenge, you can attempt to execute each step or go as far as building a bash script using raw and calculated data from the last week. Alternatively, you can use data that was generated during a previous course (i.e., Symbiosis). What you get from the last day is really up to you. We will use this time to help you all on a more individual basis to help you gain a little more independence at your own pace.

Project scenarios:

[1] Diversity: Extract 16S rRNA reads from a metagenome for classification.

[2] Function: Extract gene-level abundance information from a metagenome.

[3] Assembly: Perform an assembly using PacBio reads and do an error correction with Illumina reads.

[4] Binning: Identify a bin of interest from day 4 and refine through sub-assembly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly