4. Metagenome binning

Introduction

In part 3 you assembled a complete metagenome and then mapped the reads back to that assembly. What we want to figure out now is which contigs in that assembly belong together, i.e. which ones are derived from the same organism. Current state-of-the-art binning programs use a combination of sequence composition (typically based on tetranucleotide frequency), and differential coverage. We'll be using MetaBAT for our binning, using differential coverage across the three sampling dates you mapped yesterday.

Once we've run MetaBAT we'll take a look at the bins we get out and see if we can get something close to a complete genome.

Binning with MetaBAT

MetaBAT uses tetranucleotide frequency and differential coverage to automatically bin metagenome assemblies. We'll need our sorted, indexed BAM files from before, and our final.contigs.fa file we got from the assembler.

First let's make a new directory for the binning output

cd ~/marmic_NGS2017/day_4/
mkdir -p metabat

And let's copy some pre-computed bam files from yesterday:

rsync -av --bwlimit=10000 /bioinf/transfer/marmic_NGS2017/day_3/metagenome/sorted_markdup_sorted_bams ~/marmic_NGS2017/day_3/metagenome/bbmap_output/

Calculating read coverage

Now we'll create coverage depth profiles for each contig and each sample

~/marmic_NGS2017/software/metabat/jgi_summarize_bam_contig_depths --outputDepth metabat/depth.txt ~/marmic_NGS2017/day_3/metagenome/bbmap_output/sorted_markdup_sorted_bams/201104*.bam

Have a look at the file we just created. What information is in there? See if you can use some of your bash skillz to find out the read coverage for the longest contig. How does coverage of this contig vary between the three sampling dates?

Binning

Now we actually run the binning software

~/marmic_NGS2017/software/metabat/metabat -t 3 -i ~/marmic_NGS2017/day_3/metagenome/megahit_output/final.contigs.fa -a metabat/depth.txt -o metabat/Bin

Seawater contains lots of different microbial species, why do you think we didn't get a lot of bins here?

Looking at the metabat output

We can use the program infoseq to get some information about our bins now. Let's take a look at the output that infoseq gives us.

infoseq metabat/Bin.1.fa | head

You can see it gives a bunch of data for each contig that's in this bin. Have a go at collecting some summary information for one of the bins, like total length (in base pairs), and GC content. How many contigs are there in each bin?

Evaluating your bins

Unfortunately you won't be able to do this yourselves, but I'll show you the steps involved in assessing the completeness and contamination/redundancy in your bins using CheckM and anvi'o

Provide feedback

Saved searches

Use saved searches to filter your results more quickly