-
Notifications
You must be signed in to change notification settings - Fork 3
4. Metagenome binning
In part 3 you assembled a complete metagenome and then mapped the reads back to that assembly. What we want to figure out now is which contigs in that assembly belong together, i.e. which ones are derived from the same organism. Current state-of-the-art binning programs use a combination of sequence composition (typically based on tetranucleotide frequency), and differential coverage. We'll be using MetaBAT for our binning, using differential coverage across the three sampling dates you mapped yesterday.
Once we've run MetaBAT we'll take a look at the bins we get out and see if we can get something close to a complete genome.
MetaBAT uses tetranucleotide frequency and differential coverage to automatically bin metagenome assemblies. We'll need our sorted, indexed BAM files from before, and our final.contigs.fa file we got from the assembler.
First let's make a new directory for the binning output
cd ~/marmic_NGS2017/day_4/
mkdir -p metabat
And let's copy some pre-computed bam files from yesterday:
rsync -av --bwlimit=10000 /bioinf/transfer/marmic_NGS2017/day_3/metagenome/sorted_markdup_sorted_bams ~/marmic_NGS2017/day_3/metagenome/bbmap_output/
Now we'll create coverage depth profiles for each contig and each sample
~/marmic_NGS2017/software/metabat/jgi_summarize_bam_contig_depths --outputDepth metabat/depth.txt ~/marmic_NGS2017/day_3/metagenome/bbmap_output/sorted_markdup_sorted_bams/201104*.bam
Have a look at the file we just created. What information is in there? See if you can use some of your bash skillz to find out the read coverage for the longest contig. How does coverage of this contig vary between the three sampling dates?
Now we actually run the binning software
~/marmic_NGS2017/software/metabat/metabat -t 3 -i ~/marmic_NGS2017/day_3/metagenome/megahit_output/final.contigs.fa -a metabat/depth.txt -o metabat/Bin
Seawater contains lots of different microbial species, why do you think we didn't get a lot of bins here?
We can use the program infoseq to get some information about our bins now. Let's take a look at the output that infoseq gives us.
infoseq metabat/Bin.1.fa | head
You can see it gives a bunch of data for each contig that's in this bin. Have a go at collecting some summary information for one of the bins, like total length (in base pairs), and GC content. How many contigs are there in each bin?
Unfortunately you won't be able to do this yourselves, but I'll show you the steps involved in assessing the completeness and contamination/redundancy in your bins using CheckM and anvi'o