3.2. Gene prediction and annotation

Introduction

Now that we have our assembly, we have longer sequences that that give us more context about potential gene function and taxonomic affiliation. Here we will go over how to annotate our metagenome contigs generated from section 3.2.

Gene prediction

The first step in annotation is to predict genes or open reading frames (ORFs). For metagenomes, it is not generally possible to use a training dataset. For example, if you were annotating a single genome either from an isolate or a high-quality genome bin, you could provide extrinsic information to build a training dataset from closely-related taxa, which aids in the detection of genes in a more lineage-specific manner (called 'normal' or 'single' mode (depending on the version) in the gene caller Prodigal). For metagenomes, low quality draft genomes, small viruses and plasmids, ab initio methods must be used to predict genes based on intrinsic sequence characteristics and signals (e.g., promotor signals, transcription factor binding sites, start and stop codons). This mode is referred to as 'anonymous' or 'meta' in Prodigal. This procedure is fairly robust in prokaryotes with high coding densities and a lack of introns. It is considerably more complicated for eukaryotes.

First, lets make sure we are in the right directory and remind ourselves of its contents:

cd ~/marmic2017/day_3/metagenome/
ls -lh

You should have an assembly folder called 'megahit_output'. Examine the directory contents:

ls -lh megahit_output/

We first create a new output directory for our results and then run gene prediction on the final.contigs.fa using Prodigal:

mkdir prodigal_output
/bioinf/software/Prodigal/Prodigal-2.6.2/prodigal -h    #look at the help menu

How is the propper mode for metagenomes in this version of Prodigal called?

/bioinf/software/Prodigal/Prodigal-2.6.2/prodigal -i megahit_output/final.contigs.fa -o prodigal_output/coords.gbk -a prodigal_output/proteins.faa -p meta

Output includes a genbank file and amino acid fasta file (.faa). Take some time to examine them and extract some basic count information.

Annotation

There are many different protein databases to run your gene predictions against. Do you remember some of them from the Symbiosis and Bioinformatics course? HMMER searches for homologs against any database or set of sequences of interest using Hidden Markov Models. Many HMM profiles are precomputed and ready for download. The full Pfam database HMM, for example, can be downloaded from their FTP site) or you can download an HMM for a single family of proteins. If you have your own set of sequences you want to search against, you can build your own HMM profile.

Below is an example of how to download an HMM profile from a particular family of proteins. We do not want to download the full Pfam database, as it is currently more than 1Gb in size. Instead we will download the TonB_dep_Rec, Sulfatase and SusD HMM profiles:

mkdir pfam_search
cd pfam_search
wget http://pfam.xfam.org/family/PF00593/hmm
mv hmm TonB_dep_Rec.hmm 
#make sure to rename them before downloading the next as they are automatically just called 'hmm'
wget http://pfam.xfam.org/family/PF00884/hmm
wget http://pfam.xfam.org/family/PF07980/hmm

Before we can run HMMER against our three Pfam profiles we need to concatenate the three profiles to one file. Do you remember how to concatenate files? Now we are ready for the first part of running HMMER. Like for other tools the database that we want to search against needs to be transformed into a binary format that is easier to read for the computer. This is done by running 'hmmpress':

/bioinf/software/hmmer/hmmer-3.1b2/bin/hmmpress concatenated_profiles.hmm

How many additional files have been created in your folder now? For a detailed description of each see the HMMER user guide.

Now we can use 'hmmscan' to search our protein sequences against the database:

/bioinf/software/hmmer/hmmer-3.1b2/bin/hmmscan --tblout protein_pfam_search.txt concatenated_profiles.hmm ../prodigal_output/proteins.faa

Inspect the outputfile. How many hits are there against the different profiles? We now used hmmscan to search our sequences against the profile database. This is ok if you have a small profile database and only a few sequences you search against it. But when working with the entire Pfam database and larger number of sequences it becomes very slow. You should then think about using 'hmmsearch' which instead searches HMM profiles against a sequence database. Here is a nice explanation if you are interested in why this speeds up the search.

Gene coverage extraction

Antonio Fernandez-Guerra has put together a great tutorial, Proxy to gene abundance. It covers how to extract abundance from read mappings (SAM/BAM files) and gene annotations (GFF files) using programs BEDOPS and BEDTOOLS. These are widely used and important tools. I encourage you to explore this further on your own.

GenDB

In this part, you will learn how to annotate genes in a single genome using [GenDB] (https://www.uni-giessen.de/fbz/fb08/Inst/bioinformatik/software/gendb) annotation system. This will subsequently enable you to perform a computer-assisted reconstruction of an organisms' metabolic potential from its annotated genes.

GenDB is an annotation system for prokaryotic genomes. It has a relational database management system as backend ([MySQL] (https://www.mysql.com/)) and and a web interface as a front end (JCoast). The GenDB annotation engine performs similarity searches against [NCBI non-redundant protein] (https://www.ncbi.nlm.nih.gov/protein), InterPro, PFAM, KEGG and [COG] (https://www.ncbi.nlm.nih.gov/COG/) databases. These searches were done using a large collection of software tools such as BLAST and [HMMER] (http://hmmer.org/). Searching various databases with multiple tools gives you the opportunity to compare and curate the gene annotation in a given locus.

First, you will learn how to create a GenDB project and how to import a file to this project.

find the file Reinekea_MarMic.gbk in the corresponding folder and copy/paste to your Desktop:

cd /bioinf/transfer/marmic_NGS2017/day_3/genDB
cp Reinekea_MarMic.gbk ~/Desktop

login to the GenDB main computer:

ssh gendb

navigate to the folder containing the GenDB control scripts:

cd /vol/gendb/bin

list contents of folder:

ls -la

create a new GenDB project (use your own name as prefix). Neither username nor password, just press enter!

./create_project -D -p <your name>_MarMic_2017
Write down the name of your project on a paper! We need this information to delete your project afterwards.

import gbk file into the GenDB project (importing fasta file is also possible). You need to provide your database passsword which was sent by Lennart beforehand:

./import_EMBL_GBK -p <your name>_MarMic_2017 -G -f ~/Desktop/Reinekea_MarMic.gbk

After creating the project and importing the file, search tools should be created and submitted. This will run different tools against various databases and give you multiple search results for a gene locus. Considering the time limitation in the course, YOU WILL NOT RUN these tools in your newly created project. Instead, you will switch to a different project which has the search results for all tools.

Now you will start working in JCoast and curate the gene annotation!

copy/paste the jcoast file (jcoast_1.7b4.jar) from the software folder to your Desktop:

cd /bioinf/transfer/marmic_NGS2017/software
cp jcoast_1.7b4.jar ~/Desktop

change the directory:

cd ~/Desktop

start JCoast:

java -jar jcoast_1.7b4.jar

write your username and GenDB password. Provide a host:

host: maia.mpi-bremen.de

select the following project:

Reinekea_Hel1_31_D35_Final

spend some time to get familiar with the JCoast interface.
you will check the search results from different tools and manually curate some genes in the genome. You will be assigned to some gene loci during the tutorial.
if you find any interesting gene, use [MetaCyc] (https://metacyc.org) or [BRENDA] (http://www.brenda-enzymes.de/index.php) to predict its metabolic function.
try to make some ecological conclusions and answer the following questions:

-  Did you find any contradictory annotation for your genes? 
-  What kind of genes you annotate?
-  Can you make some predictions about the ecophysiology of this bacterium?

You can explore other (meta)genome annotation tools such as [RAST] (http://rast.nmpdr.org/), [mg-RAST] (http://metagenomics.anl.gov/) and [IMG] (https://img.jgi.doe.gov/cgi-bin/m/main.cgi). There will be a short demo about RAST at the end of the tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly