-
Notifications
You must be signed in to change notification settings - Fork 6
Tutorial 3_0
This tutorial provide a step-by-step analysis using the PanPhlAn software to map some samples against the E.rectale species pangenome.
- 1. Download and install PanPhlAn
- 2. Download the metagenomics samples
- 3. Download the reference genomes
- 4. Map samples against pangenome
- 5. Profiling strains
- Visualizing the matrix
Let's start by creating a folder for this analysis
mkdir panphlan_tutorial
cd panphlan_tutorial-tutoril
Then download the PanPhlAn software
git clone https://github.com/SegataLab/panphlan.git
In this example, we use 25 samples coming from various datasets (different studies, pathologies, country, westernized and non-westernized...). Sample can be downloaded using the following command :
wget https://www.dropbox.com/s/oi26jg0v7ktlavc/panphlan_tutorial_samples.tar.bz2
(samples have been pre-processed and filtered in order to keep only matching sequence and thus speed up download and running time needed for this tutorial ) Then, the archive must be extracted in order to get one file per sample
tar -xvjf panphlan_tutorial_samples.tar.bz2
25 fastq files are now available in a samples_fastq/
folder. They correspond to various metagenomes samples from datasets spanning across studies and countries.
Here are some metadata :
sampleID Dataset Country
CCMD34381688ST-21-0 ZellerG_2014 DEU
G78505 VatanenT_2016 RUS
G88884 SchirmerM_2016 NLD
G88970 SchirmerM_2016 NLD
G89027 SchirmerM_2016 NLD
H2M514903 LiJ_2017 CHN
H3M518116 LiJ_2017 CHN
HD-1 QinN_2014 CHN
HD-5 QinN_2014 CHN
HV-6 QinN_2014 CHN
LD-48 QinN_2014 CHN
M1.48.ST BritoIL_2016 FJI
M2.48.ST BritoIL_2016 REF
M2.58.ST BritoIL_2016 FJI
N021 WenC_2017 CHN
RA023 WenC_2017 CHN
S353 KarlssonFH_2013 SWE
SID530054 FengQ_2015 AUT
SRS011302 HMP_2012 USA
SZAXPI003417-4 YuJ_2015 CHN
T2D-025 QinJ_2012 CHN
T2D-063 QinJ_2012 CHN
T2D-105 QinJ_2012 CHN
W1.27.ST BritoIL_2016 FJI
YSZC12003_36795 XieH_2016 GBR
The pangenome of Eubacterium rectale can be easily retrieved via PanPhlAn pangenome database using the follwoing line :
python panphlan/panphlan_download_pangenome.py -i Eubacterium_rectale
Let's check the content of the folder downloaded (named Eubacterium_rectale
)
ls -l Eubacterium_rectale/
Eubacterium_rectale.1.bt2
Eubacterium_rectale.2.bt2
Eubacterium_rectale.3.bt2
Eubacterium_rectale.4.bt2
Eubacterium_rectale.rev.1.bt2
Eubacterium_rectale.rev.2.bt2
Eubacterium_rectale_pangenome.tsv
Eubacterium_rectale_pangenome_contigs.fna
panphlan_Eubacterium_rectale_annot.tsv
This will download and extract a folder containing :
-
Eubacterium_rectale_pangenome_contigs.fna
: all contigs from the reference genomes -
Eubacterium_rectale.X.bt2
: the bowtie2 indexes (6 files) -
panphlan_Eubacterium_rectale_annot.tsv
: an annotation mapping tsv file, can be useful for further analysis of gene families of interest -
Eubacterium_rectale_pangenome.tsv
: a pangenome tsv file containing information about gene families UniRef90 ID, gene family name and position (genome, contig, start and stop) like the following example.
head Eubacterium_rectale_pangenome.tsv
UniRef90_A0A174CG45 ywqN_1 GCA_000209955 gnl|X|MDHPEDEI_1 5355 5880
UniRef90_A0A174HJF4 ymdB_1 GCA_000209955 gnl|X|MDHPEDEI_1 7141 7519
UniRef90_D4JYV8 sttH GCA_000209955 gnl|X|MDHPEDEI_1 7563 8091
UniRef90_A0A173V9T5 chrA GCA_000209955 gnl|X|MDHPEDEI_1 9240 9807
UniRef90_A0A173V954 btr_1 GCA_000209955 gnl|X|MDHPEDEI_1 12790 14395
UniRef90_C4ZFH4 araN_1 GCA_000209955 gnl|X|MDHPEDEI_1 17732 18986
UniRef90_A0A173V9Z2 ugpA_1 GCA_000209955 gnl|X|MDHPEDEI_1 19135 20014
UniRef90_A0A173V933 sugB GCA_000209955 gnl|X|MDHPEDEI_1 20025 20880
UniRef90_D4JMJ2 xerC_1 GCA_000209955 gnl|X|MDHPEDEI_1 23844 24993
UniRef90_A0A174NHE9 mgrA GCA_000209955 gnl|X|MDHPEDEI_1 26198 26636
UniRef90_A0A174EBC4 prpC GCA_000209955 gnl|X|MDHPEDEI_1 28391 29738
UniRef90_C7GEM8 clsC_1 GCA_000209955 gnl|X|MDHPEDEI_1 30340 31774
UniRef90_A0A174CL23 hssR_1 GCA_000209955 gnl|X|MDHPEDEI_1 31791 32463
UniRef90_A0A173UV77 baeS_1 GCA_000209955 gnl|X|MDHPEDEI_1 32462 33506
Here is one example for a sample.
python panphlan/panphlan_map.py -i samples_fastq/CCMD34381688ST-21-0.fastq.bz2 \
--indexes Eubacterium_rectale/Eubacterium_rectale \
-p Eubacterium_rectale/Eubacterium_rectale_pangenome.tsv \
-o map_results/CCMD34381688ST-21-0_erectale.tsv
One could automatized all samples mapping with a simple loop:
for f in samples_fastq/*; do
fname=$(basename $f)
python panphlan/panphlan_map.py \
-i samples_fastq/${fname} \
--indexes Eubacterium_rectale/Eubacterium_rectale \
-p Eubacterium_rectale/Eubacterium_rectale_pangenome.tsv \
-o map_results/${fname%.*}_erectale.tsv ;
done;
If the output folder map_results/
does not exist, it will be created.
The script has to be run one time for each sample. If samples are numerous, using GNU parallel can be a relevant option.
For a quicker progress through this tutorial, the results from the mapping process for all 25 samples can be directly downloaded :
wget https://www.dropbox.com/s/z81e87dvtzp6pu3/panphlan_tutorial_map_results.tar.bz2
tar -xvjf panphlan_tutorial_map_results.tar.bz2
The most important part of the PanPhlAn software is the profiling part (panphlan_profiling.py
).
python panphlan/panphlan_profiling.py -i map_results/ \
--o_matrix result_profile_erectale.tsv \
-p Eubacterium_rectale/Eubacterium_rectale_pangenome.tsv \
--add_ref
This line will generate the presence/absence matrix of gene families. 1
means the gene family is present in the sample, 0
it is not.
CCMD34381688ST-21-0 G78505 G88884 G88970 G89027 H2M514903 H3M518116 HD-1 HD-5 HV-6 LD-48 M1 M2_48 M2_58 N021 RA023 S353 SID530054 SRS011302 SZAXPI003417-4 T2D-025 T2D-063 T2D-105 W1 YSZC12003_36795
UniRef90_A0A069A530 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
UniRef90_A0A076YZQ3 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1
UniRef90_A0A095XLP0 1 0 1 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1
UniRef90_A0A095XNF2 1 0 1 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1
UniRef90_A0A095ZX26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1
UniRef90_A0A0D0SCG4 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
In order to assess whether or not a gene family is present in a sample, PanPlhAn draws a coverage curve of gene families for each sample. Such curve can be obtained using the --o_covplot erectale_covplot
creating the erectale_covplot.png
figure :
Here, the peak on the leftmost part of the plot has been cut in order to highlight the plateau and the differences between sample. The option --covplot_ymax 100
has been used.
PanPhlAn's aim is to detect a plateau in the coverage curve. Default threshold can be changed for more or less sensitive options. Too stringent option lead to discarding a sample from the matrix as the species will be judged non present in the sample.
Least stringent option is performed by adding --min_coverage 1 --left_max 1.70 --right_min 0.30
in the command line.
By default only the mapped samples whose coverage curves display a satisfying plateau are present in the final matrix. In order to add the reference genomes in the matrix as well, the option --add_strains
must be specified.
Moreover, if a specific file mapping UniRef90 gene families names to other annotation data is provided, it can be added in the matrix.
Here it is panphlan_Eubacterium_rectale_annot.tsv
retrieved with the pangenome download
For example :
Eubacterium_rectale/panphlan_Eubacterium_rectale_annot.tsv
NR90 NR50 GO KO KEGG Pfam EC eggNOG
UniRef90_A0A069A530 UniRef50_C7H750
UniRef90_A0A069A530 UniRef50_C7H750 PF09035
UniRef90_A0A069A530 UniRef50_C7H750 GO:0015074,GO:0006310 PF09035
UniRef90_A0A076YZQ3 UniRef50_A0A2N9Z693
UniRef90_A0A076YZQ3 UniRef50_A0A2N9Z693 PF02195
UniRef90_A0A076YZQ3 UniRef50_A0A2N9Z693 GO:0003677 PF02195
UniRef90_A0A076YZQ3 UniRef50_A0A2N9Z693 GO:0003677 K03497 rus:RBI_I01277 PF02195
UniRef90_A0A095XLP0 UniRef50_A7B041
UniRef90_A0A095XLP0 UniRef50_A7B041 PF00890,PF02910
given as argument as in here:
panphlan/panphlan_profile.py -c 39491 -i map_results/ -o profile_erectale.csv --funct_annot HUMAnN2_CHOCOPhlAn_201901_functional_annotation_mapping.tsv --verbose
will add an annotation column in the presence/absence matrix. Here, without any specification, the second field is taken (UniRef50) but it can be chosen via the --field
argument (by default --field
has value 1, thus using the second field of the annotation file).
Using the matrix .tsv
file, visualization can be made through PCoA plots or Heatmaps of gene families presence/absence fingerprint for each sample.
For example, a PCoA of E.rectale strains highlights the existence of westernized (AUT, USA, NLD, SWE, GBR...) cluster compared to a non-westernized (CHN) one.
Such clusters can also be visualized through heatmap and hierarchical clustering using the same color code for countries. On such Heatmap, the core genome can also be visualized (yellow columns on the right)
PanPhlAn is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.
- PanPhlAn 3.0
- PanPhlAn 1.3