- Instituto de Ecología, Genética y Evolución de Buenos Aires (IEGEBA), CONICET, Argentina
- Departamento de Ecología, Genética y Evolución (EGE), Facultad de Ciencias Exactas y Naturales (FCEN), Universidad de Buenos Aires (UBA), Argentina
- E-mail: niconm89@gmail.com | nmoreyra@ege.fcen.uba.ar
- Twitter: @nicomoreyra
BUSCO2Tree is a pipeline designed to automate downstream phylogenomic analyses for generating a species tree after running BUSCO on multiple genomes/proteomes/transcriptomes. This guide provides an example data set obtained from a recent study [1] where four lepidopteran genomes were assessed using BUSCO v5.5.0 to obtain completeness scores and dipteran single-copy orthologs. These common BUSCOs among all genomes serve as the input for this pipeline.
- Author
- What is BUSCO2Tree?
- Installation Guide
- Example Data
- How to Run BUSCO2Tree
- Understanding the Output of BUSCO2Tree
- License Information
git clone https://github.com/niconm89/BUSCO2Tree.git
cd BUSCO2Tree
./BUSCO2Tree.py -h
At the time of release, this pipeline was tested with:
BUSCO2Tree calls upon various bioinformatics software tools and uses python modules that are not part of this pipeline. Some tools are mandatory while others are optional. Please install all tools that are required for running BUSCO2Tree in the mode of your choice.
Running BUSCO2Tree requires a Linux-system (only tested on this platform) with bash
and Python3
(and some modules), but depending on user requirements it may be usefull to install TrimAl
and IQTree
as well (see Running BUSCO2tree).
On Ubuntu, Python3 is usually installed by default, and python3
will be in your $PATH
variable, and BUSCO2Tree will automatically locate it. In case your system is not configured in this way, you may include the directory containing the python3 binary to the $PATH
or just call the pipeline's main script BUSCO2Tree.py using a command like:
/path/to/python3 /path/to/BUSCO2Tree.py -h
Note: You may want to simplify
BUSCO2Tree requires the following Python modules to be installed (except to Biopython, the remiaining modules are commonly installed by default with Python):
- argparse
- os
- time
- subprocess
- Biopython
On Ubuntu, install Python3 package manager with:
`sudo apt-get install python3-pip`
Then, install Biopython with:
`sudo pip3 install biopython`
In case you do not have root permitions, you can follow the steps shown in Installing dependencies with Anaconda.
trimAl automatically removes spurious sequences or poorly aligned regions from a multiple sequence alignment. Download trimAl from http://trimal.cgenomics.org/downloads or install it using Anaconda.
Download IQTree from http://www.iqtree.org/#download or install it using Anaconda.
Adding directories containing binaries and scripts to your $PATH
variable enables your system to automatically locate these tools. It is a requirement for running BUSCO2Tree to do this, because BUSCO2Tree will try to find them from the location of the $PATH
variable. To do this:
a) For your current bash
session (you will have to repeat these steps after you close your terminal or finish the current bash sessions), type:
PATH="/home/nmoreyra/Soft/miniconda3/bin:/home/nmoreyra/Soft/bin:$PATH"
export PATH
b) For all your BASH sessions, add the above lines to a linux startup script (e.g.~/.bashrc
).
It is not necessary to install mafft since it is executed using MafftCommandline
module available in Biopython. However, in case you need to make run Mafft by your own, you can download it from here or install it using Anaconda.
On Ubuntu, if you do not have root permissions on the Linux machine, you can install the modules with by setting up an Anaconda (https://www.anaconda.com/distribution/) environment as follows:
conda create -n BUSCO2Tree -c anaconda python #it will create an environment with python including os, argparse, time and subprocess modules
conda activate BUSCO2Tree
conda install -c conda-forge biopython
conda install -c bioconda mafft
conda install -c bioconda trimal
conda install -c bioconda iqtree
Alternatively, user may install dependencies from each software's website.
BUSCO2Tree is a collection of five Python scripts. The main script that will be called in order to run the pipeline is BUSCO2Tree.py, and additional four Python scripts that employ different stages of the analysis are:
-
Step 1: find_singlecopy_BUSCOs.py
-
Step 2: align_BUSCOs.py
-
Step 3: create_matrix.py
-
Step 4: phylogenetic_analysis.py
It is recommended to give executable permissions to all python scripts that are part of BUSCO2Tree as this facilitates the use of the entire pipeline or each script individually. This should already be the case if you download BUSCO2Tree from GitHub.
It is important that the x
in -rwxr-xr-x
is present for each script when you run ls -l BUSCO2Tree.py scripts/*py
. If that is not the case, you may run the following command in order to change file attributes:
chmod a+x BUSCO2Tree.py scripts/*py
You may find it helpful to add the directories in which BUSCO2Tree.py
and supplementary python scripts reside to your $PATH
environment variable. For a single bash session, enter:
PATH=/your_path_to_BUSCO2Tree/:$PATH
export PATH
To make this $PATH
modification available to all bash sessions, add the above lines to a startup script (e.g.~/.bashrc
).
This repository also provides the script BUSCO2Tree_fna.py, which follows the same methodology but utilizes the coding sequences of orthologs to generate the phylogenetic tree instead of using amino acid sequence alignments. This functionality is specifically designed for use with the results of BUSCO runs in genome or transcriptime mode.
The example data provide here was obtained from a study that was part of my PhD, in which I and my collegues search for genomic innovations that could be associated with cactophily and host plant specialization. The study is Phylogenomics provides insights into the evolution of cactophily and host plant shifts in Drosophila. I employed the genomes of the thirteen species included in this study and run BUSCO v5.4.6 on each one in search of BUSCO groups for the eukaryota lineage obtained from ODB v10.
To test BUSCO2Tree, please decompress the BUSCO results zip file placed in the example directory.
BUSCO2Tree.py --help
usage: BUSCO2Tree.py [-h] -s STEPS [STEPS ...] [-o OUTDIR] [-t THREADS] [-b BUSCODIR] [-d ODB] [-l LINEAGE] [-f FASTADIR] [-cnf <config file>] [-cmd <command>] [--trim] [--trimparams TRIMPARAMS] [-a ALIGNDIR] [-fmt {nexus,phylip}] [-m MATRIX] [-p PARTITIONS] [-P PREFIX] [-B BOOTSTRAP] [-st {DNA,AA}]
options:
-h, --help show this help message and exit
General arguments:
-s STEPS [STEPS ...], --steps STEPS [STEPS ...]
Steps to run. Only single or consecutive steps are allowed. Available options are: 1) find common single copy BUSCOs; 2) align
common single copy BUSCOs; 3) create phylogenetic matrix by concatenating BUSCO alignmets; 4) Generate the phylogenetic tree.
-o OUTDIR, --outdir OUTDIR
Path to the directory where the output of each step will be saved. If it does not exists, it will be created. Default:
B2T_output
-t THREADS, --threads THREADS
Number of threads to use. Default: 4
Step1: Find common single-copy BUSCO groups among BUSCO outputs:
-b BUSCODIR, --buscodir BUSCODIR
Path to the directory where individual BUSCO outputs are located (you can select the BUSCO output when running in batch mode).
Directories names will be taken as the genomes names.
-d ODB, --odb ODB OrthoDB version. This parameter is only use to complete the path of the single copy sequences directory. Default: 10
-l LINEAGE, --lineage LINEAGE
OrthoDB lineage. This parameter is only use to complete the path of the single copy sequences directory. Default: eukaryota
Step2: Align common BUSCO groups:
-f FASTADIR, --fastadir FASTADIR
Path to the directory containing the BUSCO groups in fasta format.
-cnf <config file>, --config <config file>
Config file for alignment setting. Users can find a config file template in the docs directory. If no file is provided, the
alignments will be done using default parameters.
-cmd <command>, --command <command>
MAFFT parameters to apply to each alingment. The parmeters must be defined in a command line style, between quote marks, and
the avoiding the names of in/output files, e.g. "--unalignlevel 0.1 --leavegappyregion --ep 0.12 --globalpair --maxiterate
1000".
--trim Trim alignments using trimAl in automated mode, which must be available in the path.
--trimparams TRIMPARAMS
trimAl parameters to apply when removing poorly aligned regions. The parameters must be defined in quotes marks and avoiding
in/output files, e.g. "-gt 0.3".
Step3: Create phylogenetic matrix and partition files:
-a ALIGNDIR, --aligndir ALIGNDIR
Path to the directory containing the alignments in fasta format (step 3).
-fmt {nexus,phylip}, --format {nexus,phylip}
Format for the phylogenetic matrix generated by concatenating BUSCOs alignment files (step 3). Default: phylip
Step4: Genetare phylogenetic tree:
-m MATRIX, --matrix MATRIX
Path to the phylogenetic matrix file.
-p PARTITIONS, --partitions PARTITIONS
Path to the partitions file in nexus format.
-P PREFIX, --prefix PREFIX
Prefix to name the output dataset and results. Default: iqtree
-B BOOTSTRAP, --bootstrap BOOTSTRAP
Number of bootstrap replicate to run. Default: 1000
-st {DNA,AA}, --seqtype {DNA,AA}
Type of sequence. Default: AA
End of the help
The commands shown below were tested using example data.
BUSCO2Tree.py --steps 1 --buscodir Drosophila_proteins --lineage eukaryota --outdir B2T --threads 8
BUSCO2Tree.py --steps 2 --fastadir B2T/01_single-copy/common_busco_sequences --outdir B2T --threads 8
Note: it is possible to activate alignment trimming by setting --trim
parameter. In addition, the parameter --trimparams
can be set when it is necessary to use a timAl user-defined command.
BUSCO2Tree.py --steps 3 --aligndir B2T/02_alignments --outdir B2T --threads 8
Note: if step 2 was run using the --trim/trimparams parameters, trimmed alignments will be placed in BT2/02_alignments/trimAl.
BUSCO2Tree.py --steps 4 --matrix B2T/03_matrix/matrix.phylip --partitions B2T/03_matrix/busco_coords.partitions.tsv --outdir B2T --threads 8
BUSCO2Tree.py -s 1 2 3 4 -b Drosophila_proteins --trim -o B2T -t 8
Note: this command takes ~23 minutes.
BUSCO2Tree.py -s 1 2 -b Drosophila_proteins --config /path/to/BUSCO2Tree/example/mafft_config.txt --trim --trimparams "-gt 0.3" -o B2T -t 8
BUSCO2Tree.py -s 1 2 3 -b Drosophila_proteins -cnf mafft_config.txt --trim --trimparams "-gt 0.3" --format nexus -o B2T -t 8
BUSCO2Tree.py -s 2 3 -f B2T/01_single-copy/common_busco_sequences --trim --trimparams "-gt 0.3" -o B2T -t 8
BUSCO2Tree.py -s 2 3 4 -f B2T/01_single-copy/common_busco_sequences --trim --trimparams "-gt 0.3" -B 500 -P TEST234 -o B2T -t 8
BUSCO2Tree.py -s 3 4 -a B2T/02_alignments --trim -B 1500 -P TEST34 -o B2T -t 8
Running the complete pipeline will generate an output directory containing four main folder, one per each step:
B2T/
├── 01_single-copy
│ ├── common_busco_sequences
│ | ├── #common BUSCO groups fasta files
│ ├── genomes_names.txt
│ └── list_common_busco.txt
├── 02_alignments
│ ├── #common BUSCO groups alignments
│ ├── trimAl
│ │ ├── #trimmed alignments
├── 03_matrix
│ ├── busco_coords.partitions.tsv
│ └── matrix.phylip
└── 04_phylogenetic_tree
├── iqtree.log
└── iqtree.model.gz
- B2T/01_single-copy/common_busco_sequences: it contains common BUSCO groups among species in fasta files.
- B2T/01_single-copy/genomes_names.txt: file with names of the species placed in the directory set for --buscodir.
- B2T/01_single-copy/list_common_busco.txt: file with IDs of common BUSCO group.
- B2T/02_alignments: it contains alignments of common BUSCO groups in fasta format.
- B2T/02_alignments/trimAl: it contains trimmed alignments in fasta format.
- B2T/03_matrix/busco_coords.partitions.tsv: partition file in nexus format. Each partition corresponds to a BUSCO group.
- B2T/03_matrix/matrix.phylip: phylogenetic matrix file in phylip format (nexus is also allowed).
- B2T/04_phylogenetic_tree: it contains the IQtree outputs.
When running BUSCO2Tree step by step (or using combinations of them), it is recommended to place the results (--outdir) in the same main directory where subdirectories containing files (fasta, aligment, and matrix/partitions files) are. For instance, following the tree directory shown above, if we want to run step 3:
BUSCO2Tree.py --steps 3 --aligndir B2T/02_alignments/trimAl --outdir B2T --threads 8
Check that --aligndir in set with a directry placed within the B2T directory and --outdir is thus set with B2T. As a result, a directory named "03_matrix" will be created within B2T.