This software will produce a table visualization of the presence or absence of encoded proteins in a given set of genomes. Both, the proteins and set of proteomes are required as input by the user.
The primary output will be the table of sORTholog protein data filled using the NCBI Protein Database, as well as PDF and PNG files of the table visualization.
- Installation and Running
- Important Notes for running on the UFL cluster
- Walk-Through and File Production
- Pipeline in image
- Rule descriptions
- Rule 1: fetch_fasta_from_seed
- Rule 2: fetch_proteins_database
- Rule 2.1.1: make_seed_psiblast (optional)
- Rule 2.1.2: psiblast (optional)
- Rule 2.1.3: read_psiblast (optional)
- Rule 2.2.1: hmmsearch (optional)
- Rule 2.2.2: read_hmmsearch (optional)
- Rule 2.3: merge_hmmsearch_psiblast (optional)
- Rule 2.4: make_fasta (optional)
- Rule 3: merge_fasta
- Rule 4: blast
- Rule 5: prepare_for_silix
- Rule 6: silix
- Rule 7: find_family
- Rule 8: make_PA_table
- Rule 9: plots
- Additional rules
This pipeline uses the Snakemake workflow, that is only available on unix environment. Any linuxOS or MacOS should be able to run sORTholog. Windows users may have access to unix either using a cluster working on unix or using the ubuntu virtual environment.
Snakemake and Snakedeploy are best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it can be installed via Mambaforge. For other options see here.
To install Mamba by conda, run
conda install mamba -n base -c conda-forge
Given that Mamba is installed, run
mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy
to install both Snakemake and Snakedeploy in an isolated environment.
For all following commands (step 2 to 5) ensure that this environment is activated via
conda activate snakemake
Given that Snakemake and Snakedeploy are installed and available (see Step 1: install Snakemake and Snakedeploy), the workflow can be deployed as follows.
First, create an appropriate working directory, in which sORTholog will be deployed, on your system in the place of your choice as follow (note to change the path and file name to the one you want to create):
mkdir path/to/sORTholog-workdir
NB: More on the mkdir command.
Then go to your sORTholog working directory as follow:
cd path/to/sORTholog-workdir
Second, run
snakedeploy deploy-workflow https://github.com/vdclab/sORTholog . --tag 0.4.7
Snakedeploy will create two folders workflow
and config
. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step in order to configure the workflow to your needs. Later, when executing the workflow, Snakemake will automatically find the main Snakefile
in the workflow
subfolder.
The three files described in this section (Taxid file, Seed file, and Config file) can be edited with a text editor. We recommend notepad++ if you do not have already a favorite text editor.
The taxid file is a table, in the format of a tabulated text file (e.g. .txt, .tsv). This table should contain 2 columns: TaxId
and NCBIGroups
.
The TaxId
columns should be filled with the TaxId of the genome you want to analyze. You can use any clade TaxId, sORTholog will take care of downloading all the genome of this clade.
The NCBIGroups
can be left blanked, but knowing this group would make the download of genomes faster. Here a list of accepted terms: all
, archaea
, bacteria
, fungi
, invertebrate
, metagenomes
, plant
, protozoa
, vertebrate_mammalian
, vertebrate_other
, viral
. This will allow to look in the right folder in the FTP of NCBI to download the proteome. By choosing all
, it will look in all folders and slow down the process.
Ideally your taxid file should be in your sORTholog working directory or in the config
folder.
Example provided in the file doc/dummy_taxids.tsv folder in the GitHub page.
The seed file contains the list of proteins you want to identify in your genomes. It is a table, in the format of a tabulated text file (e.g. .txt, .tsv). Ideally your seed file should be in your workdir
folder or in the config
folder.
Here is a list of collumns for this file:
- seed : Mandatory, name of the protein as you want it to appear in the final figure.
- protein_id: Mandatory, either the NCBI protein id of a protein you are sure have the function you are searching for.
- hmm: name of the hmm profile file associated with the protein family, including the extension, often
.hmm
. (no default, if not mentioned, psiBLAST will be triggered instead) - evalue: Number between 0 and 1, e-value threshold of blast.
- pident: Number between 0 and 1, percentage of identity (expressed in frequency) threshold of blast.
- cov: Number between 0 and 1, coverage threshold (expressed in frequency) for blast results. The coverage is based on the coverage of the query.
- color: hexadecimal code of the color you want the positive results to be in the final figure for this seed. Example provided in the file doc/dummy_seeds.tsv folder in the GitHub page.
- Columns in the seed files should be separated by tabulation, introduction of spaces or other hidden characters might disrupt the reading of the file.
- Do not introduce empty lines in the seed file.
- Make sure to have the correct extension for your hmm profile.
This file is in the config
folder, and is named config.yaml
. You can edit this file following these instructions (also in comments in the config.yaml
file)
- seed : path and name of your seed file.
- taxid: path and name of your taxid file.
- project_name: Name of your project. This will create a new folder in the folder
results
that will contains all the files generated during the run, including your final results. - output_folder: Name of the folder you want the generated file to be directed to. By default, it is set to
../sORTholog_deployed/results
.
\
, but unix uses foreward slash /
. Make sure to use the latter.
- ndg_option:
- section:
refseq
(default) orgenbank
. Database from which you are pulling the proteomes. - assembly_levels:
all
(default),complete
,chromosome
,scaffold
orcontig
. Type of assembly you want. - refseq_categories:
reference
,all
(default). Only refseq reference genomes or all of refseq. - groups:
all
(default),archaea
,bacteria
,fungi
,invertebrate
,metagenomes
,plant
,protozoa
,vertebrate_mammalian
,vertebrate_other
,viral
. Set this group if unspecified in theNCBITaxId
columns of the Taxid file.
- section:
- update_db:
True
orFalse
(default). Update the Taxonomy dump, will increase run time.
- perso_database: Path to personal proteome database. It should consist of a multifasta file with all the proteins you want to add to the search.
- perso_annotation: table in tabulated text format that contains the information of the annotation of the fasta file in perso_database, columns: protein_id, genome_id[, genome_name]
\
, but unix uses foreward slash /
. Make ure to use the latter.
- speedup:
True
orFalse
. Define if you want to use psi-blast or hmmsearch to filter your proteome before running a blast all vs all. It is recommended on very large amount of genomes.
- psiblast_e_val: Number between 0 and 1, default
0.01
. E-value used to accept a result in psi-blast. It is recommended to use a high e-value to gather super family related protein rather than being too stringent in this step. - iteration: Number between 0 and 5, default
5
. Number of iteration of the psi-blast. The more iterations, the more you gather phylogenetically distant proteins.
- hmm_profiles: path and name of your folder containing all the hmm profiles mentioned in your seed file.
- e_val: Number between 0 and 1, default
0.0001
. E-value default threshold if left empty in the seed file. - focus:
domain
orfull
(default: full). Analyse the results over the full size of the query or based on domain detection.
- filter:
e_value
,score
orboth
(default: e_value). Se up if you want to filter your blast results by e-value, bit score or both. - e_val: Number between 0 and 1, default
0.0001
. E-value default threshold if left empty in the seed file. - pid: Number between 0 and 1, default
0.35
. Percentage of identity (expressed in frequency) default threshold if left empty in the seed file. - cov: Number between 0 and 1, default
0.8
. Coverage of the query (expressed in frequency) default threshold if left empty in the seed file.
- cov_min:
mean
,subject
,query
,shortest
orlongest
(default: mean). Source or the length divider to calculate the coverage (from query, subject or an average of both) - pid_min:
mean
,subject
,query
,shortest
,longest
orHSP
(default: mean). Source or the length divider to calculate the percentage of identity (from query, subject or an average of both) - length_min: positive number, default 100. Mininimum length to accept partial sequences in families
default_values_plot:
- color: hexadecimal color, (default:
#131516
) - colored_border:
True
orFalse
(default). Turn the color of the border a darker shade inferred from the background.
Here a comparison of the two behaviours:
- round_border:
True
orFalse
(default). Turn border to roundish shape.
Here a comparison of the two behaviours:
- PAtab_table: Path of the table you want to use to transform it into the plot figure. The table has to be a tabulated text file of a table that has the seeds in columns, the genomes in lines.
Given that the workflow has been properly deployed and configured, it can be executed as follows.
For running the workflow while deploying any necessary software via conda, run Snakemake with
snakemake --cores 1 --use-conda
If you want to run the workflow with the additional step to speedup the analysis that contains a big dataset you can either:
- Change in the
config.yaml
the parameterspeedup
and change the value toTrue
- Run the workflow adding
-C speedup=True
to the command line as follow
snakemake --cores 1 --use-conda -C speedup=True
You can have access to your result following the path:
[path/to/sORTholog-workdir]/results/[project-name]/results
You will find the table of presence of absence patab_table.tsv
, the melt version of this table patab_melt.tsv
and a folder plots
containing the figure representing the table.
After finalizing your data analysis, you can automatically generate an interactive visual HTML report for inspection of results together with parameters and code inside of the browser via
snakemake --report report.zip --report-stylesheet config/report.css
The resulting report.zip file can be passed on to collaborators, provided as a supplementary file in publications.
- To work on the HiPerGator server, user must first open a terminal (for unix) or command prompt (for windows).
User must access the server by entering the following command:
ssh (GatorLink ID)@hpg.rc.ufl.edu
User will be required to enter their GatorLink account password, then asked to be verified by dual push.
- User will require access to files on the server. To access these files via command line, use the following:
cd /blue/(your group)/(GatorLink ID)
You can access to the files of this folder on your computer by adding a server. The address for Windows is as follow: \\exasmb.rc.ufl.edu
, for Unix (Linux or MacOS) the address: smb://exasmb.rc.ufl.edu
.
More info here.
Before starting the installation of the program, you will need to load conda
; this required software is listed as follows:
module load conda
Before running the workflow, another config file need to be assessed to configure the slurm on HiPerGator. This file is config/slurm/cluster-config.yaml
Here are the options:
- account: account linked to your HiPerGator. (insert your account here)
the subsequent options can be left by default:
- qos: qos account name for HiPerGator.
- time: time per rule in minutes.
- nodes: number of nodes to use per rules.
- ntasks: number of simultaneous tasks.
- cpus-per-task: number of CPU per tasks on HiPerGator.
- mem: memory allocated per rule if not indicated elsewhere.
- output: destination of the output log.
- error: destination of the error log.
- job-name: name of the job on HiPerGator.
Given that the workflow has been properly deployed and configured, it can be executed as follows.
For running the workflow while deploying any necessary software via conda and running on the cluster, run Snakemake with
snakemake -j 5 --use-conda --profile config/slurm
As previously, if you want to run the workflow with the additional step to speedup the analysis that contains a big dataset you can either:
- Change in the
config.yaml
the parameterspeedup
and change the value toTrue
- Run the workflow adding
-C speedup=True
to the command line as follow:
snakemake -j 5 --use-conda --profile config/slurm -C speedup=True
As previously you can also generate a report following Step 5: Generate report.
If you want to run it as a job,
- make sure that the pipeline is deploy where you are and that you change the
config/config.yaml
accordingly to your needs - create a file name, for exemple
my_project.sh
, and copy and paste the following, changing your e-mail address and other relevant parameters:
#!/bin/sh
#SBATCH --mail-user=ufid@ufl.edu # Where to send mail
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --job-name=sbatch_snakemake # Name of the job in slurm
#SBATCH --cpus-per-task=1 # Use 1 core
#SBATCH --mem=4G # Memory allocated to run the workflow
#SBATCH --time=7-0:00:00 # Time for the workflow to run, format DD-HH:MM:SS
#SBATCH --nodes=1 # Number of node to start the workflow
#SBATCH --output=logs/sbatch_snakemake-%j.out # Output of the sbatch job
#SBATCH --error=logs/sbatch-snakemake-%j.err # Error log of the sbatch job, including the snakemake workflow notifications
echo "Date start $(date)"
# activate a specific conda environment, if you so choose
module load conda
conda activate snakemake
# make things fail on errors
set -o nounset
set -o errexit
set -x
### run your commands here!
snakemake --cores 1 --use-conda --profile config/slurm
snakemake --report report.zip --report-stylesheet config/report.css
echo "Date end $(date)"
- run the following command
sbatch my_project.sh
This pipeline consists of 9/12 steps called rules that take input files and create output files. Here is a description of the pipeline.
-
As snakemake is set up, there is a last rule, called
all
, that serves to call the last output files and make sure they were created. -
A folder containing your work will be created:
[project_name]/ <- top-level project folder (your project_name)
│
│
├── report.html <- This report file
│
├── logs <- Collection of log outputs, e.g. from cluster managers
│
├── databases <- Generated analysis database related files
│ ├── all_taxid
│ │ ├─ protein_table.tsv <- Table with the informations about the proteins of the downloaded taxid
│ │ ├─ summary_assembly_taxid.tsv <- Table with the informations about the downloaded genome from NCBI
│ │ ├─ taxid_all_together.fasta <- Fasta file of the downloaded taxid
│ │ └─ taxid_checked.txt <- List of the downloaded taxid
│ │
│ ├── merge_fasta
│ │ └─ taxid_checked.txt <- Fasta with the concatenation of the genome of interest and seeds
│ │
│ └── seeds
│ ├─ seeds.fasta <- Fasta file of the seeds
│ └─ new_seeds.tsv <- Table with the informations about the seeds
│
├── analysis_thresholds <- (Optional) Generate by the rule report_thresholds
│ ├── tables
│ │ └─ table--seed.tsv <- (Optional) Table with the information of each pair of hits in for the seed family (one per seed)
│ │
│ └── report_figure_thresholds.html <- (Optional) HTML report with the plots made by report_thresholds
│
└── results <- Final results for sharing with collaborators, typically derived from analysis sets
├── fasta <- (Optional) folder with the fasta file of the orthologs of the seeds
├── patab_melt.tsv <- Table with the information of sORTholog one information by line
├── patab_table.tsv <- Table with the information of presence absence with genome in index and seeds in columns and proteins Id in the cell
└── plots <- Plots and table on which the plot are created
-
Before starting the pipeline, your taxids will be checked and updated for the lowest descendant levels. This will create a file of checked taxids. It will skip this step if this file already exists.
-
When restarting the pipeline, the software will check if you made any changes in the seed file before running. If changes have been made, it will run what is necessary, else nothing will happen.
Rule description: Fetch the fasta protein sequence of the seed from the seed table. Then they are writen in the output file.
· input file:
- seed file: type = str
format = seed name | protein id | hmm | e-value | percentage of identity | coverage | color
description: seed file input by user
· output files:
- multi fasta file: type = str
description = multifasta output of the seed protein sequences
- new seed file: type = str
columns = seed name | protein id | hmm | e-value | percentage of identity | coverage | color
description = Same file as input but the 'protein id' of each file is changed to match the fasta file
Rule description: Fetch the information for each protein of each genome in the taxid list. That includes: the protein ncbi id, sequence, length and annotation, as well as in which genome it is found. Information for the genome include genome ncbi id, name, taxid and if complete or partial.
· input file:
- taxid file: type = str
description = list of taxid in column, second additional column for teh Taxid NCBI group
· output files:
- multi fasta file: type = str
description = multifasta output of the genome proteomes in one file
- table of genomes informations: type = str
format = assembly_accession | bioproject | biosample | wgs_master | excluded_from_refseq | refseq_category | relation_to_type_material | taxid | species_taxid | organism_name | infraspecific_name | isolate | version_status | assembly_level | release_type | genome_rep | seq_rel_date | asm_name | submitter | gbrs_paired_asm | paired_asm_comp | ftp_path
description = table of the information collected from the NCBI Assembly.
- table of protein informations: type = str
format = protein id | protein name | genome name | genome id | length
description = table of the information collected on the proteins.
- new taxid file: type = str
description = list of taxid downloaded in column
Rule description: filter the seed fasta to keep only the seed that do not have a hmm profile associated with them in the seed initial seed file
· input file:
- fasta seed: type = str
description = multifasta of the seed sequences
· output file:
- fasta: type = str
description = multifasta of the seed sequences that do not have a hmm profile associated with
Rule description: Use the seed protein sequences to make a psi-BLAST against all the taxid proteome.
· input files:
- seed fasta: type = str
description = multifasta of the seed sequences that do not have a hmm profile associated with
- taxid fasta: type = str
description = multifasta of the genome sequences
· output file:
- blast out: type = str
format = query accession | query length | query sequence | query start position | querry end position | subject accession | subject length | subject sequence| subject start position | subject end position | length of alignment | percentage of identity | e-value | bitscore | querry coverage
description = blast out format 6 in tabulation, no header
Rule description: Read the psiBLAST output, remove unwanted lines and extract the list of protein matches.
· input files:
- psiblast output: type = str
format = query accession | query length | query sequence | query start position | querry end position | subject accession | subject length | subject sequence| subject start position | subject end position | length of alignment | percentage of identity | e-value | bitscore | querry coverage
description = blast out format 6 in tabulation, no header
· output files:
- list all prot: type = str
description = list of all potein identifications gathered in the psiBLAST in column
Rule description: Use hmm profile(s) provided to fetch similar proteins in the taxid proteome using hmmsearch from hmmer
· input file:
- hmm: type = list of str
description = list of hmm profile provided in the hmm folder and indictaed in the seed table
- taxid db: type = str
description = multifasta of the taxid proteome
· output file:
- domtblout: type = str
format = protein id | protein accession | protein length | query | query accession | query length | full evalue | full score | full bias | domain number | domain of | domain c-evalue | domain c-evalue | domain score | domain_bias | query from | query to | ali from | ali to | env from | env to | target description
description = domtblout output format of hmmsearch
- seed hmm: type = str
description = merge file of the hmm profile provided
Rule description: Read the hmmsearch output and extract the list of protein matches.
· input file:
- hmm out: type = str
format = protein id | protein accession | protein length | query | query accession | query length | full evalue | full score | full bias | domain number | domain of | domain c-evalue | domain c-evalue | domain score | domain_bias | query from | query to | ali from | ali to | env from | env to | target description
description = domtblout output format of hmmsearch
· output file:
- list all prot: type = str
description = list of all potein identifications gathered in the hmmsearch in column
Rule description: merge the list of proteines from both the psiBLAT and hmmsearch results.
· input files:
- list hmmsearch: type = str
description = list of all potein identifications gathered in the hmmsearch in column
- list psiblast: type = str
description = list of all potein identifications gathered in the psiBLAST in column
· output file:
- list all prot: type = str
description = list of all potein identifications gathered in both the psiblast and the hmmsearch in column
Rule description: Create a fasta file from the psiblast results and the result of the protein information in the rule cat_proteins_info.
· input files:
- protein table: type = str
format = protein id | protein name | genome name | genome id | length
description = table of protein information
- list all prot: type = str
description = list of all potein identifications gathered in both the psiblast and the hmmsearch in column
· output file:
- multifasta: type = str
description = multifasta file of all the unique protein ids.
Rule description: Merge the proteome of interest with the seeds
· input files:
- taxid fasta: type = str
description = all proteome multifasta or multifasta file of all the unique protein ids if speedup true
- seed fasta: type = str
description = multifasta file of all the seeds from the rule fetch_fasta_from_seed
· output file:
- fasta for blast: type = str
description = concatenation of the 2 input multifasta files
Rule description: Blast all versus all of the fasta of all proteins.
· input file:
- fasta for blast: type = str
description = concatenation of the the proteome taxid and seeds multifasta files
· output file:
- blast out: type = str
format = query id | subject id | percentage of identity | length of match | mismatch | gapopen | query start position | query end position | subject start position | subject end position | e-value | bitscore
description = output format 6 of blast, no header
Rule description: Filter the blast results from the rule blast with the threshold specified for each seed in the seed file. Filters include the identity score, coverage and e-value, decided by the user. Create one new filtered blast result for each seed.
· input files:
- seed file: type = str
format = seed name | protein id | hmm | e-value | percentage of identity | coverage | color
description: seed file input by user
- blast out: type = str
format = query id | subject id | percentage of identity | length of match | mismatch | gapopen | query start position | query end position | subject start position | subject end position | e-value | bitscore
description = blast output 6 from the rule blast
· output files:
- blast output to send to silix: type = list of str
format = query id | subject id | percentage of identity | length of match | mismatch | gapopen | query start position | query end position | subject start position | subject end position | e-value | bitscore
description = list of blast output filtered for each seed.
Rule description: Uses Silix to create a network of protein and give a file of the protein segregated in groups. If the blast output file is empty, just create an empty file.
· input files:
- blast out: type = str
format = query id | subject id | percentage of identity | length of match | mismatch | gapopen | query start position | query end position | subject start position | subject end position | e-value | bitscore
description = blast output filtered for a specific seed from the rule prepare_for_silix.
- fasta: type = str
description = multifasta of proteins with seed from the rule merge fasta
· output file:
- fnodes: type = str
format = family | protein id
description = fnodes file, table of protein id and family number, without headers.
Rule description: Find the group of each seed in each individual seed and record it.
· input files:
- fnodes: type = str
format = family | protein id
description = fnodes file, table of protein id and family number, without headers from the rule silix.
- seed file: type = str
format = seed name | protein id | hmm | e-value | percentage of identity | coverage | color
description: seed file input by user
· output file:
- updadted fnodes: type = str
format = family | protein id | seed
description = updated fnodes with only the family of the seed.
Rule description: Check the presence of protein similar to the seed in each taxid and create a table of presence abscence. This table will be plotted in a colored table.
· input files:
- seed file: type = str
format = seed name | protein id | hmm | e-value | percentage of identity | coverage | color
description: seed file input by user
- protein table: type = str
format = protein id | protein name | genome name | genome id | length
description = final table of protein information
- updadted fnodes: type = str
format = family | protein id | seed
description = updated fnodes with only the family of the seed
· output files:
- final table melt: type = str
format = genome_id | genome_name | seed | PA | color | protein_id
description = presence/abscence table, with header. Each line is a protein information
- final table crosstab: type = str
format = row = genome_id, columns = seed 1 | seed 2 .. seed x
description = presence/abscence table, with header. Each line is a genome, each column is a seed.
Rule description: Plot the table from make_PA_table
in a colored table.
· input file:
- final table melt : type = str
format = genome_id | genome_name | seed | PA | color | protein_id
description = presence/abscence table, with header. Each line is a protein information
· output files:
- pdf: type = list of str
description = plots in pdf of the final table centered on one seed
- png: type = list of str
description = plots in png of the final table centered on one seed
Rule description: Pair of rules that works together. From the blast output, create a report containing 2 scatter plots: one comparing the coverage and percentage identity of the hits and the other adding the evalue to the comparaison.
To run the rules do
snakemake report_threshold --cores 1 --use-conda
· input files:
- seed file: type = str
format = seed name | protein id | hmm | e-value | percentage of identity | coverage | color
description: seed file input by user
- protein table: type = str
format = protein id | protein name | genome name | genome id | length
description = final table of protein information.
- updadted fnodes: type = str
format = family | protein id | seed
description = updated fnodes with only the family of the seed
- blast out: type = str
format = query id | subject id | percentage of identity | length of match | mismatch | gapopen | query start position | query end position | subject start position | subject end position | e-value | bitscore
description = output format of blast
· output file:
- table_thresholds : type = str
format = protein1 | protein 2 | pident | coverage | evalue | length
description = table with the columns from the blast output of only the pairs containing at least one member of the seed family.
· input file:
- tables_thresholds : type = list of str
format = protein1 | protein 2 | pident | coverage | evalue | length
description = table with the columns from the blast output of only the pairs containing at least one member of the seed family.
· output file:
- report_html : type = str
description = report in html that contains the figure made using the coverage/percentage identity/evalue of all hits for each seeds family
Rule description: Create a fasta for each orthologs found (one fasta file per seed)
To run the rule do
snakemake extract_protein --use_conda --cores 1
or if on the cluster, do
snakemake extract_protein --use_conda --cores 1 --profile config/slurm
· input files:
- final table melt : type = str
format = genome_id | genome_name | seed | PA | color | protein_id
description = presence/abscence table, with header. Each line is a protein information
- prot sequence: type = str
description = multifasta file of all the unique protein ids
· output file:
- fasta files : type = list of str
description = name of all the fasta output with the format [name of the seed].fasta
Rule description: Plot a presence absence table that you may have edited in pdf and png.
To run the rule do
snakemake quick_plots -C PAtab_table=path/to/pa_table.tsv --use_conda --cores 1
or if on the cluster, do
snakemake quick_plots -C PAtab_table=path/to/pa_table.tsv --use_conda --cores 1 --profile config/slurm
Example of the presence/absence table
in the doc folder in the GitHub page
· input files:
- PA table : type = str
format = genome_id in the first column and the next columns are annotation columns
description = presence/abscence table, with header.
· output files:
- pdf: type = list of str
description = plots in pdf of the final table centered on one seed
- png: type = list of str
description = plots in png of the final table centered on one seed
Rule description: Remove the folder database, logs and processing_files to only keep results
To run the rule do
snakemake clean --cores 1
· input files:
- database, logs and processing_files directory