Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iCREs-based GRN inference feature #23

Merged
merged 17 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ singularity_cache/
# ignore large motif mapping files
*motif_mappings*.bed

# ignore icres files (too large for repo)
*icres*.bed

# ignore iCREs output files (confidentiality until publication)
example/outputs_icres/*
!example/outputs_icres/.gitkeep

# ignore nf-test executable
nf-test

Expand All @@ -18,8 +25,8 @@ nf-test
tests/outputs/

# ignore SLURM output and error files
slurm.*.out
slurm.*.err
slurm*.out
slurm*.err

# ignore jupyter notebook checkpoints
.ipynb_checkpoints/
Expand Down
26 changes: 20 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,10 @@ MINI-AC uses a dual license to offer the distribution of the software under a pr

Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and two maize genome versions (B73 RefGen_v4 and B73 RefGen_v5). Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
* **genome-wide**: strategy where the whole non-coding genome is considered for motif mappings. It captures all the ACRs of the input dataset for the GRN prediction, which is adviced when working with species with long intergenic regions and distal regulatory elements, like maize for example.
* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check the instructions [here](docs/configuration_pipeline.md).

* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check the instructions [here](docs/pipeline_configuration.md).

A detailed overview of the necessary input files and expected output files can be found in this [example](example), done on **maize V4 with the genome-wide mode**, and using as input a single-cell-derived ACR dataset of mesophyll and bundle sheath.


## **Inputs**
* **MINI-AC mode**: genome-wide or locus-based.
* **Species**: Arabidopsis or maize (maize genome version 4 or 5).
Expand Down Expand Up @@ -63,15 +61,31 @@ NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10

## Usage


Define the paths with the input files and the desired parameters setting in the [configuration file](docs/configuration_pipeline.md), and run it executing the following Nextflow command:
Define the paths with the input files and the desired parameters setting in the [configuration file](docs/pipeline_configuration.md), and run it executing the following Nextflow command:

```shell
nextflow -C mini_ac.config run mini_ac.nf --mode <genome_wide|locus_based> --species <arabidopsis|maize_v4|maize_v5>
```

Having problems running MINI-AC? Check the [FAQ](docs/FAQ.md).

## iCREs-based MINI-AC [NOT AVAILABLE UNTIL PUBLICATION]

Given the amount of resources available to profile regulatory DNA in maize, we curated a collection of integrated cis-regulatory elements (iCREs) by combining and comparing different CRE-profiling methods (details to be published).

We implemented a new framework in which it is possible to run MINI-AC given a list of maize genes. It works by retrieving the genomic coordinates of the iCREs associated with genes of interest, and submitting them to motif enrichment and GRN inference using the genome-wide mode of MINI-AC. iCREs-based MINI-AC can only be run for maize, and not for Arabidopsis. In addition, we offer different sets of iCREs that are used in the run: the "maxF1" (`maxf1`) set or the "all" (`all`) set. The first uses a set of putative CREs that is smaller but more precise (less false positives), while the second uses a more comprehensive and complete collection of maize putative CREs.

To download files with the genomic coordinates of the iCREs, the following commands should be executed on the **top-level directory of the repository**:

```shell
NOT AVAILABLE UNTIL PUBLICATION
```

To run iCREs-based MINI-AC, the [configuration file](./mini_ac_icres.config) should be prepared as explained [here](./docs/pipeline_configuration.md). Only two parameters change in comparison to the regular MINI-AC runs. Instead of providing a BED file with ACR genomic coordinates, a list of gene IDs from the maize genome version V4 or V5 should be provided, as exemplified [here](./example/inputs/gene_set_files/UP_gene_set.txt). In addition, an iCREs set should be specified (`maxf1` or `all`). Next, the following Nextflow command should be executed:

```shell
nextflow -C mini_ac_icres.config run mini_ac_icres.nf --icres_set <all|maxf1> --species <maize_v4|maize_v5>
```

## Support

Expand All @@ -81,7 +95,7 @@ Should you encounter a bug or have any questions or suggestions, please [open an

When publishing results generated using MINI-AC, please cite:

Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal. https://doi.org/10.1111/tpj.16483.
Nicolás Manosalva Pérez, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal 117, no. 1 (2024): 280–301. https://doi.org/10.1111/tpj.16483.

## Contact

Expand Down
51 changes: 51 additions & 0 deletions bin/geneList2iCREs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# %%
import argparse

def parseArgs():

parser = argparse.ArgumentParser(prog = 'Script to get a BED file with iCREs ' + \
'coordinates given a list of genes',
conflict_handler='resolve')

parser.add_argument('annotated_icres', type = str,
help = '',
metavar = 'BED file with 4th column being ' +\
'an annotated gene ID')

parser.add_argument('gene_list', type = str,
help = '',
metavar = 'One column file containing gene IDs '+ \
'of interest')

parser.add_argument('bed_of_genes_icres', type = str,
help = '',
metavar = 'Output BED file with coordinates '+\
'of iCREs associated with genes of interest')

args = parser.parse_args()

return args

args = parseArgs()

annot_icres = args.annotated_icres
genes_oi_file = args.gene_list
output_file = args.bed_of_genes_icres

# %%
genes_oi = set()

with open(genes_oi_file, "r") as fin:
for line in fin:
rec = line.strip().split("\t")
gene_id = rec[0]
genes_oi.add(gene_id)

with open(output_file, "w") as fout:
with open(annot_icres, "r") as fin:
for line in fin:
rec = line.strip().split("\t")
gene_id = rec[3]
if gene_id in genes_oi:
fout.write("\t".join(rec[0:3]))
fout.write("\n")
Empty file added data/icres/.gitkeep
Empty file.
2 changes: 1 addition & 1 deletion docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Q: MINI-AC failed, how can I fix it?
A:
* Check the [config file](/docs/configuration_pipeline.md):
* Check the [config file](/docs/pipeline_configuration.md):
* Did you specify the correct [executor](https://www.nextflow.io/docs/latest/executor.html) (e.g. SGE, SLURM, ...)? Cluster-related options (i.e., all the lines starting with `clusterOptions`) should also be adapted to match the options of the selected executor.
* Did you [specify to Singularity the path to the temporary directory](https://docs.sylabs.io/guides/3.5/user-guide/bind_paths_and_mounts.html)? It can be done by adjusting the parameter ```runOptions``` of singularity in Nextflow to ```--bind /absolute/path/to/tmp/folder```. To know the absolute path to the tmp folder in linux execute in the command line ```echo $TMPDIR```

Expand Down
Loading
Loading