Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cd-hit-dup Segmentation fault (core dumped) #139

Open
scross92 opened this issue Aug 23, 2023 · 5 comments
Open

cd-hit-dup Segmentation fault (core dumped) #139

scross92 opened this issue Aug 23, 2023 · 5 comments

Comments

@scross92
Copy link

I installed cd-hit-auxtools via a conda. However, when I run cd-hit-dup, I get an issue for Segmentation fault (core dumped). I have seen in previous issues #27 and #97 users had issues with segmentation fault errors, but this was using cd-hit-est and neither came to a clear conclusion on how to fix any possible segmentation fault errors.

I tried removing and reinstalling the cd-hit-auxtools in my conda env but that did not resolve the issue.

I feel it may be a relatively easy fix that I am overlooking. I appreciate any help!

Here is the output when I run cd-hit-dup

cd-hit-dup -i /path/to/fastq -o output.fq -u 30 -e 2
Read 100000 sequences ...
From input: /path/to/fastq
Total number of sequences: 125659
Longest: 76
Shortest: 75
Start clustering duplicated sequences ...
primer = 0
Clustered 10000 sequences with 9987 clusters ...
Clustered 20000 sequences with 19938 clusters ...
Clustered 30000 sequences with 29863 clusters ...
Clustered 40000 sequences with 39766 clusters ...
Clustered 50000 sequences with 49655 clusters ...
Clustered 60000 sequences with 59512 clusters ...
Clustered 70000 sequences with 69342 clusters ...
Clustered 80000 sequences with 79195 clusters ...
Clustered 90000 sequences with 89032 clusters ...
Clustered 100000 sequences with 98821 clusters ...
Clustered 110000 sequences with 108635 clusters ...
Clustered 120000 sequences with 118421 clusters ...
Number of reads: 125659
Number of clusters found: 123942
Number of clusters with abundance above the cutoff (=1): 123942
Number of clusters with abundance below the cutoff (=1): 0
Writing clusters to files ...
Segmentation fault (core dumped)

@replikation
Copy link

core dumped is a hardware issue. Too little disk space or RAM

@scross92
Copy link
Author

Normally I would agree. However, I have plenty of disk space (nearly 20TB) and RAM (250GB) available, so this shouldn't be an issue. Interestingly, when I run cd-hit-dup using a singularity container, it works completely fine on this same device. I may have to try and adjust my pipeline to using a singularity container for this pipeline as it may be an issue with the current conda release for the cd-hit-auxtools.

@sapoudel
Copy link

sapoudel commented Feb 13, 2024

@scross92 Which container are you using? I am having the same issue when using conda build and biocontainer's image:
https://quay.io/repository/biocontainers/cd-hit-auxtools?tab=tags&tag=latest

I also have large disk space + RAM available and subsampling to 25k reads still gives the error so it's not out-of-memory issue. However, it works if I only pass one direction at a time e.g.
cd-hit-dup -i <(zcat fastq/fastq_1.fastq.gz) -o output.fq1

@epiliper
Copy link

Also experiencing this issue, using the same container mentioned by @sapoudel. I've also tried compiling the cd-hit-dup binary in an Ubuntu Docker container of my own, which results in the same issue. The workaround @sapoudel described avoids the seg fault, but is less than ideal since it doesn't merge paired-end reads.

Wanted to bump.

@epiliper
Copy link

UPDATE with solution that worked for me:

I don't think cd-hit-dup can take reads from zcat called on gzipped fastq; I managed to run deduplication in paired-end mode by unzipping the fastq.gz files into new ones, and just using those.

Code attached shows using cd-hit-dup with paired end data on a Nextflow pipeline using this biocontainer:
https://quay.io/repository/biocontainers/cd-hit-auxtools

gunzip ${prefix}_1.fastq.gz 
gunzip ${prefix}_2.fastq.gz

## cd-hit-dup seg faults if using with -i, -i2 with gzipped input, so unzip first

cd-hit-dup \
     -i ${prefix}_1.fastq -o ${prefix}_1_dedup.fastq \
     -i2 ${prefix}_2.fastq -o2 ${prefix}_2_dedup.fastq

The code below would cause a seg fault:

  cd-hit-dup \
     -i <(zcat ${prefix}_1.fastq.gz) -o ${prefix}_1_dedup.fastq.gz
     -i2  <(zcat ${prefix}_2.fastq.gz) -o2 ${prefix}_2_dedup.fastq.gz

For context, I tried this with 2 files containing ~1e6 reads each.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants