API to change output suffix to .gz #35

d-cameron · 2019-10-31T06:14:04Z

The majority of VCF handling bioinformatics libraries use a .vcf.gz suffix, even for block gziped output. writeVcf() with index=TRUE does not support this and forceably sets the suffix to .bgz.

The following commands do exactly the same thing:

writeVcf(vcf, "example.vcf", index=TRUE)
writeVcf(vcf, "example.vcf.bgz", index=TRUE)
writeVcf(vcf, "example.vcf.gz", index=TRUE)

Desired behaviour: specifying a .vcf.gz as the output file, actually writes to the output file instead of silently changing the suffix of the output file to .vcf.bgz.

The text was updated successfully, but these errors were encountered:

d-cameron · 2019-10-31T06:14:17Z

PapenfussLab/gridss#269

bschilder · 2022-05-03T18:12:12Z

I agree, this is very unexpected. Could a warning at least be generated when writeVcf changes the path name?

vjcitn · 2022-05-03T19:16:31Z

I suppose a message could be produced. The man page for bgzip in Rsamtools shows why this is happening. If you have time to make a PR the code of interest is at

VariantAnnotation/R/methods-writeVcf.R

Line 257 in e966a1b

filenameGZ <- bgzip(scon$description, overwrite = TRUE)

bschilder · 2022-05-03T20:31:10Z

In the interest of bettering VariantAnnotation, while also being mindful I'm not able to devote too much time to projects I'm not a maintainer or author of, I propose the following divvying of work:

@vjcitn

readVcf: implement the faster chunking method in readVcf discussed here (or other ways of speeding up this functionality).
vcf2df: I can include my initial version in my PR (ive made some further improvements since the original post), but after I've submitted the PR I'd ask that you optimise this further to reduce compute time as much as possible.

@bschilder

select_vcf_fields: Function that speeds up readVcf by only importing non-empty columns.
writeVcf: Add the message discussed in this present Issue.

Does this sound fair to you? I can get started once item # 1 on your list is completed (readVcf). That way the rest of the changes I plan to make will be optimised for the updated version of VariantAnnotation.

Best,
Brian

hpages · 2022-08-24T20:06:27Z

@vjcitn @bschilder Are guys planning to follow up on this?

vjcitn · 2022-08-24T20:32:18Z

@bschilder did you make the PR that you mentioned? i cannot work on the other piece for some time.

bschilder · 2022-08-25T12:43:59Z

@hpages I didn't make this PR because @vjcitn determined it was beyond the scope of VariantAnnotation to include these functionalities (or at least some of them). So instead, I added them to our lab's package MungeSumstats. If you've changed your mind about this @vjcitn I'd be happy to share my existing code.

regarding readVcf: Speed up readVcf #59 (comment)

In summary, I personally have no basis for going into readVcf and trying to speed it up. It may be possible, but given that the original authors have left the project and there is a reasonable path to divide and conquer with the ingestion process, I am not inclined to do much more.

regarding vcf2df : Converting VCF --> data.table #57 (comment)
regarding select_vcf_fields: Not sure a decision was reached on this. @vjcitn could you confirm whether you think this would be useful to integrate into VariantAnnotation? I seem to recall some concerns about omitting certain columns based on the sampling of only the first N rows.
regarding writeVcf: @vjcitn not sure I heard back about this, is this something you'd find helpful for me to implement?

vjcitn added the enhancement label May 3, 2022

bschilder mentioned this issue May 4, 2022

Speed up readVcf #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API to change output suffix to .gz #35

API to change output suffix to .gz #35

d-cameron commented Oct 31, 2019

d-cameron commented Oct 31, 2019

bschilder commented May 3, 2022

vjcitn commented May 3, 2022

bschilder commented May 3, 2022 •

edited

Loading

hpages commented Aug 24, 2022

vjcitn commented Aug 24, 2022

bschilder commented Aug 25, 2022 •

edited

Loading

API to change output suffix to .gz #35

API to change output suffix to .gz #35

Comments

d-cameron commented Oct 31, 2019

d-cameron commented Oct 31, 2019

bschilder commented May 3, 2022

vjcitn commented May 3, 2022

bschilder commented May 3, 2022 • edited Loading

@vjcitn

@bschilder

hpages commented Aug 24, 2022

vjcitn commented Aug 24, 2022

bschilder commented Aug 25, 2022 • edited Loading

bschilder commented May 3, 2022 •

edited

Loading

bschilder commented Aug 25, 2022 •

edited

Loading