Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to change output suffix to .gz #35

Open
d-cameron opened this issue Oct 31, 2019 · 7 comments
Open

API to change output suffix to .gz #35

d-cameron opened this issue Oct 31, 2019 · 7 comments

Comments

@d-cameron
Copy link
Contributor

The majority of VCF handling bioinformatics libraries use a .vcf.gz suffix, even for block gziped output. writeVcf() with index=TRUE does not support this and forceably sets the suffix to .bgz.

The following commands do exactly the same thing:

writeVcf(vcf, "example.vcf", index=TRUE)
writeVcf(vcf, "example.vcf.bgz", index=TRUE)
writeVcf(vcf, "example.vcf.gz", index=TRUE)

Desired behaviour: specifying a .vcf.gz as the output file, actually writes to the output file instead of silently changing the suffix of the output file to .vcf.bgz.

@d-cameron
Copy link
Contributor Author

PapenfussLab/gridss#269

@bschilder
Copy link
Contributor

I agree, this is very unexpected. Could a warning at least be generated when writeVcf changes the path name?

@vjcitn
Copy link
Contributor

vjcitn commented May 3, 2022

I suppose a message could be produced. The man page for bgzip in Rsamtools shows why this is happening. If you have time to make a PR the code of interest is at

filenameGZ <- bgzip(scon$description, overwrite = TRUE)

@bschilder
Copy link
Contributor

bschilder commented May 3, 2022

In the interest of bettering VariantAnnotation, while also being mindful I'm not able to devote too much time to projects I'm not a maintainer or author of, I propose the following divvying of work:

@vjcitn

  • readVcf: implement the faster chunking method in readVcf discussed here (or other ways of speeding up this functionality).
  • vcf2df: I can include my initial version in my PR (ive made some further improvements since the original post), but after I've submitted the PR I'd ask that you optimise this further to reduce compute time as much as possible.

@bschilder

  • select_vcf_fields: Function that speeds up readVcf by only importing non-empty columns.
  • writeVcf: Add the message discussed in this present Issue.

Does this sound fair to you? I can get started once item # 1 on your list is completed (readVcf). That way the rest of the changes I plan to make will be optimised for the updated version of VariantAnnotation.

Best,
Brian

@hpages
Copy link
Contributor

hpages commented Aug 24, 2022

@vjcitn @bschilder Are guys planning to follow up on this?

@vjcitn
Copy link
Contributor

vjcitn commented Aug 24, 2022

@bschilder did you make the PR that you mentioned? i cannot work on the other piece for some time.

@bschilder
Copy link
Contributor

bschilder commented Aug 25, 2022

@hpages I didn't make this PR because @vjcitn determined it was beyond the scope of VariantAnnotation to include these functionalities (or at least some of them). So instead, I added them to our lab's package MungeSumstats. If you've changed your mind about this @vjcitn I'd be happy to share my existing code.

In summary, I personally have no basis for going into readVcf and trying to speed it up. It may be possible, but given that the original authors have left the project and there is a reasonable path to divide and conquer with the ingestion process, I am not inclined to do much more.

  • regarding vcf2df : Converting VCF --> data.table #57 (comment)

  • regarding select_vcf_fields: Not sure a decision was reached on this. @vjcitn could you confirm whether you think this would be useful to integrate into VariantAnnotation? I seem to recall some concerns about omitting certain columns based on the sampling of only the first N rows.

  • regarding writeVcf: @vjcitn not sure I heard back about this, is this something you'd find helpful for me to implement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants