Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pb_read and pb_write functions #115

Merged
merged 16 commits into from
Dec 30, 2023
Merged

Add pb_read and pb_write functions #115

merged 16 commits into from
Dec 30, 2023

Conversation

tanho63
Copy link
Collaborator

@tanho63 tanho63 commented Dec 27, 2023

Closes #97.

I thought briefly about making this a wrapper around pb_download_url + read function that accepts URLs, but I don't think it had the flexibility I wanted plus I ran into issues downloading from private repositories that I later learned was around not being able to pass an auth token to it.

I think this is the most flexible approach to the problem but would love to hear any thoughts

@tanho63 tanho63 requested a review from cboettig December 27, 2023 23:37
Copy link
Member

@cboettig cboettig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strategy (specifically the guess_read_function()) always makes me nervous. While all abstractions are a bit leaky, I've always found attempts to write a generic "read" function as an abstract method particularly leaky. There's just too much going on in read_ methods to abstract away (what about other data serializations, like spatial formats? what about lazy reads / remote reads etc).

I'm fine with a convenience wrapper that makes common use patterns more concise. Maybe all that's needed is a bit more documentation saying something like 'for common formats such as ..., along with advice about how to bypass this convenience if a user prefers their own read function (say, from readr` package, or for some other format).

@tanho63
Copy link
Collaborator Author

tanho63 commented Dec 29, 2023

too much going on in read_ methods to abstract away (what about other data serializations, like spatial formats?

I believe this will fail on spatial formats since it would not be one of csv, tsv, rds, parquet? (unfamiliar with how geoparquet works and whether arrow::read_parquet will process geoparquet, but I assume yes?).

what about lazy reads / remote reads etc?

Yep, this will read eagerly by design/default, and maybe it's a bad thing for folks who should be thinking about optimizing - however uninformed users would currently do pb_download anyways so it's not necessarily much different than that?

I agree with improving the docs, e.g.

  • explaining that cloud native is best performance (more clearly linking to vignette)
  • demonstrating ways to read from URL (maybe adding to getting started vignette?)
  • better demonstrating examples of passing in a different read function and explaining what is supported ?

R/pb_read.R Show resolved Hide resolved
R/pb_read.R Show resolved Hide resolved
R/pb_write.R Show resolved Hide resolved
vignettes/piggyback.Rmd Outdated Show resolved Hide resolved
@tanho63 tanho63 requested a review from cboettig December 30, 2023 18:47
@tanho63
Copy link
Collaborator Author

tanho63 commented Dec 30, 2023

Flow state hit me like a bus, many apologies for this PR running away from me. Since your last review (diff), I:

  • gitignored + got rid of the stored docs/ folder because ropensci builds pkgdown externally + I've been using it to test out how the vignette looks after generating it
  • updated DESCRIPTION with the newly shortened description from the readme (because I realized it hadn't been updated)
  • fleshed out man files for read_function and write_function args, added man files for guess_read_function and guess_write_function that fully explain how it gets mapped
  • added various \dontshow blocks to hide the interactive() and try() blocks as inspired by the examplesIf
  • updated README to not evaluate any chunks except for regenerating codemeta (improves syntax highlighting in RStudio to use {r} instead of just r)
  • rewrote vignette/piggyback.Rmd again to try and address your various feedback points - this is probably the main thing that needs looked at

Copy link
Member

@cboettig cboettig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@tanho63 tanho63 merged commit 4589222 into master Dec 30, 2023
6 checks passed
@tanho63 tanho63 deleted the tan/pb-rw/97 branch December 30, 2023 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pb_read / pb_write to read/write directly into memory?
2 participants