Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose dandiarchive as webdav service #166

Closed
4 of 5 tasks
yarikoptic opened this issue Dec 6, 2023 · 31 comments · Fixed by #167
Closed
4 of 5 tasks

Expose dandiarchive as webdav service #166

yarikoptic opened this issue Dec 6, 2023 · 31 comments · Fixed by #167
Assignees

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Dec 6, 2023

To facilitate integration in various external projects (e.g., OSDF with Pelican underneath) which can interface to webdav services. Treat it as unification of our API to a standard API for file access.

TODOs:

  • decide on how to map our versions for dandisets within the path or some webdav supported mechanism for versioning (if there is any)
  • double-check if webdav supports "redirection" so we do not need to tunnel all bytes from S3 through the webdav service but rather just redirect.
  • implement mapping/endpoint: I wonder even if it somehow could be just the /paths endpoint, but likely not since that one is dandiset specific, but may be it could be at the top level of API
  • decide on implementation
    • could be internal to dandi-archive. pros: more efficient since would operate directly on DB/within Python, likely minimizing flood of the logs; it would come "included" by default with any dandi-archive installation; would not require another web server to provide it; as our API still in 0.x and adds breakages from time to time -- would be easier to keep consistent; cons: it would be easier/faster to develop/fix as independent service likely
    • could be some additional external service. pros: faster/easier to develop; would stress test our API more to see what it is missing; cons: solution would need to provide yet another webserver
  • deploy as webdav.dandiarchive.org or at some other end-point (discussion below)

may be it could be implemented as independent service, not part of API.

attn @jwodder -- how much did you play with webdav?

edits:

  • https://github.com/mar10/wsgidav looks very promising and already has (non-production) backends for various things, e.g. for mercurial: https://wsgidav.readthedocs.io/en/latest/addons-mercurial.html .
  • re structure - I think we at large should mimic our S3, just adding a little more of consistency and have dandisets/ and zarrs/ (not zarr/) folders (problem for zarr/ though -- heavy folder listing).
  • on the best way to expose versions, possibilities, assuming that we do have some global prefix which ends with {dandiset_id}.
    • following mercurial example above -- have draft/, released/ (or latest/ - most recent release), releases/{VERSION}. It kinda makes it to require draft/ or other prefix folder to get to content but it is consistent so I like it most.
    • have draft/, released/ (or latest/) and then all versions at the same level. kinda ok, but since numbered releases would most likely sort first and oldest first -- I think it would be not that convenient of a default view...
    • "smarty pants" version -- kinda making versioning optional and reacting to the regex for version at the top level and if not - giving the tree of the draft or most recent version... too smarty - don't like it
    • following our URL schema dandi://INSTANCE/DANDISET_ID[@VERSION][/PATH] and thus incorporating version into the folder name for dandiset -- IMHO ugly
  • FWIW: filed Wishlist: example/support for redirection to online (http/https) resources mar10/wsgidav#303
@jwodder
Copy link
Member

jwodder commented Dec 6, 2023

@yarikoptic

how much did you play with webdav?

None.

@satra
Copy link
Member

satra commented Dec 6, 2023

i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.

@yarikoptic
Copy link
Member Author

i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.

I added a check: and here is a reply from chatgpt (didn't try yet)

Yes, you can implement a WebDAV backend that does not store or provide the actual file bytes but instead redirects to a target URL elsewhere when a client requests a file. This can be achieved by customizing the behavior of the WebDAV server to handle requests in a way that serves external resources through redirection.

Here's a simplified example of how you can create a WebDAV server that redirects requests to external URLs using the PyWebDAV library in Python:

from pywebdav.server import DAVServer
from pywebdav.types import DAVError

class RedirectDAVResource(DAVResource):
    def __init__(self, path, redirect_url):
        super().__init__(path)
        self.redirect_url = redirect_url

    def GET(self):
        raise DAVError(302, self.redirect_url)

class MyDAVServer(DAVServer):
    def __init__(self, root_path):
        super().__init__()
        self.root_path = root_path

    def get_resource_inst(self, path):
        # In this example, we return a RedirectDAVResource for all files
        # You can add logic here to determine if a file should be redirected
        return RedirectDAVResource(path, "https://example.com/external/resource")

if __name__ == '__main__':
    server = MyDAVServer('/path/to/your/data/store')
    server.run()

In this example, we define a custom RedirectDAVResource class that inherits from DAVResource. When a GET request is made to this resource, it raises a 302 HTTP status code with a Location header set to the desired redirection URL.

The get_resource_inst method of the MyDAVServer class returns instances of RedirectDAVResource for all requested files, but you can customize this logic to decide which files should be redirected and specify the target URL accordingly.

Please note that this is a basic example, and you can expand on it to meet your specific requirements for redirection, access control, and handling different types of resources. Depending on your use case, you might also want to handle other WebDAV methods (e.g., PUT, DELETE) as needed.

@yarikoptic yarikoptic transferred this issue from dandi/dandi-archive Dec 7, 2023
@yarikoptic
Copy link
Member Author

I think the best would be to start with an independent service to just try feasibility etc, using https://github.com/mar10/wsgidav and make it for now so we could just try locally and if it would work nicely then we reassess inclusion into dandi-archive.

@jwodder could you please provide a prototype imlementation using https://github.com/mar10/wsgidav and looking at available backends e.g. for mercurial: https://wsgidav.readthedocs.io/en/latest/addons-mercurial.html and others, so we establish webdav over api.dandiarchive.org to expose it as a read-only webdav where it would have a tree

  • dandisets/{dandiset_id} with following possible folders under
    • draft/ - always there, has a tree for current draft version
    • latest/ - present if there was a released version, would have a tree of the most recent version
    • releases/ with {version} subfolders - if there were releases

for individual files, should forward to the s3 http URL.

@jwodder
Copy link
Member

jwodder commented Dec 7, 2023

@yarikoptic Where should this wsgidav instance be deployed?

@yarikoptic
Copy link
Member Author

for now I have no ready for it hosting -- should be code/service anyone could just run/try locally. If we see that it works well, we would then either proceed adapting it within our dandi-archive instance(s) or look into establishing a separate hosting for it.

@jwodder
Copy link
Member

jwodder commented Dec 13, 2023

@yarikoptic

  • I'm assuming you want the WebDAV view to be read-only; is that correct?
  • How exactly should assets under a given version be laid out? Should there be a flat listing of all assets in the version, or should the assets be grouped into the directory hierarchy implied by the forward slashes in their paths?
  • How should Zarr assets be represented? Should they just be directories of entries or something else?
  • What's the point of the dandisets/ path prefix? Is the root of the WebDAV service ever going to contain any entries other than dandisets/?

@yarikoptic
Copy link
Member Author

@yarikoptic

  • I'm assuming you want the WebDAV view to be read-only; is that correct?

For this portion - yes! Depending on our success with it in read-only facility, we might (much later) want to look into providing support for uploading stuff too through it -- might be quite cool / user-friendly (as long as we can provide feedback etc).

  • How exactly should assets under a given version be laid out? Should there be a flat listing of all assets in the version, or should the assets be grouped into the directory hierarchy implied by the forward slashes in their paths?

not flat -- should be according to their directory hierarchy, like we have in datalad dandisets and files view on dandiarchive.

  • How should Zarr assets be represented? Should they just be directories of entries or something else?

my webdav knowledge is very limited... AFAIK "directory" in webdav is a collection as well, or in other words I do not know a way to have some different "types" of collections. So as such -- zarr assets indeed should be just directories, and then paths under them should be redirected to corresponding paths under corresponding zarr on S3.

  • What's the point of the dandisets/ path prefix? Is the root of the WebDAV service ever going to contain any entries other than dandisets/?

I thought of it indeed as some kind of future proofing since we do have separate "prefix/"es on S3, and indeed if there would be demand, we might want to expose later zarrs/ or metadata/ or some other elements.

@jwodder
Copy link
Member

jwodder commented Dec 13, 2023

@yarikoptic

paths under them should be redirected to corresponding paths under corresponding zarr on S3.

What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects. When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?

@yarikoptic
Copy link
Member Author

@yarikoptic

paths under them should be redirected to corresponding paths under corresponding zarr on S3.

What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects.

hm,

  • the whole premise/point of working on webdav for us would be useful only if that webdav service could operate not by channeling bytes through it but rather just redirecting to corresponding URLs on S3. Hence that 2nd checkbox in original description of this issue
  • searching for redirect within wsgidav finds some hints and issue I filed was labeled (instead of closed saying "not possible") suggesting that generally it is possible for WEBDAV and likely wsgidav
  • should be just a regular HTTP redirect
    • in the longer run (again -- if we find it useful) I thought we could make them 308 Permanent for assets in released versions, and 302 Found for assets in draft version. For now 302 would be good for all redirects.

When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?

yes -- entries within Zarr organized into a directory hierarchy -- pretty much 1-to-1 as to how it is on S3. Overall example with what redirects/responses we need on an example

@jwodder
Copy link
Member

jwodder commented Dec 13, 2023

@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437), but the wsgidav implementation does not seem to support defining redirects.

@jwodder
Copy link
Member

jwodder commented Dec 13, 2023

@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:

@yarikoptic
Copy link
Member Author

@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437),

phewph, good!

but the wsgidav implementation does not seem to support defining redirects.

:-/ is it possible to just reply with some standard HTTP response there?

@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:

  • While the Archive has an endpoint that groups assets by directory, dandi-cli currently does not support it.

even without adding to dandi-cli, is it much more just some client.paginate request directly to API?

since embargoed zarrs are not even supported yet, everything is public, let's just use boto directly to get a listing of the "index" for S3 prefix.

@yarikoptic
Copy link
Member Author

FWIW, for redirects there was a fresh followup mar10/wsgidav#303 (comment) confirming that not directly supported but possibly relatively easy to add to test the idea out

@jwodder
Copy link
Member

jwodder commented Dec 13, 2023

@yarikoptic

is it possible to just reply with some standard HTTP response there?

Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.

let's just use boto directly to get a listing of the "index" for S3 prefix.

I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.

@yarikoptic
Copy link
Member Author

yarikoptic commented Dec 13, 2023

@yarikoptic

is it possible to just reply with some standard HTTP response there?

Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.

what about the overload approach mentioned in mar10/wsgidav#303 (comment) ?

let's just use boto directly to get a listing of the "index" for S3 prefix.

I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.

nope. I did listing of directories in datalad (now in datalad-deprecated) with old boto and you can quickly (takes no time -- less than a sec) do that e.g. with

❯ time s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/1/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/2/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/3/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/4/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/5/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/6/
2022-04-21 23:26         7859  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zattrs
2022-02-26 22:22           24  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zgroup
2022-04-21 23:26        14925  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zmetadata
s3cmd -c ~/.s3cfg-dandi-backup ls   0.10s user 0.02s system 19% cpu 0.572 total

in cli.

chatgpt gave following example code for boto3 which runs in 0.5 sec locally for me, so not listing entire zarr there and there could be even better ways may be
import boto3
from botocore import UNSIGNED
from botocore.client import Config

# Create a new S3 client with anonymous access
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

bucket_name = 'dandiarchive'
prefix = 'zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/'

def list_directories_and_files(bucket, prefix):
    paginator = s3_client.get_paginator('list_objects_v2')
    result = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/')
    
    for page in result:
        if "CommonPrefixes" in page:
            for subdir in page['CommonPrefixes']:
                print('Subdirectory: ' + subdir['Prefix'])

        if "Contents" in page:
            for file in page['Contents']:
                if not file['Key'].endswith('/'):
                    print('File: ' + file['Key'])

list_directories_and_files(bucket_name, prefix)
❯ time python <(xclip -o)
Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/
Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/1/
Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/2/
Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/3/
Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/4/
Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/5/
Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/6/
File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zattrs
File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zgroup
File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zmetadata
python <(xclip -o)  0.22s user 0.03s system 49% cpu 0.512 total

@jwodder
Copy link
Member

jwodder commented Dec 13, 2023

@yarikoptic

what about the overload approach mentioned in mar10/wsgidav#303 (comment) ?

I have no idea how to implement that as a user of wsgidav without forking wsgidav.

@yarikoptic
Copy link
Member Author

update: @jwodder redid in Rust in https://github.com/dandi/dandidav . A sample instance is running at https://dandi.centerforopenneuroscience.org/ (not automatically deployed). Sample external services URLs to try:

Next stage would be deployment:

  • @dandi/archive-admin could you guide through the steps for "properly" "integrating" that beastie as webdav.dandiarchive.org within our infrastructure so configuration is centralized etc.
  • @satra -- is it for you to register a subdomain?

@satra
Copy link
Member

satra commented Jan 31, 2024

the subdomain can be registered through our aws account route53, but that involves us running the service. has forwarding to s3 for retrieval been implemented? if not, i would at least start with integrating at a local level that anyone could run it. if yes, then proceed with the setup.

@yarikoptic
Copy link
Member Author

has forwarding to s3 for retrieval been implemented?

AFAIK yes. @jwodder can confirm if that is generally so. Known exception is dandiset.yaml (rust knowledgeable folks can review source). Anyways would be nice to also add some traffic/load/requests stats for that node to see how well it copes under load.

@waxlamp waxlamp transferred this issue from dandi/dandi-infrastructure Feb 5, 2024
@yarikoptic
Copy link
Member Author

@satra @waxlamp I would like to proceed with moving dandidav deployment into the "official" dandiarchive.org space from its temporary https://dandi.centerforopenneuroscience.org/ .

Please guide us with @jwodder through on what we need to do to accomplish the drill.

@satra
Copy link
Member

satra commented Feb 20, 2024

create a new instance in the aws account or heroku account and add a route53 cname alias for it. ideally, there are a few considerations with respect to horizontal scalability, but before we do that get a basic setup running. also estimate the costs of this service based on the infrastructure you choose. pinging @aaronkanzer who may be able to help with some considerations depending on choices.

perhaps a devops doc could help you and others in the future as to how to deploy new services. note that once we move over k8s to 2i2c, we will want to use that substrate for future services.

@yarikoptic
Copy link
Member Author

am I correct that heroku would be better target since it would hard limit us on resources so we do not break the bank?

@satra
Copy link
Member

satra commented Feb 21, 2024

you can limit things in aws as well (fixed instance, no load balancer, etc.,.). but it may be quicker/easier in heroku.

@waxlamp
Copy link
Member

waxlamp commented Feb 22, 2024

It seems there are Rust buildpacks for Heroku. As for infrastructure, I think it would be prudent to manage the necessaries through our Terraform setup.

@mvandenburgh, I think you have the necessary background to look into this and formulate an operations plan. Could you please start with these two questions:

  1. How do we deploy a Rust-based web application on Heroku?
  2. What AWS resources do we need to develop in our TF materials?

@waxlamp waxlamp transferred this issue from dandi/dandidav Feb 22, 2024
@waxlamp waxlamp assigned mvandenburgh and unassigned satra Feb 22, 2024
@jwodder
Copy link
Member

jwodder commented Feb 22, 2024

Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run git rev-parse --short HEAD in order to embed the current Git commit in the binary.

@yarikoptic
Copy link
Member Author

may be also @kabilar and @aaronkanzer could help on this end since they are replicating DANDI infrastructure setup ?

@aaronkanzer
Copy link
Member

aaronkanzer commented Feb 23, 2024

I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of dandi-infrastructure first alone in Heroku since dandi-infrastructure is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh

@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Procfile with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on #166 (comment))

Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to webdav.dandiarching.org) with the dyno. If successful, we could append or easily build out more IaC in dandi-infrastructure.

Just some thoughts...

@mvandenburgh
Copy link
Member

I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of dandi-infrastructure first alone in Heroku since dandi-infrastructure is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh

Thanks @aaronkanzer, I definitely agree it makes sense to do an initial proof of concept outside of Terraform.

@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Procfile with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on #166 (comment))

Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to webdav.dandiarching.org) with the dyno. If successful, we could append or easily build out more IaC in dandi-infrastructure.

Agreed, I think this is the approach we should take - I'll start out by trying to set this up.

@mvandenburgh
Copy link
Member

Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run git rev-parse --short HEAD in order to embed the current Git commit in the binary.

@jwodder Heroku provides these values as environment variables at runtime - https://devcenter.heroku.com/articles/dyno-metadata#dyno-metadata. Is using the HEROKU_SLUG_COMMIT environment variable sufficient here?

@jwodder
Copy link
Member

jwodder commented Mar 5, 2024

@mvandenburgh I've created a PR to fetch the Git commit from HEROKU_SLUG_COMMIT if no normal Git information is available: dandi/dandidav#95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants