-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose dandiarchive as webdav service #166
Comments
None. |
i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server. |
I added a check: and here is a reply from chatgpt (didn't try yet)Yes, you can implement a WebDAV backend that does not store or provide the actual file bytes but instead redirects to a target URL elsewhere when a client requests a file. This can be achieved by customizing the behavior of the WebDAV server to handle requests in a way that serves external resources through redirection. Here's a simplified example of how you can create a WebDAV server that redirects requests to external URLs using the from pywebdav.server import DAVServer
from pywebdav.types import DAVError
class RedirectDAVResource(DAVResource):
def __init__(self, path, redirect_url):
super().__init__(path)
self.redirect_url = redirect_url
def GET(self):
raise DAVError(302, self.redirect_url)
class MyDAVServer(DAVServer):
def __init__(self, root_path):
super().__init__()
self.root_path = root_path
def get_resource_inst(self, path):
# In this example, we return a RedirectDAVResource for all files
# You can add logic here to determine if a file should be redirected
return RedirectDAVResource(path, "https://example.com/external/resource")
if __name__ == '__main__':
server = MyDAVServer('/path/to/your/data/store')
server.run() In this example, we define a custom The Please note that this is a basic example, and you can expand on it to meet your specific requirements for redirection, access control, and handling different types of resources. Depending on your use case, you might also want to handle other WebDAV methods (e.g., PUT, DELETE) as needed. |
I think the best would be to start with an independent service to just try feasibility etc, using https://github.com/mar10/wsgidav and make it for now so we could just try locally and if it would work nicely then we reassess inclusion into dandi-archive. @jwodder could you please provide a prototype imlementation using https://github.com/mar10/wsgidav and looking at available backends e.g. for mercurial: https://wsgidav.readthedocs.io/en/latest/addons-mercurial.html and others, so we establish webdav over api.dandiarchive.org to expose it as a read-only webdav where it would have a tree
for individual files, should forward to the s3 http URL. |
@yarikoptic Where should this wsgidav instance be deployed? |
for now I have no ready for it hosting -- should be code/service anyone could just run/try locally. If we see that it works well, we would then either proceed adapting it within our dandi-archive instance(s) or look into establishing a separate hosting for it. |
|
For this portion - yes! Depending on our success with it in read-only facility, we might (much later) want to look into providing support for uploading stuff too through it -- might be quite cool / user-friendly (as long as we can provide feedback etc).
not flat -- should be according to their directory hierarchy, like we have in datalad dandisets and files view on dandiarchive.
my webdav knowledge is very limited... AFAIK "directory" in webdav is a collection as well, or in other words I do not know a way to have some different "types" of collections. So as such -- zarr assets indeed should be just directories, and then paths under them should be redirected to corresponding paths under corresponding zarr on S3.
I thought of it indeed as some kind of future proofing since we do have separate "prefix/"es on S3, and indeed if there would be demand, we might want to expose later |
What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects. When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else? |
hm,
yes -- entries within Zarr organized into a directory hierarchy -- pretty much 1-to-1 as to how it is on S3. Overall example with what redirects/responses we need on an example
|
@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437), but the wsgidav implementation does not seem to support defining redirects. |
@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:
|
phewph, good!
:-/ is it possible to just reply with some standard HTTP response there?
even without adding to dandi-cli, is it much more just some
since embargoed zarrs are not even supported yet, everything is public, let's just use boto directly to get a listing of the "index" for S3 prefix. |
FWIW, for redirects there was a fresh followup mar10/wsgidav#303 (comment) confirming that not directly supported but possibly relatively easy to add to test the idea out |
Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.
I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things. |
what about the overload approach mentioned in mar10/wsgidav#303 (comment) ?
nope. I did listing of directories in datalad (now in datalad-deprecated) with old boto and you can quickly (takes no time -- less than a sec) do that e.g. with
in cli. chatgpt gave following example code for boto3 which runs in 0.5 sec locally for me, so not listing entire zarr there and there could be even better ways may beimport boto3
from botocore import UNSIGNED
from botocore.client import Config
# Create a new S3 client with anonymous access
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = 'dandiarchive'
prefix = 'zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/'
def list_directories_and_files(bucket, prefix):
paginator = s3_client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/')
for page in result:
if "CommonPrefixes" in page:
for subdir in page['CommonPrefixes']:
print('Subdirectory: ' + subdir['Prefix'])
if "Contents" in page:
for file in page['Contents']:
if not file['Key'].endswith('/'):
print('File: ' + file['Key'])
list_directories_and_files(bucket_name, prefix)
|
I have no idea how to implement that as a user of wsgidav without forking wsgidav. |
update: @jwodder redid in Rust in https://github.com/dandi/dandidav . A sample instance is running at https://dandi.centerforopenneuroscience.org/ (not automatically deployed). Sample external services URLs to try:
Next stage would be deployment:
|
the subdomain can be registered through our aws account route53, but that involves us running the service. has forwarding to s3 for retrieval been implemented? if not, i would at least start with integrating at a local level that anyone could run it. if yes, then proceed with the setup. |
AFAIK yes. @jwodder can confirm if that is generally so. Known exception is |
@satra @waxlamp I would like to proceed with moving dandidav deployment into the "official" dandiarchive.org space from its temporary https://dandi.centerforopenneuroscience.org/ . Please guide us with @jwodder through on what we need to do to accomplish the drill. |
create a new instance in the aws account or heroku account and add a route53 cname alias for it. ideally, there are a few considerations with respect to horizontal scalability, but before we do that get a basic setup running. also estimate the costs of this service based on the infrastructure you choose. pinging @aaronkanzer who may be able to help with some considerations depending on choices. perhaps a devops doc could help you and others in the future as to how to deploy new services. note that once we move over k8s to 2i2c, we will want to use that substrate for future services. |
am I correct that heroku would be better target since it would hard limit us on resources so we do not break the bank? |
you can limit things in aws as well (fixed instance, no load balancer, etc.,.). but it may be quicker/easier in heroku. |
It seems there are Rust buildpacks for Heroku. As for infrastructure, I think it would be prudent to manage the necessaries through our Terraform setup. @mvandenburgh, I think you have the necessary background to look into this and formulate an operations plan. Could you please start with these two questions:
|
Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run |
may be also @kabilar and @aaronkanzer could help on this end since they are replicating DANDI infrastructure setup ? |
I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of @yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to Just some thoughts... |
Thanks @aaronkanzer, I definitely agree it makes sense to do an initial proof of concept outside of Terraform.
Agreed, I think this is the approach we should take - I'll start out by trying to set this up. |
@jwodder Heroku provides these values as environment variables at runtime - https://devcenter.heroku.com/articles/dyno-metadata#dyno-metadata. Is using the |
@mvandenburgh I've created a PR to fetch the Git commit from |
To facilitate integration in various external projects (e.g., OSDF with Pelican underneath) which can interface to webdav services. Treat it as unification of our API to a standard API for file access.
TODOs:
/paths
endpoint, but likely not since that one is dandiset specific, but may be it could be at the top level of APImay be it could be implemented as independent service, not part of API.
attn @jwodder -- how much did you play with webdav?
edits:
dandisets/
andzarrs/
(notzarr/
) folders (problem forzarr/
though -- heavy folder listing).{dandiset_id}
.draft/
,released/
(orlatest/
- most recent release),releases/{VERSION}
. It kinda makes it to requiredraft/
or other prefix folder to get to content but it is consistent so I like it most.draft/
,released/
(orlatest/
) and then all versions at the same level. kinda ok, but since numbered releases would most likely sort first and oldest first -- I think it would be not that convenient of a default view...dandi://INSTANCE/DANDISET_ID[@VERSION][/PATH]
and thus incorporating version into the folder name for dandiset -- IMHO uglyThe text was updated successfully, but these errors were encountered: