Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assetstore Import Tracker / Repeater #197

Closed
manthey opened this issue Mar 1, 2022 · 9 comments
Closed

Assetstore Import Tracker / Repeater #197

manthey opened this issue Mar 1, 2022 · 9 comments

Comments

@manthey
Copy link
Contributor

manthey commented Mar 1, 2022

This is a summary of a long-desired feature. Once a repo is created for such a feature, any issues related to it should be moved there (e.g., #193).

We'd like to have a Girder plugin that records when any Import action is done on an assetstore. This would record all of the options: path, destination, etc. for arbitrary assetstore types (probably by hooking the import endpoint event), plus the time that the import started.

We want to show a list of import actions, sorted most-recent first with appropriate details and a button to repeat the import exactly as done before. This list would be accessible from a button somewhere on the assetstore list page and would probably need to be paged. For repeated imports with exactly the same options and assetstore, maybe instead of showing each import as separate line, it would show a "number of times" and the most recent time? In the list, we want to show sensible names, not just girder ids, for collections and folders.

As a bonus, it would be great if when we went to an assetstore import page we showed the last few (10?) imports that were done for that assetstore, so that the user could redo them or see how they wanted to do something differently.

The further feature would be optionally modifying how repeated imports are done: currently if a file doesn't exist in the expected target directory, it is created. We frequently import a directory-tree of files, then organize them in Girder so they are not conceptually in the original directory-tree. Reimporting makes duplicates of all of these files. It would be great if there were an option in import to "skip if file already is in Girder somewhere" -- this can be done by matching the import path. If the file size has changed, we would update the existing file. The more sophisticated method would be to use the computed hash and match on that -- the file might have been renamed either on the assetstore OR in Girder, and, if the hash matches, it would be nice to not have a duplicate. This would be slower, as the hash has to be computed.

It would be nice to have a feature to flag any file in girder that is no longer available on an assetstore. For filesystem assetstores, this would confirm the path is reachable. For S3 assetstores, this would have to confirm the asset is still in the bucket (so would probably be slow). If we did this, we would probably want to show a list of such files (or only such files on a specific assetstore, or only such files from a specific import path) and then have an option to delete associated Girder items (and probably prune empty girder folders, too).

@manthey
Copy link
Contributor Author

manthey commented Mar 1, 2022

@dgutman Did I miss anything in our desired feature list here? I recognize that you would like a cron-like task to repeat imports at some point. I think we need hash-matching for that to actually do what we want, and I think it is too risky to ever automate deleting missing items. If we ever cron imports, then we should probably cron checking for missing files and report that somewhere (next to the imports list, maybe?) so that the admin can decide what to do.

Ages ago I was involved in a project where we automatically added and removed files from a database when they came and when on NAS-like devices. Devices with intermittent availability (for instance, across any network) made auto removal very risky.

@dgutman
Copy link
Contributor

dgutman commented Mar 1, 2022 via email

@manthey
Copy link
Contributor Author

manthey commented Mar 1, 2022

The import endpoint supports include/exclude RegEx . We don't expose that in UI (we probably should).

@manthey
Copy link
Contributor Author

manthey commented Mar 1, 2022

It sounds like when we check for missing files, we would just add some chunk of metadata to the file (and possibly to its parent item) that we could remove again if the file comes back. Then showing missing ones could trivially be done by a virtual folder that matches on that metadata. Since the check for something being present/missing is likely to be stale when we actually try to access something, then any actions we take that expect that flag to be one way or another would have to check again.

Throwing errors when a file is missing is outside the scope of this plugin (and probably differs in the Girder interface versus the HistomicsUI interface). Let's address what we want to do about that in a different issue.

@Leengit
Copy link

Leengit commented Mar 1, 2022

I don't know enough about the Girder implementation to know whether this is sufficiently relevant, but just in case it is ... rsync handles both the check hash and check file size options, and it can avoid re-transferring something that has been temporarily absent via its --link-dest flag. The command line from a source directory to its newest copy looks in several previous copies via something like:

rsync -a farway:MySource/ 2022-03-01/ --link-dest=2022-02-28/ --link-dest=2022-02-27/ --link-dest=2022-02-26/

Although 2022-03-01/ in this example should start as an empty directory, only the files that have changed will be copied there. The rest of the files will be there too but they will be hard links from one of the correspondingly located files in the directories listed via --link-dest, assuming they match the hash code and timestamp. Additionally, because these links are hard links, we can delete old dates without losing a file that is also present in a more modern date's directory.

N.B. the last I checked, which was about 10 years ago, there was a limit of, maybe, 20 --link-dest directories. Also, I don't recall what the defaults are for rsync checking both the timestamp and the hashcode; it may be necessary to turn on those checks explicitly.

@manthey
Copy link
Contributor Author

manthey commented Mar 1, 2022

@Leengit We aren't copying anything in this -- we are just indexing files that exist somewhere -- it could be a filesystem or an S3 bucket or a GridFS server, etc. "Import" is an indexing operation, not a copy operation.

@jjnesbitt
Copy link

I've begun work on this here: https://github.com/DigitalSlideArchive/import-tracker

@manthey
Copy link
Contributor Author

manthey commented Apr 20, 2022

@AlmightyYakob We should move the individual parts of this task to issues on https://github.com/DigitalSlideArchive/import-tracker.

@manthey
Copy link
Contributor Author

manthey commented May 4, 2022

I've moved all the details from this issue to separate issues in https://github.com/DigitalSlideArchive/import-tracker, so I'm closing this issue.

@manthey manthey closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants