Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Human-readable/user-defined filename (& path) for datasets #7525

Open
VJalili opened this issue Mar 14, 2019 · 1 comment
Open

Human-readable/user-defined filename (& path) for datasets #7525

VJalili opened this issue Mar 14, 2019 · 1 comment

Comments

@VJalili
Copy link
Member

VJalili commented Mar 14, 2019

Currently Galaxy (ObjectStore) organizes files (representing datasets) under admin-defined paths following [0-9]{3}(\/dataset_)[0-9] pattern. For instance:

├── database
│   ├── files1
│   │   ├── 000
│   │   │   ├── dataset_1.dat
│   │   │   ├── dataset_2.dat
│   │   │   ├── dataset_3.dat
...
AWS S3
├── my_bucket
│   ├── files2
│   │   ├── 000
│   │   │   ├── dataset_4.dat
│   │   │   ├── dataset_5.dat
│   │   │   ├── dataset_6.dat

where some datasets are stored locally, and some are stored in an AWS S3 bucket all under 000 folder and named dataset_X.dat, where X is the dataset id.

There are a number of advantages to this pattern of naming/structuring files; for instance:

  • absolute file path can be generated dynamically and on-the-fly; e.g., see:
    if not dir_only:
    path = os.path.join(path, alt_name if alt_name else "dataset_%s.dat" % obj_id)
  • follows best-practice guidelines recommending against encoding metadata in filenames;
  • by obfuscating filenames and organizing them under specific structure, it mildly enforces data immutability and consistency (i.e., data stored in a file has remained unchanged since it was last accessed), hence ensuring reproducibility of analysis.

However, in spite of its advantages, this pattern comes with some disadvantages. For instance:

  • It introduces challenges to mount data on Galaxy, analyze them, and push results back to a folder (e.g., same as the mounted folder). For instance, a user who has their data in a S3 bucket would like to mount that bucket into Galaxy and analyze all the files (objects) inside that bucket. Currently, one of the challenges is that filenames do not adhere with Galaxy's expected naming pattern (at least from ObjectStore's perspective).
  • it prevents users from naming files as it makes more sense to them. For instance, often users encode some metadata (e.g., anti-body name, experiment duration, tissue name, or lab name) in filenames and write scripts that parse the filenames and run appropriate post processing according to the metadata in the filenames. With obfuscating filenames, we introduce challenges to users correlating input and output files.

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability. One way to enforce Galaxy's exclusive control on files is the aforementioned naming pattern. However, for various reasons (such as those aforementioned), it is advantageous to allow users to be in control of files, where Galaxy can populate a history by reading files from a user-specified folder, regardless of how they are named, and without having to duplicate them only for adherence with Galaxy (ObjectStore) required naming pattern.

There are some challenges to enable this feature via ObjectStore; for instance:

  • Objectstore needs to be able to operate on on a per user basis (see PR: User-based ObjectStore #4840);
  • Objectstore should be able to read and write datasets from/to user-specified filenames; maybe leveraging external_filenames and _extra_files_path:
    model.Dataset.table = Table(
    "dataset", metadata,
    Column("id", Integer, primary_key=True),
    Column("create_time", DateTime, default=now),
    Column("update_time", DateTime, index=True, default=now, onupdate=now),
    Column("state", TrimmedString(64), index=True),
    Column("deleted", Boolean, index=True, default=False),
    Column("purged", Boolean, index=True, default=False),
    Column("purgable", Boolean, default=True),
    Column("object_store_id", TrimmedString(255), index=True),
    Column("external_filename", TEXT),
    Column("_extra_files_path", TEXT),
    Column('file_size', Numeric(15, 0)),
    Column('total_size', Numeric(15, 0)),
    Column('uuid', UUIDType()))
  • ensure data consistency/immutability via files checksum; see PRs Dataset Checksum #4659 Implement dataset source and hash tracking. #7487
  • be able to mount a folder or a cloud-based bucket in ObjectStore, e.g., via FUSE or symlink.

ping @jgoecks @jmchilton @martenson @dannon @afgane @luke-c-sargent @natefoo

@jmchilton
Copy link
Member

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability.

It is a seemingly small point but has huge architectural implications that users do not control the datasets but the dataset associations and metadata. If users controlled the datasets and not the dataset associations - we have to re-architect dataset copying, sharing, etc.. - we need new abstractions, API, and UIs for how these things would work I think.

The two questions we want to answer I think are:

  • Do we want to have a variant (subclass, flag, etc..) of galaxy.model.Dataset that does indeed belong to a user (or group)? I heard a lot of yeses at the meeting but it didn't seem to be a consensus.
  • If yes, what does that look like at the database layer, the model layer, the API, and the UI. Some things are obvious - like if a user copies a dataset that is truly owned by another user - we need to make a physical copy of the data.

If there is a architecture for this working - I guess I'd prefer to see it done and doable with existing object store constructs and models first.

If there was a connection between the Dataset and a User in the database and we implemented this throughout the app - then it seems like the pluggable media stuff would fit right in and seem natural.

I think solving all of that is a precursor to thinking about files instead of datasets? If we could say "this jobs outputs should belong to user X", then we could create the datasets in that fashion put them where they need to be that user's object store, etc..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants