Human-readable/user-defined filename (& path) for datasets #7525

VJalili · 2019-03-14T00:57:32Z

Currently Galaxy (ObjectStore) organizes files (representing datasets) under admin-defined paths following [0-9]{3}(\/dataset_)[0-9] pattern. For instance:

├── database
│   ├── files1
│   │   ├── 000
│   │   │   ├── dataset_1.dat
│   │   │   ├── dataset_2.dat
│   │   │   ├── dataset_3.dat
...
AWS S3
├── my_bucket
│   ├── files2
│   │   ├── 000
│   │   │   ├── dataset_4.dat
│   │   │   ├── dataset_5.dat
│   │   │   ├── dataset_6.dat

where some datasets are stored locally, and some are stored in an AWS S3 bucket all under 000 folder and named dataset_X.dat, where X is the dataset id.

There are a number of advantages to this pattern of naming/structuring files; for instance:

absolute file path can be generated dynamically and on-the-fly; e.g., see:

galaxy/lib/galaxy/objectstore/__init__.py

Lines 376 to 377 in d566ed9

if not dir_only:

path = os.path.join(path, alt_name if alt_name else "dataset_%s.dat" % obj_id)
follows best-practice guidelines recommending against encoding metadata in filenames;
by obfuscating filenames and organizing them under specific structure, it mildly enforces data immutability and consistency (i.e., data stored in a file has remained unchanged since it was last accessed), hence ensuring reproducibility of analysis.

However, in spite of its advantages, this pattern comes with some disadvantages. For instance:

It introduces challenges to mount data on Galaxy, analyze them, and push results back to a folder (e.g., same as the mounted folder). For instance, a user who has their data in a S3 bucket would like to mount that bucket into Galaxy and analyze all the files (objects) inside that bucket. Currently, one of the challenges is that filenames do not adhere with Galaxy's expected naming pattern (at least from ObjectStore's perspective).
it prevents users from naming files as it makes more sense to them. For instance, often users encode some metadata (e.g., anti-body name, experiment duration, tissue name, or lab name) in filenames and write scripts that parse the filenames and run appropriate post processing according to the metadata in the filenames. With obfuscating filenames, we introduce challenges to users correlating input and output files.

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability. One way to enforce Galaxy's exclusive control on files is the aforementioned naming pattern. However, for various reasons (such as those aforementioned), it is advantageous to allow users to be in control of files, where Galaxy can populate a history by reading files from a user-specified folder, regardless of how they are named, and without having to duplicate them only for adherence with Galaxy (ObjectStore) required naming pattern.

There are some challenges to enable this feature via ObjectStore; for instance:

Objectstore needs to be able to operate on on a per user basis (see PR: User-based ObjectStore #4840);

Objectstore should be able to read and write datasets from/to user-specified filenames; maybe leveraging external_filenames and _extra_files_path:

galaxy/lib/galaxy/model/mapping.py

Lines 210 to 224 in d566ed9

    
           model.Dataset.table = Table( 
        
               "dataset", metadata, 
        
               Column("id", Integer, primary_key=True), 
        
               Column("create_time", DateTime, default=now), 
        
               Column("update_time", DateTime, index=True, default=now, onupdate=now), 
        
               Column("state", TrimmedString(64), index=True), 
        
               Column("deleted", Boolean, index=True, default=False), 
        
               Column("purged", Boolean, index=True, default=False), 
        
               Column("purgable", Boolean, default=True), 
        
               Column("object_store_id", TrimmedString(255), index=True), 
        
               Column("external_filename", TEXT), 
        
               Column("_extra_files_path", TEXT), 
        
               Column('file_size', Numeric(15, 0)), 
        
               Column('total_size', Numeric(15, 0)), 
        
               Column('uuid', UUIDType()))

ensure data consistency/immutability via files checksum; see PRs Dataset Checksum #4659 Implement dataset source and hash tracking. #7487
be able to mount a folder or a cloud-based bucket in ObjectStore, e.g., via FUSE or symlink.

ping @jgoecks @jmchilton @martenson @dannon @afgane @luke-c-sargent @natefoo

The text was updated successfully, but these errors were encountered:

jmchilton · 2019-03-15T19:19:43Z

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability.

It is a seemingly small point but has huge architectural implications that users do not control the datasets but the dataset associations and metadata. If users controlled the datasets and not the dataset associations - we have to re-architect dataset copying, sharing, etc.. - we need new abstractions, API, and UIs for how these things would work I think.

The two questions we want to answer I think are:

Do we want to have a variant (subclass, flag, etc..) of galaxy.model.Dataset that does indeed belong to a user (or group)? I heard a lot of yeses at the meeting but it didn't seem to be a consensus.
If yes, what does that look like at the database layer, the model layer, the API, and the UI. Some things are obvious - like if a user copies a dataset that is truly owned by another user - we need to make a physical copy of the data.

If there is a architecture for this working - I guess I'd prefer to see it done and doable with existing object store constructs and models first.

If there was a connection between the Dataset and a User in the database and we implemented this throughout the app - then it seems like the pluggable media stuff would fit right in and seem natural.

I think solving all of that is a precursor to thinking about files instead of datasets? If we could say "this jobs outputs should belong to user X", then we could create the datasets in that fashion put them where they need to be that user's object store, etc..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Human-readable/user-defined filename (& path) for datasets #7525

Human-readable/user-defined filename (& path) for datasets #7525

VJalili commented Mar 14, 2019 •

edited

Loading

jmchilton commented Mar 15, 2019

Human-readable/user-defined filename (& path) for datasets #7525

Human-readable/user-defined filename (& path) for datasets #7525

Comments

VJalili commented Mar 14, 2019 • edited Loading

jmchilton commented Mar 15, 2019

VJalili commented Mar 14, 2019 •

edited

Loading