In 1101 base workflow #21

ehanson8 · 2024-12-13T16:25:33Z

Purpose and background context

Certain code blocks have bounced around while writing this and I am finding there many different ways to organize the boundaries between the CLI command, BaseWorkflow class, and the ItemSubmission class. The abstractmethods on the BaseWorkflow class outline the boundary I'm seeing between the workflow and the item, with the expectation that unwritten CLI commands will be the interface with the DSS SQS input/output queues and the SES emails. My guiding principle was to minimize the number of values passed between various classes, but as always, this is wide open to iteration!

How can a reviewer manually see the effects of these changes?

Not possible yet

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1101

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * The BaseWorkflow and ItemSubmission classes are the foundation from which a significant portion of the application's functionality will be derived. How this addresses that need: * Add BaseWorflow class with abstract methods outlining the expected functionality. * Add ItemSubmission class with largely complete functionality and corresponding unit tests and fixtures Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1101

ehanson8 · 2024-12-13T16:29:14Z

dsc/item_submission.py

+        self.s3_client.put_file(
+            json.dumps(dspace_metadata),
+            bucket,
+            self.metadata_keyname,
+        )
+        metadata_uri = f"s3://{bucket}/{self.metadata_keyname}"
+        logger.info(f"Metadata uploaded to S3: {metadata_uri}")
+        self.metadata_uri = metadata_uri


I considered exception handling here but as with wiley-deposits, I think it's better to crash the run on a ClientError than continue if we can't get a metadata file into S3

ghukill

Submitting "Comment" as I have some questions and comments sprinkled through.

Overall, I'm definitely understanding the design. And in these early stages, I think it's nice that it's relatively simple.

At this early stage, what I'm most interested in are the relationships between a Workflow and an ItemSubmission. I left some comments about typings, and how it seems like you're anticipating that every source record is a dict. If so, I left some comments / requests to lean further into that direction.

I can't shake the feeling that perhaps some kind of class like SourceItem or ItemMetadata could be helpful, instead of passing around dictionaries... but maybe wait until you've had time to respond to these first round of questions.

ghukill · 2024-12-13T18:50:05Z

dsc/base.py

+        batch_metadata = self.get_batch_metadata()
+        for item_metadata in batch_metadata:
+            item_identifier = self.get_item_identifier(item_metadata)


This might be a can of worms, but I think it gets at one of my primary questions at this point.

So get_batch_metadata() is expected to return a list of metadata objects for each item that will be submitted. And then we pass that metadata object to get_item_identifier(), where the metadata object could be anything that this method is equipped to handle (noting item_metadata:Any in the signature).

Where I think I'm going with this, is that get_batch_metadata() and get_item_identifier() are kind of tightly coupled -- and this isn't necessarily a bad thing -- in that the objects returned by get_batch_metadata() must satisfy the input to get_item_identifier().

While perhaps many workflows might have a CSV file they parse and then return a dict, and get_item_identifier() could just grab a key... perhaps others might even return their own dataclass object or something, and then get_item_identifier() would use a property on that class.

However, I see that we assign this "metadata object" to ItemSubmission.source_metadata which itself is typed as dict[str, Any].

I was going to propose something like:

# near top of file ItemMetadata = TypeVar("ItemMetadata") ... ... def get_batch_metadata(self) -> list[ItemMetadata]: ... def get_item_identifier(self, item_metadata: ItemMetadata) -> str: ...

Where we through through a generic type that these are related. But, this kind of generic type + flexibility doesn't really work if ItemSubmission is expecting a dictionary for source_metadata.

To keep things simple, do you think it'd be worth just typing the output of get_batch_metadata() to something like list[dict]? And then get_item_identifier() could accept a dict? Could always revisit if some workflow would really benefit from not yielding a dictionary for each item's metadata.

I would say every method the the workflow classes are tightly coupled and opinionated to that workflow. In retrospect, I thought I errored in typing item_metadata=Any in get_item_identifier and source_metadata=dict[str, Any] in ItemSubmission since I didn't expect that object to be transformed between getting the identifier and instantiating ItemSubmission.

However! That is now suggesting to me that a get_item_metadata(batch_metadata) method is needed and that those type hints should remain:

batch_metadata = self.get_batch_metadata() for item_metadata in self.get_item_metadata(batch_metadata): item_identifier = self.get_item_identifier(item_metadata)

Currently, batch_metadata in most workflows is a CSV so get_item_metadata would be a simple dict iterator. For OpenCourseWare, it's a list of zip files, so get_item_metadata iterates and extracts a dict from each zip file. If a future source uses XML, get_item_metadata would iterate and generate a dict from the XML. Thoughts?

So would it be accurate to say?

get_batch_metadata() --> retrieves metadata from somewhere, but what is returned still contains multiple items; probably most often downloading from S3, but theoretically could get it other ways too (e.g. an HTTP request to ETD... 😎 ). The result might be a single CSV reader/dataframe, a single parsed XML file object, etc.

get_item_metadata() --> would yield actual dictionaries of item metadata, for each item, doing whatever is needed (e.g. unzipping files, parsing XML, looping through CSV rows, etc.)

I'm having trouble articulating what feels confusing about the relationship of,

get_batch_metadata

get_item_identifier

get_bitstream_uris

and now the proposed get_item_metadata

It feels kind of like they are very tightly coupled, and creating new workflows that extend this class would be jumping through hoops to define those inter-related methods. When -- and maybe missing something here -- a workflow could just be responsible for defining something like fetch_items() which would yield nearly everything you need to instantiate an ItemSubmission.

Which, and I'm feeling dizzy from circular questioning, why wouldn't they directly yield an ItemSubmission then?

Each of those workflows could, and should, define helper methods as needed to support this work. But in some cases, like reading a CSV from S3, it feels like get_batch_metadata (reading the CSV) and yielding items get_item_metadata could be achieved with a couple lines like:

import pandas as pd df = pd.read_csv("s3://....") for _, row in df.iterrows(): yield row # <----- basically an Item

or to extend that for some questions above:

import pandas as pd from dsc import ItemSubmission class FooWorkflow: def __init__(self, csv_location): self.csv_location = csv_location def fetch_items(): df = pd.read_csv(self.csv_location) for _, row in df.iterrows(): # extract identifiers, bit streams, etc. yield ItemSubmission(...)

Sorry if confusing the issue, and happy to hold off on these questions until more is established.

ghukill · 2024-12-13T19:01:55Z

dsc/base.py

+            metadata_keyname = f"{self.s3_prefix}/{item_identifier}_metadata.json"
+            item_submission = ItemSubmission(
+                source_metadata=item_metadata,
+                metadata_mapping=self.metadata_mapping,


My next can-o-worms question is here.

So it's safe to assume that metadata_mapping is a pretty critically important part of each workflow? This would be a dictionary that maps the item's "original" metadata dictionary to metadata fields that align with the dspace metadata profile?

Super rough example:

{ "title":"dc.title", ... }

Do you expect there will be any difficulties with nesting, logic, or multivalued fields?

For example: what if the input metadata for workflow Foo has a column in the CSV with semicolon delimited subjects, e.g. apples;washington;mountains. And the goal would be multiple dc.subject field values for each. I'm unsure how a dictionary mapping could explode a single value into multiple values.

But, this assumes quite a bit. First, that DSpace's metadata profile can (or will) allow for more complex or multivalued fields. Second, that we don't just require the input metadata format to have basically a 1:1 mapping to DSpace fields already present.

I think this consideration points to ItemSubmission.create_dspace_metadata() (which I have not yet looked closely at) which I would expect to take the source_metadata and apply the metadata_mapping to get the final item metadata for DSpace.

Going to leave this comment here, as-is, and followup in that method.

ghukill · 2024-12-13T19:09:01Z

dsc/item_submission.py

+
+                field_value = self.source_metadata.get(field_mapping["source_field_name"])
+                if field_value:
+                    delimiter = field_mapping["delimiter"]


Ah-ha! Related to my comment above about metadata_mapping, I see now that each entry in that is an object itself, which can provide more information for this mapping.

Request: can you add an example of what this metadata mapping might look like? With an example, I think I would understand that a single dictionary could suffice.

ghukill · 2024-12-13T19:10:40Z

tests/conftest.py

+        "dc.contributor": {
+            "source_field_name": "contributor",
+            "language": None,
+            "delimiter": "|",


Thanks for adding an example here of a delimiter. My comment above requesting an example in the docstring stands, but double-helpful to have this example here (and test it, of course).

Yes, I'll add that to the docstring!

ghukill · 2024-12-13T19:26:00Z

dsc/base.py

Commenting here, but this touches on file organization more generally.

what if base.py was renamed to workflows.py, where it may contain additional classes or helper functions as the project grows

what if item_submission.py was renamed to items.py, where it's a bit more generic to just providing classes about "Items" (maybe starting with ItemSubmission, but perhaps there is room eventually for something like ItemSource or ItemMetadata)

create a folder called /sources where actual workflows that extend these could live

Resulting in a structure like:

/ dsc / sources - .gitkeep # ensures the directory persists, even before source workflows are defined - wiley.py # which would contain a class like `WileyWorkflow` - ... - ... - items.py - workflows.py

This could help make clear that anything under /sources is an extension of these base files and classes. And then each of those "base" files would be descriptive about what kind of classes it held.

NOTE: we could create the /sources directory now and include a file like .gitkeep which would ensure it gets created and committed, even though we don't have sources yet.

ghukill · 2024-12-13T19:34:18Z

dsc/base.py

+
+        MUST NOT be overridden by workflow subclasses.
+        """
+        s3_client = S3Client()


I might propose that ItemSubmission could initialize it's own S3Client to use. I know we've touched on caching and reusing clients, but I think the performance gain is basically non-existent here, given how quickly this is initialized.

I would think that ItemSubmission.generate_and_upload_dspace_metadata() could just instantiate and use a client right when it needs it, instead of applying to the ItemSubmission class instance.

Or, have a property like this if multiple methods may eventually require an S3 client:

@property def s3_client(self): return S3Client()

ghukill · 2024-12-13T19:51:53Z

dsc/item_submission.py

+
+
+@dataclass
+class ItemSubmission:


Do you think ItemSubmission would benefit from some kind of validate() method? I'm unsure offhand what that might look for, but I'm noticing now that we call generate_and_upload_dspace_metadata() in the for loop of the workflow and it might be helpful to check things look good before we fire it off.

I did have a valid_dspace_metadata method in wiley-deposits but I may want to rework it since that had specific field names to validate while these workflows may be less constrained. But yes, there should be some validation!

Side note: If we find that dataclass module is the right fit for this module, I propose calling the proposed validate() method in a __post_init__ method!

ghukill · 2024-12-13T20:00:55Z

dsc/base.py

+        self.s3_prefix: str | None = s3_prefix
+
+    @final
+    def generate_submission_batch(self) -> Iterator[tuple[str, list[str]]]:


I'm realizing after my first pass that I have a couple of fundamental questions about this base workflow method.

When this is complete, is it correct that:

each ItemSubmission instantiated in the loop will have uploaded a JSON file to S3 that is kind of like a SIP for DSS to retrieve and ingest into DSpace

it yields the URI of each new ItemSubmission, which I believe to be the S3 URI of the JSON file uploaded to S3 for that item

it also yields ItemSubmission.bitstream_uris, which are not yet getting set on ItemSubmission

My question: what is utilizing the results of this yielding? Will it be another class method? or the CLI itself? Unless I'm grossly misunderstanding this application, wouldn't something need to send these as an SQS message to the DSS input queue?

Related question -- and apologies if this will come later -- what happens with the bistream URIs? Are they included in the SQS message for DSS?

The yielded URIs will be utilized by the deposit command to generate the SQS message, I mentioned in the PR description

with the expectation that unwritten CLI commands will be the interface with the DSS SQS input/output queues and the SES emails.

The workflow class is intended for just the workflow-opinionated code while the CLI command will be a wrapper with universal functionality (e.g. sending SQS messages, SES emails)

And yes, the bitstream URIs are in the the SQS message as well as the URI of the JSON metadata file that is uploaded to S3, DSS submission message spec for reference.

The steps would be:

ItemSubmission instantiated with source metadata and bitstreams URIs

DSpace metadata generated from source metadata

DSpace metadata uploaded to S3 as JSON file and URI is returned

ItemSubmission yields metadata_uris and bitstream_uris to deposit` CLI command for SQS message

deposit CLI command generates and send SQS message to DSS

Ah, sorry missed that in the PR! Thanks for pointing that out.

And thanks for sharing the DSS submission message spec. That document is SUPER helpful, and I think it'd be handy to include in docstrings in this application where applicable.

And, thanks for the steps; I had missed that get_bitstream_uris() is a method that workflows will need to define, and that's where they come from.

This is quite helpful for helping me clarify my next question: any reason why we couldn't yield the ItemSubmission object itself? If I'm understanding correctly, it will have the metadata_uri and the bitstream_uris.

Happy to wait on that until the CLI level orchestration is in place, where these would get sent out as SQS messages. Are you envisioning some kind of DSSInputMessage class that could encapsulate preparing the final SQS payload for sending, that would meet the spec?

Which I think has a natural follow-up question: why couldn't ItemSubmission have such a method to perform this work? Like to_dss_input_queue_message()?

Happy to meet and discuss if helpful, or keep hashing out in comments. When I see a class like ItemSubmission, which is responsible for uploading the components of a SIP to S3 (metadata and files), I wonder why it wouldn't have the capabilities of actually preparing (and maybe actually sending?) the SQS message for DSS.

Which I think has a natural follow-up question: why couldn't ItemSubmission have such a method to perform this work? Like to_dss_input_queue_message()?

I think it certainly could! Let's get @jonavellecuerdo 's thoughts as well but now I'm inclined to build SQS message formatting into ItemSubmission rather than the CLI command

I do think yielding the actual ItemSubmission instance will allow the object to be accessed outside the context of the BaseWorkflow.generate_submission_batch method. 🤔 I also think if all ItemSubmission instances must comply with the DSS submission message spec, then--as @ehanson8 writes--it also follows that defining the SQS message formatting in ItemSubmission would make sense!

* Restructure app to use workflows directory * Add DSS_INPUT_QUEUE env var * Add InvalidDSpaceMetadataError exception * Refactor generate_and_upload_dspace_metadata method to upload_dspace_metadata * Shift create_dspace_metadata and validate_dspace_metadata from ItemSubmission to BaseWorkflow * Add init params to BaseWorkflow * Add stub run method to BaseWorkflow class * Add dspace_metadata fixture * Add tests and fixtures for BaseWorkflow @Final methods

coveralls · 2024-12-17T17:21:39Z

Pull Request Test Coverage Report for Build 12401646765

Details

87 of 88 (98.86%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.9%) to 96.992%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
dsc/workflows/base/init.py	63	64	98.44%

Totals
Change from base Build 12264644297:	0.9%
Covered Lines:	258
Relevant Lines:	266

💛 - Coveralls

ghukill

Updates are looking good! The call earlier today was helpful in understanding some of the changes.

Left a few comments.

ghukill · 2024-12-17T21:09:08Z