Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In 1101 base workflow #21

Merged
merged 5 commits into from
Dec 19, 2024
Merged

In 1101 base workflow #21

merged 5 commits into from
Dec 19, 2024

Conversation

ehanson8
Copy link
Contributor

Purpose and background context

Certain code blocks have bounced around while writing this and I am finding there many different ways to organize the boundaries between the CLI command, BaseWorkflow class, and the ItemSubmission class. The abstractmethods on the BaseWorkflow class outline the boundary I'm seeing between the workflow and the item, with the expectation that unwritten CLI commands will be the interface with the DSS SQS input/output queues and the SES emails. My guiding principle was to minimize the number of values passed between various classes, but as always, this is wide open to iteration!

How can a reviewer manually see the effects of these changes?

Not possible yet

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* The BaseWorkflow and ItemSubmission classes are the foundation from which a significant portion of the application's functionality will be derived.

How this addresses that need:
* Add BaseWorflow class with abstract methods outlining the expected functionality.
* Add ItemSubmission class with largely complete functionality and corresponding unit tests and fixtures

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1101
Comment on lines 29 to 36
self.s3_client.put_file(
json.dumps(dspace_metadata),
bucket,
self.metadata_keyname,
)
metadata_uri = f"s3://{bucket}/{self.metadata_keyname}"
logger.info(f"Metadata uploaded to S3: {metadata_uri}")
self.metadata_uri = metadata_uri
Copy link
Contributor Author

@ehanson8 ehanson8 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered exception handling here but as with wiley-deposits, I think it's better to crash the run on a ClientError than continue if we can't get a metadata file into S3

@ehanson8 ehanson8 marked this pull request as ready for review December 13, 2024 16:30
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting "Comment" as I have some questions and comments sprinkled through.

Overall, I'm definitely understanding the design. And in these early stages, I think it's nice that it's relatively simple.

At this early stage, what I'm most interested in are the relationships between a Workflow and an ItemSubmission. I left some comments about typings, and how it seems like you're anticipating that every source record is a dict. If so, I left some comments / requests to lean further into that direction.

I can't shake the feeling that perhaps some kind of class like SourceItem or ItemMetadata could be helpful, instead of passing around dictionaries... but maybe wait until you've had time to respond to these first round of questions.

dsc/base.py Outdated
Comment on lines 49 to 51
batch_metadata = self.get_batch_metadata()
for item_metadata in batch_metadata:
item_identifier = self.get_item_identifier(item_metadata)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a can of worms, but I think it gets at one of my primary questions at this point.

So get_batch_metadata() is expected to return a list of metadata objects for each item that will be submitted. And then we pass that metadata object to get_item_identifier(), where the metadata object could be anything that this method is equipped to handle (noting item_metadata:Any in the signature).

Where I think I'm going with this, is that get_batch_metadata() and get_item_identifier() are kind of tightly coupled -- and this isn't necessarily a bad thing -- in that the objects returned by get_batch_metadata() must satisfy the input to get_item_identifier().

While perhaps many workflows might have a CSV file they parse and then return a dict, and get_item_identifier() could just grab a key... perhaps others might even return their own dataclass object or something, and then get_item_identifier() would use a property on that class.

However, I see that we assign this "metadata object" to ItemSubmission.source_metadata which itself is typed as dict[str, Any].

I was going to propose something like:

# near top of file
ItemMetadata = TypeVar("ItemMetadata")
...
...
def get_batch_metadata(self) -> list[ItemMetadata]:
  ...
def get_item_identifier(self, item_metadata: ItemMetadata) -> str:
  ...

Where we through through a generic type that these are related. But, this kind of generic type + flexibility doesn't really work if ItemSubmission is expecting a dictionary for source_metadata.

To keep things simple, do you think it'd be worth just typing the output of get_batch_metadata() to something like list[dict]? And then get_item_identifier() could accept a dict? Could always revisit if some workflow would really benefit from not yielding a dictionary for each item's metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say every method the the workflow classes are tightly coupled and opinionated to that workflow. In retrospect, I thought I errored in typing item_metadata=Any in get_item_identifier and source_metadata=dict[str, Any] in ItemSubmission since I didn't expect that object to be transformed between getting the identifier and instantiating ItemSubmission.

However! That is now suggesting to me that a get_item_metadata(batch_metadata) method is needed and that those type hints should remain:

batch_metadata = self.get_batch_metadata()
for item_metadata in self.get_item_metadata(batch_metadata):
      item_identifier = self.get_item_identifier(item_metadata)

Currently, batch_metadata in most workflows is a CSV so get_item_metadata would be a simple dict iterator. For OpenCourseWare, it's a list of zip files, so get_item_metadata iterates and extracts a dict from each zip file. If a future source uses XML, get_item_metadata would iterate and generate a dict from the XML. Thoughts?

Copy link

@ghukill ghukill Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So would it be accurate to say?

  1. get_batch_metadata() --> retrieves metadata from somewhere, but what is returned still contains multiple items; probably most often downloading from S3, but theoretically could get it other ways too (e.g. an HTTP request to ETD... 😎 ). The result might be a single CSV reader/dataframe, a single parsed XML file object, etc.

  2. get_item_metadata() --> would yield actual dictionaries of item metadata, for each item, doing whatever is needed (e.g. unzipping files, parsing XML, looping through CSV rows, etc.)

I'm having trouble articulating what feels confusing about the relationship of,

  • get_batch_metadata
  • get_item_identifier
  • get_bitstream_uris
  • and now the proposed get_item_metadata

It feels kind of like they are very tightly coupled, and creating new workflows that extend this class would be jumping through hoops to define those inter-related methods. When -- and maybe missing something here -- a workflow could just be responsible for defining something like fetch_items() which would yield nearly everything you need to instantiate an ItemSubmission.

Which, and I'm feeling dizzy from circular questioning, why wouldn't they directly yield an ItemSubmission then?

Each of those workflows could, and should, define helper methods as needed to support this work. But in some cases, like reading a CSV from S3, it feels like get_batch_metadata (reading the CSV) and yielding items get_item_metadata could be achieved with a couple lines like:

import pandas as pd

df = pd.read_csv("s3://....")
for _, row in df.iterrows():
   yield row # <----- basically an Item

or to extend that for some questions above:

import pandas as pd
from dsc import ItemSubmission

class FooWorkflow:

    def __init__(self, csv_location):
        self.csv_location = csv_location
  
    def fetch_items():        
        df = pd.read_csv(self.csv_location)
        for _, row in df.iterrows():
             # extract identifiers, bit streams, etc.
             yield ItemSubmission(...)

Sorry if confusing the issue, and happy to hold off on these questions until more is established.

dsc/base.py Outdated
metadata_keyname = f"{self.s3_prefix}/{item_identifier}_metadata.json"
item_submission = ItemSubmission(
source_metadata=item_metadata,
metadata_mapping=self.metadata_mapping,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My next can-o-worms question is here.

So it's safe to assume that metadata_mapping is a pretty critically important part of each workflow? This would be a dictionary that maps the item's "original" metadata dictionary to metadata fields that align with the dspace metadata profile?

Super rough example:

{
    "title":"dc.title",
    ...
}

Do you expect there will be any difficulties with nesting, logic, or multivalued fields?

For example: what if the input metadata for workflow Foo has a column in the CSV with semicolon delimited subjects, e.g. apples;washington;mountains. And the goal would be multiple dc.subject field values for each. I'm unsure how a dictionary mapping could explode a single value into multiple values.

But, this assumes quite a bit. First, that DSpace's metadata profile can (or will) allow for more complex or multivalued fields. Second, that we don't just require the input metadata format to have basically a 1:1 mapping to DSpace fields already present.

I think this consideration points to ItemSubmission.create_dspace_metadata() (which I have not yet looked closely at) which I would expect to take the source_metadata and apply the metadata_mapping to get the final item metadata for DSpace.

Going to leave this comment here, as-is, and followup in that method.


field_value = self.source_metadata.get(field_mapping["source_field_name"])
if field_value:
delimiter = field_mapping["delimiter"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah-ha! Related to my comment above about metadata_mapping, I see now that each entry in that is an object itself, which can provide more information for this mapping.

Request: can you add an example of what this metadata mapping might look like? With an example, I think I would understand that a single dictionary could suffice.

"dc.contributor": {
"source_field_name": "contributor",
"language": None,
"delimiter": "|",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding an example here of a delimiter. My comment above requesting an example in the docstring stands, but double-helpful to have this example here (and test it, of course).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add that to the docstring!

dsc/base.py Outdated
Copy link

@ghukill ghukill Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here, but this touches on file organization more generally.

  • what if base.py was renamed to workflows.py, where it may contain additional classes or helper functions as the project grows
  • what if item_submission.py was renamed to items.py, where it's a bit more generic to just providing classes about "Items" (maybe starting with ItemSubmission, but perhaps there is room eventually for something like ItemSource or ItemMetadata)
  • create a folder called /sources where actual workflows that extend these could live

Resulting in a structure like:

/ dsc
  / sources
    - .gitkeep  # ensures the directory persists, even before source workflows are defined
    - wiley.py  # which would contain a class like `WileyWorkflow`
    - ...
    - ...
  - items.py
  - workflows.py

This could help make clear that anything under /sources is an extension of these base files and classes. And then each of those "base" files would be descriptive about what kind of classes it held.

NOTE: we could create the /sources directory now and include a file like .gitkeep which would ensure it gets created and committed, even though we don't have sources yet.

dsc/base.py Outdated

MUST NOT be overridden by workflow subclasses.
"""
s3_client = S3Client()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might propose that ItemSubmission could initialize it's own S3Client to use. I know we've touched on caching and reusing clients, but I think the performance gain is basically non-existent here, given how quickly this is initialized.

I would think that ItemSubmission.generate_and_upload_dspace_metadata() could just instantiate and use a client right when it needs it, instead of applying to the ItemSubmission class instance.

Or, have a property like this if multiple methods may eventually require an S3 client:

@property
def s3_client(self):
    return S3Client()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!



@dataclass
class ItemSubmission:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think ItemSubmission would benefit from some kind of validate() method? I'm unsure offhand what that might look for, but I'm noticing now that we call generate_and_upload_dspace_metadata() in the for loop of the workflow and it might be helpful to check things look good before we fire it off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have a valid_dspace_metadata method in wiley-deposits but I may want to rework it since that had specific field names to validate while these workflows may be less constrained. But yes, there should be some validation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: If we find that dataclass module is the right fit for this module, I propose calling the proposed validate() method in a __post_init__ method!

dsc/base.py Outdated
self.s3_prefix: str | None = s3_prefix

@final
def generate_submission_batch(self) -> Iterator[tuple[str, list[str]]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing after my first pass that I have a couple of fundamental questions about this base workflow method.

When this is complete, is it correct that:

  1. each ItemSubmission instantiated in the loop will have uploaded a JSON file to S3 that is kind of like a SIP for DSS to retrieve and ingest into DSpace
  2. it yields the URI of each new ItemSubmission, which I believe to be the S3 URI of the JSON file uploaded to S3 for that item
  3. it also yields ItemSubmission.bitstream_uris, which are not yet getting set on ItemSubmission

My question: what is utilizing the results of this yielding? Will it be another class method? or the CLI itself? Unless I'm grossly misunderstanding this application, wouldn't something need to send these as an SQS message to the DSS input queue?

Related question -- and apologies if this will come later -- what happens with the bistream URIs? Are they included in the SQS message for DSS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yielded URIs will be utilized by the deposit command to generate the SQS message, I mentioned in the PR description

with the expectation that unwritten CLI commands will be the interface with the DSS SQS input/output queues and the SES emails.

The workflow class is intended for just the workflow-opinionated code while the CLI command will be a wrapper with universal functionality (e.g. sending SQS messages, SES emails)

And yes, the bitstream URIs are in the the SQS message as well as the URI of the JSON metadata file that is uploaded to S3, DSS submission message spec for reference.

The steps would be:

  1. ItemSubmission instantiated with source metadata and bitstreams URIs
  2. DSpace metadata generated from source metadata
  3. DSpace metadata uploaded to S3 as JSON file and URI is returned
  4. ItemSubmission yields metadata_uris and bitstream_uris to deposit` CLI command for SQS message
  5. deposit CLI command generates and send SQS message to DSS

Copy link

@ghukill ghukill Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry missed that in the PR! Thanks for pointing that out.

And thanks for sharing the DSS submission message spec. That document is SUPER helpful, and I think it'd be handy to include in docstrings in this application where applicable.

And, thanks for the steps; I had missed that get_bitstream_uris() is a method that workflows will need to define, and that's where they come from.

This is quite helpful for helping me clarify my next question: any reason why we couldn't yield the ItemSubmission object itself? If I'm understanding correctly, it will have the metadata_uri and the bitstream_uris.

Happy to wait on that until the CLI level orchestration is in place, where these would get sent out as SQS messages. Are you envisioning some kind of DSSInputMessage class that could encapsulate preparing the final SQS payload for sending, that would meet the spec?

Which I think has a natural follow-up question: why couldn't ItemSubmission have such a method to perform this work? Like to_dss_input_queue_message()?

Happy to meet and discuss if helpful, or keep hashing out in comments. When I see a class like ItemSubmission, which is responsible for uploading the components of a SIP to S3 (metadata and files), I wonder why it wouldn't have the capabilities of actually preparing (and maybe actually sending?) the SQS message for DSS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which I think has a natural follow-up question: why couldn't ItemSubmission have such a method to perform this work? Like to_dss_input_queue_message()?

I think it certainly could! Let's get @jonavellecuerdo 's thoughts as well but now I'm inclined to build SQS message formatting into ItemSubmission rather than the CLI command

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think yielding the actual ItemSubmission instance will allow the object to be accessed outside the context of the BaseWorkflow.generate_submission_batch method. 🤔 I also think if all ItemSubmission instances must comply with the DSS submission message spec, then--as @ehanson8 writes--it also follows that defining the SQS message formatting in ItemSubmission would make sense!

* Restructure app to use workflows directory
* Add DSS_INPUT_QUEUE env var
* Add InvalidDSpaceMetadataError exception
* Refactor generate_and_upload_dspace_metadata method to upload_dspace_metadata
* Shift create_dspace_metadata and validate_dspace_metadata from ItemSubmission to BaseWorkflow
* Add init params to BaseWorkflow
* Add stub run method to BaseWorkflow class
* Add dspace_metadata fixture
* Add tests and fixtures for BaseWorkflow @Final methods
@coveralls
Copy link

coveralls commented Dec 17, 2024

Pull Request Test Coverage Report for Build 12401646765

Details

  • 87 of 88 (98.86%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.9%) to 96.992%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/base/init.py 63 64 98.44%
Totals Coverage Status
Change from base Build 12264644297: 0.9%
Covered Lines: 258
Relevant Lines: 266

💛 - Coveralls

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates are looking good! The call earlier today was helpful in understanding some of the changes.

Left a few comments.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, very optional: class BaseWorkflow could be defined in the dsc/workflows/base/__init__.py file if we wanted to avoid another file like base_workflow.py or base.py.

Copy link

@ghukill ghukill Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I think it's easier to work backwards from what imports feel natural, then see how files/classes could be organized.

My expectation would be something like:

# if I'm defining a brand new workflow
from dsc.workflows.base import BaseWorkflow

# if I'm extending a base workflow like SImpleCSV
# still get from workflows.base, because its __all__ provides it
from dsc.workflows.base import SimpleCSV

# if I need to _use_ a workflow (maybe in the CLI?)
# workflows.__init__.py provides this import
from dsc.workflows import Wiley

This suggests something like this to me:

/ dsc
    / workflows
        - __init__.py
        /  base
            - __init__.py # BaseWorkflow defined here
            - simple_csv.py # SimpleCSV defined here
        - wiley.py
        - ocw.py 

The __all__ object could be used in dsc.workflows.base.__init__.py to provide access to SimpleCSV, Wiley, etc. Just a possibility!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this and am currently implementing!

Comment on lines 34 to 35
submission_system: The system to which item submissions will be sent (e.g.
DSpace@MIT)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reveals some ignorance about the greater context of the application, but when would it not be DSpace@MIT? For Dome?

Would it make sense to make this only values from an Enum? Is this used directly in the SQS message that then tells DSS something special?

Or... wonder if it would make sense for each workflow defined to define this on the class itself (i.e. a class level attribute, here's an example from GeoHarvester)? Instead of passing to the Workflow when instantiated. Seems like any given workflow will only ever want to deposit into a single repository.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If class level (aka workflow level) static attributes might make sense, do wonder about other things passed here. For example: would output_queue also be pretty workflow specific? or collection_handle? I honestly don't know, but maybe worth considering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, currently everything is DSpace@MIT but eventually I would like DSS to point to Dome for DDC uploads so this is to accommodate that. And yes, defining the init params on the derived workflow classes makes a lot of sense since they would largely be static for the workflow. s3_prefix and collection_handle the only ones that could potentially be dynamic, though I'd like to lock that down with the content owners where possible.

dsc/workflows/base/base_workflow.py Outdated Show resolved Hide resolved
dsc/workflows/base/base_workflow.py Outdated Show resolved Hide resolved
dsc/workflows/base/base_workflow.py Outdated Show resolved Hide resolved
"""
metadata_entries = []
for field_name, field_mapping in self.metadata_mapping.items():
if field_name not in ["item_identifier", "source_system_identifier"]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would item_identifier or source_system_identifier be present in the mapping in the first place?

It seems as if the mapping would be constructed specifically for this application, and therefore could be omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item_identifier is provided by stakeholders in the CSV but may require additional parsing in the get_item_identifier method beyond retrieving it from the CSV (e.g. could be the full filename when we want to omit the extension).

source_system_identifier is for creating reports used to update related metadata in other repositories (e.g. a digital object URI ArchivesSpace is the big use case) with the newly published DSpace handle. So it needs to be there for linking but isn't a part of the DSpace metadata that's generated. My expectation is this will be done in the finalize command. But second thought, it might be better to omit source_system_identifier until I'm ready to fully implement that functionality in the ArchivesSpace workflow, thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But second thought, it might be better to omit source_system_identifier until I'm ready to fully implement that functionality in the ArchivesSpace workflow, thoughts?

Might not be a bad idea? We've been trying a similar approach -- and it's hard -- in the parquet work, to keep the early passes really tight and focused. I think it's okay to stub something when it's almost gauranteed to show up later, and perhaps the contours of seeing it helps understand the current code to review. But yeah, maybe it'll be easier to reason about, name, and organize when things settle into place a bit more. Would make sense to me to wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

dsc/workflows/base/base_workflow.py Outdated Show resolved Hide resolved
Comment on lines 63 to 76
sqs_client = SQSClient()
for item_submission in self.item_submissions_iter():
item_submission.upload_dspace_metadata(
self.s3_bucket, item_submission.metadata_keyname
)
sqs_client.send_submission_message(
item_submission.item_identifier,
self.workflow_name,
self.output_queue,
self.submission_system,
self.collection_handle,
item_submission.metadata_uri,
item_submission.bitstream_uris,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to wait on this discussion if it would be helpful to tease out when this is actually built out.

Given this proposal, it feels like it could be a little difficult to test with this approach, where both upload_space_metadata() and send_submission_message() appear to be crafting objects before they send them off into the ether.

What about a more verbose, but potentially easier to test flow:

for item_submission in self.item_submissions_iter():

    # prepare dspace metadata and upload to S3
    dspace_metadata = item_submission.prepare_dspace_metadata()
    self.upload_dspace_metadata(metadata_s3_key, dspace_metadata)
    
    # send SQS message to DSS input queue
    submission_sqs_message = item_submission.prepare_sqs_input_message()
    sqs_client.publish_submission_message(submission_sqs_message)

This makes some assumptions that you may or may not agree with:

  • the BaseWorkflow class would have the functionality to perform uploads to S3
  • the ItemSubmission class could only prepare payloads that will get sent
    • this removes the need for ItemSubmission to know anything about where they will go (S3 buckets, queues,
      etc.; all of which are sort of set on the workflow instance itself)
  • the ItemSubmission class is what prepares SQS messages, but the SQSClient actually sends it
    • or, perhaps the BaseWorkflow could have a method like submit_to_dss_input_queue() which itself would utilize the SQSClient for actually publishing the message, but it would still be the workflow that kicked that off and was ultimately responsible for it

Just food for thought. But I realize this just a stub, and will get more attention when that part is written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very helpful, things are already shifting from this proposed code and testing was a major consideration. This will be ready for full review in the next PR after I refactor the SQS functionality!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 I agree with the additional changes requested by @ghukill ! I also had one minor request re: consistent unit test names.

I'll wait for final approval once additional changes have been committed!

* Add ItemMetadatMissingRequiredFieldError exception
* Rename metadata_keyname > metadata_s3_key
* Shift BaseWorkflow to base/__init__.py and remove base_workflow.py
* Add required key to metadata mapping and corresponding unit test
* Update unit tests names to include _success
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All said, I feel comfortable approving! Truly nice work all around, and thanks for the discussions. It's one of the harder parts of a new project; establishing some patterns when parts are still stubbed or unfinished.

It's feeling opinionated enough that I can envision where future functionality may land, what it'll be called, etc. But, also simple and flexible enough that if future functionality requires some shifting, that would be doable as well.

Left a couple of comments, but wholly optional. Approved as-is, with or without changes.

Comment on lines +92 to +93
dspace_metadata = self.create_dspace_metadata(item_metadata)
self.validate_dspace_metadata(dspace_metadata)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now, I could envision validating the generated metadata inside the creation. But I think there are arguably advantages to doing it this way as well. So NOT a request, just an observation, in case the pattern emerges elsewhere.

Comment on lines +17 to +18
metadata_s3_key: str
metadata_uri: str = ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but maybe worth considering renaming metadata_s3_uri to match the pattern of metadata_s3_key.

Copy link
Contributor Author

@ehanson8 ehanson8 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating to metadata_s3_uri and bitstream_s3_uris for clarity and consistency in the upcoming PR

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking some of the other file reorganizing and renaming into consideration -- and thanks BTW, it's feeling good to navigate around! -- I could envision this file called something more high level like items.py. If other classes ever made sense to add specific to items, it would be a natural place for it.

I don't have evidence or even a pointable philosophy to support it, but when the classname mirrors the filename, it feels a bit off. While we may not need more "item" type classes, it's tight coupling between file and class names.

Totally optional.

@ehanson8 ehanson8 merged commit cd6b202 into main Dec 19, 2024
2 checks passed
@ehanson8 ehanson8 deleted the IN-1101-base-workflow branch December 19, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants