-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In 1101 base workflow #21
Conversation
Why these changes are being introduced: * The BaseWorkflow and ItemSubmission classes are the foundation from which a significant portion of the application's functionality will be derived. How this addresses that need: * Add BaseWorflow class with abstract methods outlining the expected functionality. * Add ItemSubmission class with largely complete functionality and corresponding unit tests and fixtures Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1101
dsc/item_submission.py
Outdated
self.s3_client.put_file( | ||
json.dumps(dspace_metadata), | ||
bucket, | ||
self.metadata_keyname, | ||
) | ||
metadata_uri = f"s3://{bucket}/{self.metadata_keyname}" | ||
logger.info(f"Metadata uploaded to S3: {metadata_uri}") | ||
self.metadata_uri = metadata_uri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered exception handling here but as with wiley-deposits
, I think it's better to crash the run on a ClientError
than continue if we can't get a metadata file into S3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting "Comment" as I have some questions and comments sprinkled through.
Overall, I'm definitely understanding the design. And in these early stages, I think it's nice that it's relatively simple.
At this early stage, what I'm most interested in are the relationships between a Workflow
and an ItemSubmission
. I left some comments about typings, and how it seems like you're anticipating that every source record is a dict
. If so, I left some comments / requests to lean further into that direction.
I can't shake the feeling that perhaps some kind of class like SourceItem
or ItemMetadata
could be helpful, instead of passing around dictionaries... but maybe wait until you've had time to respond to these first round of questions.
dsc/base.py
Outdated
batch_metadata = self.get_batch_metadata() | ||
for item_metadata in batch_metadata: | ||
item_identifier = self.get_item_identifier(item_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a can of worms, but I think it gets at one of my primary questions at this point.
So get_batch_metadata()
is expected to return a list of metadata objects for each item that will be submitted. And then we pass that metadata object to get_item_identifier()
, where the metadata object could be anything that this method is equipped to handle (noting item_metadata:Any
in the signature).
Where I think I'm going with this, is that get_batch_metadata()
and get_item_identifier()
are kind of tightly coupled -- and this isn't necessarily a bad thing -- in that the objects returned by get_batch_metadata()
must satisfy the input to get_item_identifier()
.
While perhaps many workflows might have a CSV file they parse and then return a dict
, and get_item_identifier()
could just grab a key... perhaps others might even return their own dataclass object or something, and then get_item_identifier()
would use a property on that class.
However, I see that we assign this "metadata object" to ItemSubmission.source_metadata
which itself is typed as dict[str, Any]
.
I was going to propose something like:
# near top of file
ItemMetadata = TypeVar("ItemMetadata")
...
...
def get_batch_metadata(self) -> list[ItemMetadata]:
...
def get_item_identifier(self, item_metadata: ItemMetadata) -> str:
...
Where we through through a generic type that these are related. But, this kind of generic type + flexibility doesn't really work if ItemSubmission
is expecting a dictionary for source_metadata
.
To keep things simple, do you think it'd be worth just typing the output of get_batch_metadata()
to something like list[dict]
? And then get_item_identifier()
could accept a dict
? Could always revisit if some workflow would really benefit from not yielding a dictionary for each item's metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say every method the the workflow classes are tightly coupled and opinionated to that workflow. In retrospect, I thought I errored in typing item_metadata=Any
in get_item_identifier
and source_metadata=dict[str, Any]
in ItemSubmission
since I didn't expect that object to be transformed between getting the identifier and instantiating ItemSubmission
.
However! That is now suggesting to me that a get_item_metadata(batch_metadata)
method is needed and that those type hints should remain:
batch_metadata = self.get_batch_metadata()
for item_metadata in self.get_item_metadata(batch_metadata):
item_identifier = self.get_item_identifier(item_metadata)
Currently, batch_metadata
in most workflows is a CSV so get_item_metadata
would be a simple dict
iterator. For OpenCourseWare
, it's a list of zip files, so get_item_metadata
iterates and extracts a dict
from each zip file. If a future source uses XML
, get_item_metadata
would iterate and generate a dict
from the XML
. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So would it be accurate to say?
-
get_batch_metadata()
--> retrieves metadata from somewhere, but what is returned still contains multiple items; probably most often downloading from S3, but theoretically could get it other ways too (e.g. an HTTP request to ETD... 😎 ). The result might be a single CSV reader/dataframe, a single parsed XML file object, etc. -
get_item_metadata()
--> would yield actual dictionaries of item metadata, for each item, doing whatever is needed (e.g. unzipping files, parsing XML, looping through CSV rows, etc.)
I'm having trouble articulating what feels confusing about the relationship of,
get_batch_metadata
get_item_identifier
get_bitstream_uris
- and now the proposed
get_item_metadata
It feels kind of like they are very tightly coupled, and creating new workflows that extend this class would be jumping through hoops to define those inter-related methods. When -- and maybe missing something here -- a workflow could just be responsible for defining something like fetch_items()
which would yield nearly everything you need to instantiate an ItemSubmission
.
Which, and I'm feeling dizzy from circular questioning, why wouldn't they directly yield an ItemSubmission
then?
Each of those workflows could, and should, define helper methods as needed to support this work. But in some cases, like reading a CSV from S3, it feels like get_batch_metadata
(reading the CSV) and yielding items get_item_metadata
could be achieved with a couple lines like:
import pandas as pd
df = pd.read_csv("s3://....")
for _, row in df.iterrows():
yield row # <----- basically an Item
or to extend that for some questions above:
import pandas as pd
from dsc import ItemSubmission
class FooWorkflow:
def __init__(self, csv_location):
self.csv_location = csv_location
def fetch_items():
df = pd.read_csv(self.csv_location)
for _, row in df.iterrows():
# extract identifiers, bit streams, etc.
yield ItemSubmission(...)
Sorry if confusing the issue, and happy to hold off on these questions until more is established.
dsc/base.py
Outdated
metadata_keyname = f"{self.s3_prefix}/{item_identifier}_metadata.json" | ||
item_submission = ItemSubmission( | ||
source_metadata=item_metadata, | ||
metadata_mapping=self.metadata_mapping, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My next can-o-worms question is here.
So it's safe to assume that metadata_mapping
is a pretty critically important part of each workflow? This would be a dictionary that maps the item's "original" metadata dictionary to metadata fields that align with the dspace metadata profile?
Super rough example:
{
"title":"dc.title",
...
}
Do you expect there will be any difficulties with nesting, logic, or multivalued fields?
For example: what if the input metadata for workflow Foo
has a column in the CSV with semicolon delimited subjects, e.g. apples;washington;mountains
. And the goal would be multiple dc.subject
field values for each. I'm unsure how a dictionary mapping could explode a single value into multiple values.
But, this assumes quite a bit. First, that DSpace's metadata profile can (or will) allow for more complex or multivalued fields. Second, that we don't just require the input metadata format to have basically a 1:1 mapping to DSpace fields already present.
I think this consideration points to ItemSubmission.create_dspace_metadata()
(which I have not yet looked closely at) which I would expect to take the source_metadata
and apply the metadata_mapping
to get the final item metadata for DSpace.
Going to leave this comment here, as-is, and followup in that method.
dsc/item_submission.py
Outdated
|
||
field_value = self.source_metadata.get(field_mapping["source_field_name"]) | ||
if field_value: | ||
delimiter = field_mapping["delimiter"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah-ha! Related to my comment above about metadata_mapping
, I see now that each entry in that is an object itself, which can provide more information for this mapping.
Request: can you add an example of what this metadata mapping might look like? With an example, I think I would understand that a single dictionary could suffice.
"dc.contributor": { | ||
"source_field_name": "contributor", | ||
"language": None, | ||
"delimiter": "|", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding an example here of a delimiter. My comment above requesting an example in the docstring stands, but double-helpful to have this example here (and test it, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add that to the docstring!
dsc/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting here, but this touches on file organization more generally.
- what if
base.py
was renamed toworkflows.py
, where it may contain additional classes or helper functions as the project grows - what if
item_submission.py
was renamed toitems.py
, where it's a bit more generic to just providing classes about "Items" (maybe starting withItemSubmission
, but perhaps there is room eventually for something likeItemSource
orItemMetadata
) - create a folder called
/sources
where actual workflows that extend these could live
Resulting in a structure like:
/ dsc
/ sources
- .gitkeep # ensures the directory persists, even before source workflows are defined
- wiley.py # which would contain a class like `WileyWorkflow`
- ...
- ...
- items.py
- workflows.py
This could help make clear that anything under /sources
is an extension of these base files and classes. And then each of those "base" files would be descriptive about what kind of classes it held.
NOTE: we could create the /sources
directory now and include a file like .gitkeep
which would ensure it gets created and committed, even though we don't have sources yet.
dsc/base.py
Outdated
|
||
MUST NOT be overridden by workflow subclasses. | ||
""" | ||
s3_client = S3Client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might propose that ItemSubmission
could initialize it's own S3Client
to use. I know we've touched on caching and reusing clients, but I think the performance gain is basically non-existent here, given how quickly this is initialized.
I would think that ItemSubmission.generate_and_upload_dspace_metadata()
could just instantiate and use a client right when it needs it, instead of applying to the ItemSubmission
class instance.
Or, have a property like this if multiple methods may eventually require an S3 client:
@property
def s3_client(self):
return S3Client()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
|
||
|
||
@dataclass | ||
class ItemSubmission: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think ItemSubmission
would benefit from some kind of validate()
method? I'm unsure offhand what that might look for, but I'm noticing now that we call generate_and_upload_dspace_metadata()
in the for
loop of the workflow and it might be helpful to check things look good before we fire it off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did have a valid_dspace_metadata
method in wiley-deposits
but I may want to rework it since that had specific field names to validate while these workflows may be less constrained. But yes, there should be some validation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: If we find that dataclass
module is the right fit for this module, I propose calling the proposed validate()
method in a __post_init__
method!
dsc/base.py
Outdated
self.s3_prefix: str | None = s3_prefix | ||
|
||
@final | ||
def generate_submission_batch(self) -> Iterator[tuple[str, list[str]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm realizing after my first pass that I have a couple of fundamental questions about this base workflow method.
When this is complete, is it correct that:
- each
ItemSubmission
instantiated in the loop will have uploaded a JSON file to S3 that is kind of like a SIP for DSS to retrieve and ingest into DSpace - it yields the URI of each new
ItemSubmission
, which I believe to be the S3 URI of the JSON file uploaded to S3 for that item - it also yields
ItemSubmission.bitstream_uris
, which are not yet getting set onItemSubmission
My question: what is utilizing the results of this yielding? Will it be another class method? or the CLI itself? Unless I'm grossly misunderstanding this application, wouldn't something need to send these as an SQS message to the DSS input queue?
Related question -- and apologies if this will come later -- what happens with the bistream URIs? Are they included in the SQS message for DSS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The yielded URIs will be utilized by the deposit
command to generate the SQS message, I mentioned in the PR description
with the expectation that unwritten CLI commands will be the interface with the DSS SQS input/output queues and the SES emails.
The workflow class is intended for just the workflow-opinionated code while the CLI command will be a wrapper with universal functionality (e.g. sending SQS messages, SES emails)
And yes, the bitstream URIs are in the the SQS message as well as the URI of the JSON metadata file that is uploaded to S3, DSS submission message spec for reference.
The steps would be:
ItemSubmission
instantiated with source metadata and bitstreams URIs- DSpace metadata generated from source metadata
- DSpace metadata uploaded to S3 as JSON file and URI is returned
- ItemSubmission yields
metadata_uris
andbitstream_uris to
deposit` CLI command for SQS message deposit
CLI command generates and send SQS message to DSS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry missed that in the PR! Thanks for pointing that out.
And thanks for sharing the DSS submission message spec. That document is SUPER helpful, and I think it'd be handy to include in docstrings in this application where applicable.
And, thanks for the steps; I had missed that get_bitstream_uris()
is a method that workflows will need to define, and that's where they come from.
This is quite helpful for helping me clarify my next question: any reason why we couldn't yield the ItemSubmission
object itself? If I'm understanding correctly, it will have the metadata_uri
and the bitstream_uris
.
Happy to wait on that until the CLI level orchestration is in place, where these would get sent out as SQS messages. Are you envisioning some kind of DSSInputMessage
class that could encapsulate preparing the final SQS payload for sending, that would meet the spec?
Which I think has a natural follow-up question: why couldn't ItemSubmission
have such a method to perform this work? Like to_dss_input_queue_message()
?
Happy to meet and discuss if helpful, or keep hashing out in comments. When I see a class like ItemSubmission
, which is responsible for uploading the components of a SIP to S3 (metadata and files), I wonder why it wouldn't have the capabilities of actually preparing (and maybe actually sending?) the SQS message for DSS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which I think has a natural follow-up question: why couldn't
ItemSubmission
have such a method to perform this work? Liketo_dss_input_queue_message()
?
I think it certainly could! Let's get @jonavellecuerdo 's thoughts as well but now I'm inclined to build SQS message formatting into ItemSubmission
rather than the CLI command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think yielding the actual ItemSubmission
instance will allow the object to be accessed outside the context of the BaseWorkflow.generate_submission_batch
method. 🤔 I also think if all ItemSubmission
instances must comply with the DSS submission message spec, then--as @ehanson8 writes--it also follows that defining the SQS message formatting in ItemSubmission
would make sense!
* Restructure app to use workflows directory * Add DSS_INPUT_QUEUE env var * Add InvalidDSpaceMetadataError exception * Refactor generate_and_upload_dspace_metadata method to upload_dspace_metadata * Shift create_dspace_metadata and validate_dspace_metadata from ItemSubmission to BaseWorkflow * Add init params to BaseWorkflow * Add stub run method to BaseWorkflow class * Add dspace_metadata fixture * Add tests and fixtures for BaseWorkflow @Final methods
Pull Request Test Coverage Report for Build 12401646765Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates are looking good! The call earlier today was helpful in understanding some of the changes.
Left a few comments.
dsc/workflows/base/base_workflow.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor, very optional: class BaseWorkflow
could be defined in the dsc/workflows/base/__init__.py
file if we wanted to avoid another file like base_workflow.py
or base.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes I think it's easier to work backwards from what imports feel natural, then see how files/classes could be organized.
My expectation would be something like:
# if I'm defining a brand new workflow
from dsc.workflows.base import BaseWorkflow
# if I'm extending a base workflow like SImpleCSV
# still get from workflows.base, because its __all__ provides it
from dsc.workflows.base import SimpleCSV
# if I need to _use_ a workflow (maybe in the CLI?)
# workflows.__init__.py provides this import
from dsc.workflows import Wiley
This suggests something like this to me:
/ dsc
/ workflows
- __init__.py
/ base
- __init__.py # BaseWorkflow defined here
- simple_csv.py # SimpleCSV defined here
- wiley.py
- ocw.py
The __all__
object could be used in dsc.workflows.base.__init__.py
to provide access to SimpleCSV
, Wiley
, etc. Just a possibility!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this and am currently implementing!
dsc/workflows/base/base_workflow.py
Outdated
submission_system: The system to which item submissions will be sent (e.g. | ||
DSpace@MIT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reveals some ignorance about the greater context of the application, but when would it not be DSpace@MIT
? For Dome?
Would it make sense to make this only values from an Enum
? Is this used directly in the SQS message that then tells DSS something special?
Or... wonder if it would make sense for each workflow defined to define this on the class itself (i.e. a class level attribute, here's an example from GeoHarvester)? Instead of passing to the Workflow when instantiated. Seems like any given workflow will only ever want to deposit into a single repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If class level (aka workflow level) static attributes might make sense, do wonder about other things passed here. For example: would output_queue
also be pretty workflow specific? or collection_handle
? I honestly don't know, but maybe worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, currently everything is DSpace@MIT but eventually I would like DSS to point to Dome for DDC uploads so this is to accommodate that. And yes, defining the init params on the derived workflow classes makes a lot of sense since they would largely be static for the workflow. s3_prefix
and collection_handle
the only ones that could potentially be dynamic, though I'd like to lock that down with the content owners where possible.
dsc/workflows/base/base_workflow.py
Outdated
""" | ||
metadata_entries = [] | ||
for field_name, field_mapping in self.metadata_mapping.items(): | ||
if field_name not in ["item_identifier", "source_system_identifier"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would item_identifier
or source_system_identifier
be present in the mapping in the first place?
It seems as if the mapping would be constructed specifically for this application, and therefore could be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item_identifier
is provided by stakeholders in the CSV but may require additional parsing in the get_item_identifier
method beyond retrieving it from the CSV (e.g. could be the full filename when we want to omit the extension).
source_system_identifier
is for creating reports used to update related metadata in other repositories (e.g. a digital object URI ArchivesSpace is the big use case) with the newly published DSpace handle. So it needs to be there for linking but isn't a part of the DSpace metadata that's generated. My expectation is this will be done in the finalize
command. But second thought, it might be better to omit source_system_identifier
until I'm ready to fully implement that functionality in the ArchivesSpace
workflow, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But second thought, it might be better to omit source_system_identifier until I'm ready to fully implement that functionality in the ArchivesSpace workflow, thoughts?
Might not be a bad idea? We've been trying a similar approach -- and it's hard -- in the parquet work, to keep the early passes really tight and focused. I think it's okay to stub something when it's almost gauranteed to show up later, and perhaps the contours of seeing it helps understand the current code to review. But yeah, maybe it'll be easier to reason about, name, and organize when things settle into place a bit more. Would make sense to me to wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
dsc/workflows/base/base_workflow.py
Outdated
sqs_client = SQSClient() | ||
for item_submission in self.item_submissions_iter(): | ||
item_submission.upload_dspace_metadata( | ||
self.s3_bucket, item_submission.metadata_keyname | ||
) | ||
sqs_client.send_submission_message( | ||
item_submission.item_identifier, | ||
self.workflow_name, | ||
self.output_queue, | ||
self.submission_system, | ||
self.collection_handle, | ||
item_submission.metadata_uri, | ||
item_submission.bitstream_uris, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to wait on this discussion if it would be helpful to tease out when this is actually built out.
Given this proposal, it feels like it could be a little difficult to test with this approach, where both upload_space_metadata()
and send_submission_message()
appear to be crafting objects before they send them off into the ether.
What about a more verbose, but potentially easier to test flow:
for item_submission in self.item_submissions_iter():
# prepare dspace metadata and upload to S3
dspace_metadata = item_submission.prepare_dspace_metadata()
self.upload_dspace_metadata(metadata_s3_key, dspace_metadata)
# send SQS message to DSS input queue
submission_sqs_message = item_submission.prepare_sqs_input_message()
sqs_client.publish_submission_message(submission_sqs_message)
This makes some assumptions that you may or may not agree with:
- the
BaseWorkflow
class would have the functionality to perform uploads to S3 - the
ItemSubmission
class could only prepare payloads that will get sent- this removes the need for
ItemSubmission
to know anything about where they will go (S3 buckets, queues,
etc.; all of which are sort of set on the workflow instance itself)
- this removes the need for
- the
ItemSubmission
class is what prepares SQS messages, but theSQSClient
actually sends it- or, perhaps the
BaseWorkflow
could have a method likesubmit_to_dss_input_queue()
which itself would utilize theSQSClient
for actually publishing the message, but it would still be the workflow that kicked that off and was ultimately responsible for it
- or, perhaps the
Just food for thought. But I realize this just a stub, and will get more attention when that part is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very helpful, things are already shifting from this proposed code and testing was a major consideration. This will be ready for full review in the next PR after I refactor the SQS functionality!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Add ItemMetadatMissingRequiredFieldError exception * Rename metadata_keyname > metadata_s3_key * Shift BaseWorkflow to base/__init__.py and remove base_workflow.py * Add required key to metadata mapping and corresponding unit test * Update unit tests names to include _success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All said, I feel comfortable approving! Truly nice work all around, and thanks for the discussions. It's one of the harder parts of a new project; establishing some patterns when parts are still stubbed or unfinished.
It's feeling opinionated enough that I can envision where future functionality may land, what it'll be called, etc. But, also simple and flexible enough that if future functionality requires some shifting, that would be doable as well.
Left a couple of comments, but wholly optional. Approved as-is, with or without changes.
dspace_metadata = self.create_dspace_metadata(item_metadata) | ||
self.validate_dspace_metadata(dspace_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this now, I could envision validating the generated metadata inside the creation. But I think there are arguably advantages to doing it this way as well. So NOT a request, just an observation, in case the pattern emerges elsewhere.
metadata_s3_key: str | ||
metadata_uri: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but maybe worth considering renaming metadata_s3_uri
to match the pattern of metadata_s3_key
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating to metadata_s3_uri
and bitstream_s3_uris
for clarity and consistency in the upcoming PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking some of the other file reorganizing and renaming into consideration -- and thanks BTW, it's feeling good to navigate around! -- I could envision this file called something more high level like items.py
. If other classes ever made sense to add specific to items, it would be a natural place for it.
I don't have evidence or even a pointable philosophy to support it, but when the classname mirrors the filename, it feels a bit off. While we may not need more "item" type classes, it's tight coupling between file and class names.
Totally optional.
Purpose and background context
Certain code blocks have bounced around while writing this and I am finding there many different ways to organize the boundaries between the CLI command,
BaseWorkflow
class, and theItemSubmission
class. Theabstractmethod
s on theBaseWorkflow
class outline the boundary I'm seeing between the workflow and the item, with the expectation that unwritten CLI commands will be the interface with the DSS SQS input/output queues and the SES emails. My guiding principle was to minimize the number of values passed between various classes, but as always, this is wide open to iteration!How can a reviewer manually see the effects of these changes?
Not possible yet
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)