-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HRQB 22 - Employee Appointments #39
Conversation
Why these changes are being introduced: In some instances, it makes sense to not cache results, e.g. inserts or records queries, where data may have changed from other actions. How this addresses that need: * Skip caching for some QBClient methods * Adds method to retrieve as single table information from Quickbase Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22
Why these changes are being introduced: One requirement of the Quickbase implementation is data going back to 2019. Until now, no data limits were placed on data warehouse queries. This begins to establish this for pre-existing tasks, and will be enforced for future ones as well. How this addresses that need: * Limits Employee records to those with Employee Appointments that end on or after 2019 Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22
Why these changes are being introduced: It is not uncommon for a load task to have multiple required parent tasks, which means the convenience method 'single_input_dataframe' cannot be used. However, overriding QuickbaseUpsertTask.get_records() could introduce errors and inconsistencies. How this addresses that need: * Define new property QuickbaseUpsertTask.input_task_to_load that explicitly defines which input task to use data from * QuickbaseUpsertTask.get_records does not need to be overridden even if Load task has multiple parent tasks Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22
Why these changes are being introduced: Tasks to support loading of Quickbase table Employee Appointments. How this addresses that need: * Creates ETL tasks for Employee Appointments * Creates ETL tasks for related lookup tables Job Titles, Position Titles, and Employee Types Side effects of this change: * Employee appointment data is loaded to QB Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22
How this addresses that need: * Updated testing approach for ETL tasks, with tasks complete by default as fixtures Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22
def input_task_to_load(self) -> str: | ||
"""Upsert data from parent task 'TransformEmployeeAppointments'.""" | ||
return "TransformEmployeeAppointments" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of the using the optional property input_task_to_load
. By setting this, it explicitly sets which parent task from the multiple ones defined in self.requires()
to use as the input data for upserting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of clarifying questions, but it's looking good!
@@ -91,7 +91,7 @@ def single_input_dataframe(self) -> pd.DataFrame: | |||
return target.read() # type: ignore[return-value] | |||
|
|||
@property | |||
def named_inputs(self) -> dict[str, PandasPickleTarget | QuickbaseTableTarget]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why'd the type hint change? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. This was an over-engineering early on, where I had assumed that QuickbaseUpsertTasks
might be used as inputs into other tasks. But that seems unlikely now.
If we make this more generic, then we can avoid some linting and typing hiccups where it could be a dictionary from a QuickbsaeTableTarget
, when we know 99% of the time it's a dataframe from PandasPickleTarget
. But even that isn't worth the linting/typing headaches, as we often get the task target in a way that loses some of that thread. Seemed pretty clear that just relaxing this type was the best option.
@@ -57,7 +61,7 @@ def make_request( | |||
""" | |||
# hash the request to cache the response | |||
request_hash = (path, json.dumps(kwargs, sort_keys=True)) | |||
if self.cache_results and request_hash in self._cache: | |||
if cache and self.cache_results and request_hash in self._cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a scenario where cache != self.cache_results
? I was curious why an and
condition is used as opposed to or
. Is it so self.cache_results
will take precedence, and if so, how would self.cache_results
be set? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The and
supports this situation:
QBClient
instantiated with caching (default)QBClient
makes a cached method request like listing tablesQBClient
makes a non-cached method request like upserting data
I had considered temporarily setting qbclient.cache_results = False
for a non-cached request, but this is brittle if the request fails and the QBClient is then in an intermediate state.
It's probably overkill, but this allows:
- completely preventing caching by
QBClient(cache_results=False)
- temporarily / method-by-method skipping caching with
qbclient.make_request(..., cache=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for explaining how it works!
def test_task_transform_employee_appointments_libhr_data_merge_success( | ||
task_transform_employee_appointments_complete, | ||
): | ||
row = task_transform_employee_appointments_complete.get_dataframe().iloc[0] | ||
assert row["HC ID"] == "L-001" | ||
assert row["Cost Object"] == 7777777 | ||
assert row["Supervisor"] == "111111111" | ||
assert row["Related Department ID"] == 42.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-blocking] As I was reviewing these _merge
tests, I initially thought it'd be helpful to include a check that we're only merging on values that appear on the left dataframe (since how="left"
) (i.e., by also asserting row counts). However, I started to wonder whether that would be attempting to test pd.merge
, which doesn't seem like the thing to do. Just wanted to share my thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the pondering! I'm finding testing in this project difficult, particularly at this grain of actual ETL tests that are dealing with data warehouse, quickbase, and sensitive data.
I think you're on to something about row counts, because it is in pd.merge()
where we'd likely introduce duplicate rows.
I should have commented on this somewhere, but I'm envisioning some "runtime assertions" once things have settled (and getting close now, with only a couple more tables to add ETL tasks for). I think combing through some of these and adding checks on row counts, duplicates, etc., would be a nice integrity test when tasks are actually running. And, they could be confirmed with mocked unit test that would trigger duplicate rows in the merges as-is.
All said, it's on the brain! And I think it is somewhere in-between, like we need to think:
- I don't want duplicate rows in task XYZ, because ABC
- I have a runtime assertion in task XYZ that raises errors if that happens
- I can write a test that simulates that and confirm runtime assertion works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan! Thank you for sharing. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 optional suggestion
tests/test_qbclient_client.py
Outdated
|
||
|
||
def test_qbclient_parse_upsert_results_response_success(qbclient, mocked_qb_api_upsert): | ||
parsed_response = qbclient.parse_upsert_results(mocked_qb_api_upsert) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: could skip the variable and directly assert on qbclient.parse_upsert_results(mocked_qb_api_upsert)
Purpose and background context
This PR introduces tasks and updated pipelines to load employee appointment data. The Quickbase table
Employee Appointments
relies on some other Quickbase tables for lookup data, including:Job Titles
Position Titles
Employee Types
As such, tasks for loading
Employee Appointments
rely on other, new tasks that also populate those tables in Quickbase. This results in a dependency graph that looks like this:LoadEmployeeAppointments
to complete, so to must tasks likeLoadJobTitles
orLoadPositionTitles
ExtractDWEmployeeAppointments
for raw data warehouse dataI don't think it will be productive to outline these dependencies in quite so much detail going forward, but wanted to take a moment to observe them here, as this is the first time some of these more complex dependencies have been introduced.
In future tasks -- e.g. for
Employee Salary History
orEmployee Leave
-- this pattern will be common and repeated.There are some misc updates like dependency updates and additional
QBClient
functionality, but the primary work is part of commits:How can a reviewer manually see the effects of these changes?
Still not readily testable in deployed contexts, or locally without data warhouse and quickbase credentials.
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)