-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HRQB 40 - Fix merges without volatile 'HR Appointment Key' #115
Conversation
Why these changes are being introduced: After the removal of the volatile 'HR Appointment Key' from a couple of QB tables, this broke merges in transform tasks for employee leaves and salary history. Now, instead of merging on this volatile field, we need to merge on the 'Key' field which is enduring based on an MD5 hash of employee appointment data. How this addresses that need: * TransformEmployeeLeave and TransformEmployeeSalaryHistory both merge on QB Employee Appointments data by calculating a matching 'Key' value from data warehouse data Side effects of this change: * Rows in Quickbase will not appear modify and leaves and salaries will not drift away from appointments Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-40
"appt_begin_date": Timestamp("2010-01-01 00:00:00"), | ||
"appt_end_date": datetime.datetime(2011, 12, 1, 0, 0), | ||
"appt_begin_date": Timestamp("2019-01-01 00:00:00"), | ||
"appt_end_date": datetime.datetime(2022, 12, 1, 0, 0), | ||
"position_id": "987654321", | ||
"hr_appt_key": 123, | ||
"absence_date": Timestamp("2010-07-01 00:00:00"), | ||
"absence_date": Timestamp("2019-07-01 00:00:00"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update, while not strictly required here, was just to better align mocked data in fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks good, approved!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just have one non-blocking question re: generate_merge_key
static method.
def generate_merge_key( | ||
mit_id: str, | ||
position_id: str, | ||
appt_begin_date: str, | ||
appt_end_date: str, | ||
) -> str: | ||
return md5_hash_from_values( | ||
[ | ||
mit_id, | ||
position_id, | ||
appt_begin_date, | ||
appt_end_date, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this function gives us more information on which fields (and their data types) are selected as merge keys!
["appt_begin_date", "appt_end_date", "absence_date"], | ||
) | ||
dw_leaves_df["emp_appt_merge_key"] = dw_leaves_df.apply( | ||
lambda row: TransformEmployeeAppointments.generate_merge_key( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-blocking] Hmm, is there a reason why generate_merge_key
needs to be a static method of the TransformEmployeeAppointments
class as opposed to a standalone util method? 🤔 Later in the module, a key is created for leaves_df
--should TransformEmployeeLeave
get its own static method generate_merge_key
? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning here is kind of twofold.
The logic for generating a merge field key on TransformEmployeeAppointments
is needed by other tasks, but it's specific to how that task does it, so keeping it associated with that task felt like it made sense.
Related -- at least not yet -- the other tasks that generate a merge key, no other tasks need to know how they do it. If that changes, then I think a natural refactor would be to drop it into an externally available @staticmethod
.
If this pattern continues where tasks need to be able to recreate the merge key for other tasks, it might be worth abstracting that out to the base tasks, and then each task just identifies what row data is used for it. But I don't think we're quite there yet. This could be the end of this need.... or not!
Purpose and background context
The removal of volatile data from the data warehouse (data that is internally consistent for a single data warehouse run, but not between runs) in a previous PR broke some merges for employee leaves and salary history.
Instead of these tasks merging on the
HR Appointment Key
from the data warehouse, we need to merge on the QuickbaseEmployee Appointments.Key
field value. ThisKey
value is generated from an MD5 hash of employee appointment details; this will not change between data warehouse runs.This requires both
TransformEmployeeLeave
andTransformEmployeeSalaryHistory
tasks generate thisKey
value in the same way as theTransformEmployeeAppointments
task, which loads that data to Quickbase. To this end, a method was created onTransformEmployeeAppointments
that other tasks can use, ensuring they generate theKey
value in the same way.The end result is identical loadings of data, all leave and salary history have associated employee appointments, but they will not "drift" over time as data warehoues runs change the internal consistent, but not externally consistent,
HR Appointment Key
.Additionally, two new integrity checks were added to ensure that loading leave or salary data is never attempted without an associated employee appointment.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES: improvement on data quality over time
What are the relevant tickets?
Developer
Code Reviewer(s)