HRQB 40 - Fix merges without volatile 'HR Appointment Key' #115

ghukill · 2024-07-24T14:16:57Z

Purpose and background context

The removal of volatile data from the data warehouse (data that is internally consistent for a single data warehouse run, but not between runs) in a previous PR broke some merges for employee leaves and salary history.

Instead of these tasks merging on the HR Appointment Key from the data warehouse, we need to merge on the Quickbase Employee Appointments.Key field value. This Key value is generated from an MD5 hash of employee appointment details; this will not change between data warehouse runs.

This requires both TransformEmployeeLeave and TransformEmployeeSalaryHistory tasks generate this Key value in the same way as the TransformEmployeeAppointments task, which loads that data to Quickbase. To this end, a method was created on TransformEmployeeAppointments that other tasks can use, ensuring they generate the Key value in the same way.

The end result is identical loadings of data, all leave and salary history have associated employee appointments, but they will not "drift" over time as data warehoues runs change the internal consistent, but not externally consistent, HR Appointment Key.

Additionally, two new integrity checks were added to ensure that loading leave or salary data is never attempted without an associated employee appointment.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: improvement on data quality over time

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/HRQB-40

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: After the removal of the volatile 'HR Appointment Key' from a couple of QB tables, this broke merges in transform tasks for employee leaves and salary history. Now, instead of merging on this volatile field, we need to merge on the 'Key' field which is enduring based on an MD5 hash of employee appointment data. How this addresses that need: * TransformEmployeeLeave and TransformEmployeeSalaryHistory both merge on QB Employee Appointments data by calculating a matching 'Key' value from data warehouse data Side effects of this change: * Rows in Quickbase will not appear modify and leaves and salaries will not drift away from appointments Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-40

ghukill · 2024-07-24T14:19:23Z

tests/conftest.py

-                    "appt_begin_date": Timestamp("2010-01-01 00:00:00"),
-                    "appt_end_date": datetime.datetime(2011, 12, 1, 0, 0),
+                    "appt_begin_date": Timestamp("2019-01-01 00:00:00"),
+                    "appt_end_date": datetime.datetime(2022, 12, 1, 0, 0),
+                    "position_id": "987654321",
                    "hr_appt_key": 123,
-                    "absence_date": Timestamp("2010-07-01 00:00:00"),
+                    "absence_date": Timestamp("2019-07-01 00:00:00"),


This update, while not strictly required here, was just to better align mocked data in fixtures.

ehanson8

Logic looks good, approved!

jonavellecuerdo

Looks good to me! Just have one non-blocking question re: generate_merge_key static method.

jonavellecuerdo · 2024-07-24T19:50:13Z

hrqb/tasks/employee_appointments.py

+    def generate_merge_key(
+        mit_id: str,
+        position_id: str,
+        appt_begin_date: str,
+        appt_end_date: str,
+    ) -> str:
+        return md5_hash_from_values(
+            [
+                mit_id,
+                position_id,
+                appt_begin_date,
+                appt_end_date,
+            ]


I like that this function gives us more information on which fields (and their data types) are selected as merge keys!

jonavellecuerdo · 2024-07-25T16:01:46Z

hrqb/tasks/employee_leave.py

+            ["appt_begin_date", "appt_end_date", "absence_date"],
+        )
+        dw_leaves_df["emp_appt_merge_key"] = dw_leaves_df.apply(
+            lambda row: TransformEmployeeAppointments.generate_merge_key(


[Non-blocking] Hmm, is there a reason why generate_merge_key needs to be a static method of the TransformEmployeeAppointments class as opposed to a standalone util method? 🤔 Later in the module, a key is created for leaves_df--should TransformEmployeeLeave get its own static method generate_merge_key? 🤔

The reasoning here is kind of twofold.

The logic for generating a merge field key on TransformEmployeeAppointments is needed by other tasks, but it's specific to how that task does it, so keeping it associated with that task felt like it made sense.

Related -- at least not yet -- the other tasks that generate a merge key, no other tasks need to know how they do it. If that changes, then I think a natural refactor would be to drop it into an externally available @staticmethod.

If this pattern continues where tasks need to be able to recreate the merge key for other tasks, it might be worth abstracting that out to the base tasks, and then each task just identifies what row data is used for it. But I don't think we're quite there yet. This could be the end of this need.... or not!

ghukill added 2 commits July 24, 2024 10:05

Add employee appointment integrity checks

33110bd

ghukill requested review from ehanson8 and jonavellecuerdo July 24, 2024 14:17

ghukill commented Jul 24, 2024

View reviewed changes

ehanson8 approved these changes Jul 24, 2024

View reviewed changes

jonavellecuerdo approved these changes Jul 25, 2024

View reviewed changes

ghukill merged commit ed72ce1 into main Jul 25, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HRQB 40 - Fix merges without volatile 'HR Appointment Key' #115

HRQB 40 - Fix merges without volatile 'HR Appointment Key' #115

ghukill commented Jul 24, 2024 •

edited

Loading

ghukill Jul 24, 2024

ehanson8 left a comment

jonavellecuerdo left a comment

jonavellecuerdo Jul 24, 2024

jonavellecuerdo Jul 25, 2024

ghukill Jul 25, 2024 •

edited

Loading

HRQB 40 - Fix merges without volatile 'HR Appointment Key' #115

HRQB 40 - Fix merges without volatile 'HR Appointment Key' #115

Conversation

ghukill commented Jul 24, 2024 • edited Loading

Purpose and background context

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill Jul 24, 2024

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

jonavellecuerdo Jul 24, 2024

Choose a reason for hiding this comment

jonavellecuerdo Jul 25, 2024

Choose a reason for hiding this comment

ghukill Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

ghukill commented Jul 24, 2024 •

edited

Loading

ghukill Jul 25, 2024 •

edited

Loading