Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRQB 40 - Fix merges without volatile 'HR Appointment Key' #115

Merged
merged 2 commits into from
Jul 25, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Jul 24, 2024

Purpose and background context

The removal of volatile data from the data warehouse (data that is internally consistent for a single data warehouse run, but not between runs) in a previous PR broke some merges for employee leaves and salary history.

Instead of these tasks merging on the HR Appointment Key from the data warehouse, we need to merge on the Quickbase Employee Appointments.Key field value. This Key value is generated from an MD5 hash of employee appointment details; this will not change between data warehouse runs.

This requires both TransformEmployeeLeave and TransformEmployeeSalaryHistory tasks generate this Key value in the same way as the TransformEmployeeAppointments task, which loads that data to Quickbase. To this end, a method was created on TransformEmployeeAppointments that other tasks can use, ensuring they generate the Key value in the same way.

The end result is identical loadings of data, all leave and salary history have associated employee appointments, but they will not "drift" over time as data warehoues runs change the internal consistent, but not externally consistent, HR Appointment Key.

Additionally, two new integrity checks were added to ensure that loading leave or salary data is never attempted without an associated employee appointment.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: improvement on data quality over time

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

ghukill added 2 commits July 24, 2024 10:05
Why these changes are being introduced:

After the removal of the volatile 'HR Appointment Key' from a couple
of QB tables, this broke merges in transform tasks for employee leaves
and salary history.  Now, instead of merging on this volatile field,
we need to merge on the 'Key' field which is enduring based on an MD5
hash of employee appointment data.

How this addresses that need:
* TransformEmployeeLeave and TransformEmployeeSalaryHistory both merge
on QB Employee Appointments data by calculating a matching 'Key' value
from data warehouse data

Side effects of this change:
* Rows in Quickbase will not appear modify and leaves and salaries
will not drift away from appointments

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-40
Comment on lines -840 to +846
"appt_begin_date": Timestamp("2010-01-01 00:00:00"),
"appt_end_date": datetime.datetime(2011, 12, 1, 0, 0),
"appt_begin_date": Timestamp("2019-01-01 00:00:00"),
"appt_end_date": datetime.datetime(2022, 12, 1, 0, 0),
"position_id": "987654321",
"hr_appt_key": 123,
"absence_date": Timestamp("2010-07-01 00:00:00"),
"absence_date": Timestamp("2019-07-01 00:00:00"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update, while not strictly required here, was just to better align mocked data in fixtures.

Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good, approved!

Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just have one non-blocking question re: generate_merge_key static method.

Comment on lines +135 to +147
def generate_merge_key(
mit_id: str,
position_id: str,
appt_begin_date: str,
appt_end_date: str,
) -> str:
return md5_hash_from_values(
[
mit_id,
position_id,
appt_begin_date,
appt_end_date,
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this function gives us more information on which fields (and their data types) are selected as merge keys!

["appt_begin_date", "appt_end_date", "absence_date"],
)
dw_leaves_df["emp_appt_merge_key"] = dw_leaves_df.apply(
lambda row: TransformEmployeeAppointments.generate_merge_key(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking] Hmm, is there a reason why generate_merge_key needs to be a static method of the TransformEmployeeAppointments class as opposed to a standalone util method? 🤔 Later in the module, a key is created for leaves_df--should TransformEmployeeLeave get its own static method generate_merge_key? 🤔

Copy link
Collaborator Author

@ghukill ghukill Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning here is kind of twofold.

The logic for generating a merge field key on TransformEmployeeAppointments is needed by other tasks, but it's specific to how that task does it, so keeping it associated with that task felt like it made sense.

Related -- at least not yet -- the other tasks that generate a merge key, no other tasks need to know how they do it. If that changes, then I think a natural refactor would be to drop it into an externally available @staticmethod.

If this pattern continues where tasks need to be able to recreate the merge key for other tasks, it might be worth abstracting that out to the base tasks, and then each task just identifies what row data is used for it. But I don't think we're quite there yet. This could be the end of this need.... or not!

@ghukill ghukill merged commit ed72ce1 into main Jul 25, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants