Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRQB 22 - Employee Appointments #39

Merged
merged 9 commits into from
Jun 5, 2024
Merged

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Jun 3, 2024

Purpose and background context

This PR introduces tasks and updated pipelines to load employee appointment data. The Quickbase table Employee Appointments relies on some other Quickbase tables for lookup data, including:

  • Job Titles
  • Position Titles
  • Employee Types

As such, tasks for loading Employee Appointments rely on other, new tasks that also populate those tables in Quickbase. This results in a dependency graph that looks like this:

├── COMPLETE: FullUpdate()
   ├── COMPLETE: LoadEmployeeAppointments(...)
      ├── COMPLETE: LoadEmployeeTypes(...)
         ├── COMPLETE: TransformEmployeeTypes(...)
            ├── COMPLETE: ExtractDWEmployeeAppointments(...)
      ├── COMPLETE: LoadJobTitles(...)
         ├── COMPLETE: TransformUniqueJobTitles(...)
            ├── COMPLETE: ExtractDWEmployeeAppointments(...)
      ├── COMPLETE: LoadPositionTitles(...)
         ├── COMPLETE: TransformUniquePositionTitles(...)
            ├── COMPLETE: ExtractDWEmployeeAppointments(...)
      ├── COMPLETE: TransformEmployeeAppointments(...)
         ├── COMPLETE: ExtractDWEmployeeAppointments(...)
         ├── COMPLETE: ExtractQBLibHREmployeeAppointments(...)
         ├── COMPLETE: ExtractQBDepartments(...)
  • for task LoadEmployeeAppointments to complete, so to must tasks like LoadJobTitles or LoadPositionTitles
  • these tasks rely on task ExtractDWEmployeeAppointments for raw data warehouse data

I don't think it will be productive to outline these dependencies in quite so much detail going forward, but wanted to take a moment to observe them here, as this is the first time some of these more complex dependencies have been introduced.

In future tasks -- e.g. for Employee Salary History or Employee Leave -- this pattern will be common and repeated.

There are some misc updates like dependency updates and additional QBClient functionality, but the primary work is part of commits:

How can a reviewer manually see the effects of these changes?

Still not readily testable in deployed contexts, or locally without data warhouse and quickbase credentials.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

ghukill added 8 commits May 31, 2024 16:44
Why these changes are being introduced:

In some instances, it makes sense to not cache
results, e.g. inserts or records queries, where
data may have changed from other actions.

How this addresses that need:
* Skip caching for some QBClient methods
* Adds method to retrieve as single table
information from Quickbase

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-22
Why these changes are being introduced:

One requirement of the Quickbase implementation is data going
back to 2019.  Until now, no data limits were placed on data warehouse
queries.  This begins to establish this for pre-existing tasks, and will
be enforced for future ones as well.

How this addresses that need:
* Limits Employee records to those with Employee Appointments that
end on or after 2019

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-22
Why these changes are being introduced:

It is not uncommon for a load task to have multiple required parent tasks,
which means the convenience method 'single_input_dataframe' cannot be used.
However, overriding QuickbaseUpsertTask.get_records() could introduce errors
and inconsistencies.

How this addresses that need:
* Define new property QuickbaseUpsertTask.input_task_to_load that explicitly
defines which input task to use data from
* QuickbaseUpsertTask.get_records does not need to be overridden even if Load
task has multiple parent tasks

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-22
Why these changes are being introduced:

Tasks to support loading of Quickbase table Employee Appointments.

How this addresses that need:
* Creates ETL tasks for Employee Appointments
* Creates ETL tasks for related lookup tables Job Titles, Position Titles, and
Employee Types

Side effects of this change:
* Employee appointment data is loaded to QB

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-22
How this addresses that need:
* Updated testing approach for ETL tasks, with tasks complete
by default as fixtures

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-22
@ghukill ghukill marked this pull request as ready for review June 3, 2024 16:00
Comment on lines +146 to +148
def input_task_to_load(self) -> str:
"""Upsert data from parent task 'TransformEmployeeAppointments'."""
return "TransformEmployeeAppointments"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of the using the optional property input_task_to_load. By setting this, it explicitly sets which parent task from the multiple ones defined in self.requires() to use as the input data for upserting.

Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of clarifying questions, but it's looking good!

@@ -91,7 +91,7 @@ def single_input_dataframe(self) -> pd.DataFrame:
return target.read() # type: ignore[return-value]

@property
def named_inputs(self) -> dict[str, PandasPickleTarget | QuickbaseTableTarget]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why'd the type hint change? 🤔

Copy link
Collaborator Author

@ghukill ghukill Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. This was an over-engineering early on, where I had assumed that QuickbaseUpsertTasks might be used as inputs into other tasks. But that seems unlikely now.

If we make this more generic, then we can avoid some linting and typing hiccups where it could be a dictionary from a QuickbsaeTableTarget, when we know 99% of the time it's a dataframe from PandasPickleTarget. But even that isn't worth the linting/typing headaches, as we often get the task target in a way that loses some of that thread. Seemed pretty clear that just relaxing this type was the best option.

@@ -57,7 +61,7 @@ def make_request(
"""
# hash the request to cache the response
request_hash = (path, json.dumps(kwargs, sort_keys=True))
if self.cache_results and request_hash in self._cache:
if cache and self.cache_results and request_hash in self._cache:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scenario where cache != self.cache_results? I was curious why an and condition is used as opposed to or. Is it so self.cache_results will take precedence, and if so, how would self.cache_results be set? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The and supports this situation:

  • QBClient instantiated with caching (default)
  • QBClient makes a cached method request like listing tables
  • QBClient makes a non-cached method request like upserting data

I had considered temporarily setting qbclient.cache_results = False for a non-cached request, but this is brittle if the request fails and the QBClient is then in an intermediate state.

It's probably overkill, but this allows:

  • completely preventing caching by QBClient(cache_results=False)
  • temporarily / method-by-method skipping caching with qbclient.make_request(..., cache=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for explaining how it works!

Comment on lines +39 to +46
def test_task_transform_employee_appointments_libhr_data_merge_success(
task_transform_employee_appointments_complete,
):
row = task_transform_employee_appointments_complete.get_dataframe().iloc[0]
assert row["HC ID"] == "L-001"
assert row["Cost Object"] == 7777777
assert row["Supervisor"] == "111111111"
assert row["Related Department ID"] == 42.0
Copy link

@jonavellecuerdo jonavellecuerdo Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking] As I was reviewing these _merge tests, I initially thought it'd be helpful to include a check that we're only merging on values that appear on the left dataframe (since how="left") (i.e., by also asserting row counts). However, I started to wonder whether that would be attempting to test pd.merge, which doesn't seem like the thing to do. Just wanted to share my thoughts!

Copy link
Collaborator Author

@ghukill ghukill Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the pondering! I'm finding testing in this project difficult, particularly at this grain of actual ETL tests that are dealing with data warehouse, quickbase, and sensitive data.

I think you're on to something about row counts, because it is in pd.merge() where we'd likely introduce duplicate rows.

I should have commented on this somewhere, but I'm envisioning some "runtime assertions" once things have settled (and getting close now, with only a couple more tables to add ETL tasks for). I think combing through some of these and adding checks on row counts, duplicates, etc., would be a nice integrity test when tasks are actually running. And, they could be confirmed with mocked unit test that would trigger duplicate rows in the merges as-is.

All said, it's on the brain! And I think it is somewhere in-between, like we need to think:

  • I don't want duplicate rows in task XYZ, because ABC
  • I have a runtime assertion in task XYZ that raises errors if that happens
  • I can write a test that simulates that and confirm runtime assertion works

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a plan! Thank you for sharing. :)

@ghukill ghukill requested a review from jonavellecuerdo June 3, 2024 20:57
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 optional suggestion



def test_qbclient_parse_upsert_results_response_success(qbclient, mocked_qb_api_upsert):
parsed_response = qbclient.parse_upsert_results(mocked_qb_api_upsert)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: could skip the variable and directly assert on qbclient.parse_upsert_results(mocked_qb_api_upsert)

@ghukill ghukill merged commit d33d0e8 into main Jun 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants