HRQB 22 - Employee Appointments #39

ghukill · 2024-06-03T13:58:42Z

Purpose and background context

This PR introduces tasks and updated pipelines to load employee appointment data. The Quickbase table Employee Appointments relies on some other Quickbase tables for lookup data, including:

Job Titles
Position Titles
Employee Types

As such, tasks for loading Employee Appointments rely on other, new tasks that also populate those tables in Quickbase. This results in a dependency graph that looks like this:

├── COMPLETE: FullUpdate()
   ├── COMPLETE: LoadEmployeeAppointments(...)
      ├── COMPLETE: LoadEmployeeTypes(...)
         ├── COMPLETE: TransformEmployeeTypes(...)
            ├── COMPLETE: ExtractDWEmployeeAppointments(...)
      ├── COMPLETE: LoadJobTitles(...)
         ├── COMPLETE: TransformUniqueJobTitles(...)
            ├── COMPLETE: ExtractDWEmployeeAppointments(...)
      ├── COMPLETE: LoadPositionTitles(...)
         ├── COMPLETE: TransformUniquePositionTitles(...)
            ├── COMPLETE: ExtractDWEmployeeAppointments(...)
      ├── COMPLETE: TransformEmployeeAppointments(...)
         ├── COMPLETE: ExtractDWEmployeeAppointments(...)
         ├── COMPLETE: ExtractQBLibHREmployeeAppointments(...)
         ├── COMPLETE: ExtractQBDepartments(...)

for task LoadEmployeeAppointments to complete, so to must tasks like LoadJobTitles or LoadPositionTitles
these tasks rely on task ExtractDWEmployeeAppointments for raw data warehouse data

I don't think it will be productive to outline these dependencies in quite so much detail going forward, but wanted to take a moment to observe them here, as this is the first time some of these more complex dependencies have been introduced.

In future tasks -- e.g. for Employee Salary History or Employee Leave -- this pattern will be common and repeated.

There are some misc updates like dependency updates and additional QBClient functionality, but the primary work is part of commits:

How can a reviewer manually see the effects of these changes?

Still not readily testable in deployed contexts, or locally without data warhouse and quickbase credentials.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/HRQB-22

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: In some instances, it makes sense to not cache results, e.g. inserts or records queries, where data may have changed from other actions. How this addresses that need: * Skip caching for some QBClient methods * Adds method to retrieve as single table information from Quickbase Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22

Why these changes are being introduced: One requirement of the Quickbase implementation is data going back to 2019. Until now, no data limits were placed on data warehouse queries. This begins to establish this for pre-existing tasks, and will be enforced for future ones as well. How this addresses that need: * Limits Employee records to those with Employee Appointments that end on or after 2019 Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22

Why these changes are being introduced: It is not uncommon for a load task to have multiple required parent tasks, which means the convenience method 'single_input_dataframe' cannot be used. However, overriding QuickbaseUpsertTask.get_records() could introduce errors and inconsistencies. How this addresses that need: * Define new property QuickbaseUpsertTask.input_task_to_load that explicitly defines which input task to use data from * QuickbaseUpsertTask.get_records does not need to be overridden even if Load task has multiple parent tasks Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22

Why these changes are being introduced: Tasks to support loading of Quickbase table Employee Appointments. How this addresses that need: * Creates ETL tasks for Employee Appointments * Creates ETL tasks for related lookup tables Job Titles, Position Titles, and Employee Types Side effects of this change: * Employee appointment data is loaded to QB Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22

How this addresses that need: * Updated testing approach for ETL tasks, with tasks complete by default as fixtures Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22

ghukill · 2024-06-03T16:07:44Z

hrqb/tasks/employee_appointments.py

+    def input_task_to_load(self) -> str:
+        """Upsert data from parent task 'TransformEmployeeAppointments'."""
+        return "TransformEmployeeAppointments"


This is an example of the using the optional property input_task_to_load. By setting this, it explicitly sets which parent task from the multiple ones defined in self.requires() to use as the input data for upserting.

jonavellecuerdo

Just a couple of clarifying questions, but it's looking good!

jonavellecuerdo · 2024-06-03T18:50:46Z

hrqb/base/task.py

@@ -91,7 +91,7 @@ def single_input_dataframe(self) -> pd.DataFrame:
        return target.read()  # type: ignore[return-value]

    @property
-    def named_inputs(self) -> dict[str, PandasPickleTarget | QuickbaseTableTarget]:


Hmm, why'd the type hint change? 🤔

Great question. This was an over-engineering early on, where I had assumed that QuickbaseUpsertTasks might be used as inputs into other tasks. But that seems unlikely now.

If we make this more generic, then we can avoid some linting and typing hiccups where it could be a dictionary from a QuickbsaeTableTarget, when we know 99% of the time it's a dataframe from PandasPickleTarget. But even that isn't worth the linting/typing headaches, as we often get the task target in a way that loses some of that thread. Seemed pretty clear that just relaxing this type was the best option.

jonavellecuerdo · 2024-06-03T19:02:00Z

hrqb/utils/quickbase.py

@@ -57,7 +61,7 @@ def make_request(
        """
        # hash the request to cache the response
        request_hash = (path, json.dumps(kwargs, sort_keys=True))
-        if self.cache_results and request_hash in self._cache:
+        if cache and self.cache_results and request_hash in self._cache:


Is there a scenario where cache != self.cache_results? I was curious why an and condition is used as opposed to or. Is it so self.cache_results will take precedence, and if so, how would self.cache_results be set? 🤔

The and supports this situation:

QBClient instantiated with caching (default)

QBClient makes a cached method request like listing tables

QBClient makes a non-cached method request like upserting data

I had considered temporarily setting qbclient.cache_results = False for a non-cached request, but this is brittle if the request fails and the QBClient is then in an intermediate state.

It's probably overkill, but this allows:

completely preventing caching by QBClient(cache_results=False)

temporarily / method-by-method skipping caching with qbclient.make_request(..., cache=False)

I see. Thank you for explaining how it works!

jonavellecuerdo · 2024-06-03T20:44:16Z

tests/tasks/test_employee_appointments.py

+def test_task_transform_employee_appointments_libhr_data_merge_success(
+    task_transform_employee_appointments_complete,
+):
+    row = task_transform_employee_appointments_complete.get_dataframe().iloc[0]
+    assert row["HC ID"] == "L-001"
+    assert row["Cost Object"] == 7777777
+    assert row["Supervisor"] == "111111111"
+    assert row["Related Department ID"] == 42.0


[Non-blocking] As I was reviewing these _merge tests, I initially thought it'd be helpful to include a check that we're only merging on values that appear on the left dataframe (since how="left") (i.e., by also asserting row counts). However, I started to wonder whether that would be attempting to test pd.merge, which doesn't seem like the thing to do. Just wanted to share my thoughts!

Appreciate the pondering! I'm finding testing in this project difficult, particularly at this grain of actual ETL tests that are dealing with data warehouse, quickbase, and sensitive data.

I think you're on to something about row counts, because it is in pd.merge() where we'd likely introduce duplicate rows.

I should have commented on this somewhere, but I'm envisioning some "runtime assertions" once things have settled (and getting close now, with only a couple more tables to add ETL tasks for). I think combing through some of these and adding checks on row counts, duplicates, etc., would be a nice integrity test when tasks are actually running. And, they could be confirmed with mocked unit test that would trigger duplicate rows in the merges as-is.

All said, it's on the brain! And I think it is somewhere in-between, like we need to think:

I don't want duplicate rows in task XYZ, because ABC

I have a runtime assertion in task XYZ that raises errors if that happens

I can write a test that simulates that and confirm runtime assertion works

Sounds like a plan! Thank you for sharing. :)

ehanson8

1 optional suggestion

ehanson8 · 2024-06-04T14:21:26Z

tests/test_qbclient_client.py

+
+
+def test_qbclient_parse_upsert_results_response_success(qbclient, mocked_qb_api_upsert):
+    parsed_response = qbclient.parse_upsert_results(mocked_qb_api_upsert)


Optional: could skip the variable and directly assert on qbclient.parse_upsert_results(mocked_qb_api_upsert)

ghukill added 8 commits May 31, 2024 16:44

Set merge field for Employees table

bac33e5

Update dependencies and test coverage

c15239f

Tests for Employee Appointments

b9d634a

How this addresses that need: * Updated testing approach for ETL tasks, with tasks complete by default as fixtures Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-22

Perform imports in fixtures

bcfcb96

ghukill marked this pull request as ready for review June 3, 2024 16:00

ghukill requested review from ehanson8 and jonavellecuerdo June 3, 2024 16:01

ghukill commented Jun 3, 2024

View reviewed changes

jonavellecuerdo reviewed Jun 3, 2024

View reviewed changes

ghukill requested a review from jonavellecuerdo June 3, 2024 20:57

jonavellecuerdo approved these changes Jun 4, 2024

View reviewed changes

ehanson8 approved these changes Jun 4, 2024

View reviewed changes

Streamline test assert

ab70eb8

ghukill merged commit d33d0e8 into main Jun 5, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HRQB 22 - Employee Appointments #39

HRQB 22 - Employee Appointments #39

ghukill commented Jun 3, 2024 •

edited

Loading

ghukill Jun 3, 2024

jonavellecuerdo left a comment

jonavellecuerdo Jun 3, 2024

ghukill Jun 3, 2024 •

edited

Loading

jonavellecuerdo Jun 3, 2024

ghukill Jun 3, 2024

jonavellecuerdo Jun 4, 2024

jonavellecuerdo Jun 3, 2024 •

edited

Loading

ghukill Jun 3, 2024 •

edited

Loading

jonavellecuerdo Jun 4, 2024

ehanson8 left a comment

ehanson8 Jun 4, 2024



		def test_qbclient_parse_upsert_results_response_success(qbclient, mocked_qb_api_upsert):
		parsed_response = qbclient.parse_upsert_results(mocked_qb_api_upsert)

HRQB 22 - Employee Appointments #39

HRQB 22 - Employee Appointments #39

Conversation

ghukill commented Jun 3, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill Jun 3, 2024

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

jonavellecuerdo Jun 3, 2024

Choose a reason for hiding this comment

ghukill Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

jonavellecuerdo Jun 3, 2024

Choose a reason for hiding this comment

ghukill Jun 3, 2024

Choose a reason for hiding this comment

jonavellecuerdo Jun 4, 2024

Choose a reason for hiding this comment

jonavellecuerdo Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

ghukill Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

jonavellecuerdo Jun 4, 2024

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

ehanson8 Jun 4, 2024

Choose a reason for hiding this comment

ghukill commented Jun 3, 2024 •

edited

Loading

ghukill Jun 3, 2024 •

edited

Loading

jonavellecuerdo Jun 3, 2024 •

edited

Loading

ghukill Jun 3, 2024 •

edited

Loading