HRQB 21 - Employees table #31

ghukill · 2024-05-28T18:19:12Z

Purpose and background context

This PR introduces the first tasks that will load data in Quickbase, specifically the Employees table. As a foundational table, this made sense to start with, and it's relatively simple with no dependencies.

As the first table to be loaded, it needed to be situated in a "pipeline" in the HRQBClient. While we may get into incremental updates over time, the first pipeline established is a full refresh/upsert called FullUpdate.

It's gauranteed that some columns will shift, new Employee fields will be added, but this establishes a baseline to work with while other tables will follow shortly.

The goal of this PR is to knowledge share the basic structure of ETL tasks for loading data into a single table, and futures ones will follow very similar patterns.

Testing is still minimal for the tasks themselves. The base tasks, which these extend, are well tested. A meaningful test at this level would be against the data itself, where mocking it begins to present a questionable return-on-investment for what the test provides. I'm proposing to skip tests for this PR, to focus on the mechanics of a pipeline, but aim to revisit them in future PRs. Some preliminary tests have been created, with the assumption that future tests may be added as other QB table tasks take shape and expose edge cases to test for.

How can a reviewer manually see the effects of these changes?

Still figuring out sensible access restrictions, so at this time direct testing is not possible. But can demonstrate some commands and output.

Get status of pipeline:

pipenv run hrqb --verbose pipeline -p FullUpdate status

Output:

2024-05-28 14:09:08,827 INFO hrqb.cli.main() line 31: Logger 'root' configured with level=DEBUG
2024-05-28 14:09:08,827 INFO hrqb.cli.main() line 32: No Sentry DSN found, exceptions will not be sent to Sentry
2024-05-28 14:09:08,827 INFO hrqb.cli.main() line 34: Running process
2024-05-28 14:09:08,832 DEBUG hrqb.cli.pipeline() line 123: Successfully loaded pipeline: 'hrqb.tasks.pipelines.FullUpdate'
2024-05-28 14:09:08,834 INFO hrqb.cli.status() line 143: 
├── INCOMPLETE: FullUpdate()
   ├── INCOMPLETE: LoadEmployees(pipeline=FullUpdate, table_name=Employees, stage=Load)
      ├── INCOMPLETE: TransformEmployees(pipeline=FullUpdate, table_name=, stage=Transform)
         ├── INCOMPLETE: ExtractDWEmployees(table_name=, pipeline=FullUpdate, stage=Extract)

note the root FullUpdate task
not the ETL tasks that support loading the Employees table

Running the pipeline:

pipenv run hrqb --verbose pipeline -p FullUpdate run

Output:

...
...
2024-05-28 14:10:21,158 INFO hrqb.base.task.parse_and_log_upsert_results() line 255: Record processed: 751, created: 0, modified: 0, unchanged: 751
...
...
===== Luigi Execution Summary =====

Scheduled 4 tasks of which:
* 4 ran successfully:
    - 1 ExtractDWEmployees(table_name=, pipeline=FullUpdate, stage=Extract)
    - 1 FullUpdate(parent_pipeline_name=)
    - 1 LoadEmployees(pipeline=FullUpdate, table_name=Employees, stage=Load)
    - 1 TransformEmployees(pipeline=FullUpdate, table_name=, stage=Transform)

This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====

2024-05-28 14:10:21,400 INFO hrqb.cli.run() line 192: Pipeline run result: SUCCESS
2024-05-28 14:10:21,401 INFO hrqb.cli.run() line 193: 
├── COMPLETE: FullUpdate()
   ├── COMPLETE: LoadEmployees(pipeline=FullUpdate, table_name=Employees, stage=Load)
      ├── COMPLETE: TransformEmployees(pipeline=FullUpdate, table_name=, stage=Transform)
         ├── COMPLETE: ExtractDWEmployees(table_name=, pipeline=FullUpdate, stage=Extract)

because data was already loaded, no records were modified (unchanged: 751 in logs)
- demonstrating that it's idempotent upserts, for this table at least
when complete, the entire FullUpdate pipeline is considered complete; only a single task so far, but others will follow

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/HRQB-21

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: All QuickbaseUpsertTask tasks provide a dataframe for upserting to Quickbase. Many tasks will have the shared need to modify field types for quickbase. How this addresses that need: * Adds _normalize_records_for_upsert method on QuickbaseUpsertTask * Adds logging for upsert results Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21

Why these changes are being introduced: Getting into the task work, it became evident that it would be preferable to group all tasks related to a single table, in a single file. This does not prevent importing tasks as dependencies from other files, but it groups like tasks. How this addresses that need: * Removes tasks/extract.py, tasks/transform.py, tasks/load.py Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21

Why these changes are being introduced: Tasks related to loading data into the QB table Employees. This is the first task in a new pipeline called FullUpdate. How this addresses that need: * Creates new pipeline task FullUpdate * Creates ETL tasks for Employees table * establishes pattern of sql folder and table specific task files Side effects of this change: * Data theoretically can be loaded into Employees table if FullUpdate task run Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21

ghukill · 2024-05-28T18:23:06Z

hrqb/tasks/pipelines.py

+    """Pipeline to perform a full update of all Quickbase tables."""
+
+    def requires(self) -> Iterator[luigi.Task]:  # pragma: no cover
+        from hrqb.tasks.employees import LoadEmployees


Imports are performed inside the requires() method to avoid them imported at the top level. This keeps imports from this file limited to HRQBPipelineTask tasks themselves.

ghukill · 2024-05-28T18:23:39Z

hrqb/tasks/pipelines.py

+    def requires(self) -> Iterator[luigi.Task]:  # pragma: no cover
+        from hrqb.tasks.employees import LoadEmployees
+
+        yield LoadEmployees(pipeline=self.pipeline_name)


As more table ETL tasks are added, the final load task will get added here. I would estimate this will be 10-12 tasks long by the end.

Why these changes are being introduced: Unit testing will be somewhat minimal for actual tasks, given much logic is pushed down to base tasks. But some low hanging fruit is valuable, like if the tasks have SQL files to read, if explicit transformations are as expected, etc. This sets a minimal baseline to aim for for future tasks. How this addresses that need: * Added tests/tasks directory with a one-to-one for task files * Add preliminary unit tests for Employees table tasks Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21

ehanson8

Looks great, a few notes

hrqb/base/task.py

hrqb/utils/__init__.py

hrqb/tasks/employees.py

jonavellecuerdo

This is looking really great! Just have a few comments for you, @ghukill.

hrqb/cli.py

jonavellecuerdo · 2024-05-30T15:02:13Z

hrqb/utils/__init__.py



 def today_date() -> datetime.date:
    return datetime.datetime.now(tz=datetime.UTC).date()


+def normalize_date(date: str | datetime.datetime) -> str | None:
+    if date is None or date == "" or pd.isna(date):


Minor but pd.isna also returns True for None.

Nice catch! Will remove that redundant is None clause.

jonavellecuerdo · 2024-05-30T15:19:03Z

hrqb/utils/__init__.py

+    if isinstance(date, str):
+        date = date_parser(date)
+    if isinstance(date, datetime.datetime):
+        return date.strftime("%Y-%m-%d")
+    return None


Hmm, when viewing this code locally, Pylance indicates "Code is unreachable" for the final return None. Should there be some error handling for dates that cannot be parsed?

Perhaps rewrite to:

import logging def normalize_date(date: str | datetime.datetime) -> str | None: if date is None or date == "" or pd.isna(date): return None if isinstance(date, str): try: date = date_parser(date) except ParserError: logging.warning("Unable to parse date from '{date}'") if isinstance(date, datetime.datetime): return date.strftime("%Y-%m-%d") return None

This utility function has been a headache, and @ehanson8 commented on it too. I think this is a good change. You're (and pylance) are right, that if the string is not parsable, it would raise an exception so we'd never hit that return None (because if it did parse it, then we'd return on the next block). Updating!

hrqb/tasks/employees.py

ghukill · 2024-05-30T16:05:41Z

New commit to address comments @jonavellecuerdo

ghukill added 4 commits May 28, 2024 13:27

Add additional utils for tasks

a900ce6

ghukill requested review from ehanson8 and jonavellecuerdo May 28, 2024 18:19

ghukill commented May 28, 2024

View reviewed changes

ghukill added 2 commits May 28, 2024 14:24

remove placeholder WIP comment

30ef2b8

ehanson8 reviewed May 29, 2024

View reviewed changes

hrqb/base/task.py Show resolved Hide resolved

hrqb/utils/__init__.py Show resolved Hide resolved

hrqb/utils/__init__.py Outdated Show resolved Hide resolved

hrqb/tasks/employees.py Outdated Show resolved Hide resolved

Rename utility function

af9f68f

ghukill requested a review from ehanson8 May 29, 2024 18:38

ehanson8 approved these changes May 29, 2024

View reviewed changes

jonavellecuerdo reviewed May 30, 2024

View reviewed changes

Improve utility functions and tests

7de614e

ghukill requested a review from jonavellecuerdo May 30, 2024 16:05

jonavellecuerdo approved these changes May 30, 2024

View reviewed changes

ghukill merged commit 6b651bc into main May 30, 2024
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HRQB 21 - Employees table #31

HRQB 21 - Employees table #31

ghukill commented May 28, 2024 •

edited

Loading

ghukill May 28, 2024

ghukill May 28, 2024

ehanson8 left a comment

jonavellecuerdo left a comment

jonavellecuerdo May 30, 2024

ghukill May 30, 2024

jonavellecuerdo May 30, 2024

ghukill May 30, 2024

ghukill commented May 30, 2024

HRQB 21 - Employees table #31

HRQB 21 - Employees table #31

Conversation

ghukill commented May 28, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill May 28, 2024

Choose a reason for hiding this comment

ghukill May 28, 2024

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

jonavellecuerdo May 30, 2024

Choose a reason for hiding this comment

ghukill May 30, 2024

Choose a reason for hiding this comment

jonavellecuerdo May 30, 2024

Choose a reason for hiding this comment

ghukill May 30, 2024

Choose a reason for hiding this comment

ghukill commented May 30, 2024

ghukill commented May 28, 2024 •

edited

Loading