Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRQB 21 - Employees table #31

Merged
merged 8 commits into from
May 30, 2024
Merged

HRQB 21 - Employees table #31

merged 8 commits into from
May 30, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented May 28, 2024

Purpose and background context

This PR introduces the first tasks that will load data in Quickbase, specifically the Employees table. As a foundational table, this made sense to start with, and it's relatively simple with no dependencies.

As the first table to be loaded, it needed to be situated in a "pipeline" in the HRQBClient. While we may get into incremental updates over time, the first pipeline established is a full refresh/upsert called FullUpdate.

It's gauranteed that some columns will shift, new Employee fields will be added, but this establishes a baseline to work with while other tables will follow shortly.

The goal of this PR is to knowledge share the basic structure of ETL tasks for loading data into a single table, and futures ones will follow very similar patterns.

Testing is still minimal for the tasks themselves. The base tasks, which these extend, are well tested. A meaningful test at this level would be against the data itself, where mocking it begins to present a questionable return-on-investment for what the test provides. I'm proposing to skip tests for this PR, to focus on the mechanics of a pipeline, but aim to revisit them in future PRs. Some preliminary tests have been created, with the assumption that future tests may be added as other QB table tasks take shape and expose edge cases to test for.

How can a reviewer manually see the effects of these changes?

Still figuring out sensible access restrictions, so at this time direct testing is not possible. But can demonstrate some commands and output.

Get status of pipeline:

pipenv run hrqb --verbose pipeline -p FullUpdate status

Output:

2024-05-28 14:09:08,827 INFO hrqb.cli.main() line 31: Logger 'root' configured with level=DEBUG
2024-05-28 14:09:08,827 INFO hrqb.cli.main() line 32: No Sentry DSN found, exceptions will not be sent to Sentry
2024-05-28 14:09:08,827 INFO hrqb.cli.main() line 34: Running process
2024-05-28 14:09:08,832 DEBUG hrqb.cli.pipeline() line 123: Successfully loaded pipeline: 'hrqb.tasks.pipelines.FullUpdate'
2024-05-28 14:09:08,834 INFO hrqb.cli.status() line 143: 
├── INCOMPLETE: FullUpdate()
   ├── INCOMPLETE: LoadEmployees(pipeline=FullUpdate, table_name=Employees, stage=Load)
      ├── INCOMPLETE: TransformEmployees(pipeline=FullUpdate, table_name=, stage=Transform)
         ├── INCOMPLETE: ExtractDWEmployees(table_name=, pipeline=FullUpdate, stage=Extract)
  • note the root FullUpdate task
  • not the ETL tasks that support loading the Employees table

Running the pipeline:

pipenv run hrqb --verbose pipeline -p FullUpdate run

Output:

...
...
2024-05-28 14:10:21,158 INFO hrqb.base.task.parse_and_log_upsert_results() line 255: Record processed: 751, created: 0, modified: 0, unchanged: 751
...
...
===== Luigi Execution Summary =====

Scheduled 4 tasks of which:
* 4 ran successfully:
    - 1 ExtractDWEmployees(table_name=, pipeline=FullUpdate, stage=Extract)
    - 1 FullUpdate(parent_pipeline_name=)
    - 1 LoadEmployees(pipeline=FullUpdate, table_name=Employees, stage=Load)
    - 1 TransformEmployees(pipeline=FullUpdate, table_name=, stage=Transform)

This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====

2024-05-28 14:10:21,400 INFO hrqb.cli.run() line 192: Pipeline run result: SUCCESS
2024-05-28 14:10:21,401 INFO hrqb.cli.run() line 193: 
├── COMPLETE: FullUpdate()
   ├── COMPLETE: LoadEmployees(pipeline=FullUpdate, table_name=Employees, stage=Load)
      ├── COMPLETE: TransformEmployees(pipeline=FullUpdate, table_name=, stage=Transform)
         ├── COMPLETE: ExtractDWEmployees(table_name=, pipeline=FullUpdate, stage=Extract)
  • because data was already loaded, no records were modified (unchanged: 751 in logs)
    • demonstrating that it's idempotent upserts, for this table at least
  • when complete, the entire FullUpdate pipeline is considered complete; only a single task so far, but others will follow

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

ghukill added 4 commits May 28, 2024 13:27
Why these changes are being introduced:

All QuickbaseUpsertTask tasks provide a dataframe for
upserting to Quickbase.  Many tasks will have the shared need
to modify field types for quickbase.

How this addresses that need:
* Adds _normalize_records_for_upsert method on QuickbaseUpsertTask
* Adds logging for upsert results

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-21
Why these changes are being introduced:

Getting into the task work, it became evident that it would be
preferable to group all tasks related to a single table, in a
single file.  This does not prevent importing tasks as dependencies
from other files, but it groups like tasks.

How this addresses that need:
* Removes tasks/extract.py, tasks/transform.py, tasks/load.py

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-21
Why these changes are being introduced:

Tasks related to loading data into the QB table Employees.  This is
the first task in a new pipeline called FullUpdate.

How this addresses that need:
* Creates new pipeline task FullUpdate
* Creates ETL tasks for Employees table
* establishes pattern of sql folder and table specific task files

Side effects of this change:
* Data theoretically can be loaded into Employees table if FullUpdate
task run

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-21
"""Pipeline to perform a full update of all Quickbase tables."""

def requires(self) -> Iterator[luigi.Task]: # pragma: no cover
from hrqb.tasks.employees import LoadEmployees
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports are performed inside the requires() method to avoid them imported at the top level. This keeps imports from this file limited to HRQBPipelineTask tasks themselves.

def requires(self) -> Iterator[luigi.Task]: # pragma: no cover
from hrqb.tasks.employees import LoadEmployees

yield LoadEmployees(pipeline=self.pipeline_name)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As more table ETL tasks are added, the final load task will get added here. I would estimate this will be 10-12 tasks long by the end.

ghukill added 2 commits May 28, 2024 14:24
Why these changes are being introduced:

Unit testing will be somewhat minimal for actual tasks, given much logic
is pushed down to base tasks.  But some low hanging fruit is valuable,
like if the tasks have SQL files to read, if explicit transformations are as expected, etc.

This sets a minimal baseline to aim for for future tasks.

How this addresses that need:
* Added tests/tasks directory with a one-to-one for task files
* Add preliminary unit tests for Employees table tasks

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-21
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, a few notes

hrqb/base/task.py Show resolved Hide resolved
hrqb/utils/__init__.py Show resolved Hide resolved
hrqb/utils/__init__.py Outdated Show resolved Hide resolved
hrqb/tasks/employees.py Outdated Show resolved Hide resolved
@ghukill ghukill requested a review from ehanson8 May 29, 2024 18:38
Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really great! Just have a few comments for you, @ghukill.

hrqb/cli.py Show resolved Hide resolved


def today_date() -> datetime.date:
return datetime.datetime.now(tz=datetime.UTC).date()


def normalize_date(date: str | datetime.datetime) -> str | None:
if date is None or date == "" or pd.isna(date):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Will remove that redundant is None clause.

Comment on lines 18 to 22
if isinstance(date, str):
date = date_parser(date)
if isinstance(date, datetime.datetime):
return date.strftime("%Y-%m-%d")
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when viewing this code locally, Pylance indicates "Code is unreachable" for the final return None. Should there be some error handling for dates that cannot be parsed?

Perhaps rewrite to:

import logging

def normalize_date(date: str | datetime.datetime) -> str | None:
    if date is None or date == "" or pd.isna(date):
        return None
    if isinstance(date, str):
        try:
            date = date_parser(date)
        except ParserError:
            logging.warning("Unable to parse date from '{date}'")
    if isinstance(date, datetime.datetime):
        return date.strftime("%Y-%m-%d")
    return None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utility function has been a headache, and @ehanson8 commented on it too. I think this is a good change. You're (and pylance) are right, that if the string is not parsable, it would raise an exception so we'd never hit that return None (because if it did parse it, then we'd return on the next block). Updating!

hrqb/tasks/employees.py Outdated Show resolved Hide resolved
@ghukill ghukill requested a review from jonavellecuerdo May 30, 2024 16:05
@ghukill
Copy link
Collaborator Author

ghukill commented May 30, 2024

New commit to address comments @jonavellecuerdo

@ghukill ghukill merged commit 6b651bc into main May 30, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants