-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HRQB 21 - Employees table #31
Conversation
Why these changes are being introduced: All QuickbaseUpsertTask tasks provide a dataframe for upserting to Quickbase. Many tasks will have the shared need to modify field types for quickbase. How this addresses that need: * Adds _normalize_records_for_upsert method on QuickbaseUpsertTask * Adds logging for upsert results Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21
Why these changes are being introduced: Getting into the task work, it became evident that it would be preferable to group all tasks related to a single table, in a single file. This does not prevent importing tasks as dependencies from other files, but it groups like tasks. How this addresses that need: * Removes tasks/extract.py, tasks/transform.py, tasks/load.py Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21
Why these changes are being introduced: Tasks related to loading data into the QB table Employees. This is the first task in a new pipeline called FullUpdate. How this addresses that need: * Creates new pipeline task FullUpdate * Creates ETL tasks for Employees table * establishes pattern of sql folder and table specific task files Side effects of this change: * Data theoretically can be loaded into Employees table if FullUpdate task run Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21
"""Pipeline to perform a full update of all Quickbase tables.""" | ||
|
||
def requires(self) -> Iterator[luigi.Task]: # pragma: no cover | ||
from hrqb.tasks.employees import LoadEmployees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports are performed inside the requires()
method to avoid them imported at the top level. This keeps imports from this file limited to HRQBPipelineTask
tasks themselves.
def requires(self) -> Iterator[luigi.Task]: # pragma: no cover | ||
from hrqb.tasks.employees import LoadEmployees | ||
|
||
yield LoadEmployees(pipeline=self.pipeline_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As more table ETL tasks are added, the final load task will get added here. I would estimate this will be 10-12 tasks long by the end.
Why these changes are being introduced: Unit testing will be somewhat minimal for actual tasks, given much logic is pushed down to base tasks. But some low hanging fruit is valuable, like if the tasks have SQL files to read, if explicit transformations are as expected, etc. This sets a minimal baseline to aim for for future tasks. How this addresses that need: * Added tests/tasks directory with a one-to-one for task files * Add preliminary unit tests for Employees table tasks Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, a few notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really great! Just have a few comments for you, @ghukill.
hrqb/utils/__init__.py
Outdated
|
||
|
||
def today_date() -> datetime.date: | ||
return datetime.datetime.now(tz=datetime.UTC).date() | ||
|
||
|
||
def normalize_date(date: str | datetime.datetime) -> str | None: | ||
if date is None or date == "" or pd.isna(date): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but pd.isna
also returns True for None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Will remove that redundant is None
clause.
hrqb/utils/__init__.py
Outdated
if isinstance(date, str): | ||
date = date_parser(date) | ||
if isinstance(date, datetime.datetime): | ||
return date.strftime("%Y-%m-%d") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when viewing this code locally, Pylance
indicates "Code is unreachable" for the final return None
. Should there be some error handling for dates that cannot be parsed?
Perhaps rewrite to:
import logging
def normalize_date(date: str | datetime.datetime) -> str | None:
if date is None or date == "" or pd.isna(date):
return None
if isinstance(date, str):
try:
date = date_parser(date)
except ParserError:
logging.warning("Unable to parse date from '{date}'")
if isinstance(date, datetime.datetime):
return date.strftime("%Y-%m-%d")
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This utility function has been a headache, and @ehanson8 commented on it too. I think this is a good change. You're (and pylance) are right, that if the string is not parsable, it would raise an exception so we'd never hit that return None
(because if it did parse it, then we'd return on the next block). Updating!
New commit to address comments @jonavellecuerdo |
Purpose and background context
This PR introduces the first tasks that will load data in Quickbase, specifically the
Employees
table. As a foundational table, this made sense to start with, and it's relatively simple with no dependencies.As the first table to be loaded, it needed to be situated in a "pipeline" in the HRQBClient. While we may get into incremental updates over time, the first pipeline established is a full refresh/upsert called
FullUpdate
.It's gauranteed that some columns will shift, new
Employee
fields will be added, but this establishes a baseline to work with while other tables will follow shortly.The goal of this PR is to knowledge share the basic structure of ETL tasks for loading data into a single table, and futures ones will follow very similar patterns.
Testing is still minimal for the tasks themselves. The base tasks, which these extend, are well tested.
A meaningful test at this level would be against the data itself, where mocking it begins to present a questionable return-on-investment for what the test provides. I'm proposing to skip tests for this PR, to focus on the mechanics of a pipeline, but aim to revisit them in future PRs.Some preliminary tests have been created, with the assumption that future tests may be added as other QB table tasks take shape and expose edge cases to test for.How can a reviewer manually see the effects of these changes?
Still figuring out sensible access restrictions, so at this time direct testing is not possible. But can demonstrate some commands and output.
Get status of pipeline:
Output:
FullUpdate
taskEmployees
tableRunning the pipeline:
Output:
unchanged: 751
in logs)FullUpdate
pipeline is considered complete; only a single task so far, but others will followIncludes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)