Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRQB 41 - Improve task scoping for runs and data cleanup #106

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Jul 16, 2024

Purpose and background context

This PR reworks the CLI --task flag into two new ones --include and --exclude. Using a combination of these, it's now possible to:

  • load a pipeline, but begin from any task in the pipeline dependencies
  • exclude tasks from running or having their data cleared between runs

These changes dovetail with some improvements to how pipelines run, including moving the run_pipeline() function to a method on HRQBPipelineTask, where it should only ever run anyhow.

Taken altogether, the output of HRQBClient will remain identical. All changes are for local development convenience. When working on a particular task or family of tasks, it is much easier to isolate them for development.

Some areas of interest:

  • Instead of a default requires() method, HRQBPipelineTasks now expect a default_requires() method (example)
  • luigi.utils.run_pipeline() moved to method on HRQBPipelineTasks and includes setup + cleanup logic
    • setup: determine if include/exclude tasks actually exist, clear out memory of previous runs, raise exceptions if tasks are required to run but have been explicitly excluded
    • cleanup: remove data from run if flagged to do so, skipping excluding tasks

How can a reviewer manually see the effects of these changes?

Some manual testing is possible with these changes, as we can use test fixtures.

The following should show all tasks asINCOMPLETE:

pipenv run hrqb --verbose pipeline \
--pipeline-module=tests.fixtures.full_annotated_pipeline \
--pipeline=AlphaNumeric \
status

Here is the status and tasks of this pipeline:

├── INCOMPLETE: AlphaNumeric()
   ├── INCOMPLETE: CombineLettersAndNumbers(table_name=, pipeline=AlphaNumeric, stage=Transform)
      ├── INCOMPLETE: MultiplyNumbers(table_name=, pipeline=AlphaNumeric, stage=Transform)
         ├── INCOMPLETE: GenerateNumbers(table_name=, pipeline=AlphaNumeric, stage=Extract)
      ├── INCOMPLETE: GenerateLetters(table_name=, pipeline=AlphaNumeric, stage=Extract)

If not, run this command to fully clear data:

pipenv run hrqb --verbose pipeline \
--pipeline-module=tests.fixtures.full_annotated_pipeline \
--pipeline=AlphaNumeric \
remove-data

Now, we can fun the full pipeline:

pipenv run hrqb --verbose pipeline \
--pipeline-module=tests.fixtures.full_annotated_pipeline \
--pipeline=AlphaNumeric \
run

Imagine now that we need to work on the MultiplyNumbers task. By using --include-tasks and --exclude-tasks we can pinpoint this task, remove it's data each time, but keep data from other required tasks.

The following shows that --include limits scope to MultiplyNumbers and its required tasks:

pipenv run hrqb --verbose pipeline \
--pipeline-module=tests.fixtures.full_annotated_pipeline \
--pipeline=AlphaNumeric \
--include=MultiplyNumbers \
--exclude=GenerateNumbers,GenerateLetters \
status

Given the includes/excludes, we can remove all data while keeping data from desired tasks:

pipenv run hrqb --verbose pipeline \
--pipeline-module=tests.fixtures.full_annotated_pipeline \
--pipeline=AlphaNumeric \
--include=MultiplyNumbers \
--exclude=GenerateNumbers,GenerateLetters \
remove-data

In our development loop for making and testing changes to MultiplyNumbers, we can run this command repeatedly knowing that it will only run and cleanup data we expect:

pipenv run hrqb --verbose pipeline \
--pipeline-module=tests.fixtures.full_annotated_pipeline \
--pipeline=AlphaNumeric \
--include=MultiplyNumbers \
--exclude=GenerateNumbers,GenerateLetters \
run --cleanup

The output from this indicates as much, showing fully complete and successful, then data removed:

2024-07-16 09:52:26,824 INFO hrqb.cli.run() line 183: Pipeline run result: SUCCESS
2024-07-16 09:52:26,825 INFO hrqb.cli.run() line 184: 
├── COMPLETE: AlphaNumeric()
   ├── COMPLETE: MultiplyNumbers(table_name=, pipeline=AlphaNumeric, stage=Transform)
      ├── COMPLETE: GenerateNumbers(table_name=, pipeline=AlphaNumeric, stage=Extract)
2024-07-16 09:52:26,826 INFO hrqb.cli.remove_data() line 163: Successfully removed target data(s).
2024-07-16 09:52:26,826 INFO hrqb.cli.remove_data() line 164: Updated status after data cleanup:
2024-07-16 09:52:26,827 INFO hrqb.cli.remove_data() line 165: 
├── INCOMPLETE: AlphaNumeric()
   ├── INCOMPLETE: MultiplyNumbers(table_name=, pipeline=AlphaNumeric, stage=Transform)
      ├── COMPLETE: GenerateNumbers(table_name=, pipeline=AlphaNumeric, stage=Extract)

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

ghukill added 3 commits July 15, 2024 16:05
Why these changes are being introduced:

This change was motivated by a desire to more easily control what tasks are
included and excluded when running a pipeline during local development.

Formerly it was possible to simulate selecting a single task from a pipeline to run,
but it really didn't run the pipeline itself in that scenario.  This also made task
output removal finicky, where it was not clear if dealing with a single task or a pipeline.

By introducing more substantive task include and exclude parameters for a pipeline,
that affect both what tasks run and have their data optionally removed, it allows
for more fluid local development.

This shift also suggested some awkward edges that could be reworked, like a
pipeline having its own 'run_pipeline()' method, which allows for some setup
and cleanup.

How this addresses that need:
* utils.luigi.run_pipeline moved to HRQBPipelineTask as a method
* HRQBPipelineTask gets 'included_tasks' and 'excluded_tasks' parameters
  * both are lists of strings, and are checked on pipeline init that they exist
  * explicit includes override the yielded tasks for a pipeline
  * explicit excludes prevent tasks from running and having data cleaned up

Side effects of this change:
* None really.  All effects are for local development.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-41
@ghukill ghukill marked this pull request as ready for review July 16, 2024 13:53
@ghukill ghukill requested review from ehanson8 and jonavellecuerdo and removed request for ehanson8 July 16, 2024 13:53
@ghukill ghukill changed the base branch from main to HRQB-39-remove-sensitive-logging-sentry July 16, 2024 13:58
@ghukill
Copy link
Collaborator Author

ghukill commented Jul 16, 2024

Opting to close out this PR. After some good discussions with @ehanson8, came to the conclusion that the conveniences these CLI flags add for development work do not outweigh the complexity this adds to the task running framework.

In fact, it kind of entrenches a bad pattern: running lots of tasks you don't need to, and keep the outputs (targets) from those tasks on disk.

An approach that might be more inline with the sensitive data requirements of the project, and the ergonomics of working on a particular task could be:

  1. for local development, all data is deleted after each run by default
  2. identify a task to run and only that task will run; no dependencies
  3. if data is kept between runs, there is friction to do so, and consider even encryption

@ghukill ghukill closed this Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant