Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking file provenance #3712

Open
wants to merge 68 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
30f54ca
initial version of dynamic file list classes
astro-friedel May 13, 2024
69d8f02
integrated dynamic file into output file handling
astro-friedel May 21, 2024
882e3ba
data flow kernel changes to accommodate dynamic file lists
astro-friedel Jun 7, 2024
ce369aa
Merge remote-tracking branch 'upstream/master' into fixing_dynamic_fi…
astro-friedel Jun 7, 2024
7138adc
Auto stash before checking out "HEAD"
astro-friedel Jun 7, 2024
5bff70f
creation of file tale in the monitoring
astro-friedel Jun 7, 2024
6025691
added initial file provenance data in database
astro-friedel Jun 14, 2024
efc3b14
fixed error where uuid's were not strings
astro-friedel Jun 17, 2024
222166a
fixed typos in names
astro-friedel Jun 17, 2024
92597f6
initial working version
astro-friedel Jun 18, 2024
8b922d9
Merge branch 'fixing_dynamic_file_inputs_and_outputs' into trackingFi…
astro-friedel Jun 27, 2024
632890b
added flask-wtf to monitoring requirements for form processing
astro-friedel Jun 27, 2024
17e5c43
added file size and md5sum tracking for files
astro-friedel Jun 27, 2024
d8df5fe
fixed issue with clean_copy in dynamic files
astro-friedel Jun 27, 2024
b16cad6
added initial provenance interface to flask pages
astro-friedel Jun 27, 2024
0275b28
indentation fix
astro-friedel Jul 1, 2024
3a1238b
fixed database code for provenance tracking
astro-friedel Jul 1, 2024
bb013fe
added environment tracking to monitoring
astro-friedel Jul 9, 2024
bc8247a
Merge remote-tracking branch 'upstream/master' into trackingFileProve…
astro-friedel Jul 31, 2024
45af5f9
added file provenance tracking as an option to monitoring framework
astro-friedel Jul 31, 2024
cd99828
better reporting on environment
astro-friedel Jul 31, 2024
558d170
ensure that files are tagged with the task id that generated them, no…
astro-friedel Jul 31, 2024
05caec8
get the task reporting the environment correctly
astro-friedel Jul 31, 2024
8f212ba
only provide file link if files were actually used in the workflow
astro-friedel Jul 31, 2024
3ade95a
only provide file link if there were files
astro-friedel Jul 31, 2024
7501cc3
properly report environment with file details
astro-friedel Jul 31, 2024
66238e5
properly format and report files
astro-friedel Jul 31, 2024
00ffa6f
make header responsive to url
astro-friedel Jul 31, 2024
da73f91
fix bug in file size reporting
astro-friedel Jul 31, 2024
76b8008
documentation on file provenance
astro-friedel Jul 31, 2024
93b17b0
fix bug in format
astro-friedel Jul 31, 2024
1e004a6
get the correct timestamp for the file
astro-friedel Sep 17, 2024
8dde82c
remove unneeded prints
astro-friedel Sep 17, 2024
cb550ee
auto determine file size, md5sum, timestamp if possible
astro-friedel Sep 17, 2024
5ebd009
refactor variable
astro-friedel Sep 17, 2024
baf2332
make sure dfk is propagated from dynamic file list to children
astro-friedel Sep 17, 2024
79211bc
documentation and annotation cleanup
astro-friedel Sep 17, 2024
825842f
cleanup
astro-friedel Sep 17, 2024
117e66d
Merge remote-tracking branch 'upstream/master' into trackingFileProve…
astro-friedel Sep 17, 2024
8c9a2a0
backed out DynamicFile stuff so that this branch is pure file tracking
astro-friedel Nov 12, 2024
eff8ab6
Merge branch 'master' into trackingFileProvenance
astro-friedel Nov 12, 2024
5ca48cf
Merge branch 'master' into trackingFileProvenance
astro-friedel Nov 27, 2024
9a05b2c
reorganized to group similar codes together
astro-friedel Nov 27, 2024
14aac2b
fixed message format
astro-friedel Nov 27, 2024
585fd03
fixed some typos
astro-friedel Nov 27, 2024
19f7747
updates to include misc info table
astro-friedel Nov 27, 2024
27f6391
updated docs
astro-friedel Nov 27, 2024
97ade30
fixed bug for remote files
astro-friedel Nov 27, 2024
33be080
test for provenance framework
astro-friedel Nov 27, 2024
07c2e45
flake8 fixes
astro-friedel Nov 27, 2024
97108e1
fixed missing line in docs
astro-friedel Nov 27, 2024
a837f08
removed extraneous ignores
astro-friedel Dec 3, 2024
6bef04f
reverted removal of trailing white spaces
astro-friedel Dec 3, 2024
5057d19
fixes per review comments
astro-friedel Dec 3, 2024
89d5e0a
ensure that md5sum is only calculated when file provenance tracking i…
astro-friedel Dec 3, 2024
c653cbc
fixes based on review comments
astro-friedel Dec 3, 2024
7efebad
added dfk as a required parameter to DataFuture
astro-friedel Dec 3, 2024
d6e7e5b
make sure file md5sum is only calculated
astro-friedel Dec 3, 2024
1fcdbc6
added full path and parsing for path for file database entries
astro-friedel Dec 3, 2024
b443cbb
fixed typos and tests
astro-friedel Dec 3, 2024
69cfc7b
put back required SECRET_KEY so that the file search form works
astro-friedel Dec 3, 2024
0316cf9
isort fixes
astro-friedel Dec 3, 2024
af51f0e
Merge branch 'Parsl:master' into trackingFileProvenance
astro-friedel Dec 3, 2024
9ed699d
removed unneeded import
astro-friedel Dec 3, 2024
ce609cc
mypy fixes
astro-friedel Dec 3, 2024
d646aaa
Merge remote-tracking branch 'upstream/master'
astro-friedel Dec 10, 2024
53f323d
fixed incorrect variable name
astro-friedel Dec 10, 2024
9444f42
Merge branch 'master' into trackingFileProvenance
astro-friedel Dec 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,13 @@ ENV/

# emacs buffers
\#*

runinfo*
parsl/tests/.pytest*

# documentation generation
docs/stubs/*
docs/1-parsl-introduction.ipynb

/tmp
parsl/data_provider/dyn.new.py
astro-friedel marked this conversation as resolved.
Show resolved Hide resolved
Binary file added docs/images/mon_env_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/mon_file_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/mon_file_provenance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/mon_task_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/mon_workflow_files.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/images/mon_workflows_page.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
95 changes: 48 additions & 47 deletions docs/userguide/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Passing Python objects
======================

Parsl apps can communicate via standard Python function parameter passing
astro-friedel marked this conversation as resolved.
Show resolved Hide resolved
Parsl apps can communicate via standard Python function parameter passing
and return statements. The following example shows how a Python string
can be passed to, and returned from, a Parsl app.

Expand All @@ -12,31 +12,31 @@ can be passed to, and returned from, a Parsl app.
@python_app
def example(name):
return 'hello {0}'.format(name)

r = example('bob')
print(r.result())

Parsl uses the dill and pickle libraries to serialize Python objects
Parsl uses the dill and pickle libraries to serialize Python objects
into a sequence of bytes that can be passed over a network from the submitting
machine to executing workers.

Thus, Parsl apps can receive and return standard Python data types
Thus, Parsl apps can receive and return standard Python data types
such as booleans, integers, tuples, lists, and dictionaries. However, not
all objects can be serialized with these methods (e.g., closures, generators,
all objects can be serialized with these methods (e.g., closures, generators,
and system objects), and so those objects cannot be used with all executors.

Parsl will raise a `SerializationError` if it encounters an object that it cannot
serialize. This applies to objects passed as arguments to an app, as well as objects
Parsl will raise a `SerializationError` if it encounters an object that it cannot
serialize. This applies to objects passed as arguments to an app, as well as objects
returned from an app. See :ref:`label_serialization_error`.


Staging data files
==================

Parsl apps can take and return data files. A file may be passed as an input
argument to an app, or returned from an app after execution. Parsl
provides support to automatically transfer (stage) files between
the main Parsl program, worker nodes, and external data storage systems.
argument to an app, or returned from an app after execution. Parsl
provides support to automatically transfer (stage) files between
the main Parsl program, worker nodes, and external data storage systems.

Input files can be passed as regular arguments, or a list of them may be
specified in the special ``inputs`` keyword argument to an app invocation.
Expand Down Expand Up @@ -69,13 +69,13 @@ interface.
Parsl files
-----------

Parsl uses a custom :py:class:`~parsl.data_provider.files.File` to provide a
location-independent way of referencing and accessing files.
Parsl files are defined by specifying the URL *scheme* and a path to the file.
Parsl uses a custom :py:class:`~parsl.data_provider.files.File` to provide a
location-independent way of referencing and accessing files.
Parsl files are defined by specifying the URL *scheme* and a path to the file.
Thus a file may represent an absolute path on the submit-side file system
or a URL to an external file.

The scheme defines the protocol via which the file may be accessed.
The scheme defines the protocol via which the file may be accessed.
Parsl supports the following schemes: file, ftp, http, https, and globus.
If no scheme is specified Parsl will default to the file scheme.

Expand All @@ -89,8 +89,8 @@ README file.
File('https://github.com/Parsl/parsl/blob/master/README.rst')


Parsl automatically translates the file's location relative to the
environment in which it is accessed (e.g., the Parsl program or an app).
Parsl automatically translates the file's location relative to the
environment in which it is accessed (e.g., the Parsl program or an app).
The following example shows how a file can be accessed in the app
irrespective of where that app executes.

Expand All @@ -113,22 +113,23 @@ As described below, the method by which this files are transferred
depends on the scheme and the staging providers specified in the Parsl
configuration.


Staging providers
-----------------

Parsl is able to transparently stage files between at-rest locations and
Parsl is able to transparently stage files between at-rest locations and
execution locations by specifying a list of
:py:class:`~parsl.data_provider.staging.Staging` instances for an executor.
:py:class:`~parsl.data_provider.staging.Staging` instances for an executor.
These staging instances define how to transfer files in and out of an execution
location. This list should be supplied as the ``storage_access``
parameter to an executor when it is constructed.
parameter to an executor when it is constructed.

Parsl includes several staging providers for moving files using the
Parsl includes several staging providers for moving files using the
schemes defined above. By default, Parsl executors are created with
three common staging providers:
three common staging providers:
the NoOpFileStaging provider for local and shared file systems
and the HTTP(S) and FTP staging providers for transferring
files to and from remote storage locations. The following
files to and from remote storage locations. The following
example shows how to explicitly set the default staging providers.

.. code-block:: python
Expand All @@ -146,12 +147,12 @@ example shows how to explicitly set the default staging providers.
)
]
)
Parsl further differentiates when staging occurs relative to
the app invocation that requires or produces files.


Parsl further differentiates when staging occurs relative to
the app invocation that requires or produces files.
Staging either occurs with the executing task (*in-task staging*)
or as a separate task (*separate task staging*) before app execution.
or as a separate task (*separate task staging*) before app execution.
In-task staging
uses a wrapper that is executed around the Parsl task and thus
occurs on the resource on which the task is executed. Separate
Expand All @@ -167,9 +168,9 @@ NoOpFileStaging for Local/Shared File Systems
The NoOpFileStaging provider assumes that files specified either
with a path or with the ``file`` URL scheme are available both
on the submit and execution side. This occurs, for example, when there is a
shared file system. In this case, files will not moved, and the
shared file system. In this case, files will not moved, and the
File object simply presents the same file path to the Parsl program
and any executing tasks.
and any executing tasks.

Files defined as follows will be handled by the NoOpFileStaging provider.

Expand Down Expand Up @@ -207,14 +208,14 @@ will be executed as a separate
Parsl task that will complete before the corresponding app
executes. These providers cannot be used to stage out output files.

The following example defines a file accessible on a remote FTP server.
The following example defines a file accessible on a remote FTP server.

.. code-block:: python

File('ftp://www.iana.org/pub/mirror/rirstats/arin/ARIN-STATS-FORMAT-CHANGE.txt')

When such a file object is passed as an input to an app, Parsl will download the file to whatever location is selected for the app to execute.
The following example illustrates how the remote file is implicitly downloaded from an FTP server and then converted. Note that the app does not need to know the location of the downloaded file on the remote computer, as Parsl abstracts this translation.
The following example illustrates how the remote file is implicitly downloaded from an FTP server and then converted. Note that the app does not need to know the location of the downloaded file on the remote computer, as Parsl abstracts this translation.

.. code-block:: python

Expand All @@ -234,17 +235,17 @@ The following example illustrates how the remote file is implicitly downloaded f
# call the convert app with the Parsl file
f = convert(inputs=[inp], outputs=[out])
f.result()
HTTP and FTP separate task staging providers can be configured as follows.

HTTP and FTP separate task staging providers can be configured as follows.

.. code-block:: python

from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.data_provider.http import HTTPSeparateTaskStaging
from parsl.data_provider.ftp import FTPSeparateTaskStaging
config = Config(

config = Config(
executors=[
HighThroughputExecutor(
storage_access=[HTTPSeparateTaskStaging(), FTPSeparateTaskStaging()]
Expand All @@ -263,10 +264,10 @@ task staging providers described above, but will do so in a wrapper around
individual app invocations, which guarantees that they will stage files to
a file system visible to the app.

A downside of this staging approach is that the staging tasks are less visible
A downside of this staging approach is that the staging tasks are less visible
to Parsl, as they are not performed as separate Parsl tasks.

In-task staging providers can be configured as follows.
In-task staging providers can be configured as follows.

.. code-block:: python

Expand Down Expand Up @@ -345,16 +346,16 @@ In some cases, for example when using a Globus `shared endpoint <https://www.glo
)
]
)


Globus Authorization
""""""""""""""""""""

In order to transfer files with Globus, the user must first authenticate.
The first time that Globus is used with Parsl on a computer, the program
In order to transfer files with Globus, the user must first authenticate.
The first time that Globus is used with Parsl on a computer, the program
will prompt the user to follow an authentication and authorization
procedure involving a web browser. Users can authorize out of band by
running the parsl-globus-auth utility. This is useful, for example,
running the parsl-globus-auth utility. This is useful, for example,
when running a Parsl program in a batch system where it will be unattended.

.. code-block:: bash
Expand All @@ -370,7 +371,7 @@ rsync

The ``rsync`` utility can be used to transfer files in the ``file`` scheme in configurations where
workers cannot access the submit-side file system directly, such as when executing
on an AWS EC2 instance or on a cluster without a shared file system.
on an AWS EC2 instance or on a cluster without a shared file system.
However, the submit-side file system must be exposed using rsync.

rsync Configuration
Expand Down Expand Up @@ -399,13 +400,13 @@ and public IP address of the submitting system.
rsync Authorization
"""""""""""""""""""

The rsync staging provider delegates all authentication and authorization to the
underlying ``rsync`` command. This command must be correctly authorized to connect back to
the submit-side system. The form of this authorization will depend on the systems in
The rsync staging provider delegates all authentication and authorization to the
underlying ``rsync`` command. This command must be correctly authorized to connect back to
the submit-side system. The form of this authorization will depend on the systems in
question.

The following example installs an ssh key from the submit-side file system and turns off host key
checking, in the ``worker_init`` initialization of an EC2 instance. The ssh key must have
The following example installs an ssh key from the submit-side file system and turns off host key
checking, in the ``worker_init`` initialization of an EC2 instance. The ssh key must have
sufficient privileges to run ``rsync`` over ssh on the submit-side system.

.. code-block:: python
Expand Down
92 changes: 88 additions & 4 deletions docs/userguide/monitoring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ SQLite tools.
Monitoring configuration
------------------------

Parsl monitoring is only supported with the `parsl.executors.HighThroughputExecutor`.
Parsl monitoring is only supported with the `parsl.executors.HighThroughputExecutor`.

The following example shows how to enable monitoring in the Parsl
configuration. Here the `parsl.monitoring.MonitoringHub` is specified to use port
Expand Down Expand Up @@ -50,6 +50,58 @@ configuration. Here the `parsl.monitoring.MonitoringHub` is specified to use por
)


File Provenance
---------------

The monitoring system can also be used to track file provenance. File provenance is defined as the history of a file including:

* When the files was created
* File size in bytes
* File md5sum
* What task created the file
* What task(s) used the file
* What inputs were given to the task that created the file
* What environment was used (e.g. the 'worker_init' entry from a :py:class:`~parsl.providers.ExecutionProvider`), not available with every provider.

The purpose of the file provenance tracking is to provide a mechanism where the user can see exactly how a file was created and used in a workflow. This can be useful for debugging, understanding the workflow, for ensuring that the workflow is reproducible, and reviewing past work. The file provenance information is stored in the monitoring database and can be accessed using the ``parsl-visualize`` tool. To enable file provenance tracking, set the ``capture_file_provenance`` flag to ``True`` in the `parsl.monitoring.MonitoringHub` configuration.

This functionality also enables you to log informational messages from you scripts, to capture anything not automatically gathered. The main change to your code to use this functionality is to assign the return value of the ``parsl.load`` to a variable. Then use the ``log_info`` function to log the messages in the database. Note that this feature is only available in the main script, not inside apps, unless you pass the variable (``my_cfg`` in the example below), as an argument to the app. The following example shows how to use this feature.
astro-friedel marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

import parsl
from parsl.monitoring.monitoring import MonitoringHub
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.addresses import address_by_hostname

import logging

config = Config(
executors=[
HighThroughputExecutor(
label="local_htex",
cores_per_worker=1,
max_workers_per_node=4,
address=address_by_hostname(),
)
],
monitoring=MonitoringHub(
hub_address=address_by_hostname(),
hub_port=55055,
monitoring_debug=False,
resource_monitoring_interval=10,
capture_file_provenance=True,
),
strategy='none'
)

my_cfg = parsl.load(config)

my_cfg.log_info("This is an informational message")

Known limitations: The file provenance feature will capture the creation of files and the use of files in an app, but currently does not capture the modification of files it already knows about.
astro-friedel marked this conversation as resolved.
Show resolved Hide resolved

Visualization
-------------

Expand All @@ -75,7 +127,7 @@ By default, the visualization web server listens on ``127.0.0.1:8080``. If the w
$ ssh -L 50000:127.0.0.1:8080 username@cluster_address

This command will bind your local machine's port 50000 to the remote cluster's port 8080.
The dashboard can then be accessed via the local machine's browser at ``127.0.0.1:50000``.
The dashboard can then be accessed via the local machine's browser at ``127.0.0.1:50000``.

.. warning:: Alternatively you can deploy the visualization server on a public interface. However, first check that this is allowed by the cluster's security policy. The following example shows how to deploy the web server on a public port (i.e., open to Internet via ``public_IP:55555``)::

Expand All @@ -99,12 +151,12 @@ Workflow Summary

The workflow summary page captures the run level details of a workflow, including start and end times
as well as task summary statistics. The workflow summary section is followed by the *App Summary* that lists
the various apps and invocation count for each.
the various apps and invocation count for each.

.. image:: ../images/mon_workflow_summary.png


The workflow summary also presents three different views of the workflow:
The workflow summary also presents three or four different views of the workflow (the number depends on whether file provenance is enabled and files were used in the workflow):
astro-friedel marked this conversation as resolved.
Show resolved Hide resolved

* Workflow DAG - with apps differentiated by colors: This visualization is useful to visually inspect the dependency
structure of the workflow. Hovering over the nodes in the DAG shows a tooltip for the app represented by the node and it's task ID.
Expand All @@ -120,3 +172,35 @@ The workflow summary also presents three different views of the workflow:

.. image:: ../images/mon_resource_summary.png

* Workflow file provenance (only if enabled and files were used in the workflow): This visualization gives a tabular listing of each task that created (output) or used (input) a file. Each listed file has a link to a page detailing the file's information.

.. image:: ../images/mon_workflow_files.png

File Provenance
^^^^^^^^^^^^^^^

The file provenance page provides an interface for searching for files and viewing their provenance. The % wildcard can be used in the search bar to match any number of characters. Any results are listed in a table below the search bar. Clicking on a file in the table will take you to the file's detail page.

.. image:: ../images/mon_file_provenance.png

File Details
^^^^^^^^^^^^

The file details page provides information about a specific file, including the file's name, size, md5sum, and the tasks that created and used the file. Clicking on any of the tasks will take you to their respective details page. If the file was created by a task there will be an entry for the Environment used by that task. Clicking that link will take you to the Environment Details page.

.. image:: ../images/mon_file_detail.png


Task Details
^^^^^^^^^^^^

The task details page provides information about a specifiic instantiation of a task. This information includes task dependencies, executor (environment), input and output files, and task arguments.

.. image:: ../images/mon_task_detail.png

Environment Details
^^^^^^^^^^^^^^^^^^^

The environment details page provides information on the compute environment a task was run including the provider and launcher used and the worker_init that was used.

.. image:: ../images/mon_env_detail.png
Loading
Loading