Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pooch retrieval of file registry #260

Merged
merged 3 commits into from
Dec 17, 2024

Conversation

sfmig
Copy link
Collaborator

@sfmig sfmig commented Dec 12, 2024

Description

What is this PR?

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
The file registry file is not overwritten everytime we run tests as we intend.

Right now the files-registry.txt file is not always downloaded fresh from GIN.

The current comment next to the pooch.retrieve call incorrectly states that if known_hash=None the file is always downloaded. But in fact when known_hash = None and the file already exists, the file is NOT downloaded. See the chain of events below:

If known_hash = None and the file already exists:
- > hash_matches returns TRUE (see here)
- >download_action returns "fetch, Fetching" (see here)
-> pooch.retrieve does nothing (see here)

What does this PR do?
To force the download of this file everytime, we need to ensure that the file does not exist before downloading it from GIN. So this PR deletes any files with the expected name at the expected location.

Additionally,

  • I specified a filename in pooch.retrieve for the file. Otherwise If filename is None, it is set as <hash-of-the-url> + <last-part-of-url>.
  • I added wheel as an additional dependency to the check-manifest precommit, so that it runs correctly in an environment with Python 3.12.

MWE to reproduce

I assume initially (Path.home() / ".crabs-exploration-test-data" / "files-registry.txt").is_file() is False.

In a conda environment with pooch, and ipython (for convenience), start an interactive Python session and run the snippet below.

conda create -n mwe-env python=3.12 pooch ipython -y
conda activate mwe-env
ipython
Snippet 1
from pathlib import Path
import pooch

GIN_TEST_DATA_REPO = (
    "https://gin.g-node.org/SainsburyWellcomeCentre/crabs-exploration-test-data"
)

# Cache the test data in the user's home directory
test_data_dir = Path.home() / ".crabs-exploration-test-data"

file_registry_path = test_data_dir / "files-registry.txt"

# Initialise pooch registry
registry = pooch.create(
    test_data_dir,
    base_url=f"{GIN_TEST_DATA_REPO}/raw/master/test_data",
)

# Download only the registry file from GIN
# if known_hash = None, the file is always downloaded. ---> Not True
# TODO: change in crabs too!
file_registry = pooch.retrieve(
    url=f"{GIN_TEST_DATA_REPO}/raw/master/files-registry.txt",
    known_hash=None,
    fname=file_registry_path.name,
    path=file_registry_path.parent,
)

# Load registry file onto pooch registry
registry.load_registry(file_registry)

When we run Snippet 1 several times, the file is not re-downloaded. We can verify this because we get the following message only the first time we run the snippet, but not any subsequent ones:

Downloading data from 'https://gin.g-node.org/SainsburyWellcomeCentre/crabs-exploration-test-data/raw/master/files-registry.txt' to file '/home/sminano/.crabs-exploration-test-data/files-registry.txt'.
SHA256 hash of downloaded file: 34801dd50ddfc5ebf67bb5a205b558197178f999cde325462a5881b8f1fdcc31
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.

Instead if we add a few lines to remove the "files-registry.txt" file before fetching it from GIN, we see the expected download message everytime.

Snippet 2
from pathlib import Path
import pooch

GIN_TEST_DATA_REPO = (
    "https://gin.g-node.org/SainsburyWellcomeCentre/crabs-exploration-test-data"
)

# Cache the test data in the user's home directory
test_data_dir = Path.home() / ".crabs-exploration-test-data"

# Remove the file registry if it exists
file_registry_path = test_data_dir / "files-registry.txt"
if file_registry_path.is_file():
    Path(file_registry_path).unlink()

# Initialise pooch registry
registry = pooch.create(
    test_data_dir,
    base_url=f"{GIN_TEST_DATA_REPO}/raw/master/test_data",
)

# Download only the registry file from GIN
# if known_hash = None, the file is always downloaded. ---> Not True
# TODO: change in crabs too!
file_registry = pooch.retrieve(
    url=f"{GIN_TEST_DATA_REPO}/raw/master/files-registry.txt",
    known_hash=None,
    fname=file_registry_path.name,
    path=file_registry_path.parent,
)

# Load registry file onto pooch registry
registry.load_registry(file_registry)

References

\

How has this PR been tested?

Tests pass locally and in CI.

Does this PR require an update to the documentation?

\

Checklist:

  • The code has been tested locally
  • [ n/a ] Tests have been added to cover all new functionality
  • [ n/a ] The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

We need to specify a filename in `pooch.retrieve` for the file to be correctly overwritten everytime.

If filename is None, it is set as <hash-of-the-url> + <last-part-of-url>. So if the URL stays the same, the file won't be updated, even if the contents are changed

Signed-off-by: sfmig <33267254+sfmig@users.noreply.github.com>
@codecov-commenter
Copy link

codecov-commenter commented Dec 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.18%. Comparing base (3532188) to head (f3740b7).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #260   +/-   ##
=======================================
  Coverage   48.18%   48.18%           
=======================================
  Files          24       24           
  Lines        1567     1567           
=======================================
  Hits          755      755           
  Misses        812      812           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sfmig sfmig requested a review from niksirbi December 12, 2024 14:14
Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job pinning this down @sfmig. I think your MWE proves the point quite convincingly (I reproduced it locally). So this PR is ready to merge!

By the way, I looked into why this is not a problem for movement. I think we are "saved" by the fact the the download first happens to a temporary local file, which then gets renamed to the "permanent" one. So there is no temporary file left at the end of it all, and the next download is happy to fetch it again.

@sfmig sfmig merged commit 42a7eb6 into main Dec 17, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants