Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run config diff check now considers number of images #646

Draft
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

marxide
Copy link
Contributor

@marxide marxide commented Mar 16, 2022

New input files are now detected when the config uses glob expressions. Fixes #645.

New input files are now detected when the config uses glob expressions. Fixes #645.
@marxide marxide added bug Something isn't working python Pull requests that update Python code labels Mar 16, 2022
@marxide marxide self-assigned this Mar 16, 2022
@ajstewart
Copy link
Contributor

Something I thought of on seeing this and #645, does a missing image impact negatively on the run at all? I can't remember if the images.parquet is added to or it is constructed again using the images input - assuming that everything is there from the previous one.

I kind of feel like there should be a refusal to run if a previous image that has already been processed in the run is not found. Given that this issue stems from the glob command, the user may not be aware that they have mucked up their command, or something has gone wrong with the file that they expected to be there with glob. As my concern is also with the new source and forced extractions that go back and use the images through the run - this could fail or not be correct if an image is unintentionally missing. Again I can't quite remember how the images dataframe is constructed again in add image mode, but I wouldn't be surprised if some part of it relies on the input from the 'add mode config file'.

@marxide
Copy link
Contributor Author

marxide commented Mar 17, 2022

I kind of feel like there should be a refusal to run if a previous image that has already been processed in the run is not found.

I agree. I'm also not yet sure what the consequences of missing inputs are during a re-run but regardless I think it would be better if the pipeline raised an error.

I'm also now thinking about the restore run functionality and how that would work with this. Suppose a run using globs is completed, then new images are added to the filesystem without changing the run config, and the run is re-run and fails. A check is run during the restorepiperun command that compares the previous config inputs with the images.parquet.bak file:

# check images match
img_f_list = prev_config["inputs"]["image"]
if isinstance(img_f_list, dict):
img_f_list = [
item for sublist in img_f_list.values() for item in sublist
]
img_f_list = [os.path.basename(i) for i in img_f_list]
prev_images = pd.read_parquet(
bak_files['images'], columns=['id', 'name', 'measurements_path']
)
if sorted(prev_images['name'].tolist()) != sorted(img_f_list):
raise CommandError(
'Images in previous config file does not'
' match those found in the previous images.parquet.bak.'
' Cannot restore pipeline run.'
)

I think this check would fail. The previous config is the same as the new config (it contains the same globs) but the parsed file list will be different since the filesystem changed. Perhaps checking that the parsed file list is a superset of the inputs given in images.parquet.bak would work?

@ajstewart
Copy link
Contributor

I think this check would fail. The previous config is the same as the new config (it contains the same globs) but the parsed file list will be different since the filesystem changed. Perhaps checking that the parsed file list is a superset of the inputs given in images.parquet.bak would work?

Uff yes I think you're right, the glob input is really not handled very well in these modes is it 😬, my bad!

I think each of these methods needs an explicit image check that resolves any globs and yes, that the set intersection shows that all images are contained in the glob in the configs/images for the respective mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New input files are not detected when run config uses globs
2 participants