Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dust emissions #140

Open
znichollscr opened this issue Oct 22, 2024 · 27 comments
Open

Add dust emissions #140

znichollscr opened this issue Oct 22, 2024 · 27 comments
Labels
enhancement New feature or request

Comments

@znichollscr
Copy link
Collaborator

znichollscr commented Oct 22, 2024

@jfkok has done some great work pulling together dust emissions. This issue is for tracking their inclusion in input4MIPs.

@znichollscr znichollscr added the enhancement New feature or request label Oct 22, 2024
@znichollscr
Copy link
Collaborator Author

An initial draft of the files is here, which includes a nice briefing on how to use the data.

My initial comments on this are:

  • the way the files are currently written, the scenario (historical, future increase, future decrease, future constant) and the region (global or regional) are both blended into the variable name. That isn’t how most of the data is handled. The more common pattern for this (where the variable is the same every time if I am not mistaken) would be to use the same variable name for every file (e.g. dustscalefactor) and then the region of application is identified by the grid label. It’s a bit weird, but the current practice is to identify the scenario by the source ID (although I think we could do something clearer than this, full discussion is here: Altering the DRS #64).
  • is there a reason that future starts in 2001? That will be different to how ScenarioMIP will be handling the split buf it this is for a custom experiment, it may not matter (but it will matter if we want these dust emissions included in the DECK’s historical experiment)
  • the data does present a bit of a challenge for input4MIPs (the same comments apply to Stephanie’s data and it would be great to address them for CMIP7 (in CMIP6 there unfortunately wasn’t time)). The data has this idea of regions to which the data applies. The CF conventions do have a standard for this (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#geographic-regions), which is probably what we would want to use. I think that would then all just work, but it might require a bit of a tweak to how the data is written(and the grid label for regional stuff would probably then become “gr”, see the full list of options here: https://github.com/PCMDI/mip-cmor-tables/blob/main/MIP_grid_label.json).
  • Just to check my understanding: the variable is the same every time right? It’s a scaling factor and the idea is that models use this to scale their internally calculated dust to get closer to the change in dust over time?

@znichollscr
Copy link
Collaborator Author

@jfkok making sure you get a notification and can find this

@jfkok
Copy link

jfkok commented Oct 24, 2024

Hi Zeb, thanks for your very helpful input. Responses follow below:

* the way the files are currently written, the scenario (historical, future increase, future decrease, future constant) and the region (global or regional) are both blended into the variable name. That isn’t how most of the data is handled. The more common pattern for this (where the variable is the same every time if I am not mistaken) would be to use the same variable name for every file (e.g. dustscalefactor) and then the region of application is identified by the grid label. It’s a bit weird, but the current practice is to identify the scenario by the source ID (although I think we could do something clearer than this, full discussion is here: [Altering the DRS #64](https://github.com/PCMDI/input4MIPs_CVs/discussions/64)).

Ah okay, that makes sense. So I'd need to create one file for the global scaling factor (currently in this file) and seven additional files for the regional scaling factors for the seven regions (currently in this file), is that correct? And then do I still keep the historical and the (three) future scaling factors in separate files? So for 4x8 = 32 files total?

* is there a reason that future starts in 2001? That will be different to how ScenarioMIP will be handling the split buf it this is for a custom experiment, it may not matter (but it will matter if we want these dust emissions included in the DECK’s historical experiment)

Yes, the reason is that the observational reconstruction ends in the year 2000 because it is based on sedimentary records of dust deposition (like ice cores), so there's much less data for the last two decades. However, my group is working on using satellite data to extend the reconstruction to 2023 and I expect that to be ready sometime next year.

* the data does present a bit of a challenge for input4MIPs (the same comments apply to Stephanie’s data and it would be great to address them for CMIP7 (in CMIP6 there unfortunately wasn’t time)). The data has this idea of regions to which the data applies. The CF conventions do have a standard for this (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#geographic-regions), which is probably what we would want to use. I think that would then all just work, but it might require a bit of a tweak to how the data is written(and the grid label for regional stuff would probably then become “gr”, see the full list of options here: https://github.com/PCMDI/mip-cmor-tables/blob/main/MIP_grid_label.json).

Thanks for pointing this out. The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

* Just to check my understanding: the variable is the same every time right? It’s a scaling factor and the idea is that models use this to scale their internally calculated dust to get closer to the change in dust over time?

Yes, that's exactly right. It's only the year and the region of application (global versus one of seven major dust aerosol source regions) that changes.

Thanks so much!

Jasper

@znichollscr
Copy link
Collaborator Author

And then do I still keep the historical and the (three) future scaling factors in separate files? So for 4x8 = 32 files total?

Sorry bad explanation from me. I would suggest one file for global and one file for regional (you can put all seven regions in one file, just use the 'region' dimension or whatever it is that the CF conventions calls it to differentiate them). Then yes, one file for historical and one file for each scenario. So you'll end up with 4 x 2 = 8 files.

Just to try and be a bit clearer:

  • in the global file, there should be one variable. Its only dimension should be "time". The grid label should be "gm" (for global mean).
  • in the regional file, there should be one variable. It should be two-dimensional, ("time", "region") (or whatever the CF-convention name for 'region' is). I would make the grid label "gn" probably (this is your native grid). Then there'll probably be other auxillary co-ordinates in this file to help with defining the boundaries of each region, region names etc.

Got it re the historical vs. scenario split. That's fine. If we want these files to be used for DECK simulations, we'll have to do a bit of thinking. If they're just for a specific MIP experiment, they can stay as they are.

The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

Ah ok nice. Defining those boundary files would definitely be sufficient. (If it were me, I would just give the regions names first, make sure I can write the files, then go back and add the boundary definition second, because that boundary definition could be fiddly, but you may be able to skip straight to writing the boundary conditions!)

@jfkok
Copy link

jfkok commented Oct 24, 2024

Thanks, that's helpful. I'm working on implementing these changes and obtaining corrected files. In doing so, I realized that making the variable name the same for all files also means that the variable are identical for the four different global files (1 historical and 3 future scenarios) and for the four different regional files. So how would I distinguish the files if I can't put the scenario in the variable name?

The region do have well-defined coordinates though (the link you included mentioned "complex boundaries that cannot practically be specified using longitude and latitude boundary coordinates"). Would defining those boundaries in each file be sufficient, or do I need to do something else?

Ah ok nice. Defining those boundary files would definitely be sufficient. (If it were me, I would just give the regions names first, make sure I can write the files, then go back and add the boundary definition second, because that boundary definition could be fiddly, but you may be able to skip straight to writing the boundary conditions!)
I'm not sure I quite understand what you mean by "boundary files". Would it be sufficient to specify the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable? Or should I do something different?

Thanks!

Jasper

@znichollscr
Copy link
Collaborator Author

So how would I distinguish the files if I can't put the scenario in the variable name?

Excellent question. The answer is, at the moment, you put it in the "source_id" bit of the file name. So, for example, your filenames would become (where I've also dropped off the noleap prefix that isn't part of the DRS):

  • historical: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_gm_185001-200012.nc
  • constant scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-constant_gm_200101-210012.nc
  • decreasing scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-decreasing_gm_200101-210012.nc
  • future scenario: dustscaling_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-increasing_gm_200101-210012.nc

As you can tell, this doesn't make that much sense and is easy to miss, which is why we're having the discussion in #64. That may lead to a renaming in future, but for now the above is what to go for.

@jfkok
Copy link

jfkok commented Oct 25, 2024

Thanks so much! I've implemented all your comments (I think) and uploaded the updated files here. Let me know if you think any further changes are needed.

One thing to note is that I added the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable. Let me know in case I should be doing something differently.

@znichollscr
Copy link
Collaborator Author

Nice, thanks.

Let me know if you think any further changes are needed.

Underscores in the source ID have to be changed to hyphens i.e.:

  • dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_constant_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-constant_gm_200101-210012.nc
  • dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_increasing_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-increasing_gm_200101-210012.nc
  • dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_decreasing_gm_200101-210012.nc -> dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1-decreasing_gm_200101-210012.nc

Then I would suggest trying to run them through the validator and write them in the DRS: https://input4mips-validation.readthedocs.io/en/latest/how-to-guides/how-to-write-a-single-file-in-the-drs/

One thing to note is that I added the coordinates of the region boundaries as a "boundary coordinates" attribute of the "region" variable. Let me know in case I should be doing something differently.

Not sure, I haven't done this particular step before. If you run the files through the validator, the CF-checker will flag anything really wrong. In a couple of weeks I can pull the files down and have a look myself (meeting next week will take up time before then).

@jfkok
Copy link

jfkok commented Oct 28, 2024

Thanks Zeb, I corrected the file names.

I tried running the validator after installing it as an application per the instructions here. However, the command import input4mips_validation.io triggers an error message, pasted below. Do you know if there is an easy solution for this? Thanks!

image

@znichollscr
Copy link
Collaborator Author

Do you know if there is an easy solution for this?

Hmm that's not very good. Let's dive down the rabbit hole here: climate-resource/input4mips_validation#78

@znichollscr
Copy link
Collaborator Author

@jfkok are there any updated files or should I just use this link (https://drive.google.com/drive/u/0/folders/1Xr1A4oqPj35h43MFV1864IXSvYSVtYKs) again?

@jfkok
Copy link

jfkok commented Dec 11, 2024 via email

@znichollscr
Copy link
Collaborator Author

znichollscr commented Dec 11, 2024

Looking overall good. I had to make some minor tweaks to the files to get them to pass validation, that was pretty straight forward. The Python code I used is below (runs in the same environment as input4mips-validation), but you can probably do the same with Matlab easily. Looking at the changes, I assume they'll need to be applied to all files.

Python code for tweaks
import netCDF4

ds = netCDF4.Dataset(
    "dustscalefactor_input4MIPs_emissions_AerChemMIP_UCLA-1-0-1_gn_185001-200012.nc",
    "a",
)

# Capitalisation matters according to the spec
# (doesn't make sense to me, but here we are)
ds["dustscalefactor"].delncattr("Units")
# Apparently dimensionless is represented by "1"
# (again, wouldn't be my choice of how to do this)
ds["dustscalefactor"].setncattr("units", "1")
ds["dustscalefactor"].setncattr("long_name", "dustscalefactor")

# Capitalisation matters according to the spec
# (doesn't make sense to me, but here we are)
ds["region"].delncattr("Units")
# Apparently dimensionless is represented by "1"
# (again, wouldn't be my choice of how to do this)
ds["region"].setncattr("units", "1")
# Whitespace isn't allowed in attribute names
# (not sure why, it's just the spec)
ds["region"].setncattr(
    "boundary_coordinates", ds["region"].getncattr("Boundary coordinates")
)
ds["region"].delncattr("Boundary coordinates")
ds["region"].setncattr("long_name", "region")

# We use 'yr' rather than 'annual'
ds.setncattr("frequency", "yr")
# License info
ds.setncattr("license_id", "CC BY 4.0")
ds.setncattr(
    "license",
    (
        "The input4MIPs data linked to this entry is licensed "
        "under a Creative Commons Attribution 4.0 International "
        "(https://creativecommons.org/licenses/by/4.0/). "
        "Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse "
        "for terms of use governing CMIP6Plus output, "
        "including citation requirements and proper acknowledgment. "
        "The data producers and data providers make no warranty, either express "
        "or implied, including, but not limited to, warranties of merchantability "
        "and fitness for a particular purpose. "
        "All liabilities arising from the supply of the information "
        "(including any liability arising in negligence) "
        "are excluded to the fullest extent permitted by law."
    ),
)

# If you know this, helpful.
# If not, can just leave vague like this
ds.setncattr("nominal_resolution", "250 km")

ds.close()

The only other minor change is that, because you have annual data, the end of the file should be "YYYY-YYYY" not "YYYYMM-YYYYMM" as you have now i.e. for your historical files, "185001-200012.nc" becomes "1850-2000.nc".

If you're able to make those tweaks, we should be on the home straight.

@znichollscr
Copy link
Collaborator Author

The other thing we need to do is register UCLA as an institute. I've started that here if you're able to take a look: PCMDI/mip-cmor-tables#84

Thanks!

@jfkok
Copy link

jfkok commented Dec 12, 2024 via email

@znichollscr
Copy link
Collaborator Author

Thanks Zeb, that was very helpful. I've addressed all these issues and the file now passes the validation, including the cf checker!

Oh cool, I thought the cf-checker wasn't running on your machine, great to hear it's working!

Does that conclude the process or are there other steps we need to take?

Well, if you want to publish the data on ESGF there are a couple more steps. If you're not worried about that, we're done (although I'm also not sure that you needed to do any of these steps if you didn't want to publish on ESGF in the first place).

@znichollscr
Copy link
Collaborator Author

Assuming you do want to publish on ESGF, the next steps are:

  • check that this is the right institute to associate with your data: Add UCLA mip-cmor-tables#84
  • I'll pull your data down onto our server
  • Paul and I will then push it through the publication queue
  • Good to go

@jfkok
Copy link

jfkok commented Dec 13, 2024 via email

@znichollscr
Copy link
Collaborator Author

znichollscr commented Dec 13, 2024

Ok perfect. There is one last step that would be very helpful, can you please run the script below (should be clear what it's doing, any questions, fire away. As above, it will run in your normal input4mips-validation environment). Once you've done that, ping me and @durack1 and we can get the files in the publication queue.

Python script to run
import shutil
from pathlib import Path

import iris
import netCDF4
from input4mips_validation.cvs import load_cvs
from input4mips_validation.dataset import Input4MIPsDataset
from input4mips_validation.inference.from_data import infer_time_start_time_end
from input4mips_validation.logging import setup_logging
from input4mips_validation.upload_ftp import upload_ftp
from input4mips_validation.validation.file import get_validate_file_result
from input4mips_validation.xarray_helpers.iris import ds_from_iris_cubes


setup_logging(enable=True, logging_level="INFO_FILE")


def create_fixed_tmp_file(file: Path, fixed_folder: Path) -> Path:
    fixed_file = fixed_folder / file.name
    shutil.copy(file, fixed_file)

    source_id = file.name.split("_")[4]

    ds = netCDF4.Dataset(fixed_file, "a")

    ds.setncattr("source_id", source_id)
    ds.setncattr("realm", "atmos")
    if "gn" in fixed_file.name:
        ds["region"].setncattr("long_name", "region")

    ds.close()

    return fixed_file


def validate_file_then_rewrite_file_in_drs(
    file: Path, output_root: Path, cv_source: str
) -> None:
    get_validate_file_result(
        file,
        cv_source=cv_source,
    ).raise_if_errors()

    cvs = load_cvs(cv_source=cv_source)

    ds = ds_from_iris_cubes(
        iris.load(file),
    )
    ds.attrs["nominal_resolution"] = "250 km"
    ds.attrs["target_mip"] = "AerChemMIP2"  # To confirm with Jasper

    time_start, time_end = infer_time_start_time_end(
        ds=ds,
        frequency_metadata_key="frequency",
        no_time_axis_frequency="fx",
        time_dimension="time",
    )

    full_file_path = cvs.DRS.get_file_path(
        root_data_dir=output_root,
        available_attributes=ds.attrs,
        time_start=time_start,
        time_end=time_end,
    )

    if full_file_path.exists():
        raise FileExistsError(full_file_path)

    full_file_path.parent.mkdir(parents=True, exist_ok=True)

    if full_file_path.name != file.name:
        Input4MIPsDataset.from_ds(ds, cvs=cvs).write(
            root_data_dir=output_root,
        )

    else:
        shutil.copy(file, full_file_path)

    print(f"File written according to the DRS in {full_file_path}")


def main() -> None:
    # Obviously point these wherever makes sense for you
    DATA_FOLDER = Path("dust-files")
    TMP_REWRITE_FOLDER = Path("dust-files-pre-validation-fixes")
    OUTPUT_ROOT = Path("dust-rewritten")

    # You shouldn't need to change this
    CV_SOURCE = "gh:main"

    TMP_REWRITE_FOLDER.mkdir(exist_ok=True, parents=True)

    files_to_upload = DATA_FOLDER.rglob("*.nc")

    for file in files_to_upload:
        tmp_file = create_fixed_tmp_file(file, TMP_REWRITE_FOLDER)
        validate_file_then_rewrite_file_in_drs(tmp_file, OUTPUT_ROOT, CV_SOURCE)

    cvs = load_cvs(cv_source=CV_SOURCE)
    upload_ftp(
        tree_root=OUTPUT_ROOT,
        ftp_dir_rel_to_root="UCLA-1-0-1-upload-1",
        password="jfkok@ucla.edu",
        cvs=cvs,
        username="anonymous",
        ftp_server="ftp.llnl.gov",
        ftp_dir_root="/incoming",
        # You can try making this 8, but the FTP server seems to not enjoy parallel uploads
        n_threads=1,
        dry_run=False,
        continue_on_error=False,
    )


if __name__ == "__main__":
    main()

@jfkok
Copy link

jfkok commented Jan 6, 2025

Happy 2025, @durack1, @znichollscr! I ran the python script and copied the output below. It seems to have run correctly, as far as I can tell.

image

@znichollscr
Copy link
Collaborator Author

znichollscr commented Jan 6, 2025

Hmm I don't think so, no files were uploaded. Did you update DATA_FOLDER in the script to point to the directory with your files?

@znichollscr
Copy link
Collaborator Author

If yes, change

files_to_upload = DATA_FOLDER.rglob("*.nc")

to

files_to_upload = DATA_FOLDER.rglob("*.nc")
assert files_to_upload, f"No re-written files found in {DATA_FOLDER}, please check"

because something isn't working...

@jfkok
Copy link

jfkok commented Jan 6, 2025 via email

@znichollscr
Copy link
Collaborator Author

Hmmm I can't see the screenshot and we don't seem to have anything on the server. Can you try sending the screenshot again please?

@jfkok
Copy link

jfkok commented Jan 6, 2025

That's odd! I've added the line of code you mentioned and that warning doesn't trigger.

Here's the screenshot again, hopefully it comes across this time.
image

@znichollscr
Copy link
Collaborator Author

Super weird, is there anything in the folder pointed to by OUTPUT_ROOT ?

@znichollscr
Copy link
Collaborator Author

(For what it's worth, the timestamp for the message about uploading files and success are the same, so it's clear that there's no uploading happening)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants