Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error in Cloud-J module #2648

Open
foesterstroem opened this issue Dec 17, 2024 · 7 comments
Open

Runtime error in Cloud-J module #2648

foesterstroem opened this issue Dec 17, 2024 · 7 comments
Assignees
Labels
category: Bug Something isn't working topic: Photolysis Related to photolyis rate computations topic: Runtime Error Related to runtime issues (e.g. simulation stopped w/ error)

Comments

@foesterstroem
Copy link

foesterstroem commented Dec 17, 2024

Your name

Freja Østerstrøm

Your affiliation

Aarhus University, Denmark

What happened? What did you expect to happen?

I have recently set up our local cluster with environment to run GCClassic 14.5.0 and performed dryruns to download input data successfully.
When running a 4x5 benchmark run, I get a runtime error in the Cloud-J module:
In model output/print in the log-file:

 Fast-J ----J-values----
 L=  O2       O3       O3(1D)   NO       H2SO4    H2COa    H2COb    H2O2     CH3OOH   NO2      NO3      N2O5     HNO2     HNO3     HNO4     ClNO3a   ClNO3b   ClNO2    Br2      BrNO2    Cl2      HOCl     OClO     ClOO     Cl2O2    ClO      BrO      BrNO3    HOBr     BrCl     N2O      CFCl3    CF2Cl2   F113     F114     F115     CCl4     CH3Cl    MeCCl3   CH2Cl2   CHF2Cl   F123     F141b    F142b    CH3Br    H1211    H1301    H2402    CH2Br2   CHBr3    CF3I     OCS      HAC      PAN      PPN      CH3NO3   ActAld   MeVK     MeAcr    GlyAld   MEKeto   PrAld    MGlyxl   Glyxla   Glyxlb   Glyxlc   Acet-a   Acet-b   ONIT1    MPN      ETHLN    PROPNN
MVKN     MACRN    NITP     HPALD1   HPALD2   PrAldP   ICN      MACRNP   MVKCN    ENOL     ONIT2    HP2      HMHP     CH3I     CH2I2    CH2ICl   CH2IBr   I2       HOI      IO       OIO      INO      IONO     IONO2    I2O2     I2O3     ICl      IBr      MENO3    ETNO3    IPRNO3   NPRNO3   BALD
 72 0.00E+00 1.20E-04 4.56E-09 0.00E+00 1.84E-08 7.03E-08 4.03E-07 3.38E-08 9.18E-08 5.17E-04 7.86E-02 2.00E-07 8.15E-05 4.12E-10 3.42E-06 1.61E-06 4.12E-08 4.80E-06 8.77E-03 1.18E-03 1.28E-04 8.97E-06 6.78E-03 1.47E-01 9.99E-05 9.31E-12 9.40E-04 1.32E-04 2.73E-04 1.81E-03 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 7.70E-14 4.07E-44 3.07E-14 2.10E-41 9.70E-10 4.49E-08 1.03E-45 6.49E-09 1.64E-09 1.77E-09 3.32E-10 5.82E-09 1.31E-07 4.92E-08 1.05E-08 9.22E-09 1.01E-07 6.75E-05 9.55E-06 1.29E-07 6.95E-07 8.80E-12 6.24E-13 4.89E-09 3.33E-09 3.37E-07 6.25E-08
**At line 779 of file /home/freja/GCClassic_14_5_0/Rundirs/gc_4x5_merra2_fullchem_benchmark/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90 (unit = 6, file = 'stdout')
Fortran runtime error: Expected INTEGER for item 74 in formatted transfer, got REAL
(i3,1p, 72e9.2)
 ^**

I get the same error in debugging mode, but not any more information (I do have an issue with my debugger-installation, so could be unrelated to this issue).

What are the steps to reproduce the bug?

In the cldj_fjx_sub_mod.F90 file: lines 778-780 is a do statement that prints the J-values as in the output above. A comment here says the information is not fed back to the model. I have tried commenting out these three lines in the source code and this makes the model run.

!---diagnostics/variables below are JUST for PRINT and NOT returned to the
! CTM code

and

         !do L = LU,1,-1
         !   write(6,'(i3,1p, 72e9.2)') L,(VALJXX(L,K),K=1,NJX)
         !enddo

I am worried that there may be something that would go wrong down the line when commenting this information out? Or something actually wrong in the model run that I am not seeing?

The species in this list of J-values that doesn't have the J-value printed is MVKN, which seems to have a very small absorption cross section (CHEM_INPUTS/FAST_JX/v2024-05/FJX_spec.dat), so I am not sure if it is an issue about J-values approaching 0?

Please attach any relevant configuration and log files.

No response

What GEOS-Chem version were you using?

14.5.0

What environment were you running GEOS-Chem on?

Local cluster

What compiler and version were you using?

gcc 12.2.0

Will you be addressing this bug yourself?

Yes, but I will need some help

In what configuration were you running GEOS-Chem?

GCClassic

What simulation were you running?

Full chemistry

As what resolution were you running GEOS-Chem?

4x5

What meterology fields did you use?

MERRA-2

Additional information

No response

@foesterstroem foesterstroem added the category: Bug Something isn't working label Dec 17, 2024
@yantosca yantosca self-assigned this Dec 17, 2024
@yantosca yantosca added topic: Runtime Error Related to runtime issues (e.g. simulation stopped w/ error) topic: Photolysis Related to photolyis rate computations labels Dec 17, 2024
@yantosca
Copy link
Contributor

Thanks for writing @foesterstroem. This looks like an input read error. If you are using version 14.5.0 then make sure that you are reading data from the CHEM_INPUTS/CLOUD_J/v2024-09/ folder as these files are needed for Cloud-J v8+.

Also, would you be able to attach your input files and log files to this issue? We can take a look.

Also tagging @lizziel, our local Cloud-J expert.

@lizziel
Copy link
Contributor

lizziel commented Dec 17, 2024

Hi @foesterstroem, that print should only occur if Cloud-J debug prints are enabled since it is in this if block:
https://github.com/geoschem/Cloud-J/blob/f8a2b7f964bde1582fbc38c41d8872bc23a21735/src/Core/cldj_fjx_sub_mod.F90#L651

To be clear, the problem is the formatting of the write statement (not input read) and I wonder if there is a bug there. You must be running with verbose on, since that write statement is otherwise not called. It is trying to print all J-values for a single grid cell (the cell horizontal indexes are defined as (20,20) prior to the main Cloud-J call in file cldj_interface_mod.F90 in GEOS-Chem). The J-values are defined as real so it is odd it is looking for an integer.

I will look into if this is a bug. For now commenting it out should not have any impact. It might be an indicator of an underlying issue, although I doubt it.

Please do post your geoschem_config.yml and log file here. You can add .txt extension and drag and drop the files into the comment box. This will help me try to reproduce the issue.

@foesterstroem
Copy link
Author

foesterstroem commented Dec 18, 2024

Hi, thanks for getting back to me.
The model should be reading the CHEM_INPUTS/CLOUD_J/v2024-09/ folder as is seen in the geoschem_config.yml file
geoschem_config.txt

I have been running it, with verbose on, yes. I am currently re-trying the benchmark run with the do-statement uncommented and verbose off. An FYI: I have successfully run a benchmark run yesterday/today using version 14.4.3, with verbose on.

I am attaching just part of the log-file: GC-log-sections.txt

The file is too big to attach in its entirety - it ran for ~12.5 model days before failing with a floating point exception (problem in some section of the aerosol calculation part):

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x14b33ac38b7f in ???
#1  0x1073231 in mach_hetp_calco7
	at /home/freja/GCClassic_14_5_0/Rundirs/gc_4x5_merra2_fullchem_benchmark/CodeDir/src/HETP/src/Core/hetp_mod.F90:5220
#2  0x108de29 in __hetp_mod_MOD_mach_hetp_main_15cases
	at /home/freja/GCClassic_14_5_0/Rundirs/gc_4x5_merra2_fullchem_benchmark/CodeDir/src/HETP/src/Core/hetp_mod.F90:408
#3  0x5f5bbb in __aerosol_thermodynamics_mod_MOD_do_ate._omp_fn.0
	at /home/freja/GCClassic_14_5_0/Rundirs/gc_4x5_merra2_fullchem_benchmark/CodeDir/src/GEOS-Chem/GeosCore/aerosol_thermodynamics_mod.F90:799
#4  0x14b33b65357d in gomp_thread_start
	at /tmp/freja/spack-stage/spack-stage-gcc-12.2.0-5boslqnpqh4pkgcys6q5bbmrzfhts2qh/spack-src/libgomp/team.c:129
#5  0x14b33afb81ce in ???
#6  0x14b33ac23e72 in ???
#7  0xffffffffffffffff in ???
/bin/bash: line 71: 4041629 Floating point exception(core dumped) ./gcclassic >> $log

@lizziel
Copy link
Contributor

lizziel commented Dec 18, 2024

Hi @foesterstroem, I did a fullchem run yesterday with verbose on. It was successful but I was surprised to see that there are actually small differences in output between verbose on and verbose off in GC-Classic benchmark simulation. I do not recommend doing any production runs with verbose on until we figure out why there are differences. Having verbose on also slows the down the model. For that reason we do not recommend using it for production runs anyway. Is there a reason you need it on for a 12+ day run?

@yantosca
Copy link
Contributor

Thanks @foesterstroem and @lizziel. I have a hunch why this may happen. The gmax variable is defined as a local variable at line 4704 of HETP/hetp_mod.F90:

   real(dp)     :: omehi, omebe, y1, y2, y3, x3, dx, c1, c2, c2a, c3, gmax, ya, yb, xa, xb

But gmax is never initialized until here at line 5192:

      if (.not. soln) then
         gmax = 0.1_dp
         gmax = max(gmax, gama(1))
         gmax = max(gmax, gama(2))
         gmax = max(gmax, gama(3))
         gmax = max(gmax, gama(4))
         gmax = max(gmax, gama(5))
         gmax = max(gmax, gama(6))
         gmax = max(gmax, gama(7))
         gmax = max(gmax, gama(8))
         gmax = max(gmax, gama(9))
         gmax = max(gmax, gama(10))
         gmax = max(gmax, gama(11))
         gmax = max(gmax, gama(12))
         gmax = max(gmax, gama(13))
         gmax = max(gmax, gama(14))
         gmax = max(gmax, gama(15))
         gmax = max(gmax, gama(16))
         gmax = max(gmax, gama(17))
         gmax = max(gmax, gama(18))
         gmax = max(gmax, gama(19))
         gmax = max(gmax, gama(20))
         gmax = max(gmax, gama(21))
         gmax = max(gmax, gama(22))
         gmax = max(gmax, gama(23))
      end if

Note that gmax wouldn't be defined unless the soln logical is false. Then we get to line 5220 where we encountered the error:

!  ## Reinitialize activity coefficients if gmax > 100.0_dp
      if (gmax > 100.0_dp .and. (.not. soln)) then
         gama  = 0.1_dp
         gamin = 1.0e10_dp
         gamou = 1.0e10_dp
         calou = .true.
         frst  = .true.
      end if

So what could be happening is that gmax could have been set to some random value that is denormal (i.e. NaN or infinity), and that could have tripped the error.

What I think will fix this is if we set gmax to 0 at the top of the routine, at around line 4717

!  ### Initialize variables ### 
   so4   = so4_i
   nh4   = nh4_i
   no3   = no3_i
   na    = na_i
   cl    = cl_i
   ca    = ca_i
   pk    = k_i
   mg    = mg_i
   aw    = rh
   t     = temp
   hso4  = 0.0_dp
   gnh3  = 0.0_dp
   ghno3 = 0.0_dp
   ghcl  = 0.0_dp
   h     = 0.0_dp
   lwn   = tiny
   so4_t = 0.0_dp
   nh4_t = 0.0_dp
   no3_t = 0.0_dp
   na_t  = 0.0_dp
   cl_t  = 0.0_dp
   ca_t  = 0.0_dp
   pk_t  = 0.0_dp
   mg_t  = 0.0_dp
   caso4 = 0.0_dp
   so4fr = 0.0_dp
   na2so4= 0.0_dp
   k2so4 = 0.0_dp
   mgso4 = 0.0_dp
   noroot=.false.
   frk   = 0.0_dp
   frmg  = 0.0_dp
   frca  = 0.0_dp
   frna  = 0.0_dp
   soln  = .false.
   calou = .true.
   gama  = 0.1_dp
   gamin = 1.0e10_dp
   gamou = 0.1_dp
   earlye = .false.

we could just add a

   gmax = 0.0_dp

to that list just to make sure that it will always have a defined non-denormal value when soln is false.

@lizziel
Copy link
Contributor

lizziel commented Dec 18, 2024

That makes sense for that error. I think most compilers don't pick up on that unless compiled with debug flags. @foesterstroem, did you compile with debug on? I only ask because you should recompile with it off before doing a long run. It will greatly slow down the model.

@foesterstroem
Copy link
Author

foesterstroem commented Dec 20, 2024

Hi @foesterstroem, I did a fullchem run yesterday with verbose on. It was successful but I was surprised to see that there are actually small differences in output between verbose on and verbose off in GC-Classic benchmark simulation. I do not recommend doing any production runs with verbose on until we figure out why there are differences. Having verbose on also slows the down the model. For that reason we do not recommend using it for production runs anyway. Is there a reason you need it on for a 12+ day run?

That makes sense for that error. I think most compilers don't pick up on that unless compiled with debug flags. @foesterstroem, did you compile with debug on? I only ask because you should recompile with it off before doing a long run. It will greatly slow down the model.

Hi @lizziel, Thank you. I had left verbose on after running a debug version, and saw the initial Cloud-J error running in both debug and normal mode. For the second error with HETP, the particular run wasn't compiled in debug mode, but in the normal mode.

@yantosca after the holiday break, I will attempt a fresh compile of the benchmark run to see if I still have the HETP/gmax issue and see if added the code you suggest fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working topic: Photolysis Related to photolyis rate computations topic: Runtime Error Related to runtime issues (e.g. simulation stopped w/ error)
Projects
None yet
Development

No branches or pull requests

3 participants