Possible bug on GEFS fcst segment #3001

weihuang-jedi · 2024-10-11T14:12:29Z

What is wrong?

When run gefs in segment, the fcst hour seems overlapped, as below:

[Wei.Huang@hfe03 2021032312]$ grep cfhour fcst_mem00*

fcst_mem000_seg0.log: 6: in wrt run, nfhour= 0.333333333333333 cfhour=000
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 6.00000000000000 cfhour=006
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 18.0000000000000 cfhour=018
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 30.0000000000000 cfhour=030
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 42.0000000000000 cfhour=042
fcst_mem000_seg0.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048

fcst_mem000_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

fcst_mem001_seg0.log: 6: in wrt run, nfhour= 0.333333333333333 cfhour=000
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 6.00000000000000 cfhour=006
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 18.0000000000000 cfhour=018
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 30.0000000000000 cfhour=030
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 42.0000000000000 cfhour=042
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114
fcst_mem001_seg0.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

fcst_mem001_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

For mem000, seg 0 fcst from 00 - 48, then seg 1 from 12 to 120, should seg 1 be from 48 - 120?
For mem001 and mem002, seg 0 from 00 - 120, and then seg 1 from 12 to 120, seg 1 here is not needed at all, right?

rocotostat shows this:

/apps/rocoto/1.3.7/bin/rocotostat -d c48gefs.db -w c48gefs.xml
CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
================================================================================================================================
202103231200 stage_ic 803100 SUCCEEDED 0 1 21.0
202103231200 wave_init 803099 SUCCEEDED 0 1 28.0
202103231200 prep_emissions 803098 SUCCEEDED 0 1 17.0
202103231200 fcst_mem000_seg0 803195 SUCCEEDED 0 1 1164.0
202103231200 fcst_mem000_seg1 803964 SUCCEEDED 0 1 2812.0
202103231200 fcst_mem001_seg0 803196 SUCCEEDED 0 1 2859.0
202103231200 fcst_mem001_seg1 805320 SUCCEEDED 0 1 2890.0
202103231200 fcst_mem002_seg0 803197 SUCCEEDED 0 1 2850.0
202103231200 fcst_mem002_seg1 805321 SUCCEEDED 0 1 2884.0

it seems mem000 over-used 1/3 of CPU, and mem001, and mem002 doubled the CPU cost.

What should have happened?

We expect all members, if in two seg fcst, it should be:
seg 0, fcst from 00 -> 48,
seg 1, fcst from 48 -> 120.

What machines are impacted?

All or N/A

What global-workflow hash are you using?

The test is using EPIC's fork of global-workflow, which is point to the current develop.

Steps to reproduce

To produce on Hera:

compile as: build_all.sh -w
configure with:
HPC_ACCOUNT=epic
pslot=c48gefs
RUNTESTS=/scratch1/NCEPDEV/stmp2/$USER/GEFSTESTS
./workflow/create_experiment.py
--yaml ci/cases/pr/C48_S2SWA_gefs.yaml
start crontab.

Additional information

COMROOT and EXPDIR on Hera at:

[Wei.Huang@hfe03 GEFSTESTS]$ pwd
/scratch1/NCEPDEV/stmp2/Wei.Huang/GEFSTESTS
[Wei.Huang@hfe03 GEFSTESTS]$ ls -l
total 8
drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 COMROOT
drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 EXPDIR
[Wei.Huang@hfe03 GEFSTESTS]$ ls -l *
COMROOT:
total 4
drwxr-sr-x 4 Wei.Huang stmp 4096 Oct 10 22:57 c48gefs

EXPDIR:
total 4
drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 11 14:10 c48gefs

Do you have a proposed solution?

No

WalterKolczynski-NOAA · 2024-10-11T21:46:34Z

I just checked and this is definitely working correctly for gfs atm-only. Will try again with coupled, then gefs.

WalterKolczynski-NOAA · 2024-10-11T22:27:48Z

Looks like the WW3 restart files are not being written to the correct directory. There is a restart_wave directory in $DATA that is linked to $DATA_RESTART, but the restart files are getting written directly to the root $DATA. So when waves are on, it will never find wave restart files.

CC: @aerorahul

WalterKolczynski-NOAA · 2024-10-16T00:09:51Z

I've confirmed restart works correctly for S2S without waves. Will fix the wave restarts this week.

Fixes some issues that were preventing wave restarts from operating correctly. First, the wave restart files were not being correctly linked from `$DATA` to `$DATArestart`. The files are placed in the root of `$DATA` instead of in `${DATA}/WAVE_RESTART`, so now links for the individual files are created. Second, the incorrect filenames were being searches for and copied as part of a rerun. Filenames were geared towards multigrid waves, which use the grid names, but single grid just uses a `ww3`. Since multigrid waves are deprecated in workflow and will soon be removed (NOAA-EMC#2637), these were updated only supporting the single-grid option. These fixes allow forecast segments (and emergency restarts) to work correctly when waves are on. Resolves NOAA-EMC#3001

weihuang-jedi added bug Something isn't working triage Issues that are triage labels Oct 11, 2024

WalterKolczynski-NOAA self-assigned this Oct 11, 2024

WalterKolczynski-NOAA removed the triage Issues that are triage label Oct 11, 2024

WalterKolczynski-NOAA linked a pull request Oct 16, 2024 that will close this issue

Fix wave restarts and GEFS FHOUT/FHMAX #3009

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug on GEFS fcst segment #3001

Possible bug on GEFS fcst segment #3001

weihuang-jedi commented Oct 11, 2024

WalterKolczynski-NOAA commented Oct 11, 2024

WalterKolczynski-NOAA commented Oct 11, 2024

WalterKolczynski-NOAA commented Oct 16, 2024

Possible bug on GEFS fcst segment #3001

Possible bug on GEFS fcst segment #3001

Comments

weihuang-jedi commented Oct 11, 2024

What is wrong?

What should have happened?

What machines are impacted?

What global-workflow hash are you using?

Steps to reproduce

Additional information

Do you have a proposed solution?

WalterKolczynski-NOAA commented Oct 11, 2024

WalterKolczynski-NOAA commented Oct 11, 2024

WalterKolczynski-NOAA commented Oct 16, 2024