Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug on GEFS fcst segment #3001

Open
weihuang-jedi opened this issue Oct 11, 2024 · 3 comments · May be fixed by #3009
Open

Possible bug on GEFS fcst segment #3001

weihuang-jedi opened this issue Oct 11, 2024 · 3 comments · May be fixed by #3009
Assignees
Labels
bug Something isn't working

Comments

@weihuang-jedi
Copy link
Contributor

What is wrong?

When run gefs in segment, the fcst hour seems overlapped, as below:

[Wei.Huang@hfe03 2021032312]$ grep cfhour fcst_mem00*

  1. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 0.333333333333333 cfhour=000
  2. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 6.00000000000000 cfhour=006
  3. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
  4. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 18.0000000000000 cfhour=018
  5. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
  6. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 30.0000000000000 cfhour=030
  7. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
  8. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 42.0000000000000 cfhour=042
  9. fcst_mem000_seg0.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048

fcst_mem000_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114
fcst_mem000_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 0.333333333333333 cfhour=000
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 6.00000000000000 cfhour=006
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 18.0000000000000 cfhour=018
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 30.0000000000000 cfhour=030
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 42.0000000000000 cfhour=042
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114
  • fcst_mem001_seg0.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

fcst_mem001_seg1.log: 6: in wrt run, nfhour= 12.0000000000000 cfhour=012
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 24.0000000000000 cfhour=024
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 36.0000000000000 cfhour=036
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 48.0000000000000 cfhour=048
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 54.0000000000000 cfhour=054
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 60.0000000000000 cfhour=060
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 66.0000000000000 cfhour=066
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 72.0000000000000 cfhour=072
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 78.0000000000000 cfhour=078
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 84.0000000000000 cfhour=084
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 90.0000000000000 cfhour=090
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 96.0000000000000 cfhour=096
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 102.000000000000 cfhour=102
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 108.000000000000 cfhour=108
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 114.000000000000 cfhour=114
fcst_mem001_seg1.log: 6: in wrt run, nfhour= 120.000000000000 cfhour=120

For mem000, seg 0 fcst from 00 - 48, then seg 1 from 12 to 120, should seg 1 be from 48 - 120?
For mem001 and mem002, seg 0 from 00 - 120, and then seg 1 from 12 to 120, seg 1 here is not needed at all, right?

rocotostat shows this:

  • /apps/rocoto/1.3.7/bin/rocotostat -d c48gefs.db -w c48gefs.xml
    CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
    ================================================================================================================================
    202103231200 stage_ic 803100 SUCCEEDED 0 1 21.0
    202103231200 wave_init 803099 SUCCEEDED 0 1 28.0
    202103231200 prep_emissions 803098 SUCCEEDED 0 1 17.0
    202103231200 fcst_mem000_seg0 803195 SUCCEEDED 0 1 1164.0
    202103231200 fcst_mem000_seg1 803964 SUCCEEDED 0 1 2812.0
    202103231200 fcst_mem001_seg0 803196 SUCCEEDED 0 1 2859.0
    202103231200 fcst_mem001_seg1 805320 SUCCEEDED 0 1 2890.0
    202103231200 fcst_mem002_seg0 803197 SUCCEEDED 0 1 2850.0
    202103231200 fcst_mem002_seg1 805321 SUCCEEDED 0 1 2884.0

it seems mem000 over-used 1/3 of CPU, and mem001, and mem002 doubled the CPU cost.

What should have happened?

We expect all members, if in two seg fcst, it should be:
seg 0, fcst from 00 -> 48,
seg 1, fcst from 48 -> 120.

What machines are impacted?

All or N/A

What global-workflow hash are you using?

The test is using EPIC's fork of global-workflow, which is point to the current develop.

Steps to reproduce

To produce on Hera:

  1. compile as: build_all.sh -w
  2. configure with:
    HPC_ACCOUNT=epic
    pslot=c48gefs
    RUNTESTS=/scratch1/NCEPDEV/stmp2/$USER/GEFSTESTS
    ./workflow/create_experiment.py
    --yaml ci/cases/pr/C48_S2SWA_gefs.yaml
  3. start crontab.

Additional information

COMROOT and EXPDIR on Hera at:

[Wei.Huang@hfe03 GEFSTESTS]$ pwd
/scratch1/NCEPDEV/stmp2/Wei.Huang/GEFSTESTS
[Wei.Huang@hfe03 GEFSTESTS]$ ls -l
total 8
drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 COMROOT
drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 10 22:56 EXPDIR
[Wei.Huang@hfe03 GEFSTESTS]$ ls -l *
COMROOT:
total 4
drwxr-sr-x 4 Wei.Huang stmp 4096 Oct 10 22:57 c48gefs

EXPDIR:
total 4
drwxr-sr-x 3 Wei.Huang stmp 4096 Oct 11 14:10 c48gefs

Do you have a proposed solution?

No

@weihuang-jedi weihuang-jedi added bug Something isn't working triage Issues that are triage labels Oct 11, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA self-assigned this Oct 11, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Oct 11, 2024
@WalterKolczynski-NOAA
Copy link
Contributor

I just checked and this is definitely working correctly for gfs atm-only. Will try again with coupled, then gefs.

@WalterKolczynski-NOAA
Copy link
Contributor

Looks like the WW3 restart files are not being written to the correct directory. There is a restart_wave directory in $DATA that is linked to $DATA_RESTART, but the restart files are getting written directly to the root $DATA. So when waves are on, it will never find wave restart files.

CC: @aerorahul

@WalterKolczynski-NOAA
Copy link
Contributor

I've confirmed restart works correctly for S2S without waves. Will fix the wave restarts this week.

WalterKolczynski-NOAA added a commit to WalterKolczynski-NOAA/global-workflow that referenced this issue Oct 16, 2024
Fixes some issues that were preventing wave restarts from operating
correctly.

First, the wave restart files were not being correctly linked from
`$DATA` to `$DATArestart`. The files are placed in the root of
`$DATA` instead of in `${DATA}/WAVE_RESTART`, so now links for the
individual files are created.

Second, the incorrect filenames were being searches for and copied
as part of a rerun. Filenames were geared towards multigrid waves,
which use the grid names, but single grid just uses a `ww3`. Since
multigrid waves are deprecated in workflow and will soon be removed
(NOAA-EMC#2637), these were updated only supporting the single-grid option.

These fixes allow forecast segments (and emergency restarts) to
work correctly when waves are on.

Resolves NOAA-EMC#3001
@WalterKolczynski-NOAA WalterKolczynski-NOAA linked a pull request Oct 16, 2024 that will close this issue
12 tasks
WalterKolczynski-NOAA added a commit to WalterKolczynski-NOAA/global-workflow that referenced this issue Oct 16, 2024
Fixes some issues that were preventing wave restarts from operating
correctly.

First, the wave restart files were not being correctly linked from
`$DATA` to `$DATArestart`. The files are placed in the root of
`$DATA` instead of in `${DATA}/WAVE_RESTART`, so now links for the
individual files are created.

Second, the incorrect filenames were being searches for and copied
as part of a rerun. Filenames were geared towards multigrid waves,
which use the grid names, but single grid just uses a `ww3`. Since
multigrid waves are deprecated in workflow and will soon be removed
(NOAA-EMC#2637), these were updated only supporting the single-grid option.

These fixes allow forecast segments (and emergency restarts) to
work correctly when waves are on.

Resolves NOAA-EMC#3001
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants