Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"KeyError: ..." when running "muscle3 profile --instances ..." (version 0.7.1) #274

Open
DavidPCoster opened this issue Oct 27, 2023 · 5 comments
Labels
ready for release Issue is fixed in develop and ready for the next release

Comments

@DavidPCoster
Copy link
Contributor

When attempting to analyse a completed muscle3 run using

muscle3 profile --instances $RUN/performance.sqlite

I get

Traceback (most recent call last):
File "/gss_efgw_work/scratch/g2dpc/GIT/ets_paf/UQ/muscle3_venv/bin/muscle3", line 8, in
sys.exit(muscle3())
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/muscle3/muscle3.py", line 70, in profile
plot_instances(Path(performance_file))
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/muscle3/profiling.py", line 26, in plot_instances
instances, compute, transfer, wait = db.instance_stats()
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/libmuscle/manager/profile_database.py", line 135, in instance_stats
total_times = [(stop_run[i] - start_run[i]) * 1e-9 for i in instances]
File "/gw/switm/muscle3/0.7.1/intel/2020/lib/python3.10/site-packages/libmuscle/manager/profile_database.py", line 135, in
total_times = [(stop_run[i] - start_run[i]) * 1e-9 for i in instances]
KeyError: 'init'

where 'init' is the name of one of the actors (the first to be executed). I have hacked profile_database.py

    instances = list(start_run.keys())
    for i in instances:
        if i not in start_run:
            print(f'{i} not in start_run')
            start_run[i] = np.array(list(start_run.values())).min()
        if i not in stop_run:
            print(f'{i} not in stop_run')
            stop_run[i] = np.array(list(stop_run.values())).max()
    total_times = [(stop_run[i] - start_run[i]) * 1e-9 for i in instances]

(first and last lines are from the original code). This at least allows the code to run, and I get the message:

init not in stop_run

What I've done is a horrible hack, but I haven't been able to find why 'init' is not in stop_run ...

I'm not sure if anybody else has experienced this problem

@LourensVeen
Copy link
Contributor

It seems that there's no SHUTDOWN_WAIT event for init in the database (assuming the data is from a run with 0.7.1, otherwise there's a DEREGISTER missing). This suggests that either init didn't shut down cleanly, or that there's a mistake somewhere in MUSCLE3 that causes the event to not end up in the performance database. I guess the run finished cleanly? Did init run once, or never?

Maybe we could have the launcher record when processes are started and when they stop, cleanly or through a crash. Then unless the manager itself crashes (I'm sure you'll find a way 😃), the data should be there.

@DavidPCoster
Copy link
Contributor Author

DavidPCoster commented Oct 27, 2023

The run finished cleanly:

Simulation completed successfully.
Output may be found in run_ETS_MAIN_ENCAPSULATED_WITH_ETS_INIT_20231027_094921

These are the messages from init in the log file:

(muscle3_venv) <g2dpc@s53 ~/GIT/ets_paf>grep ^init run_ETS_MAIN_ENCAPSULATED_WITH_ETS_INIT_20231027_094921/muscle3_manager.log
init 2023-10-27 09:49:30,956 DEBUG init: Sending message on equilibrium_out
init 2023-10-27 09:49:30,976 DEBUG init: Sending message on core_profiles_out
init 2023-10-27 09:49:30,977 DEBUG init: Sending message on transport_solver_numerics_out
init 2023-10-27 09:49:30,978 DEBUG init: Waiting for message on equilibrium_in
init 2023-10-27 09:49:30,978 INFO init: Connecting to peer ets_init at ['tcp:10.24.54.27:11080,10.28.54.27:11080,130.186.25.49:11080']
init 2023-10-27 09:49:33,216 DEBUG init: Received message on equilibrium_in
init 2023-10-27 09:49:33,217 DEBUG init: Waiting for message on core_profiles_in
init 2023-10-27 09:49:33,218 DEBUG init: Received message on core_profiles_in
init 2023-10-27 09:49:33,218 DEBUG init: Waiting for message on transport_solver_numerics_in
init 2023-10-27 09:49:33,219 DEBUG init: Received message on transport_solver_numerics_in
init 2023-10-27 09:49:33,230 DEBUG init: Sending message on final_equilibrium_out
init 2023-10-27 09:49:33,251 DEBUG init: Sending message on final_core_profiles_out
init 2023-10-27 09:49:33,254 DEBUG init: Sending message on final_core_transport_out
init 2023-10-27 09:49:33,254 DEBUG init: Sending message on final_core_sources_out
init 2023-10-27 09:49:33,255 DEBUG init: Sending message on final_transport_solver_numerics_out
init 2023-10-27 09:49:33,256 DEBUG init: Closing port equilibrium_out
init 2023-10-27 09:49:33,256 DEBUG init: Sending message on equilibrium_out
init 2023-10-27 09:49:33,257 DEBUG init: Closing port core_profiles_out
init 2023-10-27 09:49:33,257 DEBUG init: Sending message on core_profiles_out
init 2023-10-27 09:49:33,257 DEBUG init: Closing port transport_solver_numerics_out
init 2023-10-27 09:49:33,258 DEBUG init: Sending message on transport_solver_numerics_out
init 2023-10-27 09:49:33,258 DEBUG init: Closing port final_equilibrium_out
init 2023-10-27 09:49:33,258 DEBUG init: Sending message on final_equilibrium_out
init 2023-10-27 09:49:33,261 DEBUG init: Closing port final_core_profiles_out
init 2023-10-27 09:49:33,261 DEBUG init: Sending message on final_core_profiles_out
init 2023-10-27 09:49:33,265 DEBUG init: Closing port final_core_transport_out
init 2023-10-27 09:49:33,265 DEBUG init: Sending message on final_core_transport_out
init 2023-10-27 09:49:33,266 DEBUG init: Closing port final_core_sources_out
init 2023-10-27 09:49:33,266 DEBUG init: Sending message on final_core_sources_out
init 2023-10-27 09:49:33,266 DEBUG init: Closing port final_transport_solver_numerics_out
init 2023-10-27 09:49:33,267 DEBUG init: Sending message on final_transport_solver_numerics_out
init 2023-10-27 09:49:33,267 DEBUG init: Waiting for message on equilibrium_in
init 2023-10-27 09:49:33,268 DEBUG init: Received message on equilibrium_in
init 2023-10-27 09:49:33,268 DEBUG init: Port equilibrium_in is now closed
init 2023-10-27 09:49:33,268 DEBUG init: Waiting for message on core_profiles_in
init 2023-10-27 09:49:33,269 DEBUG init: Received message on core_profiles_in
init 2023-10-27 09:49:33,269 DEBUG init: Port core_profiles_in is now closed
init 2023-10-27 09:49:33,270 DEBUG init: Waiting for message on transport_solver_numerics_in
init 2023-10-27 09:49:33,271 DEBUG init: Received message on transport_solver_numerics_in
init 2023-10-27 09:49:33,271 DEBUG init: Port transport_solver_numerics_in is now closed
(muscle3_venv) <g2dpc@s53 ~/GIT/ets_paf>

@LourensVeen
Copy link
Contributor

I think I've reproduced this on another model I'm working on, that similarly has a component that initialises a state, dispatches it to the rest of the simulation and then quits. That should make it easier to figure out what's going on. This model is getting closer to being done, after which I should have a bit more time to work on MUSCLE3 again...

@LourensVeen
Copy link
Contributor

LourensVeen commented Dec 17, 2023

I've found the issue: the problem is that if you don't have any incoming ports, then you don't have to wait for any predecessors to tell you whether they will send another message or not. So a component like your init will simply never do a SHUTDOWN_WAIT, because there's nothing to wait for. Therefore that event isn't in the database, and then the profiler gets confused and crashes.

There's a symmetrical situation with components that have no outgoing ports, which don't need to do a DISCONNECT_WAIT. In the current implementation the SHUTDOWN_WAIT is skipped while the DISCONNECT_WAIT becomes a no-op but is recorded anyway, which is inconsistent.

So now the question is what to do. Do we make these events optional and only record them if they really happened? Or do we always record them because we're still moving through this stage of the shutdown even if there is no work to do? At least it should be consistent, and for the former the profiler needs to be updated to not crash if the events are missing.

I think it's nicer if they're always there, that makes the database format less flexible and thus makes it easier to write analysis code. And the profiler should be able to deal with missing events anyway, because there are now old files that don't have them, and because an instance can crash and not generate this event.

  • Always record a SHUTDOWN_WAIT event
  • Make the profiler robust against missing SHUTDOWN_WAIT events

@LourensVeen LourensVeen added the ready for release Issue is fixed in develop and ready for the next release label Dec 18, 2023
@LourensVeen
Copy link
Contributor

Released with 0.7.2. Please give it a try, and could you close this issue if it works for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for release Issue is fixed in develop and ready for the next release
Projects
None yet
Development

No branches or pull requests

2 participants