You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have encountered a crash on pmon chassisd occasionally (not all the time) when we perform a "controlled" chassis reboot operation
We see the following:
Sep 14 03:27:56.016816 SONiC-CHASSIS-lc04 INFO pmon#supervisord: pcied Failed to load platform Pcie module. Error : No module named 'sonic_platform.pcie', Fallback to default module
Sep 14 03:27:56.018586 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:56,015 WARN received SIGTERM indicating exit request
Sep 14 03:27:56.018586 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:56,016 INFO waiting for supervisor-proc-exit-listener, rsyslogd, chassisd, ledd, xcvrd, syseepromd, thermalctld, pcied to die
Sep 14 03:27:56.022445 SONiC-CHASSIS-lc04 DEBUG container: container_stop: BEGIN
...
Sep 14 03:27:57.071290 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,070 INFO stopped: pcied (exit status 143)
Sep 14 03:27:57.071290 SONiC-CHASSIS-lc04 INFO pmon#thermal_infos: Caught signal 'SIGTERM' - exiting...
Sep 14 03:27:57.071648 SONiC-CHASSIS-lc04 INFO pmon#thermalctld: Stop thermal monitoring loop
Sep 14 03:27:57.075350 SONiC-CHASSIS-lc04 INFO pmon#thermal_infos: Shutting down with exit code 143...
...
Sep 14 03:27:57.147934 SONiC-CHASSIS-lc04 INFO pmon#Chassis: Caught signal 'SIGTERM' - exiting...
Sep 14 03:27:57.147934 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,147 INFO stopped: thermalctld (exit status 143)
Sep 14 03:27:57.147934 SONiC-CHASSIS-lc04 INFO pmon#Chassis: Shutting down...
Sep 14 03:27:57.199220 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,198 INFO stopped: syseepromd (exit status 143)
Sep 14 03:27:57.222627 SONiC-CHASSIS-lc04 NOTICE pmon#: CMIS: Stopped
...
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 0 local: msgs 19084 success 19084 fail 0 notpresent 0 unknown 0 minrspwait 14541 maxrspwait 458390 long_rsp 102 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 1 local: msgs 190 success 190 fail 0 notpresent 0 unknown 0 minrspwait 14777 maxrspwait 458110 long_rsp 2 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 2 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 3 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 4 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 5 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) no_channel_avail 0 lock_held False
Sep 14 03:27:57.661303 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC destruction (30 30)
...
Sep 14 03:27:57.681317 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,680 INFO stopped: xcvrd (exit status 0)
Sep 14 03:27:57.720642 SONiC-CHASSIS-lc04 INFO teamd0#supervisord 2024-09-14 03:27:57,720 INFO stopped: tlm_teamd (exit status 0)
Sep 14 03:27:57.672780 SONiC-CHASSIS-lc04 INFO audisp-tacplus: message repeated 5 times: [ Audisp-tacplus: User command not match.]
Sep 14 03:27:57.812196 SONiC-CHASSIS-lc04 INFO audisp-tacplus: Audisp-tacplus: Accounting: user: _anp_dcfx_rw1, tty: pts0, host: SONiC-CHASSIS-lc04, command: /usr/bin/mkdir exit=0, type: 4, task ID: 16326
Sep 14 03:27:57.812354 SONiC-CHASSIS-lc04 INFO audisp-tacplus: Audisp-tacplus: User command not match.
Sep 14 03:27:57.868780 SONiC-CHASSIS-lc04 INFO macsec1#supervisord 2024-09-14 03:27:57,868 INFO stopped: macsecmgrd (exit status 255)
Sep 14 03:27:57.890803 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd Traceback (most recent call last):
Sep 14 03:27:57.890803 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 701, in <module>
Sep 14 03:27:57.891075 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd main()
Sep 14 03:27:57.891075 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 696, in main
Sep 14 03:27:57.891435 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd chassisd.run()
Sep 14 03:27:57.891435 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 674, in run
Sep 14 03:27:57.891836 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd self.module_updater.module_db_update()
Sep 14 03:27:57.891836 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd File "/usr/local/bin/chassisd", line 362, in module_db_update
Sep 14 03:27:57.892103 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd if fvs[CHASSIS_MODULE_INFO_NAME_FIELD] in notOnlineModules:
Sep 14 03:27:57.892103 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd KeyError: 'name'
Sep 14 03:27:57.812743 SONiC-CHASSIS-lc04 INFO audisp-tacplus: message repeated 5 times: [ Audisp-tacplus: User command not match.]
Sep 14 03:27:57.929118 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.928476022Z" level=info msg="shim disconnected" id=9f0d3cfc563a215f1aacd09578745c9c97e5ed7b15625273b4595eabbd69594f
Sep 14 03:27:57.929672 SONiC-CHASSIS-lc04 INFO dockerd[1054]: time="2024-09-14T03:27:57.928637322Z" level=info msg="ignoring event" container=9f0d3cfc563a215f1aacd09578745c9c97e5ed7b15625273b4595eabbd69594f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 14 03:27:57.929762 SONiC-CHASSIS-lc04 INFO dockerd[1054]: time="2024-09-14T03:27:57.928730602Z" level=info msg="ignoring event" container=90b8a0b6bd4cf8200217092d8d40f93d5ef4cd0ff147bafde8084e682c8ed3ec module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 14 03:27:57.929837 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929219822Z" level=warning msg="cleaning up after shim disconnected" id=9f0d3cfc563a215f1aacd09578745c9c97e5ed7b15625273b4595eabbd69594f namespace=moby
Sep 14 03:27:57.929887 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929255512Z" level=info msg="cleaning up dead shim"
Sep 14 03:27:57.929940 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929690723Z" level=info msg="shim disconnected" id=90b8a0b6bd4cf8200217092d8d40f93d5ef4cd0ff147bafde8084e682c8ed3ec
Sep 14 03:27:57.929985 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929779103Z" level=warning msg="cleaning up after shim disconnected" id=90b8a0b6bd4cf8200217092d8d40f93d5ef4cd0ff147bafde8084e682c8ed3ec namespace=moby
Sep 14 03:27:57.930043 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929821413Z" level=info msg="cleaning up dead shim"
Sep 14 03:27:57.937792 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.937689636Z" level=warning msg="cleanup warnings time=\"2024-09-14T03:27:57Z\" level=info msg=\"starting signal loop\" namespace=moby pid=18546\n"
Sep 14 03:27:57.937970 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.937898466Z" level=warning msg="cleanup warnings time=\"2024-09-14T03:27:57Z\" level=info msg=\"starting signal loop\" namespace=moby pid=18545\n"
Sep 14 03:27:57.940506 SONiC-CHASSIS-lc04 INFO systemd[1]: var-lib-docker-overlay2-10a6f809013a503d751d7764ef84e284b6030d318ea3702d3caf23d31452f080-merged.mount: Succeeded.
Sep 14 03:27:57.942302 SONiC-CHASSIS-lc04 INFO dockerd[1054]: time="2024-09-14T03:27:57.942211257Z" level=info msg="ignoring event" container=ff81f2c901824761f856f0388b75664f1af296b0aa773f9e15fa2b9676953fe7 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 14 03:27:57.942362 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.942118797Z" level=info msg="shim disconnected" id=ff81f2c901824761f856f0388b75664f1af296b0aa773f9e15fa2b9676953fe7
Sep 14 03:27:57.942387 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.942305527Z" level=warning msg="cleaning up after shim disconnected" id=ff81f2c901824761f856f0388b75664f1af296b0aa773f9e15fa2b9676953fe7 namespace=moby
Sep 14 03:27:57.942408 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.942333647Z" level=info msg="cleaning up dead shim"
Sep 14 03:27:57.951360 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,950 INFO exited: chassisd (exit status 1; not expected)
...
here is the code segment where the crash occured:
...
# In line card push the hostname of the module and num_asics to the chassis state db.
# The hostname is used as key to access chassis app db entries
if not self._is_supervisor():
hostname_key = "{}{}".format(ModuleBase.MODULE_TYPE_LINE, int(self.my_slot) - 1)
hostname = try_get(device_info.get_hostname, default="None")
hostname_fvs = swsscommon.FieldValuePairs([(CHASSIS_MODULE_INFO_SLOT_FIELD, str(self.my_slot)),
(CHASSIS_MODULE_INFO_HOSTNAME_FIELD, hostname),
(CHASSIS_MODULE_INFO_NUM_ASICS_FIELD, str(len(module_info_dict[CHASSIS_MODULE_INFO_ASICS])))])
self.hostname_table.set(hostname_key, hostname_fvs)
# Asics that are on the "not online" modules need to be cleaned up
asics = list(self.asic_table.getKeys())
for asic in asics:
fvs = self.asic_table.get(asic)
if isinstance(fvs, list):
fvs = dict(fvs[-1])
if fvs[CHASSIS_MODULE_INFO_NAME_FIELD] in notOnlineModules:
self.asic_table._del(asic)
Steps to reproduce the issue (Not always reproducible):
perform sudo reboot from supervisor
one of the LC "may" experience this PMON chassisd crash issue.
Not sure if we are able to reproduce this again as this seems to be some race condition that may not easily hit.
Describe the results you received:
Describe the results you expected:
we should handle the module going down gracefully and not crash
Output of show version:
This is one of the chassis image we are testing based on 202205 branch
(paste your output here)
Output of show techsupport:
(paste your output here or download and attach the file here )
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered:
@gechiang Is it possible to get all log files from this LC and SUP? We need to have both syslog files to check the reboot state (down cycle/up cycle) on both LC and SUP based on the timestamp in the syslog. These info will help to reproduce the issue.
Thanks
Based on the syslog, while chassisd is running the update_db_module(), reboot command is trying to stop the PMON. This causes the backtrace of the chassisd. We will continue to investigate it.
Description
We have encountered a crash on pmon chassisd occasionally (not all the time) when we perform a "controlled" chassis reboot operation
We see the following:
here is the code segment where the crash occured:
Steps to reproduce the issue (Not always reproducible):
Not sure if we are able to reproduce this again as this seems to be some race condition that may not easily hit.
Describe the results you received:
Describe the results you expected:
we should handle the module going down gracefully and not crash
Output of
show version
:This is one of the chassis image we are testing based on 202205 branch
Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: