Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chassisd Crashed while Linecard is in the process of going down due to supervisor is rebooting #20543

Open
gechiang opened this issue Oct 18, 2024 · 2 comments
Assignees
Labels
Chassis 🤖 Modular chassis support Triaged this issue has been triaged

Comments

@gechiang
Copy link
Collaborator

Description

We have encountered a crash on pmon chassisd occasionally (not all the time) when we perform a "controlled" chassis reboot operation

We see the following:

Sep 14 03:27:56.016816 SONiC-CHASSIS-lc04 INFO pmon#supervisord: pcied Failed to load platform Pcie module. Error : No module named 'sonic_platform.pcie', Fallback to default module
Sep 14 03:27:56.018586 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:56,015 WARN received SIGTERM indicating exit request
Sep 14 03:27:56.018586 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:56,016 INFO waiting for supervisor-proc-exit-listener, rsyslogd, chassisd, ledd, xcvrd, syseepromd, thermalctld, pcied to die
Sep 14 03:27:56.022445 SONiC-CHASSIS-lc04 DEBUG container: container_stop: BEGIN
...
Sep 14 03:27:57.071290 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,070 INFO stopped: pcied (exit status 143)
Sep 14 03:27:57.071290 SONiC-CHASSIS-lc04 INFO pmon#thermal_infos: Caught signal 'SIGTERM' - exiting...
Sep 14 03:27:57.071648 SONiC-CHASSIS-lc04 INFO pmon#thermalctld: Stop thermal monitoring loop
Sep 14 03:27:57.075350 SONiC-CHASSIS-lc04 INFO pmon#thermal_infos: Shutting down with exit code 143...
...
Sep 14 03:27:57.147934 SONiC-CHASSIS-lc04 INFO pmon#Chassis: Caught signal 'SIGTERM' - exiting...
Sep 14 03:27:57.147934 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,147 INFO stopped: thermalctld (exit status 143)
Sep 14 03:27:57.147934 SONiC-CHASSIS-lc04 INFO pmon#Chassis: Shutting down...
Sep 14 03:27:57.199220 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,198 INFO stopped: syseepromd (exit status 143)
Sep 14 03:27:57.222627 SONiC-CHASSIS-lc04 NOTICE pmon#: CMIS: Stopped
...
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 0 local: msgs 19084 success 19084 fail 0 notpresent 0 unknown 0 minrspwait 14541 maxrspwait 458390 long_rsp 102 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 1 local: msgs 190 success 190 fail 0 notpresent 0 unknown 0 minrspwait 14777 maxrspwait 458110 long_rsp 2 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 2 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 3 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 4 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) channel 5 local: msgs 0 success 0 fail 0 notpresent 0 unknown 0 minrspwait -1 maxrspwait 0 long_rsp 0 timeouts 0 already_in_use 0
Sep 14 03:27:57.661238 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC (30 30) no_channel_avail 0 lock_held False
Sep 14 03:27:57.661303 SONiC-CHASSIS-lc04 WARNING pmon#python3: MDIPC destruction (30 30)
...
Sep 14 03:27:57.681317 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,680 INFO stopped: xcvrd (exit status 0)
Sep 14 03:27:57.720642 SONiC-CHASSIS-lc04 INFO teamd0#supervisord 2024-09-14 03:27:57,720 INFO stopped: tlm_teamd (exit status 0)
Sep 14 03:27:57.672780 SONiC-CHASSIS-lc04 INFO audisp-tacplus: message repeated 5 times: [ Audisp-tacplus: User command not match.]
Sep 14 03:27:57.812196 SONiC-CHASSIS-lc04 INFO audisp-tacplus: Audisp-tacplus: Accounting: user: _anp_dcfx_rw1, tty: pts0, host: SONiC-CHASSIS-lc04, command: /usr/bin/mkdir exit=0, type: 4, task ID: 16326
Sep 14 03:27:57.812354 SONiC-CHASSIS-lc04 INFO audisp-tacplus: Audisp-tacplus: User command not match.
Sep 14 03:27:57.868780 SONiC-CHASSIS-lc04 INFO macsec1#supervisord 2024-09-14 03:27:57,868 INFO stopped: macsecmgrd (exit status 255)
Sep 14 03:27:57.890803 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd Traceback (most recent call last):
Sep 14 03:27:57.890803 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd   File "/usr/local/bin/chassisd", line 701, in <module>
Sep 14 03:27:57.891075 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd     main()
Sep 14 03:27:57.891075 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd   File "/usr/local/bin/chassisd", line 696, in main
Sep 14 03:27:57.891435 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd     chassisd.run()
Sep 14 03:27:57.891435 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd   File "/usr/local/bin/chassisd", line 674, in run
Sep 14 03:27:57.891836 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd     self.module_updater.module_db_update()
Sep 14 03:27:57.891836 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd   File "/usr/local/bin/chassisd", line 362, in module_db_update
Sep 14 03:27:57.892103 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd     if fvs[CHASSIS_MODULE_INFO_NAME_FIELD] in notOnlineModules:
Sep 14 03:27:57.892103 SONiC-CHASSIS-lc04 INFO pmon#supervisord: chassisd KeyError: 'name'
Sep 14 03:27:57.812743 SONiC-CHASSIS-lc04 INFO audisp-tacplus: message repeated 5 times: [ Audisp-tacplus: User command not match.]
Sep 14 03:27:57.929118 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.928476022Z" level=info msg="shim disconnected" id=9f0d3cfc563a215f1aacd09578745c9c97e5ed7b15625273b4595eabbd69594f
Sep 14 03:27:57.929672 SONiC-CHASSIS-lc04 INFO dockerd[1054]: time="2024-09-14T03:27:57.928637322Z" level=info msg="ignoring event" container=9f0d3cfc563a215f1aacd09578745c9c97e5ed7b15625273b4595eabbd69594f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 14 03:27:57.929762 SONiC-CHASSIS-lc04 INFO dockerd[1054]: time="2024-09-14T03:27:57.928730602Z" level=info msg="ignoring event" container=90b8a0b6bd4cf8200217092d8d40f93d5ef4cd0ff147bafde8084e682c8ed3ec module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 14 03:27:57.929837 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929219822Z" level=warning msg="cleaning up after shim disconnected" id=9f0d3cfc563a215f1aacd09578745c9c97e5ed7b15625273b4595eabbd69594f namespace=moby
Sep 14 03:27:57.929887 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929255512Z" level=info msg="cleaning up dead shim"
Sep 14 03:27:57.929940 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929690723Z" level=info msg="shim disconnected" id=90b8a0b6bd4cf8200217092d8d40f93d5ef4cd0ff147bafde8084e682c8ed3ec
Sep 14 03:27:57.929985 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929779103Z" level=warning msg="cleaning up after shim disconnected" id=90b8a0b6bd4cf8200217092d8d40f93d5ef4cd0ff147bafde8084e682c8ed3ec namespace=moby
Sep 14 03:27:57.930043 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.929821413Z" level=info msg="cleaning up dead shim"
Sep 14 03:27:57.937792 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.937689636Z" level=warning msg="cleanup warnings time=\"2024-09-14T03:27:57Z\" level=info msg=\"starting signal loop\" namespace=moby pid=18546\n"
Sep 14 03:27:57.937970 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.937898466Z" level=warning msg="cleanup warnings time=\"2024-09-14T03:27:57Z\" level=info msg=\"starting signal loop\" namespace=moby pid=18545\n"
Sep 14 03:27:57.940506 SONiC-CHASSIS-lc04 INFO systemd[1]: var-lib-docker-overlay2-10a6f809013a503d751d7764ef84e284b6030d318ea3702d3caf23d31452f080-merged.mount: Succeeded.
Sep 14 03:27:57.942302 SONiC-CHASSIS-lc04 INFO dockerd[1054]: time="2024-09-14T03:27:57.942211257Z" level=info msg="ignoring event" container=ff81f2c901824761f856f0388b75664f1af296b0aa773f9e15fa2b9676953fe7 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 14 03:27:57.942362 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.942118797Z" level=info msg="shim disconnected" id=ff81f2c901824761f856f0388b75664f1af296b0aa773f9e15fa2b9676953fe7
Sep 14 03:27:57.942387 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.942305527Z" level=warning msg="cleaning up after shim disconnected" id=ff81f2c901824761f856f0388b75664f1af296b0aa773f9e15fa2b9676953fe7 namespace=moby
Sep 14 03:27:57.942408 SONiC-CHASSIS-lc04 INFO containerd[856]: time="2024-09-14T03:27:57.942333647Z" level=info msg="cleaning up dead shim"
Sep 14 03:27:57.951360 SONiC-CHASSIS-lc04 INFO pmon#supervisord 2024-09-14 03:27:57,950 INFO exited: chassisd (exit status 1; not expected)
...

here is the code segment where the crash occured:

...

        # In line card push the hostname of the module and num_asics to the chassis state db.
        # The hostname is used as key to access chassis app db entries
        if not self._is_supervisor():
           hostname_key = "{}{}".format(ModuleBase.MODULE_TYPE_LINE, int(self.my_slot) - 1)
           hostname = try_get(device_info.get_hostname, default="None")
           hostname_fvs = swsscommon.FieldValuePairs([(CHASSIS_MODULE_INFO_SLOT_FIELD, str(self.my_slot)),
                                                        (CHASSIS_MODULE_INFO_HOSTNAME_FIELD, hostname),
                                                        (CHASSIS_MODULE_INFO_NUM_ASICS_FIELD, str(len(module_info_dict[CHASSIS_MODULE_INFO_ASICS])))])
           self.hostname_table.set(hostname_key, hostname_fvs)

        # Asics that are on the "not online" modules need to be cleaned up
        asics = list(self.asic_table.getKeys())
        for asic in asics:
            fvs = self.asic_table.get(asic)
            if isinstance(fvs, list):
                fvs = dict(fvs[-1])
            if fvs[CHASSIS_MODULE_INFO_NAME_FIELD] in notOnlineModules:
                self.asic_table._del(asic)

Steps to reproduce the issue (Not always reproducible):

  1. perform sudo reboot from supervisor
  2. one of the LC "may" experience this PMON chassisd crash issue.

Not sure if we are able to reproduce this again as this seems to be some race condition that may not easily hit.

Describe the results you received:

Describe the results you expected:

we should handle the module going down gracefully and not crash

Output of show version:

This is one of the chassis image we are testing based on 202205 branch

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@gechiang gechiang added the Chassis 🤖 Modular chassis support label Oct 18, 2024
@mlok-nokia
Copy link
Contributor

@gechiang Is it possible to get all log files from this LC and SUP? We need to have both syslog files to check the reboot state (down cycle/up cycle) on both LC and SUP based on the timestamp in the syslog. These info will help to reproduce the issue.
Thanks

@mlok-nokia
Copy link
Contributor

Based on the syslog, while chassisd is running the update_db_module(), reboot command is trying to stop the PMON. This causes the backtrace of the chassisd. We will continue to investigate it.

@vdahiya12 vdahiya12 added the Triaged this issue has been triaged label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support Triaged this issue has been triaged
Projects
Status: No status
Development

No branches or pull requests

3 participants