Mix of reef snap versions results in microceph.daemon service failures #367

javacruft · 2024-06-14T05:14:45Z

Issue Report

What version of MicroCeph are you using

reef/edge but mix of versions - 981 and 1026

microceph             18.2.0+snap556b907075  1026   reef/edge           canonical**  - on node-01
microceph             18.2.0+snap71f71782c5  981    reef/edge           canonical**  held other nodes

What are the steps to reproduce this issue ?

Multi-node local deployment using https://microstack.run/docs

What happens (observed behaviour) ?

I believe one of the snaps refreshed which then caused some of the clustering daemons to fail with the following error:

un 13 12:40:04 node-02.dom systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: Error: Unable to start daemon: Daemon failed to start: Failed to re-establish cluster connection: Failed to update schema version when joining cluster: no such column: schema
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: time="2024-06-13T12:40:04+02:00" level=info msg="Daemon stopped"
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: time="2024-06-13T12:40:04+02:00" level=debug msg="Database error" err="Failed to update schema version when joining cluster: no such column: schema"

What were you expecting to happen ?

For the mix of different revisions to deal with this upgrade/change to the schema in a more elegant fashion

The text was updated successfully, but these errors were encountered:

fnordahl · 2024-06-14T05:18:41Z

FWIW; we are having a similar issue and discussion with the LXD/Microcluster team in canonical/microovn#121

UtkarshBhatthere · 2024-06-14T07:01:41Z

@sabaini is this the schema incompatibility thingy you mentioned yesterday ?

UtkarshBhatthere · 2024-06-14T07:02:22Z

@masnax any pointers on making it compatible with older revisions ?

sabaini · 2024-06-14T07:13:02Z

@UtkarshBhatthere yes was referring to this

sabaini · 2024-06-14T08:31:05Z

I could reproduce this locally by upgrading one out of three nodes from stable to edge

Steps:

Install microceph reef/stable (rev 981) on 3 nodes
Bootstrap and cluster 3 nodes
Add OSDs
Upgrade one to reef/edge (rev 1026)

In /v/l/syslog I see these messages:

Jun 14 08:29:48 aa-0 microceph.daemon[9040]: time="2024-06-14T08:29:48Z" level=debug msg="Database error" err="schema check gracefully aborted"
Jun 14 08:29:48 aa-0 microceph.daemon[9040]: time="2024-06-14T08:29:48Z" level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://240.22.0.77:7443"

Which seems to hint at a schema migration issue

sabaini · 2024-06-14T08:33:03Z

Ticket CEPH-766

mkalcok · 2024-06-14T11:26:05Z

Bit more context can be also found here canonical/microcluster#66. The bottom line is that this is currently an expected behavior. If there's a DB schema change, all members of the cluster must upgrade before the API becomes available again.
We are in the talks (last few comments in the PR mentioned by @fnordahl) about improving the error message.

masnax · 2024-06-25T16:53:51Z

By the way, this should have been fixed by #371 which included canonical/microcluster#150.

sabaini · 2024-07-23T16:19:46Z

Verif: performed an upgrade on a single node in a 3 node cluster and get an appropriate error message.

$ sudo microceph status
Error: failed listing disks: Database is waiting for an upgrade: 2 cluster members have not yet received the update

Other cluster members are functional, as is Ceph itself.

I believe this can be closed.

UtkarshBhatthere added the bug Something isn't working label Jun 14, 2024

masnax mentioned this issue Jun 14, 2024

Fix eager updateFromV1 canonical/microcluster#150

Merged

sabaini closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mix of reef snap versions results in microceph.daemon service failures #367

Mix of reef snap versions results in microceph.daemon service failures #367

javacruft commented Jun 14, 2024 •

edited

Loading

fnordahl commented Jun 14, 2024

UtkarshBhatthere commented Jun 14, 2024

UtkarshBhatthere commented Jun 14, 2024

sabaini commented Jun 14, 2024

sabaini commented Jun 14, 2024

sabaini commented Jun 14, 2024

mkalcok commented Jun 14, 2024

masnax commented Jun 25, 2024

sabaini commented Jul 23, 2024

Mix of reef snap versions results in microceph.daemon service failures #367

Mix of reef snap versions results in microceph.daemon service failures #367

Comments

javacruft commented Jun 14, 2024 • edited Loading

Issue Report

What version of MicroCeph are you using

What are the steps to reproduce this issue ?

What happens (observed behaviour) ?

What were you expecting to happen ?

fnordahl commented Jun 14, 2024

UtkarshBhatthere commented Jun 14, 2024

UtkarshBhatthere commented Jun 14, 2024

sabaini commented Jun 14, 2024

sabaini commented Jun 14, 2024

sabaini commented Jun 14, 2024

mkalcok commented Jun 14, 2024

masnax commented Jun 25, 2024

sabaini commented Jul 23, 2024

javacruft commented Jun 14, 2024 •

edited

Loading