Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mix of reef snap versions results in microceph.daemon service failures #367

Closed
javacruft opened this issue Jun 14, 2024 · 9 comments
Closed
Labels
bug Something isn't working

Comments

@javacruft
Copy link
Collaborator

javacruft commented Jun 14, 2024

Issue Report

What version of MicroCeph are you using

reef/edge but mix of versions - 981 and 1026

microceph             18.2.0+snap556b907075  1026   reef/edge           canonical**  - on node-01
microceph             18.2.0+snap71f71782c5  981    reef/edge           canonical**  held other nodes

What are the steps to reproduce this issue ?

Multi-node local deployment using https://microstack.run/docs

What happens (observed behaviour) ?

I believe one of the snaps refreshed which then caused some of the clustering daemons to fail with the following error:

un 13 12:40:04 node-02.dom systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: Error: Unable to start daemon: Daemon failed to start: Failed to re-establish cluster connection: Failed to update schema version when joining cluster: no such column: schema
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: time="2024-06-13T12:40:04+02:00" level=info msg="Daemon stopped"
Jun 13 12:40:04 node-02.dom microceph.daemon[6409]: time="2024-06-13T12:40:04+02:00" level=debug msg="Database error" err="Failed to update schema version when joining cluster: no such column: schema"

What were you expecting to happen ?

For the mix of different revisions to deal with this upgrade/change to the schema in a more elegant fashion

@fnordahl
Copy link
Member

FWIW; we are having a similar issue and discussion with the LXD/Microcluster team in canonical/microovn#121

@UtkarshBhatthere
Copy link
Contributor

@sabaini is this the schema incompatibility thingy you mentioned yesterday ?

@UtkarshBhatthere UtkarshBhatthere added the bug Something isn't working label Jun 14, 2024
@UtkarshBhatthere
Copy link
Contributor

@masnax any pointers on making it compatible with older revisions ?

@sabaini
Copy link
Collaborator

sabaini commented Jun 14, 2024

@UtkarshBhatthere yes was referring to this

@sabaini
Copy link
Collaborator

sabaini commented Jun 14, 2024

I could reproduce this locally by upgrading one out of three nodes from stable to edge

Steps:

  • Install microceph reef/stable (rev 981) on 3 nodes
  • Bootstrap and cluster 3 nodes
  • Add OSDs
  • Upgrade one to reef/edge (rev 1026)

In /v/l/syslog I see these messages:

Jun 14 08:29:48 aa-0 microceph.daemon[9040]: time="2024-06-14T08:29:48Z" level=debug msg="Database error" err="schema check gracefully aborted"
Jun 14 08:29:48 aa-0 microceph.daemon[9040]: time="2024-06-14T08:29:48Z" level=warning msg="Waiting for other cluster members to upgrade their versions" address="https://240.22.0.77:7443"

Which seems to hint at a schema migration issue

@sabaini
Copy link
Collaborator

sabaini commented Jun 14, 2024

Ticket CEPH-766

@mkalcok
Copy link

mkalcok commented Jun 14, 2024

Bit more context can be also found here canonical/microcluster#66. The bottom line is that this is currently an expected behavior. If there's a DB schema change, all members of the cluster must upgrade before the API becomes available again.
We are in the talks (last few comments in the PR mentioned by @fnordahl) about improving the error message.

@masnax
Copy link
Contributor

masnax commented Jun 25, 2024

By the way, this should have been fixed by #371 which included canonical/microcluster#150.

@sabaini
Copy link
Collaborator

sabaini commented Jul 23, 2024

Verif: performed an upgrade on a single node in a 3 node cluster and get an appropriate error message.

$ sudo microceph status
Error: failed listing disks: Database is waiting for an upgrade: 2 cluster members have not yet received the update

Other cluster members are functional, as is Ceph itself.

I believe this can be closed.

@sabaini sabaini closed this as completed Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants