Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: Adding support for 'api extensions' #121

Closed

Conversation

gabrielmougard
Copy link
Contributor

@gabrielmougard gabrielmougard commented Feb 16, 2024

Pre-requisite PR: canonical/microcluster#86


We'll need a system to centralize the extensions (understand optional features that other MicroCluster-backed services might use) of different MicroCluster-based services.

We propose to centralize the extensions in the MicroCluster database, where a service like MicroOVN in this case, writes its OVN related extensions when being bootstrapped.

This will be needed by #113 and canonical/microcloud#245 to check that the deployed MicroOVN used by MicroCloud, supports custom ip encapsulation for an OVN Geneve tunnel.

@mkalcok
Copy link
Contributor

mkalcok commented Feb 20, 2024

/canonical/self-hosted-runners/run-workflows a54d5c9

@mkalcok
Copy link
Contributor

mkalcok commented Feb 20, 2024

Overall LGTM, I feel much better about this feature being part of the microcluster rather than standalone implementation in microovn.
However, I assume that for this PR to pass we'll need to wait for microcluster to merge canonical/microcluster#86 and then upgrade our dependency, correct?

@gabrielmougard
Copy link
Contributor Author

gabrielmougard commented Feb 20, 2024

@mkalcok exactly. The dependency upgrade shouldn't be too problematic, there will be only one thing to change in microovn/cmd/microovnd/main.go

h.OnBootstrap = ovn.Bootstrap

would become

h.PreBootstrap = ovn.Bootstrap

or

h.PostBootstrap = ovn.Bootstrap

Not entirely sure, but I'm more confident about the PreBootstrap option.

@gabrielmougard gabrielmougard changed the title Api: Adding support for 'api extensions' api: Adding support for 'api extensions' Apr 29, 2024
@gabrielmougard
Copy link
Contributor Author

@mkalcok updated

@mkalcok
Copy link
Contributor

mkalcok commented May 2, 2024

/canonical/self-hosted-runners/run-workflows 899b41d

@gabrielmougard
Copy link
Contributor Author

gabrielmougard commented May 3, 2024

@mkalcok the TLS system test is failing here. I don't quite understand why.. Do you have an idea (I must admit I running out of ideas)?

@mkalcok
Copy link
Contributor

mkalcok commented May 3, 2024

@gabrielmougard Originally I thought it's pretty simple case of bad bash syntax that's missing a space between expression and closing ] bracket in the if statement.

expression : [ -n SHA1 Fingerprint=CB:DC:0A:6F:99:21:70:DB:92:27:1A:D6:BD:2E:AC:1C:D0:5C:1E:7C]

then i realized that you didn't change any tests and the error is coming from the BATS library 🤔. I'll take a look. We are sourcing BATS directly from it's main branch. It's possible that some error was introduced there.

@@ -29,7 +29,7 @@ func regenerateCaPut(s *state.State, r *http.Request) response.Response {
responseData := types.NewRegenerateCaResponse()

// Check that this is the initial node that received the request and recreate new CA certificate
if !client.IsForwardedRequest(r) {
if !client.IsNotification(r) {
Copy link
Contributor

@mkalcok mkalcok May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabrielmougard So I checked and the problem is not with BATS. Command microovn certificates regenerate-ca indeed does not work. It fails with

Error: command failed: failed to generate new CA: Put "http://control.socket/1.0/ca": context deadline exceeded

as the CI shows.

I also noticed that it absolutely obliterates the CPU 😆. So perhaps this change from IsForwardedRequest to IsNotification is not working as expected.

The way this command is supposed to work is that the original node that receives the request from client, forwards the request to the rest of the nodes in the cluster. This condition is then used to distinguish whether the node received direct request from client (and should forward requests to all other cluster members), or it's the forwarded message and the node should just do its own thing.

The high CPU consumption suggests to me that even the servers that received the forwarded message then try to forward it again to everyone else, creating kind of a death spiral.

But that's just a guess. I didn't dive too deep into it.

@gabrielmougard
Copy link
Contributor Author

@mkalcok can we re run the tests please?

@mkalcok
Copy link
Contributor

mkalcok commented May 3, 2024

/canonical/self-hosted-runners/run-workflows eaf8922

@mkalcok
Copy link
Contributor

mkalcok commented May 3, 2024

@gabrielmougard I wonder about the upgrade test failure. Is it possible that changes to schema.go, even though they look just cosmetic, trigger some kind of schema change that require all cluster members to be upgraded before the dqlite cluster is accessible again?

@gabrielmougard
Copy link
Contributor Author

gabrielmougard commented May 3, 2024

@mkalcok that is very possible. We introduced a change in the SQL extension mechanism in MicroCluster canonical/microcluster#94 @masnax do you know how this would work with MicroOVN custom SQL updates?

@mkalcok
Copy link
Contributor

mkalcok commented May 3, 2024

Hmm..I tried slapping this piece of code after the snap upgrade because it was enough previously when the upgrade required microovn db schema upgrade. However it doesn't seem that the cluster recovers withing 30 seconds. You probably know more about the microcluster schema upgrades than I do. Could you try manually

  • setup microovn cluster with snap from store
  • upgrade microovn with snap from this branch (on every node)
  • figure out why is the cluster not coming up (microovn cluster list)

In any case, even after it gets resolved, since this change will introduce schema upgrade, we won't be able to backport it to branch-22.03 branch. Though I don't supposed that's an issue for you @gabrielmougard. (just fyi @fnordahl)

Tangentially (also cc @fnordahl) If this becomes part of the branch-24.03, we'll have a bit of a chicken and egg problem. Microovn will require whole cluster to be updated before it starts to work properly again, but underlying OVN will require gradual upgrade (chassis nodes -> regular central nodes -> central node that performs OVN db schema conversion). I wonder how we should approach this situation. Perhaps we could create 24.03-pre-upgrade track in the snap store and the upgrade process to 24.03 could look like this:

  • 22.03/stable (Starting point)
  • 24.03-pre-upgrade (Upgrade to this version will do the Microovn schema upgrade, but the underlying OVN will still be 22.03)
  • 24.03/stable (Upgrade to this version will upgrade OVN to 24.03 and do the OVN db schema conversion)

@masnax
Copy link
Contributor

masnax commented May 3, 2024

Hmm..I tried slapping this piece of code after the snap upgrade because it was enough previously when the upgrade required microovn db schema upgrade. However it doesn't seem that the cluster recovers withing 30 seconds. You probably know more about the microcluster schema upgrades than I do. Could you try manually

* setup microovn cluster with snap from store

* upgrade microovn with snap from this branch (on every node)

* figure out why is the cluster not coming up (`microovn cluster list`)

This is a consequence of the fact that the introduction of API extensions is itself a schema extension. To give an example, imagine 3 nodes.

  • node01 runs snap refresh and detects that its expected schema version is ahead of the other cluster members, so it waits.

  • node02 runs snap refresh and detects the same thing, but is only blocked on node03

  • node03 runs snap refresh, and since the previous two nodes have already updated their expected schema versions, this node can proceed with committing the changes to the schema.

  • node03 now progresses to comparing API versions since its updated schema supports this. It detects that it has the highest expected API version because node01 and node02 did not have the necessary schema updates in the earlier steps to record their expected API version, so node03 waits for those nodes to detect that node03 has committed the schema, so they can record their expected API updates.

So you get into a situation where all 3 nodes are waiting for each other. After 30s the loop repeats so node01 will detect that its schema version matches the other nodes, and then waits on only node02 to update its API version. Node02 then finally does this, and the database opens for access.

So because the schema update that introduces API updates is part of the same update that increments the number of API updates, the update process takes at least 30s in this case.

@mkalcok
Copy link
Contributor

mkalcok commented May 3, 2024

Thanks for the detailed explanation @masnax. I believe we've gotten ourselves into a sticky situation. microovn.daemon is failing to start after upgrade with:

May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=warning msg="Local API extensions: [internal:runtime_extension_v1 custom_encapsulation_ip], cluster members API extensions: [[internal:runtime_extension_v1 custom_encapsulation_ip] [internal:runtime_extension_v1 custom_encapsulation_ip] [internal:runtime_extension_v1 custom_encapsulation_ip] [internal:runtime_extension_v1 custom_encapsulation_ip]]"
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=error msg="Failed to send database upgrade request" error="Patch \"https://10.75.224.171:6443/cluster/internal/database\": Unable to connect to \"10.75.224.171:6443\": dial tcp 10.75.224.171:6443: connect: connection refused"
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=error msg="Failed to send database upgrade request" error="Patch \"https://10.75.224.189:6443/cluster/internal/database\": Unable to connect to \"10.75.224.189:6443\": dial tcp 10.75.224.189:6443: connect: connection refused"
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=error msg="Failed to send database upgrade request" error="Patch \"https://10.75.224.222:6443/cluster/internal/database\": Unable to connect to \"10.75.224.222:6443\": dial tcp 10.75.224.222:6443: connect: connection refused"
May 03 17:49:37 microovn-upgrade-1 ovsdb-client[19737]: ovs|00001|reconnect|INFO|unix:/var/snap/microovn/common/run/switch/db.sock: connecting...
May 03 17:49:37 microovn-upgrade-1 ovsdb-client[19737]: ovs|00002|reconnect|INFO|unix:/var/snap/microovn/common/run/switch/db.sock: connected
May 03 17:49:37 microovn-upgrade-1 ovs-vsctl[19738]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set open_vswitch . external_ids:ovn-remote=
May 03 17:49:37 microovn-upgrade-1 ovs-vsctl[19738]: ovs|00002|db_ctl_base|ERR|external_ids:ovn-remote=: argument does not end in "=" followed by a value.
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: Error: Daemon stopped with error: Failed to run post-start hook: Failed to update OVS's 'ovn-remote' configuration

I believe this comes from

_, err = VSCtl(
s,
"set", "open_vswitch", ".",
fmt.Sprintf("external_ids:ovn-remote=%s", sbConnect),
)

and the sbConnect is just an empty string. What I find interesting though is that the function that's supposed to fetch the sbConnect
func environmentString(s *state.State, port int) (string, string, error) {

is connecting to the database, but it's not failing. It's just returning empty strings.

@masnax
Copy link
Contributor

masnax commented May 3, 2024

Thanks for the detailed explanation @masnax. I believe we've gotten ourselves into a sticky situation. microovn.daemon is failing to start after upgrade with:

May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=warning msg="Local API extensions: [internal:runtime_extension_v1 custom_encapsulation_ip], cluster members API extensions: [[internal:runtime_extension_v1 custom_encapsulation_ip] [internal:runtime_extension_v1 custom_encapsulation_ip] [internal:runtime_extension_v1 custom_encapsulation_ip] [internal:runtime_extension_v1 custom_encapsulation_ip]]"
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=error msg="Failed to send database upgrade request" error="Patch \"https://10.75.224.171:6443/cluster/internal/database\": Unable to connect to \"10.75.224.171:6443\": dial tcp 10.75.224.171:6443: connect: connection refused"
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=error msg="Failed to send database upgrade request" error="Patch \"https://10.75.224.189:6443/cluster/internal/database\": Unable to connect to \"10.75.224.189:6443\": dial tcp 10.75.224.189:6443: connect: connection refused"
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: time="2024-05-03T17:49:37Z" level=error msg="Failed to send database upgrade request" error="Patch \"https://10.75.224.222:6443/cluster/internal/database\": Unable to connect to \"10.75.224.222:6443\": dial tcp 10.75.224.222:6443: connect: connection refused"
May 03 17:49:37 microovn-upgrade-1 ovsdb-client[19737]: ovs|00001|reconnect|INFO|unix:/var/snap/microovn/common/run/switch/db.sock: connecting...
May 03 17:49:37 microovn-upgrade-1 ovsdb-client[19737]: ovs|00002|reconnect|INFO|unix:/var/snap/microovn/common/run/switch/db.sock: connected
May 03 17:49:37 microovn-upgrade-1 ovs-vsctl[19738]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set open_vswitch . external_ids:ovn-remote=
May 03 17:49:37 microovn-upgrade-1 ovs-vsctl[19738]: ovs|00002|db_ctl_base|ERR|external_ids:ovn-remote=: argument does not end in "=" followed by a value.
May 03 17:49:37 microovn-upgrade-1 microovn.daemon[19686]: Error: Daemon stopped with error: Failed to run post-start hook: Failed to update OVS's 'ovn-remote' configuration

I believe this comes from

_, err = VSCtl(
s,
"set", "open_vswitch", ".",
fmt.Sprintf("external_ids:ovn-remote=%s", sbConnect),
)

and the sbConnect is just an empty string. What I find interesting though is that the function that's supposed to fetch the sbConnect

func environmentString(s *state.State, port int) (string, string, error) {

is connecting to the database, but it's not failing. It's just returning empty strings.

Crap, this was my bad. I forgot to set PRAGMA foreign_keys = OFF before altering the internal_cluster_members table. So when the table was replaced, all tables with foreign keys pointing to that table were getting wiped thanks to ON DELETE CASCADE.

Actually microcluster should probably set that for each internal schema update since any external table we don't know about could be referencing any of our's.

@masnax
Copy link
Contributor

masnax commented May 3, 2024

@gabrielmougard I have 2 PRs up in microcluster, canonical/microcluster#123 and canonical/microcluster#122, which should fix the issues detected here.

@gabrielmougard
Copy link
Contributor Author

@masnax thanks! I'll have a look

@gabrielmougard
Copy link
Contributor Author

@mkalcok I updated the MicroCluster deps to use @masnax recent work, is it possible to re-run the CI?

Signed-off-by: Gabriel Mougard <gabriel.mougard@canonical.com>
Signed-off-by: Gabriel Mougard <gabriel.mougard@canonical.com>

microovn/cmd/microovnd: Pass the MicroOVN extensions map to the MicroCluster initialization process.

Signed-off-by: Gabriel Mougard <gabriel.mougard@canonical.com>
@gabrielmougard
Copy link
Contributor Author

@mkalcok I just ran the full check-system test suite on my machine and its all green. I let you run the CI when you think we won't be bothered by the rate limits :)

@gabrielmougard
Copy link
Contributor Author

Thanks @masnax for your MicroCluster PRs !

@mkalcok
Copy link
Contributor

mkalcok commented Jun 10, 2024

/canonical/self-hosted-runners/run-workflows fdfc65c

@mkalcok
Copy link
Contributor

mkalcok commented Jun 11, 2024

@gabrielmougard reason for the failing upgrade tests seems to be that the cluster needs a bit more time to converge after internal schema upgrade. Adding

wait_microovn_online "$container" 60

here

done
perform_manual_upgrade_steps $TEST_CONTAINERS

should solve the issue.

Signed-off-by: Gabriel Mougard <gabriel.mougard@canonical.com>
@gabrielmougard
Copy link
Contributor Author

@mkalcok can you re-run the CI? I added the wait_microovn_online condition for each containers

@mkalcok
Copy link
Contributor

mkalcok commented Jun 12, 2024

/canonical/self-hosted-runners/run-workflows e7c67e4

@fnordahl
Copy link
Member

So because the schema update that introduces API updates is part of the same update that increments the number of API updates, the update process takes at least 30s in this case.

@masnax seamless and painless upgrades are very important to our users, and as @mkalcok already pointed out we rely on microcluster to manage upgrade of the payload.

Our highest priority for this release is to make the upgrade process bullet proof, ensuring minimal data path downtime and keeping the end user informed (ref #130).

As far as I understand this PR effectively makes the microcluster unavailable for an extended period of time, without any means of informing the user of why or what to do.

This does not come across as great UX and conflicts with our main goal for this release, can the schema/extension migration process be improved in microcluster?

@masnax
Copy link
Contributor

masnax commented Jun 12, 2024

@fnordahl Sorry about that, I'm not sure what's happening with the upgrade process here. There was an issue earlier with the upgrade process taking a long time, but this was fixed (by this pr)

Running an upgrade locally appears instantaneous on my end, though I haven't tried it with MicroOVN's testsuite personally. I'll give it a run and let you know my findings.

@fnordahl
Copy link
Member

Running an upgrade locally appears instantaneous on my end, though I haven't tried it with MicroOVN's testsuite personally. I'll give it a run and let you know my findings.

That's excellent, thank you for looking into it!

FWIW; I did a manual test just before posting the previous comment, and the cluster appeared unresponsive until all nodes were upgraded. I think that is our main issue because we really need the cluster and its CLI to be responsive throughout the upgrade process to guide our users.

@masnax
Copy link
Contributor

masnax commented Jun 12, 2024

FWIW; I did a manual test just before posting the previous comment, and the cluster appeared unresponsive until all systems were upgraded. I think that is our main issue because we really need the cluster and its CLI to be responsive throughout the upgrade process to guide our users.

In this case it is absolutely necessary for a cluster member to enter a very restricted state if it encounters an upgrade that changes its database schema. This is because we can't risk applying a change to the schema until we are sure that all cluster members expect the same upgrade. If any one system applies the upgrade blindly, then very abruptly all running systems will become unable to properly read from or write to the database. An additional risk is the possibility of conflicting upgrades occurring on different cluster members.

The upgrade process works as follows: after encountering a new schema upgrade, that cluster member enters a waiting state which restricts access to the database and API until it receives a notification from the final cluster member to receive the upgrade. Only at this point is the upgrade actually applied, and the database and API become open for regular access. Until this point, any non-upgraded cluster member will continue to function. One added benefit here is that since schema upgrades are non-backwards compatible, we maintain the freedom to revert the upgrade until it is applied on the final system, since nothing will have been committed until that point.

I can think of these ways to keep a user informed of the upgrade process and overall cluster status going forward:

  • Right now, the status errors returned by microcluster are not that descriptive. For the most part, if the cluster is not in a ready state, then the user will just see Daemon not yet initialized. The messages pertaining to the upgrade are included in the logs, but are not returned as API error messages, so we can extend these messages to more precisely report the current state of the cluster.

  • microovn cluster list always reports either ONLINE or UNREACHABLE so we can extend this to be a bit more nuanced about what the actual issue is. In many cases a node is reachable, but not "ready" yet we still report it as UNREACHABLE.

  • Additionally, we can add a core microcluster status API that can return some information to be incorporated into MicroOVN's status output.

As for the slowness, I've narrowed it down to two components:

  • There has been a long-standing bug in microcluster which has since been fixed, but is present in MicroOVN today, which results in systems that join a cluster too close together not receiving a local record for each other until the next heartbeat. This is occurring in the test suite because the cluster is formed in quick succession and then the snaps are refreshed. When the daemons restart, some cluster members report UNREACHABLE until the next heartbeat. You may notice if you place wait_microovn_online before the snap refresh, it will still block since some cluster members are reporting UNREACHABLE right after they have joined the cluster, but before the snap refresh.

  • ovn.Start takes approximately 10-20 seconds to run since the whole cluster is restarted, and microcluster will not consider a system to have "started" until its OnStart hook has completed.

@mkalcok
Copy link
Contributor

mkalcok commented Jun 13, 2024

Thank you for the detailed insights @masnax.
One question that I have (@gabrielmougard) is whether there's a possibility of avoiding schema changes in this PR at all. The way I understand it, is that you don't actually need to change the schema, you just bumped the microcluster dependency and because of that, we need a new way of representing schema extensions, which counts as a schema change.

What would be a risks/downsides of keeping the older version of microcluster and therefore avoiding the schema change completely?

@gabrielmougard
Copy link
Contributor Author

gabrielmougard commented Jun 13, 2024

@mkalcok Well, if we want to go in that direction, I don't see a clean way for a node in the cluster to know what features are enabled in a MicroOVN node. On the MicroCloud side (client querying the MicroOVN service), we could try our luck and call an API endpoint that might or might not be present in MicroOVN to expose MicroOVN's API extensions... If the endpoint exposing the extensions is there, that's great and we can read the extensions and proceed (or not) with our logic in MicroCloud. If it turns out that de deployed MicroOVN is too old and doesn't have such an endpoint (which will return in an error not found), then I guess we could consider this as a 'lack of extension' and not proceed. This is a very brittle and unclean approach IMHO. Also, this approach assumes that all the MicroOVN nodes are always perfectly aligned on the same version (since we can not rely on the strong consistency of the MicroCluster database).. what if this is not the case and that the leader with version A has the endpoint but one node is still version A-1 and has not the endpoint? Then we might proceed on MicroCloud and execute the API call to set up an underlay on the entire MicroOVN cluster which will result in a error or worse, in an inconsistent networking setup for the Geneve tunnel.

I agree with @masnax, there are a couple of improvements we can work on to make the UX better during the upgrade. But ultimately, I don't think there is a way around upgrading the schema (and bumping the version of MicroCluster) without introducing these very risky consistency issues that could even corrupt the MicroOVN networking setup..

@fnordahl
Copy link
Member

We would welcome an improved schema conversion process.

If it is possible to get that done before our release, great, if not we need to be creative and think tactical if you want this included in the release.

@mkalcok
Copy link
Contributor

mkalcok commented Jun 13, 2024

@gabrielmougard my suggestion/question was more towards whether it's possible to implement the API feature without the schema change. Looking at this PR, there's no real change to the schema, it just changed because you bumped the microcluster version and now there's a new way to represent "schema extensions"

Would it be possible to add this API without bumping the microcluster library, therefore preventing the need for schema change?

@gabrielmougard
Copy link
Contributor Author

Naively, I would say yes it is of course possible. Adding a new API endpoint in MicroOVN for exposing its API extensions will suffice. We'll then have to find a creative way in the client side in Microcloud to ensure everything is correct on all the MicroOVN nodes. I let @masnax correct me if I say something wrong.

@mkalcok
Copy link
Contributor

mkalcok commented Jun 13, 2024

What role does the upgraded microcluster library play in this change? Does it add any feature that makes it easier to implement "extensions" API?

@mkalcok
Copy link
Contributor

mkalcok commented Jun 13, 2024

RE merge conflicts:
To resolve the conflict that got created in the upgrade.bats, you can just move your change (the added wait_microovn_online function) here:

done
perform_manual_upgrade_steps $TEST_CONTAINERS

(btw, it's sufficient to run it in single container. No need to run it in every container )

@masnax
Copy link
Contributor

masnax commented Jun 13, 2024

@mkalcok Just like the schema upgrade, the set of API extension must be something all cluster members need to agree on. It is effectively a record of the behaviour of the API. If any one cluster member unilaterally changes its API, then cluster-wide behaviour will be inconsistent and possibly broken.

For that to work, we need to record the API extensions of all cluster members in the global database so any one member can compare what set it expects to what all other cluster members (even currently offline ones) expect, and be instantly aware if a change happens on another system. This means a change to the schema.

So I don't think there is a way to add such a feature without reimplementing a mechanism that works in a similar way to the schema upgrade mechanism, because the API would be (and currently is) unstable when a single system runs snap refresh without coordinating with all other cluster members.

So maybe it's best to say that the real feature here is the addition and utilisation of a way to coordinate non-backwards compatible upgrades across the cluster.

It would be great to know what your requirements are for such a feature. Would the steps I outlined above be sufficient for moving this forward? That is:

  • A clearer error on blocked endpoints other than Daemon not yet initialized
  • Allow microovn status to work in a limited capacity during an upgrade and include microcluster status about the upgrade.

@masnax
Copy link
Contributor

masnax commented Jun 13, 2024

(btw, it's sufficient to run it in single container. No need to run it in every container )

This is actually dangerous because of the bug I mentioned earlier:

There has been a long-standing bug in microcluster which has since been fixed, but is present in MicroOVN today, which results in systems that join a cluster too close together not receiving a local record for each other until the next heartbeat. This is occurring in the test suite because the cluster is formed in quick succession and then the snaps are refreshed. When the daemons restart, some cluster members report UNREACHABLE until the next heartbeat. You may notice if you place wait_microovn_online before the snap refresh, it will still block since some cluster members are reporting UNREACHABLE right after they have joined the cluster, but before the snap refresh.

While some members might report all cluster members are ONLINE, others (particularly node 2 and node 3) may not yet trust each other, and report each other as UNREACHABLE.

Ideally, there should always be a check after cluster formation that reports all nodes are no longer in a PENDING state, and are reachable from every other system.

FWIW, if the upgrade occurs after all systems are fully reachable and set up, the total down time is only about a second more than the total down time for a restart of all systems.

@mkalcok
Copy link
Contributor

mkalcok commented Jun 13, 2024

This is actually dangerous because of the bug I mentioned earlier:

I see, thank you. I didn't fully realized how it affects the cluster list.

It would be great to know what your requirements are for such a feature.

We are trying to facilitate hassle-free upgrade of the underlying OVN cluster which, coincidentally, also involves schema upgrade of a clustered database 😆. We hold back the schema upgrade until every node in the cluster expects same version of the database. We don't want to leave the user blind during this process. For example, when 2/3 nodes in the cluster are upgraded, running cluster status outputs:

OVN Southbound: Upgrade or attention required!
Currently active schema: 20.21.0
Cluster report (expected schema versions):
	movn1: 20.33.0
	movn3: Missing API. MicroOVN needs upgrade
	movn2: 20.33.0

This lets user know that schema upgrade from 20.21.0 to 20.33.0 is pending and it's the host movn3 that still needs to be upgraded.

You touched on the issue of inconsistent APIs, in this case we deal with it here by assuming that anything that returns 404 on the endpoint that should report "expected schema version" is a node that needs to be upgraded. In the future upgrades, old hosts will simply report older schema version.

I think that if the current error message Daemon not yet initialized could be replaced with something that:

  • clearly states that cluster operation is limited due to not every member running same version
  • list of the cluster nodes that need to be upgraded

That would be something we'd be happy with

@masnax
Copy link
Contributor

masnax commented Jun 13, 2024

@mkalcok That all sounds reasonable to me, thank you :)

It shouldn't be hard to incorporate some cluster database elements into that cluster status endpoint, and allow it to work while a database upgrade is in progress.

One concern I have is about the cluster error message. There may be very many cluster members so reporting a list would require including some data structure in the error metadata.

Our principle has been to keep the error messages as simple strings for the most part without additional metadata that needs to be checked for and parsed. This is to prevent every error returned to the CLI needing to be checked for various types of metadata, and then transformed into a human-readable representation.

This isn't a hard rule or anything, but as an alternative do you think it would suffice to simply report a summary in the error message rather than include all cluster member statuses, of which there may be very many? As in simply report something like this:

  Cluster database upgrade to version 2 is in progress. Waiting for 5 cluster members to receive the update.

And then more detailed information can be available through the aforementioned cluster status command. Let me know what you think.


Connected to all of this, one thing we arrived at internally is that microcluster doesn't understand the difference between a restart or a reload of the daemon, so that the underlying OVN service can continue to run while the daemon is reloaded for a snap refresh. LXD follows this approach for snap refreshes to ensure instances remain online during a database upgrade, since the actual instances shouldn't care about that.

I'm unsure if MicroOVN has some internal reloading mechanism for snap refreshes? Perhaps we can formalize this in microcluster so that the daemon can detect if the intention is to simply reload, and if so we won't necessarily run the OnStop hook to shutdown ovn services, and then try to start them after the database has settled from an upgrade with the OnStart hook.

This would mean that during a microcluster-level schema upgrade, the underlying OVN service can continue to run. I'm not sure how relevant this will be to the next MicroOVN version since it appears to me that you are also performing some upgrades of OVN itself, so I'm not sure if that entails downtime anyway.

@mkalcok
Copy link
Contributor

mkalcok commented Jun 14, 2024

And then more detailed information can be available through the aforementioned cluster status command. Let me know what you think.

This sounds very reasonable to me. General error message can be terse and instruct user to run microovn status. Then if we'll have tools/API endpoints to determine which hosts still need upgrade, we can generate report about it there.

Regarding the refresh vs restart:

MicroOVN does not implement any onStop hook and I think that realistically, we want OVN services to be restarted on snap refresh, as new MicroOVN can include new version of OVN/OVS packages.
We tested regular (non-schema-changing) upgrades and dataplane downtime is minimal/acceptable (though we still aim to improve it). I don't think we tested dataplane outage in scenario with MicroOVN schema change on upgrade. I'll give it a try. Thank you.

fnordahl added a commit to fnordahl/microovn that referenced this pull request Jun 17, 2024
Pending resolution of discussion in canonical#121.

Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants