Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse existing MicroCeph and MicroOVN clusters #259

Merged
merged 11 commits into from
May 3, 2024

Conversation

masnax
Copy link
Contributor

@masnax masnax commented Jan 26, 2024

Closes #145

This PR allows initializing a new MicroCloud with some nodes that have already set up MicroCeph or MicroOVN in the past. This can be useful if you already have a MicroCeph cluster for example, that you want to "upgrade" into a MicroCloud.

Just after confirming the list of systems to use for the MicroCloud, we will try to grab a list of cluster members from each system. The underlying microcluster package will either return the list or report that the database is not initialized. We will record the first list that we obtain for each of MicroOVN and MicroCeph, and if the list is either non-existent or identical across any other systems, we can try to reuse those clusters.

The reuse process basically amounts to asking the system that has an existing cluster to generate join tokens and send them to the node orchestrating the initialization. Normally you have to be trusted to make such a request, so instead we can use the auth secret that we got from finding the systems over mDNS to send a request using the MicroCloud proxy, which will initiate the request from the unix socket on the clustered system.

The behaviour is as follows:

  • If two systems belong to different clusters for the same service, we will error out as they wouldn't be able to join each other.
  • If a cluster contains more nodes than what will end up in the MicroCloud, we will proceed as normal and ignore those nodes. Perhaps a user might want a larger MicroCeph cluster than their LXD cluster, for example.
  • If --auto is specified, we will strictly require no clustered systems.
  • The preseed file will have a new key reuse_existing_clusters which can be set to true or false. If true, we will reuse any clusters we find, and if false, we will skip that service entirely and set up as normal.
  • In interactive mode, we will prompt the user for each service whether they want to reuse the existing cluster or setup MicroCloud without that service.

microcloud/service/microcloud.go Fixed Show resolved Hide resolved
microcloud/service/microceph.go Fixed Show fixed Hide fixed
microcloud/service/microovn.go Fixed Show fixed Hide fixed
microcloud/service/microcloud.go Fixed Show fixed Hide fixed
microcloud/service/microcloud.go Fixed Show fixed Hide fixed
@masnax
Copy link
Contributor Author

masnax commented Apr 9, 2024

@markylaing @roosterfish @tomponline

This one should be ready for review now.

@tomponline
Copy link
Member

Thanks @masnax

Once @markylaing and @roosterfish have approved ill do a final pass. Ta

Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, only a few smaller suggestions.

microcloud/service/microceph.go Outdated Show resolved Hide resolved
microcloud/cmd/microcloud/main_init.go Show resolved Hide resolved
microcloud/cmd/microcloud/main_init.go Outdated Show resolved Hide resolved
roosterfish
roosterfish previously approved these changes Apr 10, 2024
Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@masnax
Copy link
Contributor Author

masnax commented Apr 11, 2024

@markylaing @tomponline Do either of you want to take a look at this before merging?

Also @ru-fu Looks like there's something going on with the doc checks here. Got any ideas?

@markylaing
Copy link
Contributor

@markylaing @tomponline Do either of you want to take a look at this before merging?

Also @ru-fu Looks like there's something going on with the doc checks here. Got any ideas?

Yes I'd like to take a look. Will get to it first thing tomorrow.

@ru-fu
Copy link
Contributor

ru-fu commented Apr 12, 2024

This sounds like it needs doc updates.

@markylaing
Copy link
Contributor

  • The preseed file will have a new key reuse_existing_clusters which can be set to true or false. If true, we will reuse any clusters we find, and if false, we will skip that service entirely and set up as normal.

If this key is false, does it error out if we find existing Ceph or OVN clusters as it does with the --auto flag? I'm presuming here that we can't set up a new Ceph or OVN cluster on top of one that pre-exists.

Copy link
Contributor

@markylaing markylaing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the structure of it looks good but some parts could be made a bit clearer.

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends? I believe @roosterfish is working on this currently so I think it's worth a discussion around that.

microcloud/service/microcloud.go Show resolved Hide resolved
microcloud/service/lxd.go Outdated Show resolved Hide resolved
microcloud/service/microceph.go Outdated Show resolved Hide resolved
microcloud/cmd/microcloud/ask.go Outdated Show resolved Hide resolved
microcloud/cmd/microcloud/ask.go Outdated Show resolved Hide resolved
microcloud/cmd/microcloud/main_init.go Outdated Show resolved Hide resolved
},
return shared.ProxyFromEnvironment(r)
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that canonical/microcluster#83 is merged does this need to be updated? Same for the other cases where this is being set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of filling this PR with all of those changes, I've added a new PR that moves all this implementation into a helper so we can just call that instead: #287

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so #287 has to come after this one then?

microcloud/test/includes/microcloud.sh Outdated Show resolved Hide resolved
microcloud/cmd/microcloud/main_init.go Show resolved Hide resolved
@tomponline
Copy link
Member

@markylaing @tomponline Do either of you want to take a look at this before merging?

I'll look over once approved by the others.

@roosterfish
Copy link
Contributor

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends?

Good point, I have also thought about this the other day when reviewing the PR. Essentially what we will probably not have anymore is the secret used for the X-MicroCloud-Auth header to talk to the other MicroCloud daemon.
So it has to be an action invoked by the administrator on the target system.
I have it on my list of things for the spec so we can discuss it there :)

@tomponline
Copy link
Member

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends? I believe @roosterfish is working on this currently so I think it's worth a discussion around that.

It will almost certainly have to change as mdns isn't going to be used as much. But that shouldn't necessarily block this PR - although it will likely have to be reworked as part of @roosterfish planned changes.

@masnax
Copy link
Contributor Author

masnax commented Apr 12, 2024

Also, one concern I have with this is the reliance on mDNS for issuing the tokens. Will it be possible to add existing services if the join process changes such that explicit verification is required on both ends? I believe @roosterfish is working on this currently so I think it's worth a discussion around that.

At the moment, pretty much everything MicroCloud does requires mDNS to work because we can't be trusted on the other systems until we've clustered with them. We need mDNS for selecting disks, selecting networks, and selecting the nodes themselves, and now issuing tokens from existing clusters with this PR.

I'd assume the verification process would establish a long-lived trust for the duration of the join process similar to the mDNS auth secret, allowing us to remotely issue tokens, view available disks and network interfaces, and request a node to join the clusters.

@roosterfish
Copy link
Contributor

I'd assume the verification process would establish a long-lived trust for the duration of the join process similar to the mDNS auth secret, allowing us to remotely issue tokens, view available disks and network interfaces, and request a node to join the clusters.

An option I am trying to validate for the spec is moving the forming of the MicroCloud MicroCluster right after the discovery of new peers. This establishes trust between all the peers and allow them talking to each other using mutual TLS to perform the actions we are currently performing using the auth secret.

With this approach we don't anymore require any type of auth secret besides a join token for each of the peers that want to join the MicroCloud.

@masnax
Copy link
Contributor Author

masnax commented Apr 12, 2024

I'd assume the verification process would establish a long-lived trust for the duration of the join process similar to the mDNS auth secret, allowing us to remotely issue tokens, view available disks and network interfaces, and request a node to join the clusters.

An option I am trying to validate for the spec is moving the forming of the MicroCloud MicroCluster right after the discovery of new peers. This establishes trust between all the peers and allow them talking to each other using mutual TLS to perform the actions we are currently performing using the auth secret.

With this approach we don't anymore require any type of auth secret besides a join token for each of the peers that want to join the MicroCloud.

For some historical context, we actually had a similar process in place very early on because I was hesitant to rely on the mDNS component so much.

Problem was mainly user experience. 90% of errors will happen around the configuration so a mistake would be costly as we would have to tear down the clusters. Plus it puts a significant pause (could be minutes for larger clusters) before we even get to the bulk of the user interactions.

@roosterfish
Copy link
Contributor

90% of errors will happen around the configuration so a mistake would be costly as we would have to tear down the clusters. Plus it puts a significant pause (could be minutes for larger clusters) before we even get to the bulk of the user interactions.

What do you mean by we have to tear down the clusters?

If I understood you right you are talking about the pause which will be an effect of joining in the cluster members manually? I guess this is something we have to accept as part of making the overall concept more secure. But it's worth evaluating options to make this as fast as possible for the end user.

@masnax
Copy link
Contributor Author

masnax commented Apr 15, 2024

What do you mean by we have to tear down the clusters?

The user runs microcloud init, but mistakenly hasn't prepared disks/network interfaces appropriately yet. If the systems are clustered before we ask the storage/network questions, the user only realizes the mistake after the nodes have been clustered. Now aborting the init process so the user can fix the mistake becomes much more expensive because we have already changed state on every system, before the user has made any choices.

doc/how-to/initialise.rst Outdated Show resolved Hide resolved
doc/how-to/initialise.rst Outdated Show resolved Hide resolved
doc/how-to/initialise.rst Outdated Show resolved Hide resolved
ru-fu
ru-fu previously approved these changes Apr 29, 2024
Copy link
Contributor

@ru-fu ru-fu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Docs look good now. :)

@masnax masnax dismissed markylaing’s stale review April 30, 2024 14:44

Changes addressed

@masnax
Copy link
Contributor Author

masnax commented Apr 30, 2024

@roosterfish @tomponline Could one of you please look this over one last time before merging? Thanks!

(That 1 test failure is an issue with the test suite being too hard on the runners, it passes consistently when run locally).

@tomponline
Copy link
Member

(That 1 test failure is an issue with the test suite being too hard on the runners, it passes consistently when run locally).

@masnax does this mean if we merge this we will get that test failing going forward then?

If so then thats a merge blocker to me.

@masnax
Copy link
Contributor Author

masnax commented Apr 30, 2024

(That 1 test failure is an issue with the test suite being too hard on the runners, it passes consistently when run locally).

@masnax does this mean if we merge this we will get that test failing going forward then?

If so then thats a merge blocker to me.

The test will be just as flaky on any PR regardless of if this one is merged. The test was not affected by this PR, the issue is that we just can't consistently run 4 VMs on the github runners without them running into issues like LXD being unable to start up.

@tomponline
Copy link
Member

I see @masnax thanks for clarifying that. It sounded like this PR was introducing the tests that were flaky.

I'll leave the final review to @roosterfish

Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more small comments, rest still looking good to me.

microcloud/api/types/services.go Show resolved Hide resolved
microcloud/cmd/microcloud/main_init.go Show resolved Hide resolved
microcloud/cmd/microcloud/main_init.go Show resolved Hide resolved
},
return shared.ProxyFromEnvironment(r)
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so #287 has to come after this one then?

microcloud/test/suites/basic.sh Show resolved Hide resolved
masnax added 11 commits May 2, 2024 17:21
During init, to handle the case where another system is already
clustered on a particular service, we need to be able to request
this node to issue a token for us from an as-yet untrusted system.
This endpoint is untrusted by the cluster, but authenticated with a
secret generated during mDNS lookup, so we can use it as a proxy to
the unix socket on the remote system, where we will be trusted and
can issue a token.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
…e cluster

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Detecting already clustered members will make this block very complex,
so separate it out from `waitForJoin` so that the scope of the helper is
just to wait for nodes to join

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
…ustered

The cluster size delta per service will become uneven if some nodes are
already clustered, and we can't guarantee that the local node isn't
already participating in some of those clusters, so we need to handle
each service more explicitly by carrying a map around the cluster join
process.

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
… helper

Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
Signed-off-by: Max Asnaashari <max.asnaashari@canonical.com>
@masnax masnax merged commit d5a3819 into canonical:main May 3, 2024
15 checks passed
masnax added a commit that referenced this pull request Jul 26, 2024
Closed the old PR to start fresh.

Adds two new commands:
* `microcloud service list` will list the cluster members for every
installed service, or report if it is not initialized. This will
effectively be the same as calling all of the following in succession:
```
lxc cluster list
microcloud cluster list
microceph cluster list
microovn cluster list
```

The information shown will be the name, address, dqlite role, and
current status of each member.

* `microcloud service add` will try to setup MicroOVN and MicroCeph on
all existing MicroCloud cluster members, optionally setting up storage
and networks for LXD. This is useful if MicroOVN or MicroCeph was at one
point not installed on the systems and skipped during `microcloud init`.
LXD and MicroCloud itself are required to already be set up.
Thanks to #259 we can also try to re-use a service that partially covers
existing MicroCloud cluster members. So if a MicroCloud is set up
without MicroCeph, and then the user manually configures MicroCeph to
partially cover the cluster, the user can then use `microcloud service
add` to further configure MicroCeph to work with MicroCloud, and set up
storage pools for LXD.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle MicroCeph already installed
6 participants