Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Draft
wants to merge 69 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
d1f5abe
add MIMO to dev
hawkowl Feb 29, 2024
222d321
initial maint API
hawkowl Jul 15, 2024
a590176
update mimo DB to have a fetch which isn't just pending
hawkowl Jul 18, 2024
4b4331b
update db fakes for maintmanifests
hawkowl Jul 18, 2024
57247ab
mimo API conversions for admin API
hawkowl Jul 18, 2024
2d3f9d0
query + tests for maintenancemanifests
hawkowl Jul 18, 2024
a2f6b48
update with get
hawkowl Jul 18, 2024
3598a32
test fixes
hawkowl Jul 19, 2024
b33b158
cancellation tests and impl
hawkowl Jul 19, 2024
82bef02
renaming and tweaking
hawkowl Jul 22, 2024
b14267f
static validation
hawkowl Jul 22, 2024
db8b28e
update frontend to have create
hawkowl Jul 22, 2024
a54f550
code for deleting
hawkowl Jul 22, 2024
4a8fd69
add deleting endpoint
hawkowl Jul 22, 2024
6a6d73f
move clusteroperator check code for reuse
hawkowl Jul 26, 2024
86d2fbb
MIMO task/set cleanups
hawkowl Jul 26, 2024
4864984
mimo error code
hawkowl Jul 26, 2024
761149c
more work on sets
hawkowl Jul 26, 2024
d2ff55c
tls tasks work
hawkowl Jul 26, 2024
f6e3629
update for cleanups
hawkowl Aug 5, 2024
1495db1
move into the main CLI endpoint
hawkowl Aug 16, 2024
34f97a8
add a task for updating the operator flags, for testing
hawkowl Sep 17, 2024
ef01a31
add healthz endpoints for MIMO actuator
hawkowl Sep 18, 2024
7abd9cd
makefile target for running actuator locally
hawkowl Sep 18, 2024
eaeaf05
add mimo actuator steps in e2e helper
hawkowl Sep 18, 2024
56b2a32
start mimo in e2e
hawkowl Sep 18, 2024
3b7bee4
fix build
hawkowl Sep 18, 2024
7e1184a
go generate
hawkowl Sep 18, 2024
c48c69b
lint
hawkowl Sep 18, 2024
77f68b0
updates for basic mimo e2e
hawkowl Sep 19, 2024
615efa4
e2e testing
hawkowl Sep 19, 2024
ba92893
try and see what e2e is breaking with
hawkowl Sep 23, 2024
3256443
initial doc frame
hawkowl Sep 24, 2024
287475b
ARO-9263: Add ACR Token expiry
edisonLcardenas Aug 19, 2024
e4a15f8
ARO-9263: Add ACR Token Expiry Checker
edisonLcardenas Aug 20, 2024
dd5e653
ARO-9263: Add unit test for checker
edisonLcardenas Aug 21, 2024
fa57b45
ARE-9263: Rename function
edisonLcardenas Aug 22, 2024
151a7cb
ARO-9263: Restore missing "ProvisioningStateMaintenance"
edisonLcardenas Aug 22, 2024
c8f2a2e
ARO-9263: Fix import CI check failures
edisonLcardenas Aug 26, 2024
2490d48
ARO-9263: Refactoring tests and revising logic to check expiry date.
edisonLcardenas Sep 12, 2024
e166aa5
ARO-9263: Add another condition to check if expiry date is nil
edisonLcardenas Sep 12, 2024
85ca75b
refactor: update package groupings and error messages to resolve issu…
edisonLcardenas Sep 12, 2024
25e49f4
ARO-9263: Change expiry to the date the token was issued.
edisonLcardenas Sep 16, 2024
b7f881d
ARO-9263: Revise logic to check issue date instead of expiry
edisonLcardenas Sep 17, 2024
cb5f0e5
ARO-9263: Add constants to reduce redunant values
edisonLcardenas Sep 17, 2024
b623dfa
ARO-9263: Update test to check issue date in constant time to avoid f…
edisonLcardenas Sep 18, 2024
822dbdb
ARO-9263: Change or remove any references about expiry to issue date.
edisonLcardenas Sep 19, 2024
93c9999
ARO-9263: Fix lint issues
edisonLcardenas Sep 20, 2024
e63d63d
ARO-9263: Revise error message and reorder return statement
edisonLcardenas Sep 24, 2024
abce1ef
ARO-9263: Fix unit test
edisonLcardenas Sep 24, 2024
5892cc7
fix e2e, hopefully
hawkowl Sep 26, 2024
2a69cbc
pls
hawkowl Sep 26, 2024
ecea407
API OperatorFlagsMergeStrategy
SrinivasAtmakuri Mar 1, 2023
0d153ca
operator flags patches + tests
hawkowl Oct 2, 2024
41a781e
add the maintmanifests client to the RP frontend/backend in dev
hawkowl Oct 2, 2024
3be50c1
reset the cluster flags to stop other tests failing
hawkowl Oct 2, 2024
a3f2c02
Bump test file
hawkowl Oct 2, 2024
c07c86f
Update actuator_test.go
hawkowl Oct 2, 2024
47f1af9
fix the ARM resource deploying the partition key
hawkowl Oct 3, 2024
547c989
regen
hawkowl Oct 3, 2024
a18b250
lint fix
hawkowl Oct 3, 2024
176ecca
fixes for e2e
hawkowl Oct 3, 2024
3e8cc71
add the ability to add a debug flag
hawkowl Oct 4, 2024
5c26640
e2e fix
hawkowl Oct 4, 2024
55f0456
renames and fixes
hawkowl Oct 9, 2024
7fc6d19
go mod tidy
hawkowl Oct 10, 2024
15ecc94
add some documentation
hawkowl Oct 11, 2024
517333d
review cleanups and neatening things up
hawkowl Oct 17, 2024
7f275e0
try and fix e2e race condition
hawkowl Oct 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .pipelines/e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ jobs:

- script: |
export CI=true
# Tell the E2E binary to run the MIMO tests
export ARO_E2E_MIMO=true
. secrets/env
. ./hack/e2e/run-rp-and-e2e.sh

Expand All @@ -84,6 +86,9 @@ jobs:
run_selenium
validate_selenium_running

run_mimo_actuator
validate_mimo_actuator_running

run_rp
validate_rp_running

Expand Down Expand Up @@ -128,6 +133,7 @@ jobs:

delete_e2e_cluster
kill_rp
kill_mimo_actuator
kill_selenium
kill_podman
kill_vpn
Expand Down
10 changes: 7 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ SHELL = /bin/bash
TAG ?= $(shell git describe --exact-match 2>/dev/null)
COMMIT = $(shell git rev-parse --short=7 HEAD)$(shell [[ $$(git status --porcelain) = "" ]] || echo -dirty)
ARO_IMAGE_BASE = ${RP_IMAGE_ACR}.azurecr.io/aro
E2E_FLAGS ?= -test.v --ginkgo.v --ginkgo.timeout 180m --ginkgo.flake-attempts=2 --ginkgo.junit-report=e2e-report.xml
E2E_FLAGS ?= -test.v --ginkgo.vv --ginkgo.timeout 180m --ginkgo.flake-attempts=2 --ginkgo.junit-report=e2e-report.xml
E2E_LABEL ?= !smoke&&!regressiontest
GO_FLAGS ?= -tags=containers_image_openpgp,exclude_graphdriver_btrfs,exclude_graphdriver_devicemapper

Expand Down Expand Up @@ -67,7 +67,7 @@ aro: check-release generate

.PHONY: runlocal-rp
runlocal-rp:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro rp
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} rp

.PHONY: az
az: pyenv
Expand Down Expand Up @@ -196,7 +196,11 @@ proxy:

.PHONY: runlocal-portal
runlocal-portal:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro portal
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} portal

.PHONY: runlocal-actuator
runlocal-actuator:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} mimo-actuator

.PHONY: build-portal
build-portal:
Expand Down
4 changes: 4 additions & 0 deletions cmd/aro/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ func usage() {
fmt.Fprintf(flag.CommandLine.Output(), " %s operator {master,worker}\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s update-versions\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s update-role-sets\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s mimo-actuator\n", os.Args[0])
flag.PrintDefaults()
}

Expand Down Expand Up @@ -74,6 +75,9 @@ func main() {
case "update-role-sets":
checkArgs(1)
err = updatePlatformWorkloadIdentityRoleSets(ctx, log)
case "mimo-actuator":
checkArgs(1)
err = mimoActuator(ctx, log)
default:
usage()
os.Exit(2)
Expand Down
98 changes: 98 additions & 0 deletions cmd/aro/mimoactuator.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
package main

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

import (
"context"
"os"
"os/signal"
"syscall"

"github.com/sirupsen/logrus"

"github.com/Azure/ARO-RP/pkg/database"
"github.com/Azure/ARO-RP/pkg/env"
"github.com/Azure/ARO-RP/pkg/metrics/statsd"
"github.com/Azure/ARO-RP/pkg/metrics/statsd/golang"
"github.com/Azure/ARO-RP/pkg/mimo/actuator"
"github.com/Azure/ARO-RP/pkg/mimo/tasks"
"github.com/Azure/ARO-RP/pkg/proxy"
"github.com/Azure/ARO-RP/pkg/util/service"
)

func mimoActuator(ctx context.Context, log *logrus.Entry) error {
stop := make(chan struct{})

_env, err := env.NewEnv(ctx, log, env.COMPONENT_MIMO_ACTUATOR)
if err != nil {
return err
}

var keys []string
if _env.IsLocalDevelopmentMode() {
keys = []string{}
} else {
keys = []string{
"MDM_ACCOUNT",
"MDM_NAMESPACE",
}
}

if err = env.ValidateVars(keys...); err != nil {
return err
}

m := statsd.New(ctx, log.WithField("component", "actuator"), _env, os.Getenv("MDM_ACCOUNT"), os.Getenv("MDM_NAMESPACE"), os.Getenv("MDM_STATSD_SOCKET"))

g, err := golang.NewMetrics(_env.Logger(), m)
if err != nil {
return err
}
go g.Run()

dbc, err := service.NewDatabase(ctx, _env, log, m, true)
if err != nil {
return err
}

dbName, err := service.DBName(_env.IsLocalDevelopmentMode())
if err != nil {
return err
}

clusters, err := database.NewOpenShiftClusters(ctx, dbc, dbName)
if err != nil {
return err
}

manifests, err := database.NewMaintenanceManifests(ctx, dbc, dbName)
if err != nil {
return err
}

dbg := database.NewDBGroup().
WithOpenShiftClusters(clusters).
WithMaintenanceManifests(manifests)

dialer, err := proxy.NewDialer(_env.IsLocalDevelopmentMode())
if err != nil {
return err
}

a := actuator.NewService(_env, _env.Logger(), dialer, dbg, m)
a.SetMaintenanceTasks(tasks.DEFAULT_MAINTENANCE_SETS)

sigterm := make(chan os.Signal, 1)
done := make(chan struct{})
signal.Notify(sigterm, syscall.SIGTERM)

go a.Run(ctx, stop, done)

<-sigterm
log.Print("received SIGTERM")
close(stop)
<-done

return nil
}
9 changes: 9 additions & 0 deletions cmd/aro/rp.go
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,15 @@ func rp(ctx context.Context, log, audit *logrus.Entry) error {
WithPlatformWorkloadIdentityRoleSets(dbPlatformWorkloadIdentityRoleSets).
WithSubscriptions(dbSubscriptions)

// MIMO only activated in development for now
if _env.IsLocalDevelopmentMode() {
dbMaintenanceManifests, err := database.NewMaintenanceManifests(ctx, dbc, dbName)
if err != nil {
return err
}
dbg.WithMaintenanceManifests(dbMaintenanceManifests)
}

f, err := frontend.NewFrontend(ctx, audit, log.WithField("component", "frontend"), _env, dbg, api.APIs, metrics, clusterm, feAead, hiveClusterManager, adminactions.NewKubeActions, adminactions.NewAzureActions, adminactions.NewAppLensActions, clusterdata.NewParallelEnricher(metrics, _env))
if err != nil {
return err
Expand Down
22 changes: 22 additions & 0 deletions docs/mimo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# MIMO Documentation

The Managed Infrastructure Maintenance Operator, or MIMO, is a component of the Azure Red Hat OpenShift Resource Provider (ARO-RP) which is responsible for automated maintenance of clusters provisioned by the platform.
MIMO specifically focuses on "managed infrastructure", the parts of ARO that are deployed and maintained by the RP and ARO Operator instead of by OCP (in-cluster) or Hive (out-of-cluster).

MIMO consists of two main components, the [Actuator](./actuator.md) and the [Scheduler](./scheduler.md). It is primarily interfaced with via the [Admin API](./admin-api.md).

## A Primer On MIMO

The smallest thing that you can tell MIMO to run is a **Task** (see [`pkg/mimo/tasks/`](../../pkg/mimo/tasks/)).
A Task is composed of reusable **Steps** (see [`pkg/mimo/steps/`](../../pkg/mimo/steps/)), reusing the framework utilised by AdminUpdate/Update/Install methods in `pkg/cluster/`.
A Task only runs in the scope of a singular cluster.
These steps are run in sequence and can return either **Terminal** errors (causing the ran Task to fail and not be retried) or **Transient** errors (which indicates that the Task can be retried later).

Tasks are executed by the **Actuator** by way of creation of a **Maintenance Manifest**.
This Manifest is created with the cluster ID (which is elided from the cluster-scoped Admin APIs), the Task ID (which is currently a UUID), and optional priority, "start after", and "start before" times which are filled in with defaults if not provided.
The Actuator will treat these Maintenance Manifests as a work queue, taking ones which are past their "start after" time and executing them in order of earliest start-after and priority.
After running each, a state will be written into the Manifest (with optional free-form status text) with the result of the ran Task.
Manifests past their start-before times are marked as having a "timed out" state and not ran.

Currently, Manifests are created by the Admin API.
In the future, the Scheduler will create some these Manifests depending on cluster state/version and wall-clock time, providing the ability to perform tasks like rotations of secrets autonomously.
30 changes: 30 additions & 0 deletions docs/mimo/actuator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Managed Infrastructure Maintenance Operator: Actuator

The Actuator is the MIMO component that performs execution of tasks.
The process of running tasks looks like this:

```mermaid
graph TD;
START((Start))-->QUERY;
QUERY[Fetch all State = Pending] -->SORT;
SORT[Sort tasks by RUNAFTER and PRIORITY]-->ITERATE[Iterate over tasks];
ITERATE-- Per Task -->ISEXPIRED;
subgraph PerTask[ ]
ISEXPIRED{{Is RUNBEFORE > now?}}-- Yes --> STATETIMEDOUT([State = TimedOut]) --> CONTINUE[Continue];
ISEXPIRED-- No --> DEQUEUECLUSTER;
DEQUEUECLUSTER[Claim lease on OpenShiftClusterDocument] --> DEQUEUE;
DEQUEUE[Actuator dequeues task]--> ISRETRYLIMIT;
ISRETRYLIMIT{{Have we retried the task too many times?}} -- Yes --> STATERETRYEXCEEDED([State = RetriesExceeded]) --> CONTINUE;
ISRETRYLIMIT -- No -->STATEINPROGRESS;
STATEINPROGRESS([State = InProgress]) -->RUN[[Task is run]];
RUN -- Success --> SUCCESS
RUN-- Terminal Error-->TERMINALERROR;
RUN-- Transient Error-->TRANSIENTERROR;
SUCCESS([State = Completed])-->DELEASECLUSTER
TERMINALERROR([State = Failed])-->DELEASECLUSTER;
TRANSIENTERROR([State = Pending])-->DELEASECLUSTER;
DELEASECLUSTER[Release Lease on OpenShiftClusterDocument] -->CONTINUE;
end
CONTINUE-->ITERATE;
ITERATE-- Finished -->END;
```
Empty file added docs/mimo/admin-api.md
Empty file.
3 changes: 3 additions & 0 deletions docs/mimo/scheduler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# MIMO Scheduler

The MIMO Scheduler is a planned component, but is not yet implemented.
1 change: 1 addition & 0 deletions docs/mimo/writing-tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Writing MIMO Tasks
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ require (
github.com/vincent-petithory/dataurl v1.0.0
go.uber.org/mock v0.4.0
golang.org/x/crypto v0.28.0
golang.org/x/exp v0.0.0-20240222234643-814bf88cf225
golang.org/x/net v0.30.0
golang.org/x/oauth2 v0.18.0
golang.org/x/sync v0.8.0
Expand Down Expand Up @@ -260,7 +261,6 @@ require (
go.opentelemetry.io/otel/metric v1.22.0 // indirect
go.opentelemetry.io/otel/trace v1.22.0 // indirect
go.starlark.net v0.0.0-20220328144851-d1966c6b9fcd // indirect
golang.org/x/exp v0.0.0-20240222234643-814bf88cf225 // indirect
golang.org/x/mod v0.17.0 // indirect
golang.org/x/sys v0.26.0 // indirect
golang.org/x/term v0.25.0 // indirect
Expand Down
37 changes: 37 additions & 0 deletions hack/e2e/run-rp-and-e2e.sh
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,43 @@ kill_portal() {
wait $rppid
}

run_mimo_actuator() {
echo "########## 🚀 Run MIMO Actuator in background ##########"
export AZURE_ENVIRONMENT=AzurePublicCloud
./aro mimo-actuator &
}

kill_mimo_actuator() {
echo "########## Kill the MIMO Actuator running in background ##########"
rppid=$(lsof -t -i :8445)
kill $rppid
wait $rppid
}

validate_mimo_actuator_running() {
echo "########## ?Checking MIMO Actuator Status ##########"
ELAPSED=0
while true; do
sleep 5
http_code=$(curl -k -s -o /dev/null -w '%{http_code}' http://localhost:8445/healthz/ready)
case $http_code in
"200")
echo "########## ✅ ARO MIMO Actuator Running ##########"
break
;;
*)
echo "Attempt $ELAPSED - local MIMO Actuator is NOT up. Code : $http_code, waiting"
sleep 2
# after 40 secs return exit 1 to not block ci
ELAPSED=$((ELAPSED + 1))
if [ $ELAPSED -eq 20 ]; then
exit 1
fi
;;
esac
done
}

run_vpn() {
echo "########## 🚀 Run OpenVPN in background ##########"
echo "Using Secret secrets/$VPN"
Expand Down
41 changes: 41 additions & 0 deletions pkg/api/admin/mimo.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
package admin

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

type MaintenanceManifestState string

const (
MaintenanceManifestStatePending MaintenanceManifestState = "Pending"
MaintenanceManifestStateInProgress MaintenanceManifestState = "InProgress"
MaintenanceManifestStateCompleted MaintenanceManifestState = "Completed"
MaintenanceManifestStateFailed MaintenanceManifestState = "Failed"
MaintenanceManifestStateRetriesExceeded MaintenanceManifestState = "RetriesExceeded"
MaintenanceManifestStateTimedOut MaintenanceManifestState = "TimedOut"
MaintenanceManifestStateCancelled MaintenanceManifestState = "Cancelled"
)

type MaintenanceManifest struct {
// The ID for the resource.
ID string `json:"id,omitempty"`

State MaintenanceManifestState `json:"state,omitempty"`
StatusText string `json:"statusText,omitempty"`

MaintenanceTaskID string `json:"maintenanceTaskID,omitempty"`
Priority int `json:"priority,omitempty"`

// RunAfter defines the earliest that this manifest should start running
RunAfter int `json:"runAfter,omitempty"`
// RunBefore defines the latest that this manifest should start running
RunBefore int `json:"runBefore,omitempty"`
}

// MaintenanceManifestList represents a list of MaintenanceManifests.
type MaintenanceManifestList struct {
// The list of MaintenanceManifests.
MaintenanceManifests []*MaintenanceManifest `json:"value"`

// The link used to get the next page of operations.
NextLink string `json:"nextLink,omitempty"`
}
Loading
Loading