Managed Infrastructure Maintenance Operator - Milestone 1 #3571

hawkowl · 2024-05-10T08:09:47Z

Which issue this PR addresses:

Part of https://issues.redhat.com/browse/ARO-4895.

What this PR does / why we need it:

This PR is the initial feature branch for the MIMO M1 milestone.

Is there any documentation that needs to be updated for this PR?

Yes, see https://issues.redhat.com/browse/ARO-4895 .

How do you know this will function as expected in production?

Telemetry, monitoring, and documentation will need to be fleshed out. See https://issues.redhat.com/browse/ARO-4895 for details.

github-actions · 2024-05-15T16:33:33Z

Please rebase pull request.

jaitaiwan

I've started a review, and reached my ingestion limit. I'll keep reviewing later.

go.mod

pkg/api/mimodocument.go

pkg/database/mimo.go

jaitaiwan · 2024-05-23T01:05:20Z

pkg/database/mimo.go

+	if err, ok := err.(*cosmosdb.Error); ok && err.StatusCode == http.StatusConflict {
+		err.StatusCode = http.StatusPreconditionFailed
+	}


Why are we overwriting the http status condition?

It looks like to me it's because of line 143. We're saying that in case of a conflict we want to change it to a status that will have the cosmosdb Retry function retry the request. If this is the case I think commenting this would be helpful in-case the functionality of functions that use the cosmosdb Retry function change in the future.

jaitaiwan · 2024-05-23T01:09:03Z

pkg/database/mimo.go

+	return c.c.Get(ctx, clusterID, id, nil)
+}
+
+// QueueLength returns maintenanceManifests un-queued document count.


Comment is a little confusing. I think we need to work a little more on definitional language here.
e.g.:

Scheduled

Queued

Pending

etc
And what these statuses might all mean in practice.

It seems to me like QueueLength is returning the list of MaintenanceSets that are pending delivery to the actuator?

pkg/database/mimo.go

hawkowl · 2024-06-06T02:47:12Z

/azp run ci, e2e

azure-pipelines · 2024-06-06T02:47:26Z

Azure Pipelines successfully started running 2 pipeline(s).

github-actions · 2024-06-06T14:18:04Z

Please rebase pull request.

github-actions · 2024-06-12T16:29:56Z

Please rebase pull request.

pkg/api/mimo.go

pkg/deploy/generator/resources_rp.go

ArrisLee · 2024-06-18T00:46:09Z

pkg/mimo/actuator/manager.go

+		docs, err := i.Next(ctx, -1)
+		if err != nil {
+			return false, err
+		}
+		if docs == nil {
+			break
+		}
+
+		docList = append(docList, docs.MaintenanceManifestDocuments...)
+	}
+
+	manifestsToAction := make([]*api.MaintenanceManifestDocument, 0)
+
+	sort.SliceStable(docList, func(i, j int) bool {
+		if docList[i].MaintenanceManifest.RunAfter != docList[j].MaintenanceManifest.RunAfter {
+			return docList[i].MaintenanceManifest.Priority < docList[j].MaintenanceManifest.Priority
+		}
+
+		return docList[i].MaintenanceManifest.RunAfter < docList[j].MaintenanceManifest.RunAfter
+	})
+
+	evaluationTime := a.now()
+
+	// Check for manifests that have timed out first
+	for _, doc := range docList {
+		if evaluationTime.After(time.Unix(int64(doc.MaintenanceManifest.RunBefore), 0)) {
+			// timed out, mark as such
+			a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())
+
+			_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {
+				d.MaintenanceManifest.State = api.MaintenanceManifestStateTimedOut
+				d.MaintenanceManifest.StatusText = fmt.Sprintf("timed out at %s", evaluationTime.UTC())
+				return nil
+			})
+			if err != nil {
+				a.log.Error(err)
+			}
+		} else {
+			// not timed out, do something about it
+			manifestsToAction = append(manifestsToAction, doc)
+		}
+	}
+
+	// Nothing to do, don't dequeue
+	if len(manifestsToAction) == 0 {
+		return false, nil
+	}
+
+	// Dequeue the document
+	oc, err := a.oc.Get(ctx, a.clusterID)
+	if err != nil {
+		return false, err
+	}
+
+	oc, err = a.oc.DoDequeue(ctx, oc)
+	if err != nil {
+		return false, err // This will include StatusPreconditionFaileds
+	}
+
+	taskContext := newTaskContext(a.env, a.log, oc)
+
+	// Execute on the manifests we want to action
+	for _, doc := range manifestsToAction {
+		// here
+		f, ok := a.tasks[doc.MaintenanceManifest.MaintenanceSetID]
+		if !ok {
+			a.log.Infof("not found %v", doc.MaintenanceManifest.MaintenanceSetID)
+			continue
+		}
+
+		// Attempt a dequeue
+		doc, err = a.mmf.Lease(ctx, a.clusterID, doc.ID)
+		if err != nil {
+			// log and continue if it doesn't work
+			a.log.Error(err)
+			continue
+		}
+
+		// if we've tried too many times, give up
+		if doc.Dequeues > maxDequeueCount {
+			err := fmt.Errorf("dequeued %d times, failing", doc.Dequeues)
+			_, leaseErr := a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, api.MaintenanceManifestStateTimedOut, to.StringPtr(err.Error()))
+			if leaseErr != nil {
+				a.log.Error(err)
+			}
+			continue
+		}
+
+		// Perform the task
+		state, msg := f(ctx, taskContext, doc, oc)
+		_, err = a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, state, &msg)
+		if err != nil {
+			a.log.Error(err)
+		}
+	}
+
+	// release the OpenShiftCluster
+	_, err = a.oc.EndLease(ctx, a.clusterID, oc.OpenShiftCluster.Properties.ProvisioningState, api.ProvisioningStateMaintenance, nil)
+	return true, err
+}


suggest to split the logic into private funcs to improve readability, something like:

func (a *actuator) Process(ctx context.Context) (bool, error) { // Fetch manifests manifests, err := a.fetchManifests(ctx) if err != nil { return false, err } // Evaluate and segregate manifests expiredManifests, actionableManifests := a.evaluateManifests(manifests) // Handle expired manifests a.handleExpiredManifests(ctx, expiredManifests) // If no actionable manifests, return if len(actionableManifests) == 0 { return false, nil } // Dequeue the cluster document oc, err := a.oc.DequeueCluster(ctx, a.clusterID) if err != nil { return false, err } // Execute tasks taskContext := newTaskContext(a.env, a.log, oc) a.executeTasks(ctx, taskContext, actionableManifests) // Release the cluster lease return true, a.oc.EndClusterLease(ctx, a.clusterID, oc) } func (a *actuator) fetchManifests(ctx context.Context) ([]*api.MaintenanceManifestDocument, error) { // Fetch manifests logic here } func (a *actuator) evaluateManifests(manifests []*api.MaintenanceManifestDocument) ([]*api.MaintenanceManifestDocument, []*api.MaintenanceManifestDocument) { // Evaluation logic here } func (a *actuator) handleExpiredManifests(ctx context.Context, expiredManifests []*api.MaintenanceManifestDocument) { // Handling expired manifests logic here } func (a *actuator) executeTasks(ctx context.Context, taskContext tasks.TaskContext, manifests []*api.MaintenanceManifestDocument) { // Task execution logic here }

pkg/api/mimo.go

yjst2012 · 2024-06-18T04:18:50Z

pkg/database/mimo.go

+	triggerc := cosmosdb.NewTriggerClient(collc, collMaintenanceManifests)
+	for _, trigger := range triggers {
+		_, err := triggerc.Create(ctx, trigger)
+		if err != nil && !cosmosdb.IsErrorStatusCode(err, http.StatusConflict) {


I understand this line of code is used everywhere, just wondering if it is really safe to ignore the 409 error here, maybe simply starts from log it somewhere 🤔?

pkg/database/mimo.go

yjst2012 · 2024-06-18T05:07:30Z

pkg/database/mimo.go

+
+	return c.patchWithLease(ctx, clusterID, id, func(doc *api.MaintenanceManifestDocument) error {
+		doc.LeaseOwner = c.uuid
+		doc.Dequeues++


not sure if doc.LeaseExpires should be updated here as well, seems like it is not updated in either patchWithLease() or patch()?

pkg/deploy/generator/resources_rp.go

pkg/mimo/actuator/manager.go

ArrisLee · 2024-06-25T09:49:02Z

pkg/mimo/actuator/manager.go

+	// Get the manifests for this cluster which need to be worked
+	i, err := a.mmf.GetByClusterID(ctx, a.clusterID, "")
+	if err != nil {
+		return false, err


shall we log the error here? e.g., a.log.WithError(....) or a.log.Error(....) ?

ArrisLee · 2024-06-25T09:49:25Z

pkg/mimo/actuator/manager.go

+	for {
+		docs, err := i.Next(ctx, -1)
+		if err != nil {
+			return false, err


same, shall we log the error here? e.g., a.log.WithError(....) or a.log.Error(....) ?

ArrisLee · 2024-06-25T09:56:54Z

pkg/mimo/actuator/manager.go

+			// timed out, mark as such
+			a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())
+
+			_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {


shall we implement a retry logic here? just to make the patch action more robust?

ArrisLee

LGTM overall, left some comments for potential improvments, please have a check

…ailure

JIRA: https://issues.redhat.com/browse/ARO-1885

hawkowl · 2024-10-21T03:39:37Z

/azp run

azure-pipelines · 2024-10-21T03:39:52Z

Azure Pipelines successfully started running 2 pipeline(s).

This was referenced May 10, 2024

[turbo WIP] MIMO PoC #3210

Closed

MIMO M1 (Database) #3560

Closed

github-actions bot added the needs-rebase branch needs a rebase label May 15, 2024

jaitaiwan reviewed May 23, 2024

View reviewed changes

github-actions bot removed the needs-rebase branch needs a rebase label May 27, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 6d97470 to 335c6fd Compare June 6, 2024 02:15

github-actions bot added the needs-rebase branch needs a rebase label Jun 6, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 335c6fd to 94fb144 Compare June 11, 2024 00:59

github-actions bot added needs-rebase branch needs a rebase and removed needs-rebase branch needs a rebase labels Jun 11, 2024

ArrisLee reviewed Jun 15, 2024

View reviewed changes

pkg/api/mimo.go Show resolved Hide resolved

ArrisLee reviewed Jun 15, 2024

View reviewed changes

pkg/api/mimo.go Show resolved Hide resolved

ArrisLee reviewed Jun 18, 2024

View reviewed changes

pkg/deploy/generator/resources_rp.go Show resolved Hide resolved

ArrisLee reviewed Jun 18, 2024

View reviewed changes

yjst2012 reviewed Jun 18, 2024

View reviewed changes

ArrisLee reviewed Jun 25, 2024

View reviewed changes

ArrisLee previously approved these changes Jun 25, 2024

View reviewed changes

hawkowl mentioned this pull request Jun 26, 2024

Clean up some duplicated code in cmd/ #3648

Merged

hawkowl dismissed ArrisLee’s stale review via 046a230 July 2, 2024 04:27

hawkowl force-pushed the hawkowl/mimo-m1 branch from 94fb144 to 046a230 Compare July 2, 2024 04:27

github-actions bot removed the needs-rebase branch needs a rebase label Jul 2, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 046a230 to 9272c68 Compare July 4, 2024 03:04

hawkowl force-pushed the hawkowl/mimo-m1 branch from 9272c68 to d89d0bc Compare July 18, 2024 03:19

edisonLcardenas and others added 26 commits October 21, 2024 14:33

ARO-9263: Revise logic to check issue date instead of expiry

b7f881d

ARO-9263: Add constants to reduce redunant values

cb5f0e5

ARO-9263: Update test to check issue date in constant time to avoid f…

b623dfa

…ailure

ARO-9263: Change or remove any references about expiry to issue date.

822dbdb

ARO-9263: Fix lint issues

93c9999

ARO-9263: Revise error message and reorder return statement

e63d63d

ARO-9263: Fix unit test

abce1ef

fix e2e, hopefully

5892cc7

pls

2a69cbc

API OperatorFlagsMergeStrategy

ecea407

JIRA: https://issues.redhat.com/browse/ARO-1885

operator flags patches + tests

0d153ca

add the maintmanifests client to the RP frontend/backend in dev

41a781e

reset the cluster flags to stop other tests failing

3be50c1

Bump test file

a3f2c02

Update actuator_test.go

c07c86f

fix the ARM resource deploying the partition key

47f1af9

regen

547c989

lint fix

a18b250

fixes for e2e

176ecca

add the ability to add a debug flag

3e8cc71

e2e fix

5c26640

renames and fixes

55f0456

go mod tidy

7fc6d19

add some documentation

15ecc94

review cleanups and neatening things up

517333d

try and fix e2e race condition

7f275e0

hawkowl force-pushed the hawkowl/mimo-m1 branch from 7f1f0ea to 7f275e0 Compare October 21, 2024 03:33

hawkowl mentioned this pull request Oct 21, 2024

[ARO-1885] Implement OperatorFlagsMergeStrategy #3911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

hawkowl commented May 10, 2024 •

edited

Loading

github-actions bot commented May 15, 2024

jaitaiwan left a comment

jaitaiwan May 23, 2024

jaitaiwan May 23, 2024

jaitaiwan May 23, 2024

hawkowl commented Jun 6, 2024

azure-pipelines bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 12, 2024

ArrisLee Jun 18, 2024

yjst2012 Jun 18, 2024

yjst2012 Jun 18, 2024

ArrisLee Jun 25, 2024

ArrisLee Jun 25, 2024

ArrisLee Jun 25, 2024

ArrisLee left a comment

hawkowl commented Oct 21, 2024

azure-pipelines bot commented Oct 21, 2024

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Are you sure you want to change the base?

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Conversation

hawkowl commented May 10, 2024 • edited Loading

Which issue this PR addresses:

What this PR does / why we need it:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

github-actions bot commented May 15, 2024

jaitaiwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkowl commented Jun 6, 2024

azure-pipelines bot commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

github-actions bot commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArrisLee left a comment

Choose a reason for hiding this comment

hawkowl commented Oct 21, 2024

azure-pipelines bot commented Oct 21, 2024

hawkowl commented May 10, 2024 •

edited

Loading