Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opsgenie Alias for deduplication of alerts #460

Open
hhollenstain opened this issue Jan 9, 2023 · 18 comments
Open

Opsgenie Alias for deduplication of alerts #460

hhollenstain opened this issue Jan 9, 2023 · 18 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@hhollenstain
Copy link

We currently utilize Opsgenie for paging and found the integration with flux works pretty well. The main issue of contention is missing the alias for deduplication. Currently when an alert is triggered it will continuous fire/create new pages. Ideally we can set an alias and fire once and let Opsgenie handle additional notification/triggers.

payload := OpsgenieAlert{

Opsgenie API docs

@al-lac
Copy link
Contributor

al-lac commented Mar 15, 2023

This would be nice to have.

Would also be great to have the alerts solve themself automatically once they are not happening anymore.

@makkes makkes added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Mar 15, 2023
@PatrickZeier-SAG
Copy link

It would generally be nice to be able to customize the fields content. If you have a large OpsGenie/JSM instance where alerts from multiple systems are processed, you want to have some more info than e.g. "Kustomization/somecomponent" in the title of the alert as this is no very specific.

@stefanprodan
Copy link
Member

event.Metadata gets injected in the payload sent to OpsGenie so you can add cluster name, region, etc. Can't this be used for deduplication?

@PatrickZeier-SAG
Copy link

PatrickZeier-SAG commented Apr 16, 2024

In JSM I see this in the created alert:
image
(summaryand testField were added by me in spec.eventMetadata of the Flux alert)

which according to the Jira API documentation matches this field:
image

And event.Metadata seems to refer to spec.eventMetadata and is also part part of the payload:

payload := OpsgenieAlert{
		Message:     event.InvolvedObject.Kind + "/" + event.InvolvedObject.Name,
		Description: event.Message,
		Details:     event.Metadata,

With some Jira automation rules deduplication and title manipulation could work (need to check with some admin there on our side). Customization of the fields on Flux side would be a bit easier in my eyes, but see it as a feature request 😃.
Many thanks @stefanprodan for the hint!

@al-lac
Copy link
Contributor

al-lac commented Apr 16, 2024

For Opsgenie i just set the alias to the description, works out most of the time, as long as the description does not contain a time string that is always different.

What would be great however would be a message that could also close the alert. So like a "recovery" message.

@stefanprodan
Copy link
Member

Flux is stateless, there is no way to send recovery messages as notification-controller doesn't know it has send a previous error alert.

@PatrickZeier-SAG
Copy link

For Opsgenie i just set the alias to the description, works out most of the time, as long as the description does not contain a time string that is always different.

What would be great however would be a message that could also close the alert. So like a "recovery" message.

That's also possible.
What I use for the alias: Message title (which I enriched with some more text like the cluster name) plus the revision that comes as metadata field from Flux by default.

So, for me the alias looks like this: [FluxCD] ({{extraProperties.cluster}}) {{message}} {{extraProperties.revision}}

As the message contains also the Kustomization name and I am only sending alerts about Kustomizations, this should be enough. Of course s.o. could let the alert stay open and in the meantime there is another issue in the cluster for this Kustomization that does no more match the alert description. But as you @al-lac said: When the description contains a time string, the deduplication won't work.

@al-lac
Copy link
Contributor

al-lac commented Apr 16, 2024

@stefanprodan True, guess the only way that would work would be to send messages for every run that was ok, which would be a little noisy.

@stefanprodan
Copy link
Member

@al-lac
Copy link
Contributor

al-lac commented Apr 16, 2024

@PatrickZeier-SAG how did you manage to enrich it with the cluster name? Did you just add more information to spec. eventMetadata?

@stefanprodan i guess i would need to set the eventSeverity to info right? I guess i would need to filter on this then when creating / resolving alerts.

@PatrickZeier-SAG
Copy link

@al-lac Exactly.
That's the alert:

apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Alert
metadata: 
  name: jsm
  namespace: somenamespace
spec: 
  providerRef: 
    name: jsm
  eventSeverity: error
  eventSources:     
    - kind: Kustomization
      name: '*'
      namespace: somenamespace
  eventMetadata: 
    cluster: "mycluster"

And this I can then access like described above with {{extraProperties.cluster}} in JSM (probably OpsGenie as well, never tested).

@al-lac
Copy link
Contributor

al-lac commented Apr 16, 2024

@PatrickZeier-SAG thanks that works perfectly!

Now i just need to find a way to differentiate between errors, infos and recovery messages 😁

@PatrickZeier-SAG
Copy link

@al-lac I would be happy to read about your solution if you find something 😃 . Especially the recovery message (I did not yet get that out of the code Stefan linked).

Idea for differentiation between severity types: You could add one Flux alert per severity but with different value in the eventMetadata. E.g. severity: error. Then you can parse this field in JSM/OpsGenie and set the alert priority or whatever you want to do with that info.

@al-lac
Copy link
Contributor

al-lac commented Apr 16, 2024

@PatrickZeier-SAG ah yeah that is one way of handling this. Thanks for the tip!

Yeah me neither, i don't see a way on how a recovery message is different from the rest. Maybe @stefanprodan can elaborate further.

@stefanprodan
Copy link
Member

For the same revision, Flux will emit a single info event and not spam. If let's say for some new Git commit the health check fails, if it recovers you get 2 events error and info.

@al-lac
Copy link
Contributor

al-lac commented Apr 17, 2024

Ok, i thought of doing it the way like @PatrickZeier-SAG suggested it. So to have two alerts for info and error. But as the info also contains the error part i cannot use it to close the alert as they would always get in the way of each other.

@stefanprodan ok that is good to know. But how will i be able to differentiate between error and info if this info does not get sent to the provider? If i would have the error level (info / error), i could just match on the revision and resolve the alert once a new info message comes in with the same resource id.

So i would set the following as an alias on OpsGenie: -main@

But without the information if it is an error or info i cannot do the closing :-(

@stefanprodan
Copy link
Member

But without the information if it is an error or info i cannot do the closing

Feel free to open a PR, all you need is adding event.Severity to the payload.

@al-lac
Copy link
Contributor

al-lac commented Apr 19, 2024

So with the changes from #796 i managed to set the eventSeverity to info and filtering on Opsgenie so only errors are made into an alert.

However, i seem to not get enough info alerts with the following configuration:

---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
    name: gitops-notifications-opsgenie
    namespace: flux-system
spec:
    summary: Alert from flux for cluster a
    providerRef:
        name: opsgenie
    eventSeverity: info
    eventSources:
      - kind: GitRepository
        name: '*'
        namespace: cluster-a
      - kind: Kustomization
        name: '*'
        namespace: cluster-a
      - kind: HelmRelease
        name: '*'
        namespace: cluster-a

Should i not also get an alert then for every Reconciliation finished?

I let one kustomization fail and repaired it again, but i never got any recovery message or info message in Opsgenie...

The only thing i see in the Opsgenie log is this alert coming in every time a sync runs:
CustomResourceDefinition/clustersecretstores.external-secrets.io configured

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants