-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga2 looses state history and notifications during restart #10179
Comments
Here is everything you will need to reproduce the same errors locally:
Here is the archive with all needed data: |
I could also reproduce the same problem on both Debian and RHEL 9, but it was orders of magnitude less likely to happen there. |
I have now also reproduced the same issue with icingadb: How to Reproduce:
From my local tests on a VM with 4 Cores and 8GB of RAM, the error should be observable in around 10 Minutes / two deploys. To run that, download the tar -xaf dropped_state_query.tar.gz
cd dropped_state_query
cargo run --release -- --host 127.0.0.1 --user icingadb --password icingadb --icingadb | tee analyzed-history.log Hint: Rust can be installed easily from https://rustup.rs/ Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history. |
Hi all, Thank you for all the details. I am trying to reproduce the scenario, but so far without success. When was the screenshot taken? Immediately after executing the queries to check if something is missing in the database? As there is a high chance that Icinga DB has not inserted everything yet. Also, missing entries in the database do not necessarily mean that Icinga has not sent a notification. That should rather be verified using custom check and notification plugins. Best regards, |
Hello Eric. I have run the tests over the weekend. Even after several days they do not appear in icingadb. This is of course obvious to you, as I posted the screenshot on the 14th, while the state changes happened on the 10th. And as you know, if it were a few minutes ago, it would not have shown the date, but rather the delta time since then. As for the notifications, we became aware of the problem, because one of our services went into critical without sending any notifications. That does normally work and we have tried the configuration to make sure it works. It was on pure coincident that we noticed that, which lead us to investigate. |
@lippserd I have opened a PR to fix the issue. |
@lippserd Have you seen my PR? I would really appreciate some feedback. |
Describe the bug
If an object has a state change during an icinga2 restart (e.g. during a deploy), it is sometimes not written to the database and does not trigger the notifications.
To Reproduce
icingacli director basket restore < icinga-lost-statechange-basket.json
icingacli director config deploy
With that configuration running, deploy icinga2 a few times:
icingacli director config deploy --force --wait
Soon there will be state changes in the state history that should not be possible:
In this case, the service went from hard warning into soft warning. The soft warning history says that the last state was Ok, but that was never written into the history.
To find lost state histories quicker I used the following script:
dropped_state_query.tar.gz
It needs as parameters the endpoint, user and password. If the db is postgres, it can be run with the
--postgres
flag.Expected behavior
I expect icinga2 to not loose state changes like that.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):icinga2 feature list
):icinga2 daemon -C
):Additional context
I could observe the loss of notifications in production, have however not yet reproduced that behavior locally. I suspect however that the two behavior are linked.
We could also observe the same behavior when creating objects over the icinga2 api and then immediately sending a check-result. Once again, I have not replicated this locally yet, but I suspect the problem is the same in all these cases.
The text was updated successfully, but these errors were encountered: