Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let redundancy groups not just fail #10190

Open
nilmerg opened this issue Oct 16, 2024 · 0 comments
Open

Let redundancy groups not just fail #10190

nilmerg opened this issue Oct 16, 2024 · 0 comments

Comments

@nilmerg
Copy link
Member

nilmerg commented Oct 16, 2024

This is not a bug, not yet, at least. Once #10014 is fixed, this should be considered.

tl/dr

Redundancy groups should not just fail, they should in addition be not reachable or not determinable, just like hosts and services are.

What's this about?

graph LR;
    ChildHost-->pa["ParentHostA (Group 1)"];
    ChildHost-->pb["ParentHostB (Group 1)"];
    pa-->ga["GrandParentA"];
    pb-->gb["GrandParentB"];
Loading

Right now, the child host will be unreachable in two cases:

  1. Both parents are down (expected)
  2. One of them is unreachable (The linked bug)

Once the second case is fixed, it will likely behave like this instead:

  1. Both parents are down (still, expected)
  2. Both parents are unreachable (new, expected)

But what happens in case only one of the parents is down and the other unreachable?

  1. One parent is down and the other unreachable (new, ??)

I suppose so, since already right now, the availability of the child host is influenced by its parent's reachable state. It will just be extended to consider redundancy groups in such a case.

But then there is Dependency::IsAvailable

Which is what is used to determine a redundancy group's state at the moment. Though, it doesn't consider the parent's reachable state at all. The parent might be UP, but unreachable, and it returns true.

So, in order to really ensure that the child host is unreachable in the third case, this might need to change. (The potential bug)

Of course, this has larger side-effects, as it changes how dependencies work in general.

Which is what this issue is really about! As I believe this is a good thing. Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway. This discrepancy will only pop up in case checks are disabled on the parent in question. (As if checks run, the parent will eventually go down/unknown)

Additional note

What led me to this, was the wish to understand how a redundancy group's state will be determined as part of Icinga/icingadb#347.

As of today I thought it simply represents whether all related dependencies failed or not. But now a failed dependency might be failing, because a parent might just not be reachable. And so will a redundancy group.

This all sounds fine, unless you want to identify the actual root problem in a failing dependency chain. (A root problem is a dependency node, which is reachable and has a problem)

In a dependency chain, the current plan is, to represent redundancy groups as actual nodes with an actual state. While hosts and services also have state, they can in addition be not reachable. Redundancy groups can only fail.

But a redundancy group, with one host member that is UP but not reachable, is technically also not reachable. So it's not just a failing dependency, but one that is … not determinable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant