Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure communicating with Statuscake API resulting in duplicate tests #362

Closed
osilva opened this issue Sep 10, 2021 · 5 comments
Closed
Labels
kind/bug Something isn't working

Comments

@osilva
Copy link

osilva commented Sep 10, 2021

We are running v2.1.10 on a v1.19.12 cluster with tests/monitors being created in Statuscake. We find that duplicate Monitors are being created and when I check the IMC pod I find:

{ "level": "error", "ts": 1631202713.767603, "logger": "statuscake-monitor", "msg": "Unable to retrieve monitor", "error": "Get \"https://app.statuscake.com/API/Tests/\": read tcp 10.51.4.130:50776->104.20.73.215:443: read: connection reset by peer", "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error \t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error ...

So it makes sense that if there is no communication with Statuscake then the Monitor will be created, which it is.

I am not sure if the problem is with IMC or with Statuscake and am trying to get more information. So far I have not found errors on the underlying node, in the cluster, or with the network but at the same time I don't exactly know what is happening inside the IMC container.

I'm new-ish to this so it's VERY LIKELY I am missing something obvious but I am unable to exec into the container itself:

OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory"

I see others have reported duplicate test creation as well but did not find a resolution.

The full error:
{"level":"error","ts":1631257648.135457,"logger":"statuscake-monitor","msg":"Unable to retrieve monitor","error":"Get \"https://app.statuscake.com/API/Tests/\": read tcp 10.51.4.130:45626->104.20.73.215:443: read: connection reset by peer","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:144\ngithub.com/stakater/IngressMonitorController/pkg/monitors/statuscake.(*StatusCakeMonitorService).GetAll\n\t/workspace/pkg/monitors/statuscake/statuscake-monitor.go:231\ngithub.com/stakater/IngressMonitorController/pkg/monitors/statuscake.(*StatusCakeMonitorService).GetByName\n\t/workspace/pkg/monitors/statuscake/statuscake-monitor.go:203\ngithub.com/stakater/IngressMonitorController/pkg/monitors.(*MonitorServiceProxy).GetByName\n\t/workspace/pkg/monitors/monitor-proxy.go:84\ngithub.com/stakater/IngressMonitorController/pkg/controllers.findMonitorByName\n\t/workspace/pkg/controllers/endpointmonitor_util.go:10\ngithub.com/stakater/IngressMonitorController/pkg/controllers.(*EndpointMonitorReconciler).Reconcile\n\t/workspace/pkg/controllers/endpointmonitor_controller.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:99"} {"level":"info","ts":1631257648.136304,"logger":"controllers.EndpointMonitor","msg":"Creating Monitor: test-service.mlsdevcloudfake.com-k8s-cluster-platform","endpointMonitor":"k8s-cluster-platform"} {"level":"info","ts":1631257648.589671,"logger":"statuscake-monitor","msg":"Monitor Added: 6114024"}

@osilva
Copy link
Author

osilva commented Sep 10, 2021

I realize we are running IMC on multiple clusters using the same credentials, so it's also possible we are hitting rate limits on the API on the Statuscake side. If so, it would be good to maybe have a random back-off, or retry on verifying whether a monitor exists.

@Dadavan
Copy link

Dadavan commented Sep 12, 2021

We are also experiencing this. Running on multiple clusters while only 2 of them seem to have this issue.

@osilva
Copy link
Author

osilva commented Sep 15, 2021

We've been told by Statuscake support that it's likely an issue with API rate limiting. Is it possible to have IMC note the problem and retry or have a back-off time?

@rasheedamir rasheedamir added the kind/bug Something isn't working label Sep 17, 2021
@Dadavan
Copy link

Dadavan commented Oct 4, 2021

I think the problem is with the GetAll() function. Maybe it should also return an error, that way if an error is returned from it the reconciliation loop should continue without creating a new monitor (a new monitor should be created only if the monitor is nil, not if the actual GetAll() function fails, in this case it should just print the error and retry on the next iteration). The thing is that this will affect the MonitorService interface and therefore all of the service types and not only StatusCake would need to be updated.
What do you think?
EDIT: This is the exact same issue as #293

@karl-johan-grahn
Copy link
Contributor

Closing in favor of #293

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants