Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My fluentd pods keep restarting #72

Open
rileyhun opened this issue Apr 14, 2021 · 5 comments
Open

My fluentd pods keep restarting #72

rileyhun opened this issue Apr 14, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@rileyhun
Copy link

rileyhun commented Apr 14, 2021

I am noticing my fluentd pods keep restarting. They are collecting the logs and sending them to elasticsearch, so the workflow isn't broken per se, but in the last 13 hours, the fluentd pods have restarted 61 times.

Describe the bug
The logs indicate the following:

    [warn]: unexpected error while calling stop on input plugin plugin=Fluent::Plugin::MonitorAgentInput plugin_id="monitor_agent" error_class=ThreadError error="killed thread"
    [warn]: [elasticsearch] failed to flush the buffer. retry_time=1 next_retry_seconds=XXX chunk="XXXXXX" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster
    [warn]: /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:291:in `shutdown'

Version of Helm and Kubernetes:

Helm Version:

version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}

Kubernetes Version: 1.18.16

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-gke.502", GitCommit:"a2a88ab32201dca596d0cdb116bbba3f765ebd36", GitTreeState:"clean", BuildDate:"2021-03-08T22:06:24Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

Which version of the chart: latest

How to reproduce it (as minimally and precisely as possible):

extraConfigMaps:
  containers.input.conf: |-
    <source>
    @id fluentd-containers.log
    @type tail
    path /var/log/containers/*.log
    pos_file /var/log/containers.log.pos
    tag raw.kubernetes.*
    read_from_head true
    <parse>
    @type multi_format
    <pattern>
    format json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
    </pattern>
    <pattern>
    format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
    time_format %Y-%m-%dT%H:%M:%S.%N%:z
    </pattern>
    </parse>
    </source>

    # Detect exceptions in the log output and forward them as one log entry.
    <match raw.kubernetes.**>
    @id raw.kubernetes
    @type detect_exceptions
    remove_tag_prefix raw
    message log
    stream stream
    multiline_flush_interval 5
    max_bytes 500000
    max_lines 1000
    </match>

    # Concatenate multi-line logs
    <filter **>
    @id filter_concat
    @type concat
    key message
    multiline_end_regexp /\n$/
    separator ""
    </filter>

    # Enriches records with Kubernetes metadata
    <filter kubernetes.**>
    @id filter_kubernetes_metadata
    @type kubernetes_metadata
    </filter>

    # Fixes json fields in Elasticsearch
    <filter kubernetes.**>
    @id filter_parser
    @type parser
    key_name log
    reserve_data true
    remove_key_name_field true
    <parse>
    @type multi_format
    <pattern>
    format json
    </pattern>
    <pattern>
    format none
    </pattern>
    </parse>
    </filter>

    #exclude kube-system
    <match kubernetes.var.log.containers.**kube-system**.log>
    @type null
    </match>

    # Filter to only records with label fluentd=true
    <filter kubernetes.**>
    @type grep
    <regexp>
    key $.kubernetes.labels.fluentd
    pattern true
    </regexp>
    </filter>

    <filter kubernetes.**>
    @type grep
    <exclude>
    key $.kubernetes.container_name
    pattern istio-proxy
    </exclude>
    </filter>
helm upgrade --install fluentd fluentd-elasticsearch-11.9.0.tgz --namespace=logging --values=../logging/fluentd-values.yaml
@rileyhun rileyhun added the bug Something isn't working label Apr 14, 2021
@monotek
Copy link
Member

monotek commented Apr 16, 2021

Seems elasticsearch is not reachable.

@rileyhun
Copy link
Author

It is reachable in the sense that my logs are being sent

@Ghazgkull
Copy link
Contributor

@rileyhun If you're able to view the Kubernetes events in your cluster around the time that your fluentd is restarting, look for an event whose event.involvedObject.name is the name of your fluentd pod. If you're on the latest version of this helm chart, I recently PRed a change such that the liveness probe now writes an error message when it fails... that message will show up on the Kubernetes event's event.message field. You should see something like "Liveness probe failed: Elasticsearch buffers found stuck longer than 300 seconds."

@Ghazgkull
Copy link
Contributor

Also, in that same PR I fixed an issue where the liveness probe would fail during periods of near-zero log shipping. The latest version of the chart fixes that.

@s7an-it
Copy link

s7an-it commented Feb 26, 2022

I am using default ECK operator with default setting here apart from setting the connection strings and no matter how I fine tune those I can't evade occasional pod restarts, it looks like buffer settings influence it, I am using intra cluster only setup but still I get it, could you please advise if you have tested it with ECK @Ghazgkull @monotek and if use any fine tuned settings for it. I am on LTS ECK / EKS 1.20.X.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

4 participants