Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to execute method NodeOps.repair #1370

Open
JBOClara opened this issue Jul 16, 2024 · 2 comments
Open

Failed to execute method NodeOps.repair #1370

JBOClara opened this issue Jul 16, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@JBOClara
Copy link
Contributor

JBOClara commented Jul 16, 2024

What happened?

Cassandra container shows the following error in the logs:

com.datastax.oss.driver.api.core.servererrors.ServerError: Failed to execute method NodeOps.repair

Did you expect to see something different?

/api/v2/repairs status=500 Internal Server Error should return with a 200.

How to reproduce it (as minimally and precisely as possible):

Visible in the cassandra logs

Environment

  • K8ssandra Operator version:

this error is visible with:

helm ls -A -a -d | grep k8ss
k8ssandra-operator       	k8ssandra-operator	1       	2024-05-22 17:13:25.033002 +0200 CEST   	deployed	k8ssandra-operator-1.16.0                 	1.16.0

and

k8ssandra-operator    	k8ssandra-operator	29      	2024-07-13 17:55:28.039314 +0200 CEST  	deployed	k8ssandra-operator-1.17.0          	1.17.0
k describe po -n k8ssandra-operator | grep "Image:" | sort -u
Alias tip: k describe po -n k8ssandra-operator | grep --color "Image:" | sort -u
    Image:          cr.k8ssandra.io/k8ssandra/cass-management-api:4.1.4
    Image:          cr.k8ssandra.io/k8ssandra/system-logger:v1.21.0
    Image:          docker.io/k8ssandra/medusa:0.19.1
    Image:          docker.io/k8ssandra/medusa:0.21.0
    Image:          docker.io/thelastpickle/cassandra-reaper:3.5.0
    Image:          timberio/vector:0.26.0-alpine
    Image:         bitnami/kubectl:1.29.3
    Image:         busybox:1.28
    Image:         cr.k8ssandra.io/k8ssandra/cass-management-api:4.1.4
    Image:         cr.k8ssandra.io/k8ssandra/cass-operator:v1.21.0
    Image:         cr.k8ssandra.io/k8ssandra/k8ssandra-client:v0.4.0
    Image:         cr.k8ssandra.io/k8ssandra/k8ssandra-operator:v1.17.0
    Image:         docker.io/thelastpickle/cassandra-reaper:3.5.0
Image hash
k describe po -n k8ssandra-operator | grep "Image ID:" | sort -u
    Image ID:       cr.k8ssandra.io/k8ssandra/cass-management-api@sha256:e606bae0bd49e794dffdb508bd461e6734e8bba415ac30f2f58742f647fab38c
    Image ID:       cr.k8ssandra.io/k8ssandra/system-logger@sha256:a25251eb74ca08dc87d5ceb3d22bfcb7ac93c1ec7b673c3ce2f8c7bc32769c1f
    Image ID:       docker.io/k8ssandra/medusa@sha256:1a8e63b9dd49744cf13678584f9558c6452ed1b160de17c149174d6035e053d7
    Image ID:       docker.io/k8ssandra/medusa@sha256:4f2991f88c92441bd6ed5034c4a0cdab94b52e37590183753b2b5786eb25abd9
    Image ID:       docker.io/thelastpickle/cassandra-reaper@sha256:9e84f87108994d63bc76cec25b2cdd2e1f02072585f825fd2ca493b09371fc38
    Image ID:       docker.io/timberio/vector@sha256:13779856a8afe8240a1549208040dec12a50cd9b9d98b577d9327d2c212499d8
    Image ID:      cr.k8ssandra.io/k8ssandra/cass-management-api@sha256:e606bae0bd49e794dffdb508bd461e6734e8bba415ac30f2f58742f647fab38c
    Image ID:      cr.k8ssandra.io/k8ssandra/cass-operator@sha256:d851410079654d6f0acd55d220f647f042d7691dd28a6b3866efcc120c34aeae
    Image ID:      cr.k8ssandra.io/k8ssandra/k8ssandra-client@sha256:4cd4f97e74ea4ce256cb55aa166039471b977c5c4f75e92971d012579146b050
    Image ID:      cr.k8ssandra.io/k8ssandra/k8ssandra-operator@sha256:00cd1e0bab61aba16df7edcfbcdab5aa5c9d6c29d3656d1e467aca312090890d
    Image ID:      docker.io/bitnami/kubectl@sha256:f5fc0d561d9ef931f9ecb2e8b65d93eb92767c57f64897c56a100bfe28102c74
    Image ID:      docker.io/library/busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
    Image ID:      docker.io/thelastpickle/cassandra-reaper@sha256:9e84f87108994d63bc76cec25b2cdd2e1f02072585f825fd2ca493b09371fc38
  • Kubernetes version information:

    `kubectl version````kubectl version
    Alias tip: k version
    Client Version: v1.30.2
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.30.2-eks-db838b0


And:

kubectl version
Alias tip: k version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b


* Kubernetes cluster kind:

EKS

* Manifests:

<details>
  <summary>Manifests</summary>

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
annotations:
eks.amazonaws.com/skip-containers: cassandra,server-system-logger,server-config-init
finalizers:

  • finalizer.cassandra.datastax.com
    generation: 1
    labels:
    app.kubernetes.io/component: cassandra
    app.kubernetes.io/name: k8ssandra-operator
    app.kubernetes.io/part-of: k8ssandra
    k8ssandra.io/cleaned-up-by: k8ssandracluster-controller
    k8ssandra.io/cluster-name: cassandra
    k8ssandra.io/cluster-namespace: k8ssandra-operator
    name: us-east
    namespace: k8ssandra-operator
    spec:
    additionalServiceConfig:
    additionalSeedService: {}
    allpodsService: {}
    dcService: {}
    nodePortService: {}
    seedService: {}
    clusterName: cassandra
    config:
    cassandra-env-sh:
    additional-jvm-opts:
    • -Dcassandra.allow_alter_rf_during_range_movement=true
    • -Dcassandra.system_distributed_replication=us-east:3
    • -Dcassandra.jmx.authorizer=org.apache.cassandra.auth.jmx.AuthorizationProxy
    • -Djava.security.auth.login.config=$CASSANDRA_HOME/conf/cassandra-jaas.config
    • -Dcassandra.jmx.remote.login.config=CassandraLogin
    • -Dcom.sun.management.jmxremote.authenticate=true
    • -Djavax.net.ssl.trustStore=/mnt/client-truststore/truststore
    • -Djavax.net.ssl.keyStore=/mnt/client-keystore/keystore
    • -Djavax.net.debug=ssl
    • -Dcom.sun.management.jmxremote.registry.ssl=true
    • -Dcassandra.consistent.rangemovement=false
    • -Dcom.sun.management.jmxremote.ssl.need.client.auth=true
    • -Dcom.sun.management.jmxremote.registry.ssl=true
    • -Dcom.sun.management.jmxremote.ssl=true
    • -Dcassandra.allow_new_old_config_keys=true
      cassandra-yaml:
      authenticator: PasswordAuthenticator
      authorizer: CassandraAuthorizer
      auto_bootstrap: true
      auto_snapshot: true
      batch_size_fail_threshold: 1500KiB
      batch_size_warn_threshold: 10KiB
      client_encryption_options:
      enabled: true
      keystore: /mnt/client-keystore/keystore
      keystore_password: READACTED
      optional: false
      require_client_auth: false
      truststore: /mnt/client-truststore/truststore
      truststore_password: READACTED
      concurrent_counter_writes: 64
      concurrent_materialized_view_writes: 64
      concurrent_reads: 64
      concurrent_writes: 64
      counter_cache_size: 50MiB
      materialized_views_enabled: true
      native_transport_port: 9042
      num_tokens: 256
      range_request_timeout: 10000ms
      read_request_timeout: 15000ms
      request_timeout: 20000ms
      role_manager: CassandraRoleManager
      server_encryption_options:
      internode_encryption: all
      keystore: /mnt/server-keystore/keystore
      keystore_password: READACTED
      require_client_auth: false
      truststore: /mnt/server-truststore/truststore
      truststore_password: READACTED
      write_request_timeout: 2000ms
      jvm-server-options:
      initial_heap_size: 4294967296
      jmx-connection-type: local-no-auth
      jmx-port: 7199
      jmx-remote-ssl: true
      max_heap_size: 4294967296
      jvm11-server-options:
      garbage_collector: G1GC
      configBuilderResources: {}
      managementApiAuth: {}
      networking: {}
      podTemplateSpec:
      metadata: {}
      spec:
      containers:
    • env:
      • name: LOCAL_JMX
        value: "no"
      • name: MANAGEMENT_API_HEAP_SIZE
        value: "128000000"
      • name: MGMT_API_DISABLE_MCAC
        value: "true"
        livenessProbe:
        failureThreshold: 3
        httpGet:
        path: /api/v0/probes/liveness
        port: 8080
        scheme: HTTP
        initialDelaySeconds: 230
        periodSeconds: 15
        successThreshold: 1
        timeoutSeconds: 10
        name: cassandra
        readinessProbe:
        failureThreshold: 3
        httpGet:
        path: /api/v0/probes/readiness
        port: 8080
        scheme: HTTP
        initialDelaySeconds: 270
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 10
        resources: {}
        volumeMounts:
      • mountPath: /crypto
        name: certs
      • mountPath: /home/cassandra/.cassandra/cqlshrc
        name: cqlsh-config
        subPath: cqlshrc
      • mountPath: /home/cassandra/.cassandra/nodetool-ssl.properties
        name: nodetool-config
        subPath: nodetool-ssl.properties
      • mountPath: /mnt/client-keystore
        name: client-keystore
      • mountPath: /mnt/client-truststore
        name: client-truststore
      • mountPath: /mnt/server-keystore
        name: server-keystore
      • mountPath: /mnt/server-truststore
        name: server-truststore
    • name: server-system-logger
      resources: {}
    • env:
      • name: MEDUSA_MODE
        value: GRPC
      • name: MEDUSA_TMP_DIR
        value: /var/lib/cassandra
      • name: POD_NAME
        valueFrom:
        fieldRef:
        fieldPath: metadata.name
      • name: CQL_USERNAME
        valueFrom:
        secretKeyRef:
        key: username
        name: cassandra-medusa
      • name: CQL_PASSWORD
        valueFrom:
        secretKeyRef:
        key: password
        name: cassandra-medusa
        image: docker.io/k8ssandra/medusa:0.21.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
        exec:
        command:
        • /bin/grpc_health_probe
        • --addr=:50051
          failureThreshold: 10
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
          name: medusa
          ports:
      • containerPort: 50051
        name: grpc
        protocol: TCP
        readinessProbe:
        exec:
        command:
        • /bin/grpc_health_probe
        • --addr=:50051
          failureThreshold: 10
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
          resources:
          limits:
          memory: 512Mi
          requests:
          cpu: 10m
          memory: 116Mi
          volumeMounts:
      • mountPath: /etc/cassandra
        name: server-config
      • mountPath: /var/lib/cassandra
        name: server-data
      • mountPath: /etc/medusa
        name: cassandra-medusa
      • mountPath: /etc/podinfo
        name: podinfo
      • mountPath: /etc/certificates
        name: certificates
        initContainers:
    • command:
      • sysctl
      • -w
      • vm.max_map_count=1048575
        image: busybox:1.28
        name: sysctl
        resources: {}
        securityContext:
        privileged: true
    • name: server-config-init
      resources: {}
    • env:
      • name: MEDUSA_MODE
        value: RESTORE
      • name: MEDUSA_TMP_DIR
        value: /var/lib/cassandra
      • name: POD_NAME
        valueFrom:
        fieldRef:
        fieldPath: metadata.name
      • name: CQL_USERNAME
        valueFrom:
        secretKeyRef:
        key: username
        name: cassandra-medusa
      • name: CQL_PASSWORD
        valueFrom:
        secretKeyRef:
        key: password
        name: cassandra-medusa
        image: docker.io/k8ssandra/medusa:0.21.0
        imagePullPolicy: IfNotPresent
        name: medusa-restore
        resources:
        limits:
        memory: 8Gi
        requests:
        cpu: 100m
        memory: 100Mi
        volumeMounts:
      • mountPath: /etc/cassandra
        name: server-config
      • mountPath: /var/lib/cassandra
        name: server-data
      • mountPath: /etc/medusa
        name: cassandra-medusa
      • mountPath: /etc/podinfo
        name: podinfo
      • mountPath: /etc/certificates
        name: certificates
        volumes:
    • name: certs
      secret:
      secretName: cassandra-jks-keystore
    • configMap:
      name: cqlsh-config
      name: cqlsh-config
    • configMap:
      name: nodetool-config
      name: nodetool-config
    • name: client-keystore
      secret:
      items:
      • key: keystore.jks
        path: keystore
        secretName: cassandra-jks-keystore
    • name: client-truststore
      secret:
      items:
      • key: truststore.jks
        path: truststore
        secretName: cassandra-jks-keystore
    • name: server-keystore
      secret:
      items:
      • key: keystore.jks
        path: keystore
        secretName: cassandra-jks-keystore
    • name: server-truststore
      secret:
      items:
      • key: truststore.jks
        path: truststore
        secretName: cassandra-jks-keystore
    • configMap:
      name: cassandra-medusa
      name: cassandra-medusa
    • downwardAPI:
      items:
      • fieldRef:
        fieldPath: metadata.labels
        path: labels
        name: podinfo
    • name: certificates
      secret:
      secretName: medusa-certificates
      racks:
  • name: 1a
    nodeAffinityLabels:
    topology.kubernetes.io/zone: us-east-1a
  • name: 1d
    nodeAffinityLabels:
    topology.kubernetes.io/zone: us-east-1b
  • name: 1c
    nodeAffinityLabels:
    topology.kubernetes.io/zone: us-east-1c
    resources:
    limits:
    memory: 9Gi
    requests:
    cpu: "1"
    memory: 9Gi
    serverType: cassandra
    serverVersion: 4.1.4
    size: 3
    storageConfig:
    additionalVolumes:
    • mountPath: /etc/vector
      name: vector-config
      volumeSource:
      configMap:
      name: cassandra-us-east-cass-vector
    • mountPath: /opt/management-api/configs
      name: metrics-agent-config
      volumeSource:
      configMap:
      items:
      • key: metrics-collector.yaml
        path: metrics-collector.yaml
        name: cassandra-us-east-metrics-agent-config
        cassandraDataVolumeClaimSpec:
        accessModes:
    • ReadWriteOnce
      resources:
      requests:
      storage: 300Gi
      storageClassName: ebs-xfs-sc
      superuserSecretName: cassandra-superuser
      systemLoggerResources:
      limits:
      memory: 512Mi
      requests:
      cpu: 100m
      memory: 128Mi
      users:
  • secretName: cassandra-reaper
    superuser: true
  • secretName: cassandra-medusa
    superuser: true

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
annotations:
config.kubernetes.io/origin: |
path: ../../base/k8ssandra-encrypted.yaml
k8ssandra.io/initial-system-replication: '{"us-east":3}'
finalizers:

  • k8ssandracluster.k8ssandra.io/finalizer
    generation: 5
    name: cassandra
    namespace: k8ssandra-operator
    spec:
    auth: true
    cassandra:
    clientEncryptionStores:
    keystorePasswordSecretRef:
    name: jks-password
    keystoreSecretRef:
    key: keystore.jks
    name: cassandra-jks-keystore
    truststorePasswordSecretRef:
    name: jks-password
    truststoreSecretRef:
    key: truststore.jks
    name: cassandra-jks-keystore
    config:
    cassandraYaml:
    authenticator: PasswordAuthenticator
    authorizer: CassandraAuthorizer
    auto_bootstrap: true
    auto_snapshot: true
    batch_size_fail_threshold: 1500KiB
    batch_size_warn_threshold: 10KiB
    client_encryption_options:
    enabled: true
    optional: false
    require_client_auth: false
    concurrent_counter_writes: 64
    concurrent_materialized_view_writes: 64
    concurrent_reads: 64
    concurrent_writes: 64
    counter_cache_size: 50MiB
    materialized_views_enabled: true
    native_transport_port: 9042
    num_tokens: 256
    range_request_timeout: 10000ms
    read_request_timeout: 15000ms
    request_timeout: 20000ms
    server_encryption_options:
    internode_encryption: all
    require_client_auth: false
    write_request_timeout: 2000ms
    jvmOptions:
    additionalOptions:
    • -Djavax.net.debug=ssl
    • -Dcom.sun.management.jmxremote.registry.ssl=true
    • -Dcassandra.consistent.rangemovement=false
    • -Dcom.sun.management.jmxremote.ssl.need.client.auth=true
    • -Dcom.sun.management.jmxremote.registry.ssl=true
    • -Dcom.sun.management.jmxremote.ssl=true
    • -Dcassandra.allow_new_old_config_keys=true
      gc: G1GC
      heap_initial_size: 4Gi
      heap_max_size: 4Gi
      jmx_connection_type: local-no-auth
      jmx_port: 7199
      jmx_remote_ssl: true
      containers:
  • livenessProbe:
    failureThreshold: 3
    httpGet:
    path: /api/v0/probes/liveness
    port: 8080
    scheme: HTTP
    initialDelaySeconds: 230
    periodSeconds: 15
    successThreshold: 1
    timeoutSeconds: 10
    name: cassandra
    readinessProbe:
    failureThreshold: 3
    httpGet:
    path: /api/v0/probes/readiness
    port: 8080
    scheme: HTTP
    initialDelaySeconds: 270
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 10
    volumeMounts:
    • mountPath: /crypto
      name: certs
    • mountPath: /home/cassandra/.cassandra/cqlshrc
      name: cqlsh-config
      subPath: cqlshrc
    • mountPath: /home/cassandra/.cassandra/nodetool-ssl.properties
      name: nodetool-config
      subPath: nodetool-ssl.properties
      datacenters:
  • initContainers:
    • command:
      • sysctl
      • -w
      • vm.max_map_count=1048575
        image: busybox:1.28
        name: sysctl
        securityContext:
        privileged: true
        metadata:
        name: us-east
        perNodeConfigInitContainerImage: mikefarah/yq:4
        racks:
    • name: 1a
      nodeAffinityLabels:
      topology.kubernetes.io/zone: us-east-1a
    • name: 1d
      nodeAffinityLabels:
      topology.kubernetes.io/zone: us-east-1b
    • name: 1c
      nodeAffinityLabels:
      topology.kubernetes.io/zone: us-east-1c
      resources:
      limits:
      memory: 9Gi
      requests:
      cpu: 1
      memory: 9Gi
      size: 3
      stopped: false
      extraVolumes:
      volumes:
    • name: certs
      secret:
      secretName: cassandra-jks-keystore
    • configMap:
      name: cqlsh-config
      name: cqlsh-config
    • configMap:
      name: nodetool-config
      name: nodetool-config
      metadata:
      annotations:
      eks.amazonaws.com/skip-containers: cassandra,server-system-logger,server-config-init
      mgmtAPIHeap: 128M
      networking:
      hostNetwork: false
      perNodeConfigInitContainerImage: mikefarah/yq:4
      serverEncryptionStores:
      keystorePasswordSecretRef:
      name: jks-password
      keystoreSecretRef:
      key: keystore.jks
      name: cassandra-jks-keystore
      truststorePasswordSecretRef:
      name: jks-password
      truststoreSecretRef:
      key: truststore.jks
      name: cassandra-jks-keystore
      serverType: cassandra
      serverVersion: 4.1.4
      softPodAntiAffinity: false
      storageConfig:
      cassandraDataVolumeClaimSpec:
      accessModes:
      • ReadWriteOnce
        resources:
        requests:
        storage: 300Gi
        storageClassName: ebs-xfs-sc
        telemetry:
        mcac:
        enabled: false
        prometheus:
        enabled: true
        vector:
        components:
        sinks:
        • config: |
          target = "stdout"
          [sinks.console_output.encoding]
          codec = "json"
          inputs:
          • cassandra_metrics
            name: console_output
            type: console
            enabled: true
            resources:
            limits:
            memory: 512Mi
            requests:
            cpu: 100m
            memory: 128Mi
            scrapeInterval: 30s
            medusa:
            certificatesSecretRef:
            name: medusa-certificates
            containerImage:
            name: medusa
            registry: docker.io
            repository: k8ssandra
            tag: 0.21.0
            containerResources:
            limits:
            memory: 512Mi
            requests:
            cpu: 10m
            memory: 116Mi
            storageProperties:
            bucketName: dow-backups
            concurrentTransfers: 10
            credentialsType: role-based
            maxBackupAge: 0
            maxBackupCount: 0
            multiPartUploadThreshold: 104857600
            prefix: cassandra-tests
            region: us-east-1
            secure: true
            storageProvider: s3
            storageSecretRef:
            name: ""
            transferMaxBandwidth: 90MB/s
            reaper:
            ServiceAccountName: default
            autoScheduling:
            enabled: true
            initialDelayPeriod: PT15S
            percentUnrepairedThreshold: 10
            periodBetweenPolls: PT10M
            repairType: AUTO
            scheduleSpreadPeriod: PT6H
            timeBeforeFirstSchedule: PT5M
            containerImage:
            name: cassandra-reaper
            repository: thelastpickle
            tag: 3.6.0
            deploymentMode: SINGLE
            heapSize: 2Gi
            httpManagement:
            enabled: true
            keyspace: reaper_db
            secretsProvider: internal
            telemetry:
            cassandra:
            endpoint:
            address: 0.0.0.0
            mcac:
            enabled: false
            prometheus:
            enabled: true
            vector:
            enabled: true
            resources:
            limits:
            cpu: 100m
            memory: 512Mi
            requests:
            cpu: 100m
            memory: 128Mi
            secretsProvider: internal

</details>

* K8ssandra Operator Logs:

INFO [nioEventLoopGroup-2-2] 2024-07-16 12:31:35,347 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v2/repairs status=500 Internal Server Error
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:31:38,541 Cli.java:663 - address=/10.210.20.219:56784 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:31:43,538 Cli.java:663 - address=/10.210.20.219:51656 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:31:48,540 Cli.java:663 - address=/10.210.20.219:51666 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:31:58,539 Cli.java:663 - address=/10.210.20.219:48066 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:31:58,540 Cli.java:663 - address=/10.210.20.219:48068 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:02,818 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:02,820 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:02,909 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:05,371 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:05,373 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:05,466 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:08,541 Cli.java:663 - address=/10.210.20.219:55514 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:32:13,538 Cli.java:663 - address=/10.210.20.219:58392 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:18,540 Cli.java:663 - address=/10.210.20.219:58402 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:32:28,539 Cli.java:663 - address=/10.210.20.219:52776 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:28,540 Cli.java:663 - address=/10.210.20.219:52790 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:32:38,541 Cli.java:663 - address=/10.210.20.219:39932 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:40,870 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:40,873 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:40,989 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:41,561 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:41,564 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:41,657 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:43,539 Cli.java:663 - address=/10.210.20.219:44100 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:32:48,540 Cli.java:663 - address=/10.210.20.219:44112 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:32:58,538 Cli.java:663 - address=/10.210.20.219:36508 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:32:58,541 Cli.java:663 - address=/10.210.20.219:36520 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:08,541 Cli.java:663 - address=/10.210.20.219:52446 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-1] 2024-07-16 12:33:13,538 Cli.java:663 - address=/10.210.20.219:52002 url=/api/v0/probes/liveness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,148 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,150 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,152 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,161 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,162 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:17,254 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,537 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,539 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,540 Cli.java:663 - address=/10.210.20.219:52018 url=/api/v0/probes/readiness status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:18,643 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v1/ops/tables/compactions status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:26,184 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
INFO [nioEventLoopGroup-2-2] 2024-07-16 12:33:26,186 Cli.java:663 - address=/10.210.18.172:49500 url=/api/v0/metadata/endpoints status=200 OK
com.datastax.oss.driver.api.core.servererrors.ServerError: Failed to execute method NodeOps.repair
at com.datastax.oss.driver.api.core.servererrors.ServerError.copy(ServerError.java:54)
at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
at com.datastax.mgmtapi.CqlService.executePreparedStatement(CqlService.java:57)
at com.datastax.mgmtapi.resources.v2.RepairResourcesV2.lambda$repair$0(RepairResourcesV2.java:80)
at com.datastax.mgmtapi.resources.common.BaseResources.handle(BaseResources.java:67)
at com.datastax.mgmtapi.resources.v2.RepairResourcesV2.repair(RepairResourcesV2.java:71)
at jdk.internal.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:170)
at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:130)
at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:643)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:507)
at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$2(ResourceMethodInvoker.java:457)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:459)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:419)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:393)
at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:68)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:492)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
at org.jboss.resteasy.plugins.server.netty.RequestDispatcher.service(RequestDispatcher.java:86)
at org.jboss.resteasy.plugins.server.netty.RequestHandler.channelRead0(RequestHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61)
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)


**Anything else we need to know?**:

No



┆Issue is synchronized with this [Jira Story](https://datastax.jira.com/browse/K8OP-9) by [Unito](https://www.unito.io)
┆Issue Number: K8OP-9
@JBOClara JBOClara added the bug Something isn't working label Jul 16, 2024
@iAlex97
Copy link

iAlex97 commented Jul 31, 2024

We have also encountered the same issue when enabling autoScheduling for Reaper. Further checking the logs from mgmt-api, I think this is due to Reaper having an invalid combination of default parameters (only happens for Cassandra 4.x) when setting up automatic schedules. The error which led me to think this:

INFO  [epollEventLoopGroup-5-3] 2024-07-31 08:44:16,274 RpcMethod41x.java:138 - Failed to execute method NodeOps.repair
java.lang.reflect.InvocationTargetException: null
	at jdk.internal.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at com.datastax.mgmtapi.rpc.RpcMethod41x.execute(RpcMethod41x.java:130)
	at com.datastax.mgmtapi.rpc.RpcMethod41x.execute(RpcMethod41x.java:33)
	at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.lambda$handle$1(QueryHandlerInterceptor.java:120)
	at com.datastax.mgmtapi.shims.CassandraAPI.handleRpcResult(CassandraAPI.java:73)
	at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.handle(QueryHandlerInterceptor.java:120)
	at com.datastax.mgmtapi.interceptors.QueryHandlerInterceptor.intercept(QueryHandlerInterceptor.java:80)
	at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java)
	at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:116)
	at org.apache.cassandra.transport.Message$Request.execute(Message.java:255)
	<redacted>
Caused by: java.io.IOException: Invalid repair combination. Incremental repair if Parallelism is not set
	at com.datastax.mgmtapi.NodeOpsProvider.repair(NodeOpsProvider.java:824)
	... 43 common frames omitted

K8ssandraCluster CRD has autoScheduling.repairType set as AUTO which for Cassandra 4.x will behave as INCREMENTAL and will setup the schedules accordingly.

From the Reaper docs we understand that for an Incremental repair the only allowed value for repairParallelism is PARALLEL:

Sets the default repair type unless specifically defined for each run. Note that this is only supported with the PARALLEL repairParallelism setting. For more details in incremental repair, please refer to the following article.http://www.datastax.com/dev/blog/more-efficient-repairs

This is checked by the management-api here which indeed throws the error that I'm seeing.

Run exec into a reaper pod to check it's configuration, we see that its /etc/cassandra-reaper/config/cassandra-reaper.yml sets repairParallelism to the value of an env variable called REAPER_REPAIR_PARALELLISM. The value for that variable is

REAPER_REPAIR_PARALELLISM=DATACENTER_AWARE

We can further check this by looking at the reaper tables inside cassandra:

prod-superuser@cqlsh> use reaper_db;
prod-superuser@cqlsh:reaper_db> select * from repair_schedule_v1;

 id                                   | adaptive | creation_time                   | days_between | intensity | last_run                             | next_activation                 | owner           | pause_time                      | percent_unrepaired_threshold | repair_parallelism | repair_unit_id                       | run_history | segment_count | segment_count_per_node | state
--------------------------------------+----------+---------------------------------+--------------+-----------+--------------------------------------+---------------------------------+-----------------+---------------------------------+------------------------------+--------------------+--------------------------------------+-------------+---------------+------------------------+--------
 b3cb2180-4e7c-11ef-9f1c-4d0488525d6c |    False | 2024-07-30 14:04:57.112000+0000 |            7 |       0.9 | 2db1ccf0-4e97-11ef-92a3-c328b392dd6d | 2024-08-06 17:08:38.033000+0000 | auto-scheduling | 2024-07-30 14:10:43.136000+0000 |                           10 |        dc_parallel | b3c9e900-4e7c-11ef-9f1c-4d0488525d6c |        null |          null |                     64 | ACTIVE
 b3d533a0-4e7c-11ef-9f1c-4d0488525d6c |    False | 2024-07-30 14:04:57.178000+0000 |            7 |       0.9 |                                 null | 2024-07-30 20:09:57.150000+0000 | auto-scheduling |                            null |                           10 |        dc_parallel | b3d337d0-4e7c-11ef-9f1c-4d0488525d6c |        null |          null |                     64 | ACTIVE

which confirms that the default parallelism was set to dc_parallel or DATACENTER_AWARE.

My confusion comes from where this variable is set. From my limited research is not specified in the reaper deployment, it is not inside the Dockerfile, nor can it be configured from CRD.

For possible workarounds, I see the following:

  • set autoScheduling.repairType inside the CRD to REGULAR, because ADAPTIVE is only recommended for cassandra 3.x
  • manually edit the entries from the schedules table and set repair_parallelism to parallel
  • manually edit the reaper deployment like so:
    1. Initial deployment of reaper with "wrong" config
    2. Scale down the deployment to 0
    3. Edit the deployment and set the env variable REAPER_REPAIR_PARALELLISM=PARALLEL
    4. Delete reaper_db keyspace
    5. Scale up reaper deployment which should re-run migrations and populate schedules table with proper parallelism value

@adejanovski what do you think?

@JBOClara
Copy link
Contributor Author

Hello @adejanovski

Can you tell us if there is enough information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: No status
Development

No branches or pull requests

2 participants