Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some data were not uploaded to the S3 when the num of partition > 7 #481

Open
freshtang opened this issue Jan 19, 2024 · 6 comments
Open

Comments

@freshtang
Copy link

freshtang commented Jan 19, 2024

What happened?

image
Several of the partition data were not uploaded to the S3 when the partition num of topic > 7

What did you expect to happen?

All partition data was successfully uploaded to the S3

What else do we need to know?

diff of two paritions as follow
image
image
The tieredTopic01173-13 partition had not logs like "Copied 00000000000000000000.log...." found

@freshtang freshtang changed the title Some data were not uploaded to the S3 when the partition num of topic > 7 Some data were not uploaded to the S3 when the num of partition > 7 Jan 19, 2024
@jeqo
Copy link
Contributor

jeqo commented Jan 23, 2024

@freshtang thanks for reporting this issue.

I can see you have a topic with 14 partitions.
Could you confirm if the partitions you highlight have more than one segment? If only one (active) segment, then there's no candidate segments to upload.

You could also expose JMX metrics and validate if there's any error when copying segments (MBean: kafka.server:type=BrokerTopicMetrics,name=RemoteCopyErrorsPerSec)

@freshtang
Copy link
Author

freshtang commented Jan 25, 2024

Thank you @jeqo for the follow-up . I have set the prop of segment.bytes =100MB,so the partitions have more than one segment
I encountered this issue in the Kafka 3.6.1,but it did not appear in the Kafka 3.6.2-SNAPSHOT

@jeqo
Copy link
Contributor

jeqo commented Jan 30, 2024

@freshtang thanks for the context.

Can't find any commit upstream that may be fixing this since 3.6.1.
Could you provide more info on the topic configuration (retention, etc.) and broker configs, and scripts to try to reproduce it?
You could use the demo: https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/tree/main/demo as baseline if it helps.

Also relevant to mention that we haven't yet fully test against KRaft--could you also try to reproduce against a Zookeeper based cluster?

@jeqo
Copy link
Contributor

jeqo commented May 21, 2024

@freshtang we have similar behavior in other users. If this issue persists on your side, could you use the API timeout configuration for S3? (s3.api.call.timeout and s3.api.call.attempt.timeout)
By default these are empty. My guess is that S3 operations are getting stuck and without a timeout they just hung indefinitely blocking the uploads for the partition assigned.

@freshtang
Copy link
Author

@jeqo I tried to reproduce with a Zookeeper based cluste and API timeout configuration(s3.api.call.timeout and s3.api.call.attempt.timeout), but the problem still exists

@jeqo
Copy link
Contributor

jeqo commented Jul 18, 2024

@freshtang thanks for confirming! Would be possible to give a try to this PR: #549 to see if the reduced allocation helps your deployment? We have seem other scenarios where the thread allocated to a certain partition runs OOM and leads to this scenario.
If you could observe your threads as well to validate this hypothesis would be great.
Looking forward to your results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants