Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on windows when closing several subscriptions #1249

Open
chrisejones opened this issue Feb 10, 2021 · 12 comments
Open

Issue on windows when closing several subscriptions #1249

chrisejones opened this issue Feb 10, 2021 · 12 comments

Comments

@chrisejones
Copy link

Hello,

We're seeing an issue only on Windows where closing several subscriptions at once seems to cause an unrelated subscription to go unavailable. We believe this is because the media driver may be spending too long flushing the log buffers for those subscriptions to disk and this is causing the aeron.image.liveness.timeout to be hit.

We've tried using a ramdisk as recommended in the Aeron documentation and that seemed to make closing subscriptions about twice as fast, which was not enough to solve this issue.

Is there a solution to this? Other than reducing the number of streams? Would it be possible to make closing subscriptions asynchronous?

@mjpt777
Copy link
Collaborator

mjpt777 commented Feb 10, 2021

What version are you using? Is this the Java client?

@gr-mm
Copy link

gr-mm commented Feb 10, 2021

Java AeronMediaDriver, v 1.31.1 (3 threads spinning),
C# client (1.29.0)
based on the network captures from the box the AMD seems to stop sending status messages for everything for extended periods (10s of seconds) after we destroy the subscriptions and attempt to create new ones

@mjpt777
Copy link
Collaborator

mjpt777 commented Feb 10, 2021

As of 1.28.0 the Java client does an async close of all resources. I'll check if this has been ported to the C# client.

Are the log buffers set to sparse=false and/or aeron.pre.touch.mapped.memory=true?

What type of drives are in those Windows machines and how often do you optimize them?

@chrisejones
Copy link
Author

We're currently using sparse=true and pre.touch=false. They are SSDs in an HPE RAID controller. I don't think it's possible to TRIM them.

We think that the problem is actually the media driver blocking rather than the client. We can see that it stops sending status messages for unrelated subscriptions when this happens. Or is it possible that a blocking client could cause this somehow? I think status messages are still sent when clients aren't making progress.

@mjpt777
Copy link
Collaborator

mjpt777 commented Feb 11, 2021

I believe you are right that this is the media driver taking too long to service the subscription changes.

Can you provide more details about how many active streams, term lengths, and how many subscriptions are closed and re-opened at the same time? This will help with creating a test so we can investigate.

@mjpt777
Copy link
Collaborator

mjpt777 commented Feb 11, 2021

Also can you provide details of what Windows version and Java version this is running on?

@chrisejones
Copy link
Author

chrisejones commented Feb 12, 2021

The Windows version is Windows Server 2016 and the Java version 1.8.0_121.

This is the full list of subscriptions and publications from one of the affected machines. We've seen an issue in different situations. Closing and re-opening the subscriptions with the stream id 70004 and roughly the same number with the stream id 800XX at the same time can cause the issue.

SUBSCRIPTIONS

Subscriber Count Stream Id Range Channel Type Publisher Count Term Buffer Size
1 175 ipc 1 33554432
1 177 ipc 1 67108864
1 1980XX ipc 7 67108864
1 20011558XX ipc 14 67108864
1-8 1800XX ipc 27 134217728
1 52000 udp 1 65536
3 150018 udp 1 33554432
1 177 udp 4 33554432
1 180 udp 30 67108864
1 182 udp 2 67108864
1 500002 udp 4 67108864
2-3 70004 udp 4 268435456

PUBLICATIONS

StreamId Channel Type Count Term Buffer Size Type
175 udp 1 33554432 concurrent
102 udp 1 33554432 concurrent
150015 udp 8 67108864 exclusive
172 udp 7 67108864 exclusive
180 udp 1 67108864 concurrent
182 udp 1 67108864 concurrent
185 udp 1 67108864 concurrent

@mjpt777
Copy link
Collaborator

mjpt777 commented Feb 12, 2021

Thanks @chrisejones. Are the publications concurrent or exclusive?

@chrisejones
Copy link
Author

I've updated the table above

@mjpt777
Copy link
Collaborator

mjpt777 commented Feb 16, 2021

stream id 70004 and roughly the same number with the stream id 800XX

Did you mean 1800XX? Also are they exclusive?

We have been simulating this and found a few things but nothing to the extent of the pauses you are describing.

@mjpt777
Copy link
Collaborator

mjpt777 commented Feb 19, 2021

To help avoid such issues we have made some changes. The number of commands processed per work cycle has been reduced from 10 to 2. We have also moved the sending of status messages on an interval from the conductor to receiver thread which means this can continue if the conductor takes a pause due to file IO. While doing this it was discovered the updating of the cached clocks could get stale due to the conductor taking a long pause. The sender and receiver now have their own local cached clocks so that heartbeats and status messages continue.

It would be good if you could test the head of the main repository to see if this fixes the issue for you.

We would also recommend you upgrade your Java version from 1.8.0_121 to 1.8.0_282 as many bugs have since been fixed.

@mjpt777
Copy link
Collaborator

mjpt777 commented Mar 10, 2021

I wonder if this could possibly be related to name resolution for DNS. Are you using names rather than IP addresses? Can you use -Djava.net.preferIPv4Stack=true?

pull bot pushed a commit that referenced this issue Oct 29, 2021
* [Java] Go immediately from LEADER_LOG_REPLICATION to LEADER_REPLAY and don't update the leader's commit position.

* [Java] Move replication deadline forward if newLeadershipTerms are being received.

* [Java] Change LEADER_LOG_REPLICATION back to waiting for followers to replication, but don't update leader commitPosition until LEADER_REPLAY has completed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants