Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ofi+psm2 client reconnection issues #270

Closed
ael-code opened this issue Jan 30, 2019 · 6 comments
Closed

ofi+psm2 client reconnection issues #270

ael-code opened this issue Jan 30, 2019 · 6 comments

Comments

@ael-code
Copy link

ael-code commented Jan 30, 2019

Describe the bug
After the first client disconnect from server the subsequent ones that try to connect to the same server will trigger an error on the server, in particular this is an assertion error on the psm2 library:

ptl_am/am_reqrep_shmem.c:770: ptl->am_ep[shmidx].epaddr == ((void *)0)

To Reproduce

  • Do a first RPC with the first client
  • Shutdown the first client
  • Do a second RPC with the second client

Platform (please complete the following information):

  • Mercury version: master branch 6d9bfa0
  • Compiler version: gcc 8.1.0
  • Opa psm2 version: 11.2.68
  • Libfabric version: 1.7.0 and 1.6.2 (both versions lead to the same error)
@soumagne
Copy link
Member

soumagne commented Feb 1, 2019

Hmm thanks for reporting that. I think @j-xiong who maintains the PSM2 provider would be able to advise.

@j-xiong
Copy link

j-xiong commented Feb 1, 2019

It's a little bit weird to see failure in the shm path while the client and the server are on different node. Is a full stack trace available at the point of the assertion failure?

@ael-code
Copy link
Author

ael-code commented Feb 2, 2019

It's a little bit weird to see failure in the shm path while the client and the server are on different node.

This is not weird at all... psm2 library has its own implementation of shared memory that is enabled by default. As I already said, the assertion error is in that library. I've also opened an issue on their official repository: cornelisnetworks/opa-psm2#34

In any case I've discovered that we have the same issue even if the server and the client are on the very same node.

Is a full stack trace available at the point of the assertion failure?

I'll try to provide one, but in any case is very easy to reproduce. You just need to issue two subsequent RPC to the same psm2 server from two different clients on the same node.

@j-xiong
Copy link

j-xiong commented Feb 3, 2019

This is not weird at all... psm2 library has its own implementation of shared memory that is enabled by default. As I already said, the assertion error is in that library. I've also opened an issue on their official repository: intel/opa-psm2#34

Yes I understand the assertion is inside the psm2 library. Normally the shared memory path is not supposed to be reached if connections only happen between different nodes. Did the server happen to also talk to other clients during the test?

In any case I've discovered that we have the same issue even if the server and the client are on the very same node.

That is a useful information.

I'll try to provide one, but in any case is very easy to reproduce. You just need to issue two subsequent RPC to the same psm2 server from two different clients on the same node.

I don't have a ready to use setup for mercury, nor have I used one before. So it might be simpler if the trace is available.

@ael-code
Copy link
Author

ael-code commented Feb 7, 2019

Yes we discovered that the error is triggered when a second client try
to contact the server within the same node. If we set the
FI_PSM2_DISCONNECT=0 it simply doesn't work because client are not
properly disconnected during shutdown. While if we set
FI_PSM2_DISCONNECT=1 we ends up with the error I've described to you.

We managed to avoid the error by not using psm2 for local communication.
We use it just for remote communications.

So if I understood correctly this is probably an error in the psm2
library and not in mercury.

@ael-code ael-code closed this as completed Feb 7, 2019
@sktzwhj
Copy link

sktzwhj commented May 27, 2020

Hi @ael-code , we got the same problem recently. But we were running client and server that are separated on different nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants