Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collective message failure with PSM2 #60

Open
patrick-legi opened this issue Feb 5, 2021 · 18 comments
Open

Collective message failure with PSM2 #60

patrick-legi opened this issue Feb 5, 2021 · 18 comments

Comments

@patrick-legi
Copy link

Hi,
for several weeks I try to understand a problem (wrong behavior) with fortran MPI_ALLTOALLW calls. The problem only occur on a Debian supercomputer using this opa-psm2 library for it's omni-path architecture. I, and 2 OpenMPI developpers, have tested many other achitectures (intel or amd cpu, with ethernet, omni-path or infiniband network and running RedHat or Suse OS. The problem do not occur in any of these tests. More over, if on the debian computer I build OpenMPI using --without-psm2 flag the problem do not occur but omni-path performances are not reached.
I'm building OpenMPI 4.0.5 with gcc 6.3 or gcc 10.2 (same behavior)

Please find in attachement a really small test case showing the problem. If all runs fine it prints " Test pass!" else it shows the wrong values and calls mpi_abort().
To run this test:

  1. make
  2. mpirun -np 4 ./test_layout_array

Patrick

DEBUG.tar.gz

@mwheinz
Copy link

mwheinz commented Feb 5, 2021

Patrick, you never replied to my last email. Are you using CUDA?

Also - please post a sample command line showing how you are exactly running this test case, showing successful and unsuccessful outputs and exactly what version of PSM2 you are using.

Intel never directly supported Debian and neither does Cornelis Networks but I will take another look when I get a chance.

@patrick-legi
Copy link
Author

patrick-legi commented Feb 5, 2021 via email

@patrick-legi
Copy link
Author

About this test,

  make
   mpirun -np 4 ./test_layout_array

If all runs fine it prints " Test pass!" else it shows the wrong values detected and calls mpi_abort().
The error also occurs with all processes on the same node
There is a README file in the archive for more details.

No GPU on the node, no multithread implementation (just MPI).

About PSM2 versions:

I've installed from github. It seams to be the same as the OS version as there no recent commits. The COMMIT file contains:

30c52a0fd155774e18cc06328a1ba83c2a6a8104

For the OS provided libraries (also tested):

dpkg -l |grep -i psm2
ii  libfabric-psm2                         1.10.0-2-1ifs+deb9                amd64        Dynamic PSM2 provider for user-space Open Fabric Interfaces
ii  libpsm2-2                              11.2.185-1-1ifs+deb9              amd64        Intel PSM2 Libraries
ii  libpsm2-2-compat                       11.2.185-1-1ifs+deb9              amd64        Compat library for Intel PSM2
ii  libpsm2-dev                            11.2.185-1-1ifs+deb9              amd64        Development files for Intel PSM2
ii  openmpi-gcc-hfi                        4.0.3-8-1ifs+deb9                 amd64        Powerful implementation of MPI/SHMEM with PSM2

Notice that I have also the bug with the OS deployed openmpi-gcc-hfi.

Patrick

@mwheinz
Copy link

mwheinz commented Feb 5, 2021

Okay. I'll try to look at this today. You're running 4 ranks on only one host?

@patrick-legi
Copy link
Author

patrick-legi commented Feb 5, 2021

Yes it is enought to show the problem. And the problem size is also very small: in the main program it is set to 5x7 points for easily tracking the problem with a debugger. Larger dimensions also show the problem.
But of course the real CFD code runs In 3D with high resolutions on many nodes, this is just a test case showing the problem.

@nsccap
Copy link

nsccap commented Feb 9, 2021

I can reproduce your issue on our production platform with all of the OpenMPIs I tried (2.1.2, 3.1.2 and 4.0.5). Our system is CentOS7 based with libpsm2-11.2.78-1.el7.
However, we run ~5 million jobs through this per year (most of them IntelMPI though) and we've had no issues reported by our quite wide user community. Are you sure your example is valid MPI?

@patrick-legi
Copy link
Author

Hi Peter
I ran also this code successfully on many architectures even if a bug in my CFD code and in this simple 2D test case is possible.
Disabling the use of PSM2 on the Debian/omnipath cluster, the problem do not occur but I do not reach OPA high performances.
When setting this minimal test case (4 process, really small resolution, 2D instead of 3D) the goal was to check all subarrays type parameters with gdb and I do not find something going wrong.
Patrick

@nsccap
Copy link

nsccap commented Feb 9, 2021

Note that I said I COULD reproduce it. In fact I could not make it run successfully with OpenMPI and PSM2 in any way. It did run ok without PSM2 or with IntelMPI on PSM2. This however does not guarantee that the code is correct (I've not had time to analyze it myself).

@patrick-legi
Copy link
Author

I agree, even such a small code may have a bug inside... even with my deep checks using gdb.
Does Intel MPI uses the same PSM2 implementation than OpenMPI ?
How can I help ?

@nsccap
Copy link

nsccap commented Feb 9, 2021

Well I'm just a systems expert that read your thread on openmpi-users and thought I'd help you out by contributing my testing results. Also, it's not until now I realized that you actually meant alltoallw (not typo for alltoallv). I can imagine that being bugged without people noticing. In fact, this is the first time I've heard of an application that uses it (I'm sure there are examples that I've missed though). Any way, alltoallw is not very common and probably sees very limited testing...

edit: yes my IntelMPI test was using the same PSM2. One can do "export PSM2_IDENTIFY=1" before mpirun to get runtime info on what is used.

@mwheinz
Copy link

mwheinz commented Feb 9, 2021 via email

@mwheinz
Copy link

mwheinz commented Feb 9, 2021

Patrick,

I just tried your DEBUG package on my machines and I did get your error when I used PSM2 - but I got the same error when I used verbs, so I still don't know that this is a PSM2 issue. Here's what I did:

[cn-priv-01:~/work/STL-61275/DEBUG](N/A)$ mpirun --mca mtl_base_verbose 9 --mca mtl ofi --mca mtl_ofi_provider_exclude psm2 --mca FI_LOG_LEVEL info -np 4 ./test_layout_array
[cn-priv-01:1211573] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211573] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211573] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211573] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211573] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[cn-priv-01:1211575] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211575] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211575] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211575] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211575] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[cn-priv-01:1211574] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211574] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211574] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211574] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211574] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[cn-priv-01:1211576] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211576] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211576] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211576] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211576] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
On 1 found 1007 but expect 3007
Test fails on process rank 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cn-priv-01:1211569] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
On 2 found 1007 but expect 4007
Test fails on process rank 2
[cn-priv-01:1211576] mtl_ofi.h:511: fi_tsendddata failed: No route to host(-113)
[cn-priv-01:1211569] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[cn-priv-01:1211569] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@mwheinz
Copy link

mwheinz commented Feb 9, 2021

I get the same failure when using OFI sockets:

...

[cn-priv-01:~/work/STL-61275/DEBUG](N/A)$ mpirun --mca mtl_base_verbose 9 --mca mtl ofi --mca mtl_ofi_provider_include sockets -x FI_LOG_LEVEL=trace -x glibc.malloc.check=1 -np 4 ./test_layout_array
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213328:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213329:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213327:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213330:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
[cn-priv-01:1213328] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213328] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
[cn-priv-01:1213330] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213330] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
[cn-priv-01:1213327] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213327] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
[cn-priv-01:1213329] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213329] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
On 1 found 1007 but expect 3007
Test fails on process rank 1
On 2 found 1007 but expect 4007
Test fails on process rank 2
libfabric:1213328:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
libfabric:1213328:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
libfabric:1213329:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
libfabric:1213329:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cn-priv-01:1213323] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[cn-priv-01:1213323] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn-priv-01:~/work/STL-61275/DEBUG](N/A)$ 

...

@mwheinz
Copy link

mwheinz commented Feb 9, 2021

I haven't used Fortran in ~20 years so I'm having trouble reading your sample app. What is the largest chunk of data that you send at one time?

@patrick-legi
Copy link
Author

The largest chunck contains 4 elements, the smaller 1 element.
The structure (variable called val) contains 2 arrays describing the organization of chunks:

  1. Y4layoutOnX(ncpus,2) when datas are stored along X axis, it is the Ymin,Ymax on each rank.
  2. X4layoutOnY(ncpus,2) when datas are stored along y axis, it is the Xmin,Xmax on each rank.

In fortran arrays are allocated with their real index in the global array. Ex: (1:nx, ymin:ymax).
The test case switches from one organization to the other and back.
alongy
alongY

@mwheinz
Copy link

mwheinz commented Feb 9, 2021

Well, that ruins that idea. Many transports have a maximum message size but don't enforce it, leading to data corruption - but you'd have to be sending 2 gigabytes or more in a single message for this to become a factor for PSM2.

@mwheinz
Copy link

mwheinz commented Feb 9, 2021

Patrick, I'm going to continue to look at this when I can - but since I get the same error with verbs and with sockets, I really think you should move this to the OMPI repo.

@patrick-legi
Copy link
Author

Thanks Michael for your help. I'll open an issue on OMPI soon, this week I have a lot of teaching hours to do, so may be at the end of the week. I will point also to this discusison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants