Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi: add test that physical cores are not oversubscribed #17

Merged
merged 2 commits into from
Mar 1, 2024

Conversation

wihobbs
Copy link
Member

@wihobbs wihobbs commented Feb 22, 2024

Problem: A common failure in some MPI implementations is that all the MPI ranks end up packed on the same core. This usually shows up if pmix is configured improperly. It causes lock contention, achieves no parallelization, and is bad, so we want to catch if that's happening.

Add a test to our CI that collects the physical cpu number for each MPI rank, then checks that no two are the same. Note that this will fail if cores are deliberately oversubscribed (but they shouldn't be in CI.

This is work-in-progress. A few things left to do:

  • vcpu numbers can be the same if you're running across two nodes. So each node needs to have its own array where it checks for collisions.
  • Currently, every rank in the MPI task checks to make sure there are no collisions. This is probably wasteful, although those cores would be sitting idle anyway, so maybe it's not.
  • We want to test an MPI_Allreduce (even though that's already tested in hello.c) but currently MPI_Allgather is used.
  • Memory check -- run under valgrind and make sure memory is being allocated/freed properly.
  • Clean-up of formatting to be consistent with Flux standards.

@wihobbs wihobbs marked this pull request as draft February 22, 2024 22:26
@wihobbs wihobbs force-pushed the vcpu-test branch 2 times, most recently from a0327e0 to 259dd37 Compare February 26, 2024 22:49
@wihobbs wihobbs marked this pull request as ready for review February 26, 2024 22:50
@wihobbs wihobbs requested a review from grondo February 26, 2024 22:50
Copy link

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple minor comments. Be sure to remove WIP before merging :-)

mpi/vcpu.c Outdated
#include <sched.h>
#define MASTER 0

int sched_getcpu (void);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prototype declaration should not be necessary if sched.h is included as documented in the sched_getcpu(3) man page:

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

    sched_getcpu():
           Since glibc 2.14:
               _GNU_SOURCE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I leave this? Both cray's compilers and gcc throw a warning if I pull it out:

(s=33,d=0) tioga11 ~/flux-test-collective/mpi (vcpu-test)$ mpicc -o vcpu vcpu.c
vcpu.c: In function 'main':
vcpu.c:65:11: warning: implicit declaration of function 'sched_getcpu'; did you mean 'sched_getparam'? [-Wimplicit-function-declaration]
   65 |     cpu = sched_getcpu ();
      |           ^~~~~~~~~~~~
      |           sched_getparam

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get this warning because you are not defining _GNU_SOURCE before including sched.h as documented in the "Feature Test Macro Requirements" I referenced above.

It is slightly bad practice to define a prototype for an external function if there exists a valid header for it. In the case the prototype is incorrect you may have a runtime crash which would have been easily avoided by including the correct prototype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I just recalled that @garlick had asked for a small blurb at the top of the test describing its pass/fail conditions. Added that too.

mpi/vcpu.c Outdated Show resolved Hide resolved
@wihobbs wihobbs changed the title WIP: mpi: add test that physical cores are not oversubscribed mpi: add test that physical cores are not oversubscribed Feb 27, 2024
@garlick
Copy link
Member

garlick commented Feb 27, 2024

I'm fuzzy on what the actual failure mode is here and what it might have to do with pmix? Can you elaborate or is there an issue to point to?

@wihobbs
Copy link
Member Author

wihobbs commented Feb 27, 2024

I'm fuzzy on what the actual failure mode is here and what it might have to do with pmix? Can you elaborate or is there an issue to point to?

A failure mode might have output that looks like this:

Hello from local rank 0 (global rank 0) on corona244 vcpu 47
Hello from local rank 1 (global rank 1) on corona244 vcpu 47
Hello from local rank 2 (global rank 2) on corona244 vcpu 47
Hello from local rank 3 (global rank 3) on corona244 vcpu 47

On corona, where there are 48 cores per node, it shouldn't be packing 4 MPI ranks on one core by default. (Note that I haven't actually ever seen this happen, I made that output up.)

If I try and force this behavior, it ends up looking like this:

 flux run --cores=1 --tasks-per-core=4 -o verbose=2 ./vcpu
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 SPLIT_TYPE FROM 0
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
PMI_Abort: (0) N/A
Hello from local rank 0 (global rank 0) on corona244 vcpu 47
33.249s: flux-shell[0]: FATAL: doom: rank 0 exited and exit-timeout=30s has expired

Re: what it might have to do with pmix...apparently, if pmix is configured improperly, it can have behavior similar to what I showed above -- packing a ton of MPI ranks on one core instead of spreading them out like it should. Again, I haven't ever seen it do this, but this was something @nhanford suggested we test for in case that behavior ever pops up.

One thing worth mentioning is that we considered using this code instead of the mpi hello-world test from flux-core (originally, that's how it was written) but @grondo suggested keeping it as a separate test to give us a broader testsuite.

Does that help? Sorry if it doesn't, I'm not an expert on this myself...

@nhanford
Copy link
Contributor

nhanford commented Feb 27, 2024

Hello @garlick, This is a quick way of categorically ruling out Open MPI failing over to singleton mode when PMIx is misconfigured. Also, TCE-32 on the internal LC Jira shows an example along similar lines where affinity was being set incorrectly. However, I don't know if we ever ruled out PMIx on that one, but iti is still worthwhile to detect discrepancies between what Flux set for affinity and where processes actually ended up, IMHO.

@wihobbs
Copy link
Member Author

wihobbs commented Feb 27, 2024

This is a quick way of categorically ruling out Open MPI failing over to singleton mode when PMIx is misconfigured.

Mmmm I think because of the localRank communicator split this might not pick up if they all show up as a bunch of singletons....this is back to WIP while I investigate that.

@wihobbs wihobbs marked this pull request as draft February 27, 2024 21:57
@garlick
Copy link
Member

garlick commented Feb 27, 2024

Maybe a simpler test to detect multiple singletons would be to just have rank 1 print its rank and if it's 1 pass, and if 0 fail?

Edit: Duh, that wouldn't work - but maybe just have it fail if the size is 1 and always run > 1?

Why the communicator split?

@garlick
Copy link
Member

garlick commented Feb 28, 2024

p.s. if @nhanford is on board here, this can be merged. I was just trying to understand.

@wihobbs
Copy link
Member Author

wihobbs commented Feb 28, 2024

I think we need to keep this test, since it's still looking at a behavior that could be problematic (core oversubscription when that shouldn't happen). However, I think we need an additional test to detect if multiple singletons are being created, because this test won't pick that up as its currently designed.

One smoke test would be to have the rank 0 MPI process create a new file in /tmp/fluxci and call MPI_Abort if that file has already been created. Then, if there were multiple rank 0 instances the test would fail.

Thanks @garlick and @nhanford for the comments, you made me realize I had misunderstood what behavior we were looking for!

@garlick
Copy link
Member

garlick commented Feb 28, 2024 via email

@wihobbs
Copy link
Member Author

wihobbs commented Feb 28, 2024

Why the communicator split?

I'm going to explain what I thought we were trying to test for in vcpu.c and maybe that will clear up why this was designed in this way.

I thought we were looking for output that looked like this:

Hello from local rank 0 (global rank 0) on corona244 vcpu 47
Hello from local rank 1 (global rank 1) on corona244 vcpu 47
Hello from local rank 0 (global rank 2) on corona243 vcpu 46
Hello from local rank 1 (global rank 3) on corona243 vcpu 46

where it had unnecessarily packed multiple MPI ranks onto one core instead of spreading them out. In order to detect this, I had each rank fetch its rank number and sched_getcpu number and write this to stdout, then hit a barrier, then report all the sched_getcpu numbers to an array via an MPI_Allgather call (this has the added benefit of testing point-to-point and all-to-all MPI communication, which we're already testing anyway, but I digress).

The rank 0 MPI task then iterated over that array to check that each one was unique. This worked well for detecting errors on one node, but failed when I tried to run it across two, because the sched_getcpu numbers can be duplicated -- two nodes have two of core # 1, 2, 3, ..., 96.

In order to solve that, I split the communicators by node, and had each node iterate over its own array to check that no two sched_getcpu numbers were the same. If an overlap was detected, it would call MPI_Abort.

When I said:

Mmmm I think because of the localRank communicator split this might not pick up if they all show up as a bunch of singletons

I was actually wrong; this test won't pick up singletons, but it has nothing to do with the communicator split. This test won't pick up singletons because it relies on the assumption that all the ranks initialized when the test starts will report to each other their core number, which isn't the case if they're singletons (silly Hobbs). (And even if they could all report their core number, they might still be unique.) So, in the case of n singletons being created, you'll have n arrays of size 1 containing 1 sched_getcpu number, which will all pass the uniqueness test and exit with a 0 return code.

@wihobbs
Copy link
Member Author

wihobbs commented Feb 28, 2024

A file check would work but it would be easier and equivalent to fail if the mpi size is 1.

Nice idea. I'll make this in a new PR. If we're good to keep this test, I'm removing WIP.

@wihobbs wihobbs marked this pull request as ready for review February 28, 2024 17:09
@wihobbs wihobbs requested a review from grondo February 28, 2024 17:55
wihobbs and others added 2 commits March 1, 2024 09:02
Problem: A common failure in some MPI implementations is that
all the MPI ranks end up packed on the same core. This causes
lock contention, achieves no parallelization, and is bad, so we
want to catch if that's happening.

Add a test to our CI that collects the physical cpu number for
each MPI rank, then checks that no two are the same. Note that this
will fail if cores are deliberately oversubscribed (but they
shouldn't be in CI).

Co-authored-by: Nathan Hanford <8302958+nhanford@users.noreply.github.com>
Add a clang-format file to standardize all C formatting for future
tests.
Copy link

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@wihobbs
Copy link
Member Author

wihobbs commented Mar 1, 2024

A lot of learning happened in this one. Thanks @grondo @garlick and @nhanford for all the help!

@wihobbs
Copy link
Member Author

wihobbs commented Mar 1, 2024

(and fun too 🎉 )

@wihobbs wihobbs merged commit 6fd80ab into flux-framework:main Mar 1, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants