-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439
base: develop
Are you sure you want to change the base?
Conversation
Which will also cause it to start running all of the appropriate tests. If I remember correctly, we had this disabled because the containers were running out of disk space, but we want this enabled for the "real" PR configuration. Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
We disable X11 everywhere else, so be consistent here. In the future, we probably want to enable this, since we DO have X11 in the containers, but getting that hooked up and working is for another day. Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
The CUDA tests look good, with four exceptions, detailed here: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=211376 @trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do. If any developers from the tagged teams can provide any insight for the four failing tests (and they do fail reliably), it would be much appreciated! I can turn them off, but I wanted to at least do SOME due diligence and see what the community thinks. |
Yes, please. The |
@cgcgcg - would you mind taking a look at the panzer/mini-em failure here? Looks to be a linear solver issue similar to what you have fixed in the past. |
I see this message in the output of the failing Stratimikos and Panzer tests:
|
Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
For some reason, there are a couple of tests that are failing when RDMA support is initialized. I debugged it to the point of disabling the smcuda BTL in OpenMPI. My guess is that something is wrong with our container build of OpenMPI, OR there is something different hardware-wise about our new Ampere80 machines (I checked the PCI bus addresses because that was something that a brief Google investigation indicated, but they didn't look any worse than the Volta70 machines). Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
That looks like issues with |
@sebrowne Do we set |
We do not. I did do some debuggery and that particular error went away when I disabled the |
Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
New results with the Kokkos option, my disable of the smcuda BTL, and running the Intrepid2 test serially: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=217579 Seeing the same tests fail (except for the Intrepid2 one), but in perhaps more-straightforwards way? I see NaN errors from Belos. |
@sebrowne Thanks for adding the option. Seems like the message went away. I'll have another look to see what's wrong. |
Fyi, The option |
We decided to make |
I just went down the rabbit hole. The issue in this test |
Perfect, thank you so much! I'll get on fixing it ASAP. |
@trilinos/framework
Motivation
Want to align the CUDA AT2 build with the old AutoTester one.
Related Issues
https://sems-atlassian-son.sandia.gov/jira/browse/TRILFRAME-673