Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return resource control to ats to avoid scheduling issues with the flux adapter. #107

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jwhite242
Copy link
Collaborator

Had found in testing that the split in resource tracking across ats/flux was confusing the schedulers a bit, resulting in much lower than expected throughput when dealing with lots of short lived jobs.

Additionally, the scheduled time limit on the flux mini run was disabled to return control of that to ats's scheduling: flux was refusing to schedule jobs when too close to the end time, but ats wasn't able to pick up on that and just idled resources.

This MR addresses the issues mentioned in #106.

@dawson6
Copy link
Member

dawson6 commented Feb 28, 2023

@jwhite242 can you go ahead and push your branch(es) to main please. I like to do a pull so I can test the changes as part of the approval process.

Copy link
Member

@dawson6 dawson6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. In general, we like to do a rebase before the merge if you can do that.

@jwhite242 jwhite242 force-pushed the bugfix/jwhite242/flux_adapter_fix branch from 8d37ea5 to 5ccd2b5 Compare March 1, 2023 21:08
@jwhite242 jwhite242 force-pushed the bugfix/jwhite242/flux_adapter_fix branch from 5ccd2b5 to 5698a1e Compare March 16, 2023 17:42
@dawson6
Copy link
Member

dawson6 commented Aug 16, 2023

Jeremy, can you test with the current version of ATS (7.0.114 or the main branch) on rzvernal.
We have reworked how we track jobs under flux, and this may have helped this issue.

@jwhite242
Copy link
Collaborator Author

jwhite242 commented Aug 17, 2023

Jeremy, can you test with the current version of ATS (7.0.114 or the main branch) on rzvernal. We have reworked how we track jobs under flux, and this may have helped this issue.

@dawson6 We've been using 7.0.114 for ~3 weeks now and things seem to be ok at the moment. We're not using any of the limiters though in case you're looking for feedback on those: concurrency, time limit, etc.

Think for now we can close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants