Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blackbox probing of slurm endpoints #147

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Feb 15, 2022

Add probing of slurmctld and slurmd ports using the prometheus blackbox exporter. Will allow monitoring/alerting on slurm{ctld,d} being up/reachable.

"Default" (= everything layout) behaviour is to put the blackbox exporter on the prometheus node, on the basis that that removes a failure point. Is that appropriate?

Note there are slurm exporters for prometheus but they seem slightly fragile/dependent on specific versions. This is anyway more at the infra layer so probably still useful.

TODO:

  • Expand slurmd endpoints to all compute/login nodes.
  • Label jobs appropriately.
  • Create dashboard

Could also add (probably as another job) blackbox probing for e.g. DNS?

@sjpb sjpb changed the base branch from main to ood February 15, 2022 16:30
@sjpb sjpb marked this pull request as draft February 15, 2022 16:32
Base automatically changed from ood to main April 4, 2022 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant