add limits for container resources like cpu. memory, GPU in addition to requests #3168

shalberd · 2023-06-27T14:27:46Z

Is your feature request related to a problem? Please describe.
@kevin-bates Currently, in the Elyra GUI, i.e. for generic pipelines and a component therein, you can set values for CPU, Memory, and GPU and GPU Vendor.

https://elyra.readthedocs.io/en/stable/user_guide/pipelines.html#resources-cpu-gpu-and-ram

Those values are set as container request sizes, except GPU, which is set as GPU limit and used as GPU limit, as far as I see. That goes against the doc that speaks of all as "resources required".

Resources: CPU, GPU, and RAM

- Resources that the notebook or script requires. RAM takes units of gigabytes (109 bytes).
- Specify a custom Kubernetes GPU vendor, if desired. The default vendor is nvidia.com/gpu. See [this topic in the Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) for more information.
- The values are ignored when the pipeline is executed locally.
- Example: amd.com/gpu

The problem is that not setting any limits, and not giving the user any way to set limits, leads to cluster instability potentially on Kubernetes and Openshift.

Describe the solution you'd like
Add fields and properties for CPU, GPU, Memory limits in GUI, to be used later in processors and templates.
Possibly also add default factor of x1, overwritable of course, for the limits field values, based on requests fields.

Would have to add those new fields / properties to the generic properties template as well.

Describe alternatives you've considered
No real way around it if using the graphical pipeline editors with setting for runtime containers.

Additional context
Came across this when discussing Kubernetes / Openshift Resource Limits setting in an Airflow2 compatible way. Discussion quickly came to, wait a sec, no limits set ...

#3167 (comment)

The text was updated successfully, but these errors were encountered:

shalberd · 2023-10-09T06:12:03Z

@lresende what do you think? Do you know why historically, only requests, but not limits were added?

kevin-bates · 2023-10-09T16:33:50Z

Hi @shalberd. First, I apologize for the delay in responding (new employer and different application).

I happen to be looking at this very topic for one of my other projects and have reached the conclusion that CPU limits should either not be set (assuming the container can be trusted to not go crazy with resources), or set to some value so as to prevent such a rogue action from consuming all CPU resources, but absolutely not the same as the request size. Memory request/limits, on the other hand, is different and it's probably fine to equate the two (so I would argue it could be set w/o UI/metadata changes).

This all has to do with the fact that CPU is a "compressible resource" while memory is not. Here are a couple articles I found really helpful:
https://sysdig.com/blog/kubernetes-limits-requests/
https://home.robusta.dev/blog/stop-using-cpu-limits?nocache=234

I think GPU limits are difference because, IIRC, there isn't a way to "request" GPU.

To introduce CPU limits, we'd either have to introduce a new value in the metadata (and forms) or come up with some kind of factor value (e.g., limit = 2x request) that can be set via an ENV or also exposed (although we should just expose the limit if this is appearing in the UI).

shalberd · 2023-10-11T13:37:22Z

Hi @kevin-bates

Two new values / fields in forms / GUI would be the approach I suggest, for CPU and memory limits. What the, overridable, filled-in defaults (if even possible with the GUI technology here) are, can be discussed further, you've got good input. But for now, I'd even be happy just being able to set limits myself, at user discretion, non-required.

@lresende does Elyra use Django for its GUI framework? Or Bottle or ... I see jinja templates, but I cannot make heads or tails out of the Web Framework technology used. Maybe indirectly through some Jupyter extension mechanism.

paloma-rebuelta · 2023-12-19T12:57:13Z

Hello @kevin-bates and @shalberd

I am also being affected by this issue. My kubernetes cluster has argo workflow resource defaults, which means that it will populate both request and limits if they are not specified. In the case of my Elyra pipeline, I am only able to specify the compute requests, which means that the argo controller will populate the limit defaults and the pipeline automatically fails because the requests surpasses those limits. Having the option to specify the limits (or at least have the both equate) would allows to trigger the pipelines as needed.

Thanks a lot for any help!

kevin-bates · 2023-12-20T17:21:11Z

Hi @paloma-rebuelta. It is very disappointing to hear that Argo's resource limit handling is implemented that way! They should (at a minimum) provide an option to use the request value as the limit, rather than a blanket fixed limit. It is commonly recommended to NOT set limits, particularly for CPU, so that a node's resources can be fully utilized. (Memory is a bit of a different story.)

Given it appears this PR is hung up on the UI side of things, here's a suggestion that I believe could be easily implemented AND be both backward (and forward once the UI is in place) compatible.

Introduce two new environment variables: ELYRA_NODE_CPU_LIMIT_FACTOR and ELYRA_NODE_MEM_LIMIT_FACTOR. These variables hold float values, default to 0, and are set prior to starting the Jupyter Lab/Elyra instance.
Each env must have a value >= 1.0. Values < 1.0 will be treated as 0, which is equivalent to the variable not being set.
When a node is processed, its request values are set (either from the UI or via defaults). At that time, the envs are checked for values >= 1.0. Each valid limit factor value is multiplied against the corresponding request value and the integer ceiling of the resultant value is used as the corresponding limit value.
Any limit factor values < 1.0 trigger no additional logic than what is performed today. As a result, doing nothing results in today's behavior.

Once the UI portion of the changes have been done, we should fallback on the logic above to determine the limit values when the UI fields are empty. As a result, the logic above essentially becomes the default value for the limits, which could include NOT setting a limit when both the UI field and the envs are not set (or set to values < 1.0).

Would you be willing to contribute such a PR?

shalberd · 2023-12-20T18:00:42Z

@kevin-bates I agree on your stepwise solution proposed here. If you want to, I can take a crack at it next year, let's say starting January.

@romeokienzler fyi

@paloma-rebuelta if you make a PR before me, that would be ok, but I will, if ok, start with the env var approach and PR in January. Also, since I have recently added GUI value fields myself and know now how those are handled and added I might even combine in the PR adding the new GUI fields, the env vars, and the overall new logic, as @kevin-bates described.

romeokienzler · 2023-12-20T19:31:10Z

Thanks a lot @shalberd looking forward

paloma-rebuelta · 2023-12-20T20:05:26Z

Hi @shalberd I created a PR just now, let me know if you are able to have a look! #3202

shalberd · 2023-12-20T20:07:03Z

I'll only get to it by January, am traveling now and on holidays. Maybe the others can have a closer look, thank you very much for the work and effort.

shalberd added the kind:enhancement New feature or request label Jun 27, 2023

shalberd mentioned this issue Oct 9, 2023

Make generic pipeline work with airflow2x #3167

Open

paloma-rebuelta mentioned this issue Dec 20, 2023

Adding resource limits that can be selected from the UI #3202

Merged

lresende closed this as completed in #3202 Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add limits for container resources like cpu. memory, GPU in addition to requests #3168

add limits for container resources like cpu. memory, GPU in addition to requests #3168

shalberd commented Jun 27, 2023 •

edited

Loading

shalberd commented Oct 9, 2023

kevin-bates commented Oct 9, 2023

shalberd commented Oct 11, 2023 •

edited

Loading

paloma-rebuelta commented Dec 19, 2023

kevin-bates commented Dec 20, 2023 •

edited

Loading

shalberd commented Dec 20, 2023 •

edited

Loading

romeokienzler commented Dec 20, 2023

paloma-rebuelta commented Dec 20, 2023

shalberd commented Dec 20, 2023 •

edited

Loading

add limits for container resources like cpu. memory, GPU in addition to requests #3168

add limits for container resources like cpu. memory, GPU in addition to requests #3168

Comments

shalberd commented Jun 27, 2023 • edited Loading

shalberd commented Oct 9, 2023

kevin-bates commented Oct 9, 2023

shalberd commented Oct 11, 2023 • edited Loading

paloma-rebuelta commented Dec 19, 2023

kevin-bates commented Dec 20, 2023 • edited Loading

shalberd commented Dec 20, 2023 • edited Loading

romeokienzler commented Dec 20, 2023

paloma-rebuelta commented Dec 20, 2023

shalberd commented Dec 20, 2023 • edited Loading

shalberd commented Jun 27, 2023 •

edited

Loading

shalberd commented Oct 11, 2023 •

edited

Loading

kevin-bates commented Dec 20, 2023 •

edited

Loading

shalberd commented Dec 20, 2023 •

edited

Loading

shalberd commented Dec 20, 2023 •

edited

Loading