autoscale / allocate jobs based on metrics use case #118

gedw99 · 2023-07-11T19:25:02Z

I have a use case where i need to run long running jobs on hetzner where the hetzner robot allows me to add and remove vms.

So it will allow me to autoscale, on cheap hardware.

In order to do this i need to detect RAM and CPU usage on each server where the async jobs agent runs. Detection is pretty easy.

Logic is:

if server is below 10% utilisation, put server into "blocked" mode, block new job allocations, and move any long running jobs off to servers with 50 to 80% utilisation. Its essentially rebalancing, so we can kill servers.
if server has more than 80% utilisation, start a new server, and let it take jobs off the queue.
on new job, find server with lowest utilisation that is not "blocked".

so that the logic of the 2 use cases can be don, NATS KV would be an easy win.
If each agent sends metrics every 1 minute, and NATS KV flushes KV on a TTL of 1 hour, it will self run.

Then core can issue jobs based on these metrics.

ripienaar · 2023-07-12T06:52:53Z

Sounds like you are issuing jobs for a specific worker?

If not you can start many workers as you want and they will scale up and down automatically. The only question is how to handle in flight jobs as today I don’t support a signal that says complete current job and exit rather than complete current job and get next one?

how you handle auto scaling per se is out of the control of this tool though afaik?

gedw99 · 2023-07-12T08:22:18Z

thanks for asking.. sure your pretty busy. I would like to try to some experience with asyncjobs and this seems like a really useful one too.

Sounds like you are issuing jobs for a specific worker?

yes.

The autoscaler is this: https://github.com/woodpecker-ci/autoscaler
Its a self hosted Git, Build and CI system.

Hetzner robot in included that talks to Hetzner IAAS to control VM's
https://github.com/woodpecker-ci/autoscaler/blob/main/provider/hetzner.go which deploys the wood packer agent to a VM.

If not you can start many workers as you want and they will scale up and down automatically. The only question is how to handle in flight jobs as today I don’t support a signal that says complete current job and exit rather than complete current job and get next one?

yes i agree. In this case, it's a single run job style task. Run and when done, please die basically.

That would mean i don't have to do any rebalancing because we are basically doing the Serverless pattern, which is a smarter pattern i think :)

Makes me think of google cloud run, where it only dies if there has not been all http calls for the last 5 minutes.

I think these 2 logic patterns are the way to do it. Its simpler than what i proposed originally.

how you handle auto scaling per se is out of the control of this tool though afaik?

To calculate the auto scaling, https://github.com/woodpecker-ci/autoscaler/blob/main/main.go#L23 holds a reference to all the agents.
The logic is in the next function below called "getLoad".

I am realising that the way to build this is to have a dummy VM Pool locally, to simulate a Hertzner VM.
https://github.com/woodpecker-ci/autoscaler/blob/main/client.go is the client, and so i need a dummy version

gedw99 · 2023-07-12T08:27:47Z

https://github.com/windsource/picus does that same logic and uses woodpecker ci.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoscale / allocate jobs based on metrics use case #118

autoscale / allocate jobs based on metrics use case #118

gedw99 commented Jul 11, 2023

ripienaar commented Jul 12, 2023

gedw99 commented Jul 12, 2023 •

edited

Loading

gedw99 commented Jul 12, 2023

autoscale / allocate jobs based on metrics use case #118

autoscale / allocate jobs based on metrics use case #118

Comments

gedw99 commented Jul 11, 2023

ripienaar commented Jul 12, 2023

gedw99 commented Jul 12, 2023 • edited Loading

gedw99 commented Jul 12, 2023

gedw99 commented Jul 12, 2023 •

edited

Loading