Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker unload/start loop #2428

Open
sreeprasannar opened this issue Jun 23, 2023 · 7 comments
Open

Worker unload/start loop #2428

sreeprasannar opened this issue Jun 23, 2023 · 7 comments
Labels
support triaged Issue has been reviewed and triaged

Comments

@sreeprasannar
Copy link

Why would torchserve workers restart over and over?
I'm using torchserve docker image version 0.8.0

2023-06-20T23:15:44,435 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2899
2023-06-20T23:15:44,436 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:44,436 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:44,438 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:15:46,800 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2906
2023-06-20T23:15:46,801 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2906
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:46,810 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:46,811 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:15:50,202 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2912
2023-06-20T23:15:50,203 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:15:50,211 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2912
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:50,212 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:15:54,592 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2918
2023-06-20T23:15:54,593 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2918
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:15:54,604 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:15:54,605 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:00,982 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2924
2023-06-20T23:16:00,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2924
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:00,992 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:00,993 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:10,402 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2930
2023-06-20T23:16:10,403 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2930
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:10,412 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:10,413 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:24,789 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2936
2023-06-20T23:16:24,790 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2936
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:24,799 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:16:47,198 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2942
2023-06-20T23:16:47,199 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2942
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:16:47,208 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:16:47,209 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:17:22,596 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2948
2023-06-20T23:17:22,597 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2948
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:17:22,606 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:17:22,607 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:18:18,973 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2954
2023-06-20T23:18:18,973 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:18:18,982 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2954
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:18:18,983 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:19:49,389 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2961
2023-06-20T23:19:49,389 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2961
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:19:49,398 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:19:49,399 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-06-20T23:22:14,772 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=2993
2023-06-20T23:22:14,773 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Successfully loaded /app/config/metrics.yaml.
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - [PID]2993
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Torch worker started.
2023-06-20T23:22:14,782 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Python runtime: 3.9.16
2023-06-20T23:22:14,783 [INFO ] W-9000-auto_categorization_0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
@msaroufim
Copy link
Member

msaroufim commented Jun 23, 2023

It's usually either a

  1. model failure: does anything interesting show up in your logs/model_log.log?
  2. config problem: what does your config.properties look like
  3. Resource constraint problem: Did you give docker enough CPU and RAM?

@msaroufim msaroufim added triaged Issue has been reviewed and triaged support labels Jun 23, 2023
@sreeprasannar
Copy link
Author

interesting thing is I do see this in ts_log.log at the same times where model_log.log workers get unloaded:

2023-06-20T23:22:14,783 [DEBUG] W-9000-auto_categorization_0.1 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: DefaultChannelPromise@7698e29a(failure: io.netty.channel.StacklessClosedChannelException)
	at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
	at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:201) [model-server.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
2023-06-20T23:22:14,783 [DEBUG] W-9000-auto_categorization_0.1 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: DefaultChannelPromise@7698e29a(failure: io.netty.channel.StacklessClosedChannelException)
	at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
	at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:201) [model-server.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

@sreeprasannar
Copy link
Author

It seems related to #2357

@sreeprasannar
Copy link
Author

TS uses a back_off to schedule worker recovery. You can see the recovery schedule in the log. When a worker die, TS is trying to recover a backend worker immediately in the first 5 rounds. The worker will try to load the model into memory immediately. That's why the memory seems not to be released.
So it seems to wait for 2 minutes for the worker to start and then decides it's unresponsive and kills it. Happens again and again is my guess..

  • I believe my inference code indeed has a slow startup time so I could try optimizing that.
  • I could also increase the timeout? Currently it might be about 120 seconds, maybe I could increase that

@sreeprasannar
Copy link
Author

The main issue is that I have install_py_dep_per_model=True because I have requirements that need to be installed for the model. That seems to be the main reason for the slow startup time.

@msaroufim
Copy link
Member

Yeah that flag makes everything so slow. I'd suggest either consolidating your dependencies for all models or passing in your venv in extra files

@thirtyninetythree
Copy link

I have this same error but on a VM(not using docker)
torchserve==0.8.1

class DiffusersHandler(BaseHandler, ABC):
    """
    Diffusers handler class for text to image generation.
    """
    def __init__(self):
        super(DiffusersHandler, self).__init__()
        self.initialized = False
       
    def initialize(self, ctx):
        """In this initialize function, the Stable Diffusion model is loaded and
        initialized here.
        Args:
            ctx (context): It is a JSON Object containing information
            pertaining to the model artefacts parameters.
        """
        controlnet = ControlNetModel.from_pretrained(
            "./stable-diffusion/controlnet", torch_dtype=torch.float16
        )
        pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
            "./stable-diffusion",
            controlnet=controlnet,
            safety_checker=None,
            torch_dtype=torch.float16,
        )
        pipe.to("cuda")
      

        self.initialized = True

Logs show worker/start stop:

2023-07-02T09:24:04,960 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-07-02T09:24:04,960 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?]
	at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:415) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
2023-07-02T09:24:04,960 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?]
	at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:415) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
2023-07-02T09:24:04,962 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-07-02T09:24:04,962 [DEBUG] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-stable-diffusion_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-07-02T09:24:04,962 [WARN ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2023-07-02T09:24:04,962 [WARN ] W-9000-stable-diffusion_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again

This is following the torchserve diffusers tutorial here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support triaged Issue has been reviewed and triaged
Projects
None yet
Development

No branches or pull requests

3 participants