Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log entire stdout and stderr for terminated backend worker process #3036

Merged
merged 6 commits into from
Apr 22, 2024

Conversation

namannandan
Copy link
Collaborator

@namannandan namannandan commented Mar 21, 2024

Description

The current implementation of reader threads to consume stdout and stderr from the backend worker process are terminated on worker failure before the existing Scanner buffer is drained:

while (isRunning.get() && scanner.hasNext()) {
String result = scanner.nextLine();

This causes loss of data and prevents the entire logs(error traceback in case of backend worker crash) from being logged.

Consider a handler with a deliberate bug as follows:

from ts.torch_handler.image_classifier import ImageClassifier
import torch._dynamo
torch._dynamo.reset()

class Handler(ImageClassifier):
    def initialize(self, context):
        assert False

On loading the model with the current implementation on master branch:
(Notice that the entire backend worker error traceback is logged only for the first attempt to start the worker. For subsequent attempts only a portion of the traceback is logged)

2024-03-22T11:42:06,792 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-22T11:42:06,793 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-22T11:42:06,793 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change null -> WORKER_STARTED
2024-03-22T11:42:06,797 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-03-22T11:42:06,812 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Connection accepted: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000.
2024-03-22T11:42:06,814 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1711132926814
2024-03-22T11:42:06,816 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1711132926816
2024-03-22T11:42:06,848 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - model_name: resnet, batchSize: 1
2024-03-22T11:42:08,734 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG - /Users/namannan/anaconda3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2024-03-22T11:42:08,735 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2024-03-22T11:42:11,110 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - ONNX enabled
2024-03-22T11:42:11,110 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-03-22T11:42:11,112 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-22T11:42:11,113 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-22T11:42:11,113 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 263, in <module>
2024-03-22T11:42:11,113 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     worker.run_server()
2024-03-22T11:42:11,113 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 231, in run_server
2024-03-22T11:42:11,113 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2024-03-22T11:42:11,113 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 194, in handle_connection
2024-03-22T11:42:11,114 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2024-03-22T11:42:11,114 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-22T11:42:11,114 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service = model_loader.load(
2024-03-22T11:42:11,114 [INFO ] KQueueEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-22T11:42:11,114 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-03-22T11:42:11,114 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2024-03-22T11:42:11,115 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2024-03-22T11:42:11,115 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Volumes/workplace/pytorch/serve/model_store/resnet-50-default/handler.py", line 7, in initialize
2024-03-22T11:42:11,115 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     assert False
2024-03-22T11:42:11,115 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - AssertionError
2024-03-22T11:42:11,115 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., responseTimeout:120sec
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:229) [model-server.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2024-03-22T11:42:11,124 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: resnet, error: Worker died.
2024-03-22T11:42:11,125 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2024-03-22T11:42:11,125 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1711132931125
2024-03-22T11:42:11,126 [INFO ] W-9000-resnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet_1.0-stdout
2024-03-22T11:42:11,126 [INFO ] W-9000-resnet_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet_1.0-stderr
2024-03-22T11:42:11,126 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stderr
2024-03-22T11:42:11,126 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stdout
2024-03-22T11:42:11,128 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2024-03-22T11:42:12,132 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/Users/namannan/anaconda3/bin/python, /Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000, --metrics-config, /Users/namannan/anaconda3/lib/python3.10/site-packages/ts/configs/metrics.yaml]
2024-03-22T11:42:13,519 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - s_name_part0=/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock, s_name_part1=9000, pid=49773
2024-03-22T11:42:13,520 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Listening on port: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-03-22T11:42:13,530 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Successfully loaded /Users/namannan/anaconda3/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2024-03-22T11:42:13,530 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - [PID]49773
2024-03-22T11:42:13,531 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-22T11:42:13,531 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2024-03-22T11:42:13,531 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-22T11:42:13,531 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-03-22T11:42:13,532 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Connection accepted: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000.
2024-03-22T11:42:13,532 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1711132933532
2024-03-22T11:42:13,533 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1711132933533
2024-03-22T11:42:13,551 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - model_name: resnet, batchSize: 1
2024-03-22T11:42:13,882 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG - /Users/namannan/anaconda3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2024-03-22T11:42:13,883 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2024-03-22T11:42:15,556 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - ONNX enabled
2024-03-22T11:42:15,556 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-03-22T11:42:15,557 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-22T11:42:15,557 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-22T11:42:15,558 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 263, in <module>
2024-03-22T11:42:15,558 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     worker.run_server()
2024-03-22T11:42:15,557 [INFO ] KQueueEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-22T11:42:15,558 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 231, in run_server
2024-03-22T11:42:15,558 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2024-03-22T11:42:15,558 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2024-03-22T11:42:15,558 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 194, in handle_connection
2024-03-22T11:42:15,558 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., responseTimeout:120sec
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:229) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2024-03-22T11:42:15,558 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2024-03-22T11:42:15,558 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-22T11:42:15,558 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: resnet, error: Worker died.
2024-03-22T11:42:15,559 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service = model_loader.load(
2024-03-22T11:42:15,559 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2024-03-22T11:42:15,559 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-03-22T11:42:15,559 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-03-22T11:42:15,559 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2024-03-22T11:42:15,559 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stderr
2024-03-22T11:42:15,559 [INFO ] W-9000-resnet_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet_1.0-stderr
2024-03-22T11:42:15,559 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stdout
2024-03-22T11:42:15,559 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Volumes/workplace/pytorch/serve/model_store/resnet-50-default/handler.py", line 7, in initialize
2024-03-22T11:42:15,560 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.

On loading the model with the fix in this PR:
(Notice that the entire backend worker error traceback is logged for every worker restart)

.....
.....
2024-03-22T11:33:14,454 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-22T11:33:14,454 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-22T11:33:14,455 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change null -> WORKER_STARTED
2024-03-22T11:33:14,475 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-03-22T11:33:14,492 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Connection accepted: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000.
2024-03-22T11:33:14,495 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1711132394495
2024-03-22T11:33:14,498 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1711132394498
2024-03-22T11:33:14,539 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - model_name: resnet, batchSize: 1
2024-03-22T11:33:16,866 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG - /Users/namannan/anaconda3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2024-03-22T11:33:16,867 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2024-03-22T11:33:20,913 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - ONNX enabled
2024-03-22T11:33:20,915 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-03-22T11:33:20,918 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-22T11:33:20,918 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-22T11:33:20,919 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 263, in <module>
2024-03-22T11:33:20,919 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     worker.run_server()
2024-03-22T11:33:20,919 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 231, in run_server
2024-03-22T11:33:20,919 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2024-03-22T11:33:20,919 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 194, in handle_connection
2024-03-22T11:33:20,920 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2024-03-22T11:33:20,920 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-22T11:33:20,920 [INFO ] KQueueEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-22T11:33:20,920 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service = model_loader.load(
2024-03-22T11:33:20,920 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-03-22T11:33:20,921 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2024-03-22T11:33:20,921 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2024-03-22T11:33:20,921 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Volumes/workplace/pytorch/serve/model_store/resnet-50-default/handler.py", line 7, in initialize
2024-03-22T11:33:20,921 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     assert False
2024-03-22T11:33:20,921 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - AssertionError
2024-03-22T11:33:20,921 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., responseTimeout:120sec
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:229) [model-server.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2024-03-22T11:33:20,936 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: resnet, error: Worker died.
2024-03-22T11:33:20,936 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2024-03-22T11:33:20,936 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1711132400936
2024-03-22T11:33:20,937 [INFO ] W-9000-resnet_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet_1.0-stderr
2024-03-22T11:33:20,937 [INFO ] W-9000-resnet_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet_1.0-stdout
2024-03-22T11:33:20,937 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stderr
2024-03-22T11:33:20,938 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stdout
2024-03-22T11:33:20,939 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2024-03-22T11:33:21,944 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/Users/namannan/anaconda3/bin/python, /Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000, --metrics-config, /Users/namannan/anaconda3/lib/python3.10/site-packages/ts/configs/metrics.yaml]
2024-03-22T11:33:23,709 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - s_name_part0=/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock, s_name_part1=9000, pid=48707
2024-03-22T11:33:23,710 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Listening on port: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-03-22T11:33:23,721 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Successfully loaded /Users/namannan/anaconda3/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2024-03-22T11:33:23,722 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - [PID]48707
2024-03-22T11:33:23,722 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-22T11:33:23,722 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2024-03-22T11:33:23,722 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-22T11:33:23,722 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-03-22T11:33:23,724 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Connection accepted: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000.
2024-03-22T11:33:23,724 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1711132403724
2024-03-22T11:33:23,724 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1711132403724
2024-03-22T11:33:23,743 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - model_name: resnet, batchSize: 1
2024-03-22T11:33:24,116 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG - /Users/namannan/anaconda3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2024-03-22T11:33:24,117 [WARN ] W-9000-resnet_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2024-03-22T11:33:26,244 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - ONNX enabled
2024-03-22T11:33:26,245 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-03-22T11:33:26,246 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-22T11:33:26,246 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-22T11:33:26,247 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 263, in <module>
2024-03-22T11:33:26,246 [INFO ] KQueueEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-22T11:33:26,247 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     worker.run_server()
2024-03-22T11:33:26,247 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 231, in run_server
2024-03-22T11:33:26,247 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2024-03-22T11:33:26,247 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2024-03-22T11:33:26,247 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 194, in handle_connection
2024-03-22T11:33:26,247 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., responseTimeout:120sec
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:229) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2024-03-22T11:33:26,248 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2024-03-22T11:33:26,248 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: resnet, error: Worker died.
2024-03-22T11:33:26,248 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-22T11:33:26,248 [DEBUG] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2024-03-22T11:33:26,248 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     service = model_loader.load(
2024-03-22T11:33:26,248 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-03-22T11:33:26,248 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Users/namannan/anaconda3/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-03-22T11:33:26,248 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stderr
2024-03-22T11:33:26,249 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2024-03-22T11:33:26,248 [INFO ] W-9000-resnet_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet_1.0-stderr
2024-03-22T11:33:26,249 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -   File "/Volumes/workplace/pytorch/serve/model_store/resnet-50-default/handler.py", line 7, in initialize
2024-03-22T11:33:26,249 [WARN ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet_1.0-stdout
2024-03-22T11:33:26,249 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG -     assert False
2024-03-22T11:33:26,249 [INFO ] W-9000-resnet_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2024-03-22T11:33:26,249 [INFO ] W-9000-resnet_1.0-stdout MODEL_LOG - AssertionError
.....
.....

Fixes #3034

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

@lxning
Copy link
Collaborator

lxning commented Mar 26, 2024

It seems that the following code does not address the root cause listed in the description section The current implementation of reader threads to consume stdout and stderr from the backend worker process are terminated on worker failure before the existing Scanner buffer is drained.

               while (scanner.hasNextLine()) {
                    String result = scanner.nextLine();
                    if (result == null) {
                        if (isRunning.get()) {
                            continue;
                        } else {
                            break;
                        }
                    }

@namannandan
Copy link
Collaborator Author

namannandan commented Apr 9, 2024

It seems that the following code does not address the root cause listed in the description section The current implementation of reader threads to consume stdout and stderr from the backend worker process are terminated on worker failure before the existing Scanner buffer is drained.

               while (scanner.hasNextLine()) {
                    String result = scanner.nextLine();
                    if (result == null) {
                        if (isRunning.get()) {
                            continue;
                        } else {
                            break;
                        }
                    }

I believe it does address the root cause. Added a regression test to confirm the behavior: https://github.com/pytorch/serve/pull/3036/files#diff-b4f49147fd772129dbf21c64a502a950c5cb76a192a76de8df4e4d2949c6816f

Without the fix in this PR (Notice that only a part of the traceback is logged causing the test to fail):

$ python -m pytest test/pytest/test_handler_traceback_logging.py
========================================================================================================= test session starts ==========================================================================================================
platform darwin -- Python 3.8.13, pytest-7.3.1, pluggy-1.0.0
rootdir: /Volumes/workplace/pytorch/serve
configfile: pytest.ini
plugins: cov-4.1.0, mock-3.12.0, timeout-2.3.1
collected 1 item                                                                                                                                                                                                                       

test/pytest/test_handler_traceback_logging.py::test_model_inference_traceback 
------------------------------------------------------------------------------------------------------------ live log setup ------------------------------------------------------------------------------------------------------------
INFO     root:model_packaging.py:54 Successfully exported model test_model to file /private/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T/pytest-of-namannan/pytest-69/test_model0
DEBUG    urllib3.connectionpool:connectionpool.py:228 Starting new HTTP connection (1): localhost:8081
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8081 "POST /models?model_name=test_model&url=test_model.mar&initial_workers=1&synchronous=false&batch_size=1 HTTP/1.1" 202 47
FAILED
.....
.....
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2024-04-09T12:34:19,122 [WARN ] W-9000-test_model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: test_model, error: Worker died.
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -   File "/Users/namannan/.pyenv/versions/3.8.13/lib/python3.8/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -     service = model_loader.load(
2024-04-09T12:34:19,122 [DEBUG] W-9000-test_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-test_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -   File "/Users/namannan/.pyenv/versions/3.8.13/lib/python3.8/site-packages/ts/model_loader.py", line 143, in load
2024-04-09T12:34:19,122 [WARN ] W-9000-test_model_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -   File "/private/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T/models/a778f6da54c744b1b9307178f3969e41/handler.py", line 12, in initialize
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -     self.call1()
2024-04-09T12:34:19,122 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -   File "/private/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T/models/a778f6da54c744b1b9307178f3969e41/handler.py", line 18, in call1
2024-04-09T12:34:19,123 [WARN ] W-9000-test_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-test_model_1.0-stderr
2024-04-09T12:34:19,123 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -     self.call2()
2024-04-09T12:34:19,123 [WARN ] W-9000-test_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-test_model_1.0-stdout
2024-04-09T12:34:19,123 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG -   File "/private/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T/models/a778f6da54c744b1b9307178f3969e41/handler.py", line 21, in call2
2024-04-09T12:34:19,123 [INFO ] W-9000-test_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-test_model_1.0-stderr
2024-04-09T12:34:19,123 [INFO ] W-9000-test_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
.....
.....
FAILED test/pytest/test_handler_traceback_logging.py::test_model_inference_traceback - Failed: Timeout >60.0s
===================================================================================================== 1 failed in 62.87s (0:01:02) =====================================================================================================

With the fix in this PR:

$ python -m pytest test/pytest/test_handler_traceback_logging.py
========================================================================================================= test session starts ==========================================================================================================
platform darwin -- Python 3.8.13, pytest-7.3.1, pluggy-1.0.0
rootdir: /Volumes/workplace/pytorch/serve
configfile: pytest.ini
plugins: cov-4.1.0, mock-3.12.0, timeout-2.3.1
collected 1 item                                                                                                                                                                                                                       

test/pytest/test_handler_traceback_logging.py::test_model_inference_traceback 
------------------------------------------------------------------------------------------------------------ live log setup ------------------------------------------------------------------------------------------------------------
INFO     root:model_packaging.py:54 Successfully exported model test_model to file /private/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T/pytest-of-namannan/pytest-70/test_model0
DEBUG    urllib3.connectionpool:connectionpool.py:228 Starting new HTTP connection (1): localhost:8081
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8081 "POST /models?model_name=test_model&url=test_model.mar&initial_workers=1&synchronous=false&batch_size=1 HTTP/1.1" 202 47
2024-04-09T12:40:55,177 [INFO ] W-9000-test_model_1.0-stdout MODEL_LOG - AssertionError
2024-04-09T12:40:55,177 [INFO ] W-9000-test_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-test_model_1.0-stdout
PASSED                                                                                                                                                                                                                           [100%]
---------------------------------------------------------------------------------------------------------- live log teardown -----------------------------------------------------------------------------------------------------------
DEBUG    urllib3.connectionpool:connectionpool.py:228 Starting new HTTP connection (1): localhost:8081
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8081 "DELETE /models/test_model HTTP/1.1" 200 52


========================================================================================================== 1 passed in 15.40s ==========================================================================================================

Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments, overall LGTM

@@ -380,10 +380,14 @@ public void terminate() {
@Override
public void run() {
try (Scanner scanner = new Scanner(is, StandardCharsets.UTF_8.name())) {
while (isRunning.get() && scanner.hasNext()) {
while (scanner.hasNextLine()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that hasNextLine() will return True for the input stream as long as the process is alive even though the stream does not contain new input?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe hasNextLine() will block until the next line is available and return true once a new line of input is available. It will return false once EOF is encountered or underlying stream has been closed.

break

for line in traceback:
assert any(line in log for log in logs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

assert all(line in logs for line in traceback)

any will stop after the first line is found

Copy link
Collaborator Author

@namannandan namannandan Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line in logs would not work because the logs include other details such as timestamp and log level along with the traceback line which cannot be hardcoded in the test, so I believe the logic would have to be line in log for log in logs.

Agree that any will stop after the first line is found, which is why I loop over each of the traceback lines I expect to find in the captured logs:

        for line in traceback:
            assert any(line in log for log in logs)

@namannandan
Copy link
Collaborator Author

Test with the changes on this PR to validate that the stdout and stderr reader threads get terminated when the backend worker process gets terminated:

# start torchserve and load model 
$ torchserve --ncs --start --model-store model_store
$ curl -X POST  "http://localhost:8081/models?url=resnet-18.mar&initial_workers=1"

# kill worker process
$ kill -9 79766
.....
.....
2024-04-18T17:08:24,729 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., responseTimeout:120sec
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513) ~[?:?]
        at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675) ~[?:?]
        at org.pytorch.serve.wlm.Model.pollBatch(Model.java:459) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.pollBatch(BatchAggregator.java:190) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:37) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:193) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2024-04-18T17:08:24,729 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died., responseTimeout:120sec
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513) ~[?:?]
        at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675) ~[?:?]
        at org.pytorch.serve.wlm.Model.pollBatch(Model.java:459) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.pollBatch(BatchAggregator.java:190) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:37) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:193) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2024-04-18T17:08:24,731 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2024-04-18T17:08:24,731 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2024-04-18T17:08:24,731 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1713485304731
2024-04-18T17:08:24,731 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1713485304731
2024-04-18T17:08:24,732 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2024-04-18T17:08:24,732 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2024-04-18T17:08:25,738 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/Users/namannan/.pyenv/versions/3.8.13/bin/python, /Users/namannan/.pyenv/versions/3.8.13/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000, --metrics-config, /Users/namannan/.pyenv/versions/3.8.13/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2024-04-18T17:08:25,738 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/Users/namannan/.pyenv/versions/3.8.13/bin/python, /Users/namannan/.pyenv/versions/3.8.13/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000, --metrics-config, /Users/namannan/.pyenv/versions/3.8.13/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2024-04-18T17:08:27,607 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - s_name_part0=/var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock, s_name_part1=9000, pid=80614
2024-04-18T17:08:27,608 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - Listening on port: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-04-18T17:08:27,621 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - Successfully loaded /Users/namannan/.pyenv/versions/3.8.13/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2024-04-18T17:08:27,622 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - [PID]80614
2024-04-18T17:08:27,622 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - Torch worker started.
2024-04-18T17:08:27,622 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2024-04-18T17:08:27,622 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2024-04-18T17:08:27,622 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - Python runtime: 3.8.13
2024-04-18T17:08:27,622 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-04-18T17:08:27,622 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000
2024-04-18T17:08:27,624 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1713485307624
2024-04-18T17:08:27,624 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - Connection accepted: /var/folders/l4/hfy8rtpx0755sys3m8m8c48w0000gs/T//.ts.sock.9000.
2024-04-18T17:08:27,624 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1713485307624
2024-04-18T17:08:27,625 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1713485307625
2024-04-18T17:08:27,625 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1713485307625
2024-04-18T17:08:27,642 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - model_name: resnet-18, batchSize: 1
2024-04-18T17:08:30,387 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - ONNX enabled
2024-04-18T17:08:30,387 [INFO ] W-9000-resnet-18_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-04-18T17:08:30,599 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2974
2024-04-18T17:08:30,599 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 2974
2024-04-18T17:08:30,599 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-04-18T17:08:30,599 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-04-18T17:08:30,599 [INFO ] W-9000-resnet-18_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:767396.0|#WorkerName:W-9000-resnet-18_1.0,Level:Host|#hostname:88665a372f4b,timestamp:1713485310
2024-04-18T17:08:30,599 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery succeeded, reset recoveryStartTS
2024-04-18T17:08:30,599 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery succeeded, reset recoveryStartTS

As can be seen from the above logs, the reader thread exits from the while loop after the backend worker process is terminated:

2024-04-18T17:08:24,724 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet-18_1.0-stdout
2024-04-18T17:08:24,724 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet-18_1.0-stdout
2024-04-18T17:08:24,725 [INFO ] W-9000-resnet-18_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet-18_1.0-stderr
2024-04-18T17:08:24,725 [INFO ] KQueueEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2024-04-18T17:08:24,725 [INFO ] W-9000-resnet-18_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet-18_1.0-stderr

try (Scanner scanner = new Scanner(is, StandardCharsets.UTF_8.name())) {
while (isRunning.get() && scanner.hasNext()) {

} finally {
logger.info("Stopped Scanner - {}", getName());
lifeCycle.setSuccess(false);
try {
is.close();
} catch (IOException e) {
logger.error("Failed to close stream for thread {}", this.getName(), e);
}
}

@namannandan namannandan requested a review from mreso April 19, 2024 00:37
@msaroufim msaroufim removed their request for review April 20, 2024 00:55
Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@namannandan namannandan added this pull request to the merge queue Apr 22, 2024
Merged via the queue into pytorch:master with commit f2163e8 Apr 22, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce or remove worker retries for specific failures
3 participants