support torchrun and optimize gpu assignment #2209

lxning · 2023-03-30T01:11:56Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

support torchrun parameters in model config yaml file.
support gpu assignment "deviceIds" in model config yaml file.
update doc about model config and gpu assignment algorithm.
set torchrun log-dir under logs

Fixes #(issue)
#2207

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

test gpu assignment

cat model-store/t/model-config.yaml
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "pp"
deviceType: "gpu"
deviceIds: [2,3]

torchrun:
    nproc-per-node: 2

pippy:
    rpc_timeout: 1800

model-config.yaml
log

2023-03-30T23:45:17,829 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - [PID]67915
2023-03-30T23:45:17,830 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Torch worker started.
2023-03-30T23:45:17,830 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Python runtime: 3.8.16
2023-03-30T23:45:17,830 [DEBUG] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-bloom_1.0 State change null -> WORKER_STARTED
2023-03-30T23:45:17,830 [DEBUG] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-bloom_1.0 State change null -> WORKER_STARTED
2023-03-30T23:45:17,834 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29500
2023-03-30T23:45:17,834 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29500
2023-03-30T23:45:17,840 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.29500.
2023-03-30T23:45:17,840 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29501
2023-03-30T23:45:17,840 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29501
2023-03-30T23:45:17,842 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.29501.
2023-03-30T23:45:17,843 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1680219917843
2023-03-30T23:45:17,843 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1680219917843
2023-03-30T23:45:17,858 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - model_name: bloom, batchSize: 1
2023-03-30T23:45:17,864 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - model_name: bloom, batchSize: 1
2023-03-30T23:45:18,439 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Transformers version 4.27.4
2023-03-30T23:45:18,440 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - CUDA_VISIBLE_DEVICES=2,3
2023-03-30T23:45:18,440 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Transformers version 4.27.4
2023-03-30T23:45:18,440 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - CUDA_VISIBLE_DEVICES=2,3
2023-03-30T23:45:18,470 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - n_devs=2
2023-03-30T23:45:18,471 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker0, 1: 0
2023-03-30T23:45:18,471 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker1, 1: 1
2023-03-30T23:45:18,471 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rank = 1 pid/device = 67916/cuda:1
2023-03-30T23:45:18,472 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - n_devs=2
2023-03-30T23:45:18,472 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker0, 0: 0
2023-03-30T23:45:18,472 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker1, 0: 1
2023-03-30T23:45:18,473 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rank = 0 pid/device = 67915/cuda:0
2023-03-30T23:45:18,859 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rpc.timeout=1800
2023-03-30T23:45:18,859 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rpc.timeout=1800

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

codecov · 2023-03-30T01:34:36Z

Codecov Report

Merging #2209 (0cc3f6d) into master (4ca86f3) will increase coverage by 0.01%.
The diff coverage is n/a.

❗ Current head 0cc3f6d differs from pull request most recent head ddccf5d. Consider uploading reports for the commit ddccf5d to get more accurate results

@@            Coverage Diff             @@
##           master    #2209      +/-   ##
==========================================
+ Coverage   71.44%   71.46%   +0.01%     
==========================================
  Files          73       73              
  Lines        3338     3329       -9     
  Branches       57       57              
==========================================
- Hits         2385     2379       -6     
+ Misses        950      947       -3     
  Partials        3        3

Impacted Files	Coverage Δ
ts/service.py	`74.35% <ø> (+3.09%)`	⬆️

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

HamidShojanazeri

thanks @lxning added two minor comments

HamidShojanazeri · 2023-03-31T20:44:41Z

frontend/archive/src/test/resources/modelConfig/valid.yaml

@lxning are we still using it?

yes, this is unit test

HamidShojanazeri · 2023-03-31T20:53:45Z

docs/configuration.md


 * The frontend parameters are controlled by TorchServe's frontend and specify the parameter name and default values. TorchServe now uses a priority order to determine the final value of a model's parameters in frontend. Specifically, the config.property file has the lowest priority, followed by the model configuration YAML file, and finally, the REST or gRPC model management API has the highest priority.

-* The backend parameters are fully controlled by the user. Users customized handler can access the backend parameters via the `model_yaml_config` property of the context object.
+* The backend parameters are fully controlled by the user. Users customized handler can access the backend parameters via the `model_yaml_config` property of the [context object](https://github.com/pytorch/serve/blob/master/ts/context.py#L24).
+


adding an example how model_yaml_config can be useful + adding one how to access it in the handler.

support torchrun and optimize gpu assignment

878ed37

lxning added 3 commits March 29, 2023 19:49

fix envp

9fbd384

update torchrun log dir

4364d37

update doc

0d82774

lxning self-assigned this Mar 30, 2023

lxning requested review from HamidShojanazeri and msaroufim March 30, 2023 18:02

lxning added this to the v0.8.0 milestone Mar 30, 2023

add word in whitelist

76d8df9

lxning changed the title ~~[WIP] support torchrun and optimize gpu assignment~~ support torchrun and optimize gpu assignment Mar 30, 2023

lxning added 3 commits March 30, 2023 15:18

remove debug log

2569776

use path func

dd05d22

fmt

d587560

HamidShojanazeri approved these changes Mar 31, 2023

View reviewed changes

msaroufim approved these changes Mar 31, 2023

View reviewed changes

lxning and others added 3 commits March 31, 2023 15:57

update doc

54626dd

add word in whitelist

023432d

Merge branch 'master' into issue_2207

ddccf5d

lxning merged commit c9fbc7c into master Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support torchrun and optimize gpu assignment #2209

support torchrun and optimize gpu assignment #2209

lxning commented Mar 30, 2023 •

edited

Loading

codecov bot commented Mar 30, 2023 •

edited

Loading

HamidShojanazeri left a comment

HamidShojanazeri Mar 31, 2023

lxning Mar 31, 2023

HamidShojanazeri Mar 31, 2023

support torchrun and optimize gpu assignment #2209

support torchrun and optimize gpu assignment #2209

Conversation

lxning commented Mar 30, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

codecov bot commented Mar 30, 2023 • edited Loading

Codecov Report

HamidShojanazeri left a comment

Choose a reason for hiding this comment

HamidShojanazeri Mar 31, 2023

Choose a reason for hiding this comment

lxning Mar 31, 2023

Choose a reason for hiding this comment

HamidShojanazeri Mar 31, 2023

Choose a reason for hiding this comment

lxning commented Mar 30, 2023 •

edited

Loading

codecov bot commented Mar 30, 2023 •

edited

Loading