Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support torchrun and optimize gpu assignment #2209

Merged
merged 11 commits into from
Mar 31, 2023
Merged

support torchrun and optimize gpu assignment #2209

merged 11 commits into from
Mar 31, 2023

Conversation

lxning
Copy link
Collaborator

@lxning lxning commented Mar 30, 2023

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

  • support torchrun parameters in model config yaml file.
  • support gpu assignment "deviceIds" in model config yaml file.
  • update doc about model config and gpu assignment algorithm.
  • set torchrun log-dir under logs

Fixes #(issue)
#2207

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • test gpu assignment
cat model-store/t/model-config.yaml
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "pp"
deviceType: "gpu"
deviceIds: [2,3]

torchrun:
    nproc-per-node: 2

pippy:
    rpc_timeout: 1800
  1. model-config.yaml
  2. log
2023-03-30T23:45:17,829 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - [PID]67915
2023-03-30T23:45:17,830 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Torch worker started.
2023-03-30T23:45:17,830 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Python runtime: 3.8.16
2023-03-30T23:45:17,830 [DEBUG] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-bloom_1.0 State change null -> WORKER_STARTED
2023-03-30T23:45:17,830 [DEBUG] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-bloom_1.0 State change null -> WORKER_STARTED
2023-03-30T23:45:17,834 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29500
2023-03-30T23:45:17,834 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29500
2023-03-30T23:45:17,840 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.29500.
2023-03-30T23:45:17,840 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29501
2023-03-30T23:45:17,840 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.29501
2023-03-30T23:45:17,842 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.29501.
2023-03-30T23:45:17,843 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1680219917843
2023-03-30T23:45:17,843 [INFO ] W-29500-bloom_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1680219917843
2023-03-30T23:45:17,858 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - model_name: bloom, batchSize: 1
2023-03-30T23:45:17,864 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - model_name: bloom, batchSize: 1
2023-03-30T23:45:18,439 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Transformers version 4.27.4
2023-03-30T23:45:18,440 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - CUDA_VISIBLE_DEVICES=2,3
2023-03-30T23:45:18,440 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - Transformers version 4.27.4
2023-03-30T23:45:18,440 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - CUDA_VISIBLE_DEVICES=2,3
2023-03-30T23:45:18,470 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - n_devs=2
2023-03-30T23:45:18,471 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker0, 1: 0
2023-03-30T23:45:18,471 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker1, 1: 1
2023-03-30T23:45:18,471 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rank = 1 pid/device = 67916/cuda:1
2023-03-30T23:45:18,472 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - n_devs=2
2023-03-30T23:45:18,472 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker0, 0: 0
2023-03-30T23:45:18,472 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - worker1, 0: 1
2023-03-30T23:45:18,473 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rank = 0 pid/device = 67915/cuda:0
2023-03-30T23:45:18,859 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rpc.timeout=1800
2023-03-30T23:45:18,859 [INFO ] W-29500-bloom_1.0-stdout MODEL_LOG - rpc.timeout=1800

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@codecov
Copy link

codecov bot commented Mar 30, 2023

Codecov Report

Merging #2209 (0cc3f6d) into master (4ca86f3) will increase coverage by 0.01%.
The diff coverage is n/a.

❗ Current head 0cc3f6d differs from pull request most recent head ddccf5d. Consider uploading reports for the commit ddccf5d to get more accurate results

@@            Coverage Diff             @@
##           master    #2209      +/-   ##
==========================================
+ Coverage   71.44%   71.46%   +0.01%     
==========================================
  Files          73       73              
  Lines        3338     3329       -9     
  Branches       57       57              
==========================================
- Hits         2385     2379       -6     
+ Misses        950      947       -3     
  Partials        3        3              
Impacted Files Coverage Δ
ts/service.py 74.35% <ø> (+3.09%) ⬆️

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@lxning lxning self-assigned this Mar 30, 2023
@lxning lxning added this to the v0.8.0 milestone Mar 30, 2023
@lxning lxning changed the title [WIP] support torchrun and optimize gpu assignment support torchrun and optimize gpu assignment Mar 30, 2023
Copy link
Collaborator

@HamidShojanazeri HamidShojanazeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lxning added two minor comments

deviceIds: [0,1,2,3] # device index for gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxning are we still using it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is unit test


* The frontend parameters are controlled by TorchServe's frontend and specify the parameter name and default values. TorchServe now uses a priority order to determine the final value of a model's parameters in frontend. Specifically, the config.property file has the lowest priority, followed by the model configuration YAML file, and finally, the REST or gRPC model management API has the highest priority.

* The backend parameters are fully controlled by the user. Users customized handler can access the backend parameters via the `model_yaml_config` property of the context object.
* The backend parameters are fully controlled by the user. Users customized handler can access the backend parameters via the `model_yaml_config` property of the [context object](https://github.com/pytorch/serve/blob/master/ts/context.py#L24).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding an example how model_yaml_config can be useful + adding one how to access it in the handler.

@lxning lxning merged commit c9fbc7c into master Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
v0.8.0 lifecycle
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

3 participants