Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minicpm-v采用W4A16量化,推理速度没什么变化 #1906

Open
2 tasks done
DankoZhang opened this issue Jul 3, 2024 · 13 comments
Open
2 tasks done

minicpm-v采用W4A16量化,推理速度没什么变化 #1906

DankoZhang opened this issue Jul 3, 2024 · 13 comments
Assignees

Comments

@DankoZhang
Copy link

DankoZhang commented Jul 3, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

minicpm-v采用W4A16量化后,显存占用确实小了,但是推理速度却没什么变化,是什么原因导致的呢?
量化脚本:
lmdeploy lite auto_awq
$HF_MODEL
--calib-dataset 'ptb'
--calib-samples 128
--calib-seqlen 2048
--w-bits 4
--w-group-size 128
--batch-size 1
--search-scale False
--work-dir $WORK_DIR

推理脚本:
pipe = pipeline(path, chat_template_config=ChatTemplateConfig(model_name='llama3'), backend_config=TurbomindEngineConfig(cache_max_entry_count=0.5, model_format='awq'))

Reproduction

Environment

Error traceback

@DankoZhang
Copy link
Author

帮忙看看这个问题

@DankoZhang
Copy link
Author

@lvhan028 帮忙看看

@lvhan028
Copy link
Collaborator

lvhan028 commented Jul 3, 2024

请提供测速的方法。
btw,lmdeploy没有针对 vlm 中的视觉模型做优化,优化的是它的语言模型部分。

@DankoZhang
Copy link
Author

请提供测速的方法。 btw,lmdeploy没有针对 vlm 中的视觉模型做优化,优化的是它的语言模型部分。

测速的方法是,用同一批数据请求量化后和量化前的模型,分别统计两者的用时

@lvhan028
Copy link
Collaborator

lvhan028 commented Jul 3, 2024

能不能直接贴测试代码呢?对着代码说更清楚些。btw,vision 模型默认的batch size只有1。是要调整的。

@DankoZhang
Copy link
Author

能不能直接贴测试代码呢?对着代码说更清楚些。btw,vision 模型默认的batch size只有1。是要调整的。
def infer_data(self, input_path, output_path):
total_time = 0
fw = open(output_path, "w")
with open(input_path, "r") as fr:
for line in tqdm(fr.readlines()):
r = json.loads(line.strip())
url = r["image"]
human = r["conversations"][0]['value']
gpt = r["conversations"][1]['value']
# import pdb;pdb.set_trace()
if '' in human:
human = human.replace("\n","")
image = load_image(url.replace("data", "data/Images"))
start = time.time()
response = self.pipe((human, image))
end = time.time()
total_time += end - start
# print(gpt, response)
r["predict"] = response.text
fw.write(json.dumps(r, ensure_ascii=False) + "\n")
fw.flush()
fw.close()
print(total_time)

上面是统计的代码,主要看total_time

@DankoZhang
Copy link
Author

能不能直接贴测试代码呢?对着代码说更清楚些。btw,vision 模型默认的batch size只有1。是要调整的。

另外请教下,『vision 模型默认的batch size只有1』,是什么意思,这个哪里用到了,跟github中的批处理有关系吗?(下面的代码)
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

@lvhan028
Copy link
Collaborator

lvhan028 commented Jul 4, 2024

lmdeploy的优势是在 LLM 推理部分,视觉部分没有优化。感觉是视觉处理占了大头,显得 w4a16 相比于 fp16 提升不明显。
可以把 INFO 日志打开,会有一些时间戳的打印。可以仔细看看。

@DankoZhang
Copy link
Author

vision 模型默认的batch size只有1。是要调整的。

这句话是啥意思,这个在哪里调整呢

@lvhan028 lvhan028 self-assigned this Jul 4, 2024
@irexyc
Copy link
Collaborator

irexyc commented Jul 4, 2024

@DankoZhang

推理速度提升的是LLM的部分,要测试的话,不要带图片。

可以参考这个文档 docs/zh_cn/benchmark/profile_api_server.md

@luoyangen
Copy link

@irexyc 求助下,为什么我这边使用同样语句去量化minicpm会报错,运行环境是官方的docker,lmdeploy==0.5.0

File "/opt/conda/lib/python3.12/site-packages/lmdeploy/cli/entrypoint.py", line 43, in run
args.run(args)
File "/opt/conda/lib/python3.12/site-packages/lmdeploy/cli/lite.py", line 139, in auto_awq
auto_awq(**kwargs)
File "/opt/conda/lib/python3.12/site-packages/lmdeploy/lite/apis/auto_awq.py", line 148, in auto_awq
save_vl_model(vl_model, model_path, work_dir)
File "/opt/conda/lib/python3.12/site-packages/lmdeploy/lite/apis/auto_awq.py", line 45, in save_vl_model
vl_model.save_pretrained(dst_path,
File "/opt/conda/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2661, in save_pretrained
shard = {tensor: state_dict[tensor] for tensor in tensors}
^^^^^^^^^^
UnboundLocalError: cannot access local variable 'state_dict' where it is not associated with a value

@irexyc
Copy link
Collaborator

irexyc commented Jul 7, 2024

@luoyangen
如果我没记错的话,docker里面的python是3.8的,看你的报错似乎是自己装了另外的环境。

建议重新起一个容器,不要改任何东西试试一下。

@luoyangen
Copy link

@irexyc 把transformers的版本降到4.40.0可用了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants