Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #82

Open
flaviadeutsch opened this issue May 29, 2023 · 6 comments
Open

Comments

@flaviadeutsch
Copy link

Qlora LLaMa 13B

  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 270, in step
    torch.cuda.synchronize()
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/cuda/__init__.py", line 688, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@KKcorps
Copy link
Contributor

KKcorps commented May 29, 2023

I got this as well today

@lmc8133
Copy link

lmc8133 commented May 30, 2023

What's your torch version. I use torch 2.0 at first and got same problem, then I degraded it to 1.13.1 and works well. Hope helpful

@artidoro
Copy link
Owner

artidoro commented Jun 1, 2023

I also got this when using decapoda-research/llama-7b-hf. With another hf conversion (more recent I think) I did not get the problem. I recommend using newer conversions if possible.

It looks like it can also be fixed by downgrading torch but I haven't verified it.

@flaviadeutsch
Copy link
Author

Which hf conversion please

@apachemycat
Copy link

latest (nigthly )torch 2.0 same error ,but --per_device_train_batch_size 2 --gradient_accumulation_steps 1 ok, --per_device_train_batch_size set 3 then an illegal memory access was encountered
#82

@apachemycat
Copy link

╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
1%|▋ | 62/10000 [05:00<13:21:35, 4.84s/it]
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora# pip list|grep torch
pytorch-triton 2.1.0+440fd1bf20
torch 1.13.1
torchaudio 2.1.0.dev20230622+cu121
torchsparseattn 0.2
torchvision 0.16.0.dev20230622+cu121

torch 1.13.1 also error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants