Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSK busy 100% #197

Open
Thin-Troll opened this issue May 19, 2022 · 7 comments
Open

DSK busy 100% #197

Thin-Troll opened this issue May 19, 2022 · 7 comments

Comments

@Thin-Troll
Copy link

Looked at atop and noticed disk load from 95% to 100%.

I started to analyze, it all started with the fact that I turned off all the working projects on this dedicated server and noticed that the load had dropped to 15-20%, I thought it was in the projects .. but it wasn’t there, the load returned again and began to reach 75-85%, in atop it was clear that when kworker appeared, the disk load instantly jumped.

atop screenshots:

  1. https://i.stack.imgur.com/r81Wr.png

  2. https://i.stack.imgur.com/lsd8f.png

  3. https://i.stack.imgur.com/nQ86t.png

    I look in perf log, perf top and see:

    https://i.stack.imgur.com/1VOxm.png
    https://i.stack.imgur.com/KdXFa.png

    Drives are healthy, speed result:

    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.4319 s, 2.5 GB/s

    Timing buffered disk reads: 3878 MB in 3.00 seconds = 1292.39 MB/sec

    what can be done in the next steps to localize the problem and load disks by 95-100%?
    debian 10 Debian 4.19.181-1

The problem is similar to the one described in the closed request on github.
can you tell me the options for the outcome, how to fix where?
#47

@johannesboon
Copy link

This might be a kernel issue, others have reported that changing the I/O scheduler (elevator) helps:

echo "mq-deadline" | sudo tee /sys/block/nvme*/queue/scheduler

Source: netdata/netdata#5744 (comment)

Which scheduler have you been using? See: https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers

Some other software experienced issues because certain drivers did not use unique major / minor device numbers, not sure how many partitions and devices you have and whether that plays any role here. See: netdata/netdata#10841

@Thin-Troll
Copy link
Author

Thin-Troll commented May 20, 2022

Это может быть проблема с доходами, другие сообщают, что изменение планировщика ввода-вывода (лифта) помогает:

эхо "mq-срок" | sudo tee /sys/block/nvme*/queue/scheduler

Источник: netdata/netdata#5744 (комментарий)

Какой планировщик вы использовали? См.: https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers.

Некоторое другое программное обеспечение программного обеспечения проповедо проблемы, потому что некоторые драйверы не использовали основные / второстепенные номера устройств, не уникальное количество у вас разделов и играет ли это здесь какую-либо роль. См.: netdata/netdata#10841

cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline

cat /sys/block/nvme1n1/queue/scheduler
[none] mq-deadline

I can change, but do not realize the consequences and changes awaiting me?
how safe is it to do on a production server?

@Thin-Troll
Copy link
Author

also, I don't quite understand how to check for major and minor disk number errors.
if nvme0n1 is listed higher than nvme1n1 is this a problem?

image

@johannesboon
Copy link

I can change, but do not realize the consequences and changes awaiting me? how safe is it to do on a production server?

I am no expert on this but never had problems changing it on production. It is designed to be safe to change without rebooting / unmounting, so it should (only) affect performance, not cause any data corruption, as far as I know. See also: https://www.kernel.org/doc/html/latest/block/switching-sched.html

Of course I don't know what the impact is on your server / service if the performance were to degrade.

also, I don't quite understand how to check for major and minor disk number errors.

Check if the output of: lsblk shows unique major:minor numbers

if nvme0n1 is listed higher than nvme1n1 is this a problem?

Not that I know.

@Thin-Troll
Copy link
Author

Thin-Troll commented May 20, 2022

Thank you

I will try and let you know

@Thin-Troll
Copy link
Author

I can change, but do not realize the consequences and changes awaiting me? how safe is it to do on a production server?

I am no expert on this but never had problems changing it on production. It is designed to be safe to change without rebooting / unmounting, so it should (only) affect performance, not cause any data corruption, as far as I know. See also: https://www.kernel.org/doc/html/latest/block/switching-sched.html

Of course I don't know what the impact is on your server / service if the performance were to degrade.

also, I don't quite understand how to check for major and minor disk number errors.

Check if the output of: lsblk shows unique major:minor numbers

if nvme0n1 is listed higher than nvme1n1 is this a problem?

Not that I know.

changing the scheduler really helped, but I didn't stop there.

@xpufx
Copy link

xpufx commented Aug 15, 2022

FYI, this is happening to me on a proxmox VM where the underlying physical disks on the hypervisor are NVME. Changing to the mq-deadline scheduler on the VM seems to get rid of the incorrect busy display. (VM is running debian buster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants