DSK busy 100% #197

Thin-Troll · 2022-05-19T17:06:14Z

Looked at atop and noticed disk load from 95% to 100%.

I started to analyze, it all started with the fact that I turned off all the working projects on this dedicated server and noticed that the load had dropped to 15-20%, I thought it was in the projects .. but it wasn’t there, the load returned again and began to reach 75-85%, in atop it was clear that when kworker appeared, the disk load instantly jumped.

atop screenshots:

https://i.stack.imgur.com/r81Wr.png
https://i.stack.imgur.com/lsd8f.png
https://i.stack.imgur.com/nQ86t.png

I look in perf log, perf top and see:

https://i.stack.imgur.com/1VOxm.png
https://i.stack.imgur.com/KdXFa.png

Drives are healthy, speed result:

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.4319 s, 2.5 GB/s

Timing buffered disk reads: 3878 MB in 3.00 seconds = 1292.39 MB/sec

what can be done in the next steps to localize the problem and load disks by 95-100%?
debian 10 Debian 4.19.181-1

The problem is similar to the one described in the closed request on github.
can you tell me the options for the outcome, how to fix where?
#47

The text was updated successfully, but these errors were encountered:

johannesboon · 2022-05-20T10:51:20Z

This might be a kernel issue, others have reported that changing the I/O scheduler (elevator) helps:

echo "mq-deadline" | sudo tee /sys/block/nvme*/queue/scheduler

Source: netdata/netdata#5744 (comment)

Which scheduler have you been using? See: https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers

Some other software experienced issues because certain drivers did not use unique major / minor device numbers, not sure how many partitions and devices you have and whether that plays any role here. See: netdata/netdata#10841

Thin-Troll · 2022-05-20T11:16:32Z

Это может быть проблема с доходами, другие сообщают, что изменение планировщика ввода-вывода (лифта) помогает:

эхо "mq-срок" | sudo tee /sys/block/nvme*/queue/scheduler

Источник: netdata/netdata#5744 (комментарий)

Какой планировщик вы использовали? См.: https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers.

Некоторое другое программное обеспечение программного обеспечения проповедо проблемы, потому что некоторые драйверы не использовали основные / второстепенные номера устройств, не уникальное количество у вас разделов и играет ли это здесь какую-либо роль. См.: netdata/netdata#10841

cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline

cat /sys/block/nvme1n1/queue/scheduler
[none] mq-deadline

I can change, but do not realize the consequences and changes awaiting me?
how safe is it to do on a production server?

Thin-Troll · 2022-05-20T11:27:24Z

also, I don't quite understand how to check for major and minor disk number errors.
if nvme0n1 is listed higher than nvme1n1 is this a problem?

johannesboon · 2022-05-20T12:36:07Z

I can change, but do not realize the consequences and changes awaiting me? how safe is it to do on a production server?

I am no expert on this but never had problems changing it on production. It is designed to be safe to change without rebooting / unmounting, so it should (only) affect performance, not cause any data corruption, as far as I know. See also: https://www.kernel.org/doc/html/latest/block/switching-sched.html

Of course I don't know what the impact is on your server / service if the performance were to degrade.

also, I don't quite understand how to check for major and minor disk number errors.

Check if the output of: lsblk shows unique major:minor numbers

if nvme0n1 is listed higher than nvme1n1 is this a problem?

Not that I know.

Thin-Troll · 2022-05-20T14:07:26Z

Thank you

I will try and let you know

Thin-Troll · 2022-05-21T20:09:32Z

I can change, but do not realize the consequences and changes awaiting me? how safe is it to do on a production server?

I am no expert on this but never had problems changing it on production. It is designed to be safe to change without rebooting / unmounting, so it should (only) affect performance, not cause any data corruption, as far as I know. See also: https://www.kernel.org/doc/html/latest/block/switching-sched.html

Of course I don't know what the impact is on your server / service if the performance were to degrade.

also, I don't quite understand how to check for major and minor disk number errors.

Check if the output of: lsblk shows unique major:minor numbers

if nvme0n1 is listed higher than nvme1n1 is this a problem?

Not that I know.

changing the scheduler really helped, but I didn't stop there.

xpufx · 2022-08-15T11:32:36Z

FYI, this is happening to me on a proxmox VM where the underlying physical disks on the hypervisor are NVME. Changing to the mq-deadline scheduler on the VM seems to get rid of the incorrect busy display. (VM is running debian buster)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSK busy 100% #197

DSK busy 100% #197

Thin-Troll commented May 19, 2022

johannesboon commented May 20, 2022

Thin-Troll commented May 20, 2022 •

edited

Loading

Thin-Troll commented May 20, 2022

johannesboon commented May 20, 2022

Thin-Troll commented May 20, 2022 •

edited

Loading

Thin-Troll commented May 21, 2022

xpufx commented Aug 15, 2022 •

edited

Loading

DSK busy 100% #197

DSK busy 100% #197

Comments

Thin-Troll commented May 19, 2022

johannesboon commented May 20, 2022

Thin-Troll commented May 20, 2022 • edited Loading

Thin-Troll commented May 20, 2022

johannesboon commented May 20, 2022

Thin-Troll commented May 20, 2022 • edited Loading

Thin-Troll commented May 21, 2022

xpufx commented Aug 15, 2022 • edited Loading

Thin-Troll commented May 20, 2022 •

edited

Loading

Thin-Troll commented May 20, 2022 •

edited

Loading

xpufx commented Aug 15, 2022 •

edited

Loading