(error in btrfs_finish_ordered_io: 2829: errno=-5 IO failure). Can this be fixed or do I have a serious drive problem?

ThePoorPilot · 23 April 2021 21:33

A few months ago I had to reinstall Garuda due to a bunch of drive errors.

Yesterday, I ran a bad block test(read only) and it said there were 0 bad blocks.

Every so often, when I leave my computer on overnight, I will pull it up in the morning and there will be error messages that several system critical files such as ".bashrc" are unwritable.

After rebooting, this issue tends to correct itself. Today, I got back to my computer and saw those messages, but rebooting actually made things worse.

Instead I now (probably due to the block testing triggering the same part of the drive) I cannot boot to the GUI and get this error message in console mode:

Garuda Linux 5.11.16-zen1-1-zen (tty1)

mmpc login:
 [ 12.7592381 BTRFS: error (device nvme0n1p1) in btrfs_finish_ordered_io: 2829: errno=-5 IO failure 
[ 12.7592421 BTRFS: error (device nvme@n1p1) in btrfs_finish_ordered_io: 2829: errno=-5 IO failure

Converted to text using google lens

I'm really not knowledgeable on BTRFS and can't find much on this, does anyone have suggestions or is my drive toast?

Extra info:
I cannot login to my main account since it runs fish.
I can log into root with bash. Fish causes some bus error
Pacman appears to have db lock, even though there is no reason for it to.

*Now, it just freezes at the log in screen in console mode

Edit:
Btrfsck(run on old Manjaro drive) shows some errors in a few blocks: PrivateBin

jonathon · 24 April 2021 00:28

This has come up a few times over the past month (e.g. [Solved - kinda] Can't log in : BTRFS: error in btrfs_run_delayed_refs:2124: errno:-5 IO failure, BTRFS error : system boot in emergency mode).

A quick web search (Garuda searxng) leads to this thread on the Arch BBS:

https://bbs.archlinux.org/viewtopic.php?pid=1772672#p1772672

and that might provide a starting point.

In any event, make sure you have a backup of any important data.

ThePoorPilot · 24 April 2021 00:53

Thanks for getting me in the loop.

I did take a look at that archlinux forum page, but I wasn't sure since the issue dealt with "btrfs_free_extent" instead of "btrfs_finish_ordered_io"

I don't know if that initial part means anything.

It still might mean something, but I guess I'll test things out now that I've got a backup.

I don't too much about btrfs/drives, but I know it is very easy to make your problems worse. As such, was afraid to do anything without consulting for general knowledge given how fragile a state my drive is in.

I took a good backup of my home directory, so I guess I'll give btrfsck --init-extent-tree or maybe even btrfsck --repair a shot.

The thing that's different between me and the previous posts is that I have already had the same error and had to reinstall. Once the drive has been filled up to a certain block, it has problems. Perhaps this is because things didn't re-initialize or perhaps it is a hardware issue with the drive.

Every other user just bought a new drive/reinstalled, but I want to get to the bottom of this(if possible) since it is a repeating issue.

To test if it is a hardware issue, I can install an ext4 OS, fill the drive, and see if I run into similar issues.

If I can discern a hardware problem, perhaps I could try to get it replaced by Samsung under their "5 year limited-warranty"

jonathon · 24 April 2021 00:55

That kind of sounds like a bad block on the disk - although it’s an NVMe, so…

tbg · 24 April 2021 01:12

Bit of a long shot, but the errors could possibly be a result of faulty RAM rather than the drive itself. Perhaps run the long RAM test to be sure it is not the RAM causing the errors.

ThePoorPilot · 24 April 2021 01:40

I'll run a bunch of tests overnight. I am currently running badblocks and it's about 45% done.

After that, I'll run memtest from Grub

A few months ago, I did upgrade from 16 GB to 32 GB by adding two extra sticks, so it could be a relevant factor. I've also had a decent share of segmentation faults/stuttering/and BSOD on Windows over the years so my CPU or MOBO could have issues too.

For about a year straight, I ran my PC all day as a crypto mining machine. Perhaps, I am now just seeing the bad effects of such strain on the hardware(although NVME drive is very new.) Even if that's the case, I made more than enough off the crypto to buy some new parts if necessary.

tbg · 24 April 2021 01:53

Strange as it may sound I have also seen I/O errors caused by specific kernels and schedulers they use. It wouldn't hurt to test some different kernels and schedulers.

ThePoorPilot · 24 April 2021 21:38

Here's a few things I ended up testing

linux-lts kernel didn't do much and I couldn't test a different scheduler since I couldn't even log in
Couldn't get anything productive done using chroot.

btrfsck --init-extent-tree did not run at all and btrfsck --repair seemed to make things worse. It got itself into loops deleting and repairing the same things over and over again.

My running theory is that part of the partition table or btrfs partition was corrupted due to unexpected loss of power some months ago. When I reinstalled, it reused the same partition and did not completely re-initialize it since it was going from btrfs to btrfs.

In fact, I recall the installation had a lot of problems for the first few days, they then stopped, and I forgot and was running on borrowed time.

I guess I am going to go ahead an attempt a reinstall. This time, I think I am going to try using GPT as opposed to MSDOS for the partition table (assuming that's supported).

Overall, none of the hardware tests I ran showed problems, it was only BTRFSCK tests that showed corruption. This makes me hope that my drive just needs to be completely re-partitioned to be fixed.

ThePoorPilot · 1 May 2021 02:13

Update on this, I just rebooted, and this issue is occurring once again.
A week after completely reinstalling, I am receiving basically the same message:

Garuda Linux 5.10.33-1-1ts (tty1) mmpc log in: 

[18.2650241] BTRFS: error (device nvme@n1p2) in btrfs_finish_ordered_ io:2736: errno=-5 I O failure

[18.265891] BTRFS: error (device nvme@n1p2) in btrfs_finish_ordered_io:2736: errno=-5 10 failure

I am thinking at this point, this means my drive has a serious drive issue.
What really puzzles me, however, is how none of the hardware tests I ran on the drive showed any problems.

I guess one last idea would be try to use an ext4 OS on it. Perhaps it is an inherent BTRFS issue, or there is some other aspect of Garuda Linux causing a problem on the NVME drive.

If I did put an ext4 OS on the drive, I am trying to think of a way to stress test the job and cause potential errors to occur. Let me know if you've got any ideas!

I'd prefer not to wait a week or two for something to happen, as the drastically would slow down my ability to test the drive/ask for warranty replacement.

Edit: When I tried to mount the drive in Manjaro, it spit out "can't read superblock." Using a few btrfs rescue commands, I got the drive to be mountable. I've got a little bit more testing to do, maybe I can get something to tell me there is hardware error just to confirm it. Otherwise, I'll go the ext4 testing route.

tbg · 1 May 2021 02:43

Run a script on the drive to check for illegal filenames. I have encountered I/O errors on ext4 drives in the past that passed all the hardware tests. However, the drive had a lot of files on it with illegal file names and I suspected that was what was causing the problems.

Double check for files that are violating any naming conventions, and correct any you find with bad file names.

linuzo · 1 May 2021 03:18

What brand of nvme??

Some nvme do have bad blocks to begin with and people that run ext4 will never notice them but when they move to a filesystem like btrfs then it will trigger but I dont think you should worry,. I'd run a full test on it and see how many bad blocks if its really bad reach out for warranty replacement?

ThePoorPilot · 1 May 2021 03:36

It's a Samsung 970 EVO 500 GB

It's only a 6-8 months old. For the first month I ran ext4 with Manjaro, I've been using Garuda since November last year.

I'll do more checks very soon! Thanks for the tips

ThePoorPilot · 1 May 2021 16:27

I ran a find command from here:

I'm not sure if that is exactly what you are thinking of, but it's the best I could find. Let me know if there is something more specific you suggest.

Anyways, the command caused these errors in the terminal:

find: ‘./@home/michael/.local/share/kwalletd’: Input/output error
find: ‘./@home/michael/.local/share/sddm’: Input/output error
find: ‘./@home/michael/.local/share/fish’: Input/output error
find: ‘./@home/michael/Videos’: Input/output error
find: ‘./@home/michael/.cache’: Input/output error

I believe this shows the places where there are problems.
I am also looking at the output of the command for illegal characters in file names. It looks like it may not be the best command, as I am fairly sure + is a permissible character.
https://bin.garudalinux.org/?0d37486fb9d48a7d#AYefLGrwW7Ps8WuWRZPaHhk2ABnX263M3B6kxKxThf9V

ThePoorPilot · 1 May 2021 18:39

I still haven't gotten any hardware errors anywhere, so I really am confused.

The only other thing I can think of is that it looks like the drive runs pretty hot (since it is close to two GPUs). As you can see below, it reads 81C after only being used to copy a few files.

The S.M.A.R.T statistics also point out quite a few "unsafe shutdowns."

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 81 C
available_spare				: 100%
available_spare_threshold		: 10%
percentage_used				: 1%
endurance group critical warning summary: 0
data_units_read				: 15,280,007
data_units_written			: 29,182,021
host_read_commands			: 100,066,639
host_write_commands			: 426,400,585
controller_busy_time			: 752
power_cycles				: 224
power_on_hours				: 895
unsafe_shutdowns			: 177
media_errors				: 0
num_err_log_entries			: 245
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Temperature Sensor 1           : 81 C
Temperature Sensor 2           : 49 C
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0