Temporary system hangs/freezes when updating

tbg · 6 March 2021 03:32

I just remembered another change I made was to my systemctl settings.

I created:

/etc/sysctl.d/90-dirty.conf

With this file content:

vm.dirty_expire_centisecs = 500

I've no idea which if any of the many changes I made helped, but the issue went away. I probably made more changes, but I wasn't particularly meticulous about documenting all of them.

kremix · 6 March 2021 12:04

disabling option in timeshift didn't help. on muqss kernel I can compile on all 4 threads and browse chrome playing music without any problems. high io is causing freezes.

petsam · 6 March 2021 12:53

I wonder why there is no journal log investigation, when discussing about freezes.

This is a fake troubleshooting, just chasing your tail in rounds
Keep on turning!!

jonathon · 6 March 2021 12:55

Feel free to step in any time.

petsam · 6 March 2021 13:05

Thanks for the invitation!

Sorry for my rudeness, but I was strongly tempted…

General guide
If your system freezes terminally, check latest journal messages from previous boot.
Depending on evidence, proceed to more investigation.

Bro · 6 March 2021 13:10

What was meant, was that any subvolid at all is unnecessary. Only subvol is needed.

I do not remember where I came across that information. I’m on ext4 on lappy.

petsam · 6 March 2021 13:46

I can read your mind for you. You saw this in Archwiki.

You only need one of subvol= and subvolid= to be set in mount/kernel_params, if I interpret correctly.
But when booting into a snapshot, assuming you want to rescue a failed normal boot, you have to fix the default boot.
Then, you have to :

Use timeshift to restore your current (snapshot) boot and immediately reboot to normal boot

If you decided to change your default/normal boot, while booted to your real normal boot, you have to:

Chroot to the intended new normal boot snapshot and (re)install grub.

But all these are complicated and it needs a good understanding of how btrfs subvolumes function, when booting snapshots.

Disclaimer: I only RTFM. I have no hands on experience on btrfs, so take the above lightly

To the Team: Using the rescue function of grub-btrfs+timeshift auto-snapshots, makes users think they are not required to do anything else after booting to a snapshot.
Of course these are described at Garuda wiki, but it seems (don’t we know?) most users don’t read useless time consuming bla bla. And then they face their fate, which leads to the forum for help.
So something is missing. Either make them read, or create a mechanism to save them (as much as possible). Like a strong notification after booting to a snapshot not-the-default, or similar.

Bro · 6 March 2021 18:54

Excellent !

EDIT: As an adjunct to the above, to directly mount a specific snapshot, subvolid=nnn is needed., i.e.: mount /dev/sdb2 /mnt2 -o subvolid=262

petsam · 6 March 2021 20:17

Is it different in some way to Archwiki?

On boot, kernel params (mount as root).
Normal mount.
subvol uses path in FS.
To use subvolid, you need to... find it first .

Bro · 7 March 2021 00:15

No comment.

BrutalBirdie · 7 March 2021 21:10

Adding to that.

I just read the Kernel 5.11 notes for BTRFS.

BTRFS

(FEATURED) Switch homegrown locking used for the buffer tree implementation to a standard Linux rw_semaphore. Performance in highly contended scenarios seems to be much better in general, much better (tens of percent gains) for some commit, commit

(FEATURED) Performance improvements for dbench alike workloads commit, commit, commit, commit, commit, commit

(FEATURED) Introduce rescue mount option rescue=ignoredatacsums, which ignores data checksums failures (these can actually happen when an application modifies a buffer in-flight when doing an O_DIRECT write) commit

(FEATURED) Introduce rescue mount option rescue=ignorebadroots. It attempts to make read-only mount possible when failing to read corrupted tree roots commit

(FEATURED) Introduce mount option rescue=all, which enables ‘ignorebadroots’ + ‘ignoredatacsums’ + ‘nologreplay’ at the same time

This might be useful for garuda rescue mode? If it’s not already used.

tbg · 8 March 2021 00:28

Yes, I posted that a little while back. Very nice enhancements for btrfs. Btrfs is still a relatively new FS, but they have been really ramping up the feature set rapidly in the 5 series kernels. Definitely a positive sign of things to come for btrfs in the future.

smoky · 14 March 2021 08:50

only my expirience
As my sys hang before and the balance take ages ( some hours) . i tried somethings like quata and qgroups disable . but nothing helps ..
and then my btrfs was corupt dont no from what
so i make a fresh install
after the first sys update i get freezes again
i set up my sys as i want after i finish configure and install all my things i start with this :
i reduce the snapshoots to 3 (standart is 5)
i let uncheck timedhift of home folder( i have on my old)
i let quata and qgroups enable
now i make sylink my pictures and videos to home only (b4 it was in home)
as Tbg mention i make a full balance then restart
after restart i make again a full balance again and restart again
and what should i say now it runs very fast no hangs no freeze
now i make allways a full balance if i delete big files or maybee once a week depends what i have all done and its fine ( maybee not for nvme disk )
im happy with my sys i just not know the reason why it hangs b4
i think maybee the snapshoots was to big bec i have 5 automatic with homefolder and 2 manual
and b4 i have steam games pic videos all in home folder . now only steam pics and videos only symlink to home .
and balance now also get fast maybee 20 minutes only
i only write here my expirince maybee it helps someone

tbg · 14 March 2021 19:27

Thank you for posting your experiences regarding your system slow downs. Hopefully with enough user feedback we can pinpoint the steps we need to take to help Garuda run more smoothly on all systems.

Perhaps running services to do regular maintence tasks on a schedule might be the way to go. My feeling is this can't be an extremely common occurance or I think my more users would be reporting these slowdowns.

I've only gone through several short periods where my system was affected, but the slowdowns were drastic. In my mind there must be a commonality in the cases where these slowdowns happen after either a very large update or a snapshot restore.

My thinking although I have no real evidence is that this affects users with small memory based storage devices more so than large platter drives for their system OS drive.

I only use a small 120 GB SSD drive as my system drive and I use large 10 TB drives for storage. While the small SSD drives worked well with ext4, I'm starting to think they may be too cramped for more than 3 or so btrfs snapshots.

I also use a separate home partition from years of ext4 habit. I am now starting to think this is only exacerbating this issue as the empty space from home would help with reducing the percentage of flushed data accumulating on the root partition. I could change this manually of course, but I think I will leave changing my partition structures until my next full install. I plan to add a dedicated Ext2 boot partition to help grub save my default kernel boot choice. So, I think some structural changes may help with these slowdown issues as well. Again, just a suspicion.

I could be wrong, but I'm thinking the smaller drives while they are fine as far as space they will have a far higher percentage of redundant flushed data when large blocks (such as snapshots) are flushed.

I have no idea if my suspicion is correct, but I'm thinking that if you are using a small drive, (or small partition possibly?) for your OS then you may need to perform system trim and balance operations far more frequently than the average user.

As I stated this is only my suspicion at this point, but I will be adjusting my maintenance schedule to see if the adjustments help prevent this in the future.

Edit: also in the back of my mind I have a sneaking suspicion that keeping many kernels installed also makes this situation worse. I have no proof, but as I was testing up to a half dozen different kernels at a time, that itself was probably a big factor in increasing flushed data from so many kernel updates taking place regularly. I assume most users don't run more than a couple of kernels, but I need to know good alternates to recommend to users having issues, so I usually have a bunch installed. While I have no evidence on this subject, I think keeping more than 2 kernels installed at a time could be a contributing factor with this issue.

RodneyCK · 28 December 2021 16:50

Thanks for the info, this is happening to me now. It kept getting worse each day, my system was at a standstill.

I did the 'btrfs quota disable /' mentioned above, so we will see if it happens again during the next run.