Safely update with garuda-update

Elmar · 2 August 2023 01:16

Hey Garuda Community,

In the past two weeks, I had problems with the Garuda system update because of insufficient memory. Temporarily, I could not boot anymore but I finally succeeded.

What happened

The Garuda updater failed in the upgrade phase and I could not update my system, allegedly due to insufficient memory. Finally, last Thursday, I had 10GiB available but the upgrading part still failed. I restored a snapshot and restarted but then, I could not boot any snapshot or backup anymore. It was either a Samba-load-failure symptom or the snapshot/backup subvolume lacked indeed the kernel files for whatever reason. Fallback snapshots got stuck.

I was afraid, the game is over and I'd need to reinstall the system. I invested many days in configuration and settings, I wouldn't want to risk losing something.

Then, after I booted from a live USB, sudo btrfs check --force /dev/nvme0n1… indeed shows the existence of 4GiB "free space" and no errors but when I tried to copy missing Linux files to the default subvolume, from run/media/garuda/garuda/@_backup_20231907…/boot/ to run/media/garuda/garuda/@/boot/, it told me, that there is "No space left on device". But my device is new, certainly not hardware wear. I didn't understand what happened.

The BTRFS contained like 7 snapshots where 5 in between were broken and contained 3 backup volumes where 2 of them were broken.

After one clueless day, thinking about how I would migrate all configurations and settings into a fresh installation (fortunately with a separate /home partition), I thought, what if the backup actually is in a working state and the boot process indeed crashes before the login screen because of a lack of free space in the partition?

Therefore I deleted the broken backup subvolumes and the broken snapshots. I restored the good backup subvolume and restarted. And voilà! It worked!
Well, I still had to fix the /etc/fstab issue again but I did not lose any relevant changes.

In order to succeed with the update this time, I made a manual snapshot, updated only the Linux Kernel files with Discovery, then updated some other packages. I removed unused applications and removed additionally created snapshots. I checked, that less than 20GiB of over 40 were used. Then I finally started the Garuda updater and this time updating 700+ packages did not make problems. In the end, it used about 33 GiB of partition memory.

I will take care in case of a new big update.

Garuda Update

based on my experience, I wonder what could be improved to prevent failed system updates from corrupting the system state (e.g. lost Kernel files).

Memory requirements upper bound

I had big trouble with the memory requirements for the update. It failed every time with insufficient memory ("could not write") even when it looked like enough space is available before the update.
Maybe you know a way to obtain the exact disk space requirements for the update?
But I guess, this is impossible to know since the update process executes some scripts here and there and I saw it even builds a few packages locally which requires an unknown amount of space in addition to the download size.

using virtual disk space in RAM

The updating process writes data to disk and does not keep things in the RAM. I have 32 GiB of RAM available, pretty much enough for any update I need, and it could dump the data in the RAM to disk after the update finished, so it doesn't need to run out of memory.

Is there an option to map the working memory of the updating process into RAM?

Updating in multiple steps

It would be very useful if the Garuda Assistant could first update the most critical files in isolation, instead of updating everything at once. This could remove the single point of failure and would make updates of critical packages such as the kernel less risky.

Message to the user

If the update fails due to insufficient memory, the system is in a broken state (when the Kernel files are missing), system software might not start anymore, and users should avoid making package operations or further update attempts which could replace old good snapshots with broken snapshots. Users should instead just restore the last good snapshot. A directive error message could help people to do the right thing by telling them that they should restore a certain snapshot and reboot.

Restoring the snapshot with the BTRFS Assistant will create a useless backup which just wastes more space, so either the backup should not be created or the users are told that they may delete the broken backup.

Such a conditional message could save users from the trouble of destroying their good snapshots with bad ones.

Thank you for reading.

BluishHumility · 2 August 2023 03:34

Allegedly? This is not specific enough. Next time, copy and paste the exact terminal output into the thread so others can see what is happening and help you.

Again, this isn’t really specific enough. There are a lot of very fixable scenarios for non-booting systems; if you reached out with specific details and terminal output, it is likely someone from the community could have helped you.

One possibility is you did not resize the filesystem after resizing the partition (in this thread: Broken Root Partition (after resizing and failed garuda-update)). After the partition is expanded, the filesystem needs to be expanded separately–otherwise, the filesystem can become full while the partition itself still has empty space on it.

Do you mean Pacman? You should not be installing or updating any kernels with Discover, the KDE software manager.

Some of your message is not correct. This is not right:

That isn’t true, Pacman is able to gracefully fail to update. If the update fails, no packages will be updated and no packages will be removed. What you are describing is something else–for example, if an update is interrupted (by the user, or through power loss for example) then that can break the system.

Pre-update snapshots will not be “broken”, unless your system was already broken before you took an update. The snapshot is from before anything related to the package maintenance even starts.

Post-update snapshots will not be taken if the update fails. So again: there shouldn’t be “broken” snapshots created from a failed Pacman transaction like you describe.

Restoring a snapshot is an option, but it is not the only option–and I wouldn’t agree it is the best option. I would think it would be better to fix the problem that prevented the update from succeeding, and then take the update.

Snapshots typically take up very little space, especially if not much has changed on the disk between the snapshots. It is old snapshots that typically are to blame for chewing up space unnecessarily, because they may be holding a lot of files that don’t exist anymore. If I take 100 snapshots of my system over the course of two minutes with very few changes to the disk, that will take up practically no space–the snapshots just remember the changes.

You may have set your threshold for snapshots too low, if you are running into situations where the useful snapshots deplete so rapidly that you are considering changing your update routine. In your other thread you mentioned you set it to 10 snapshots? Try changing it to 20.

Don’t forget to expand your filesystem when you have a chance, to prevent all this trouble from recurring.

Elmar · 2 August 2023 11:51

Thank you for your time to answer.

Merely my experience. You were kind and linked my original post with the root partition problem which shows the error output I got.
It only tells me, it could not write the binaries (or other files) to the location. I looked at the space usage of the partition and in the live USB session when I saw, I could not write anything anymore in the partition (“No space left on device”), I assumed, it was a memory insufficiency problem.

That isn’t true, Pacman is able to gracefully fail to update. If the update fails, no packages will be updated and no packages will be removed.

Right, most of the time, garuda-update failures (e.g. due to access rights, internet connection or dependencies) do not break anything and exit cleanly. But it actually happened in every update attempt in the past two weeks that my system was in a broken state after the update process failed with writing errors. Directly afterward, Kernel files were missing, important software could not be started anymore, such as Konsole and settings.

[…] fix the problem that prevented the update from succeeding, and then take the update […]

Indeed, I did that at first, I tried cleaning up space, and it made everything worse. It finally caused a state (I removed almost every snapshot for making space) where I initially could not boot into any snapshot or backup subvolume. As you say, deleting backup subvolumes and package cache is much more effective.

The two remaining good snapshots caused a Samba SMB / NMB failure symptom before the login screen. It showed me 4GiB free (after I had already done sudo btrfs filesystem resize max /), I don’t know why, maybe allocated by another process like Snapper but unused, when it told me “No space left on device” during write attempts.

The good news is, the good backup subvolume booted after I cleared space in a live USB session.

Snapshots typically take up very little space, […]
In your other thread you mentioned you set it to 10 snapshots? Try changing it to 20.

Thanks. That’s a helpful tip, I will try it. Indeed, I was afraid of space usage.
And still, it created at least 5 snapshots within 7 minutes because removing one application in Pamac or Discovery creates two new snapshots. I read that at least one other person had a problem with snapshot management and replaced good snapshots with bad ones, could not boot anymore and gave up.
Of course, people say, they should make a backup (while sthg. like experience or hardware access holds them back), but it’s certainly more usable to make people aware of the snapshot issue in relevant cases. I guess, restoring a full partition backup in an unbootable system takes more effort than just selecting a snapshot in the bootloader.

You should not be installing or updating any kernels with Discover[y], […]

I will remember that. Fortunately, it worked without errors or such.

Cheers.