Hey Garuda Community,
In the past two weeks, I had problems with the Garuda system update because of insufficient memory. Temporarily, I could not boot anymore but I finally succeeded.
What happened
The Garuda updater failed in the upgrade phase and I could not update my system, allegedly due to insufficient memory. Finally, last Thursday, I had 10GiB available but the upgrading part still failed. I restored a snapshot and restarted but then, I could not boot any snapshot or backup anymore. It was either a Samba-load-failure symptom or the snapshot/backup subvolume lacked indeed the kernel files for whatever reason. Fallback snapshots got stuck.
I was afraid, the game is over and I'd need to reinstall the system. I invested many days in configuration and settings, I wouldn't want to risk losing something.
Then, after I booted from a live USB, sudo btrfs check --force /dev/nvme0n1…
indeed shows the existence of 4GiB "free space" and no errors but when I tried to copy missing Linux files to the default subvolume, from run/media/garuda/garuda/@_backup_20231907…/boot/
to run/media/garuda/garuda/@/boot/
, it told me, that there is "No space left on device". But my device is new, certainly not hardware wear. I didn't understand what happened.
The BTRFS contained like 7 snapshots where 5 in between were broken and contained 3 backup volumes where 2 of them were broken.
After one clueless day, thinking about how I would migrate all configurations and settings into a fresh installation (fortunately with a separate /home
partition), I thought, what if the backup actually is in a working state and the boot process indeed crashes before the login screen because of a lack of free space in the partition?
Therefore I deleted the broken backup subvolumes and the broken snapshots. I restored the good backup subvolume and restarted. And voilĂ ! It worked!
Well, I still had to fix the /etc/fstab
issue again but I did not lose any relevant changes.
In order to succeed with the update this time, I made a manual snapshot, updated only the Linux Kernel files with Discovery, then updated some other packages. I removed unused applications and removed additionally created snapshots. I checked, that less than 20GiB of over 40 were used. Then I finally started the Garuda updater and this time updating 700+ packages did not make problems. In the end, it used about 33 GiB of partition memory.
I will take care in case of a new big update.
Garuda Update
based on my experience, I wonder what could be improved to prevent failed system updates from corrupting the system state (e.g. lost Kernel files).
Memory requirements upper bound
I had big trouble with the memory requirements for the update. It failed every time with insufficient memory ("could not write") even when it looked like enough space is available before the update.
Maybe you know a way to obtain the exact disk space requirements for the update?
But I guess, this is impossible to know since the update process executes some scripts here and there and I saw it even builds a few packages locally which requires an unknown amount of space in addition to the download size.
using virtual disk space in RAM
The updating process writes data to disk and does not keep things in the RAM. I have 32 GiB of RAM available, pretty much enough for any update I need, and it could dump the data in the RAM to disk after the update finished, so it doesn't need to run out of memory.
Is there an option to map the working memory of the updating process into RAM?
Updating in multiple steps
It would be very useful if the Garuda Assistant could first update the most critical files in isolation, instead of updating everything at once. This could remove the single point of failure and would make updates of critical packages such as the kernel less risky.
Message to the user
If the update fails due to insufficient memory, the system is in a broken state (when the Kernel files are missing), system software might not start anymore, and users should avoid making package operations or further update attempts which could replace old good snapshots with broken snapshots. Users should instead just restore the last good snapshot. A directive error message could help people to do the right thing by telling them that they should restore a certain snapshot and reboot.
Restoring the snapshot with the BTRFS Assistant will create a useless backup which just wastes more space, so either the backup should not be created or the users are told that they may delete the broken backup.
Such a conditional message could save users from the trouble of destroying their good snapshots with bad ones.
Thank you for reading.