Corrupt BTRFS filesystem, possible reason VirtualBox with NFS share?

clownshow · 4 January 2023 08:34

Hello,

I am coming here, because this happend on TWO independent computers, both running Garuda linux in the span of two days during the same procedures.

I have the following setup.

PHP project in a virtualbox/Vagrant setup
One folder mounted via NFS with the following parameters (Vagrantfile)

config.nfs.map_uid = Process.uid
config.nfs.map_gid = Process.gid
config.vm.synced_folder "/home/foo/projects/php-project", "/home/project-root/current", # left being host machine, right being vm
id: "i",
type: 'nfs',
nfs_version: 4,
nfs_udp: false,
mount_options: ["actimeo=1", "nolock"],
:nfs => { :mount_options => ["dmode=0755","fmode=0664"] }

I run composer update in the project folder which installs all dependencies, which aborts because it can not delete old dependency folder.

Result:

BTRFS partition is read-only.

Error message I could extract from journalctl:

Jan 04 09:04:10 foo kernel: BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=64411566080 slot=31, bad key order, prev (107391 96 39>
Jan 04 09:04:10 foo kernel: BTRFS info (device dm-0): leaf 64411566080 gen 17047 total ptrs 181 free space 1727 owner 18446744073709551610
Jan 04 09:04:10 foo kernel:         item 0 key (470 1 0) itemoff 16123 itemsize 160
Jan 04 09:04:10 foo kernel:                 inode generation 52 size 2559483904 mode 100644

Everything else is basically lost, because after that the FS becomes RO and dmesg is full of messages saying that stuff can not be written.

What I have tried:

First machine

Running btrfsck --force /dev/md-0
- Result: Invalid inodes on specific files in the vendor folder (php projects equivalent to node_modules, where the packages are installed to if I run composer XXX).
  
  Sorry, I do not have the exact output anymore, because this was yesterday and it seemed like a contained problem then.
I tried to restore a snapshot from that day only to realize, that snapshots are just files and the filesystem was broken. It did nothing and was probably a bad idea.
Then I ran btrfsck --repair /dev/md-0 which is strongly advised against basically everywhere (don't create that command then, I guess). It fixed problems that did not show up in the btrfsck run before and also the inode things.
Afterwards the system booted one time, I retried the php project, it was read-only again, and now it fails to boot completely.

Second machine

Running:

 # sudo btrfsck --force /dev/mapper/luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa
 Opening filesystem to check...
 WARNING: filesystem mounted, continuing because of --force
 Checking filesystem on /dev/mapper/luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa
 UUID: ccdd6485-2326-4b0e-a016-810aab378edd
 [1/7] checking root items
 [2/7] checking extents
 [3/7] checking free space tree
 [4/7] checking fs roots
 [5/7] checking only csums items (without verifying data)
 [6/7] checking root refs
 [7/7] checking quota groups skipped (not enabled on this FS)
 found 76418887680 bytes used, no error found
 total csum bytes: 68161524
 total tree bytes: 2291154944
 total fs tree bytes: 2104164352
 total extent tree bytes: 99516416
 btree space waste bytes: 360352256
 file data blocks allocated: 1905105891328
 referenced 116512391168

Which seems fine, but why then go into read-only mode?
Maybe it healed itself in the meantime.

I am basically lost right now.

I can not really use the tools from the live stick on machine #1, because it is an old installation which still has Timeshift in use. I can not use the old live system for it, because I can not update it properly. Also even after updating it, it does not use the latest kernel, which I understand is strongly advised for working with BTRFS.

Machine #2 is not yet investigated enough. As seen above, there is currently no error in btrfsck, but that does mean nothing to me. I wanted to re-install machine #1 today, but if neither work properly, I am hesitant.

I am pretty lost right now.

What do you guys recommend in this scenario?

P.S.: Adding garuda-inxi as soon as I find a way to copy&paste it without a file in between.

Update (garuda-inxi from machine #2):

# sudo garuda-inxi | xsel --clipboard --logfile /dev/null

System:
  Kernel: 6.1.1-zen1-1-zen arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
    parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-zen
    root=UUID=ccdd6485-2326-4b0e-a016-810aab378edd rw rootflags=subvol=@
    quiet
    cryptdevice=UUID=8bbf76ae-7d61-4f11-baf6-975ba4e35aaa:luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa
    root=/dev/mapper/luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa quiet splash
    rd.udev.log_priority=3 vt.global_cursor_default=0 loglevel=3 ibt=off
  Desktop: KDE Plasma v: 5.26.4 tk: Qt v: 5.15.7 info: latte-dock
    wm: kwin_x11 dm: SDDM Distro: Garuda Linux base: Arch Linux
Machine:
  Type: Laptop System: LENOVO product: xxxxxxxxxx v: ThinkPad P51
    serial: <filter> Chassis: type: 10 serial: <filter>
  Mobo: LENOVO model: xxxxxxxxxx serial: <filter> UEFI: LENOVO
    v: N1UET85W (1.59 ) date: 07/18/2022
Battery:
  ID-1: BAT0 charge: 15.2 Wh (20.5%) condition: 74.1/90.0 Wh (82.4%)
    volts: 10.6 min: 11.2 model: SMP 00NY493 type: Li-poly serial: <filter>
    status: discharging cycles: 327
CPU:
  Info: model: Intel Xeon E3-1505M v6 socket: BGA1440 (U3E1) note: check
    bits: 64 type: MT MCP arch: Kaby Lake level: v3 note: check built: 2018
    process: Intel 14nm family: 6 model-id: 0x9E (158) stepping: 9
    microcode: 0xF0
  Topology: cpus: 1x cores: 4 tpc: 2 threads: 8 smt: enabled cache:
    L1: 256 KiB desc: d-4x32 KiB; i-4x32 KiB L2: 1024 KiB desc: 4x256 KiB
    L3: 8 MiB desc: 1x8 MiB
  Speed (MHz): avg: 3144 high: 3150 min/max: 800/4000 base/boost: 3000/3000
    scaling: driver: intel_pstate governor: powersave volts: 1.1 V
    ext-clock: 100 MHz cores: 1: 3150 2: 3150 3: 3150 4: 3150 5: 3103 6: 3150
    7: 3150 8: 3150 bogomips: 48000
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
  Vulnerabilities:
  Type: itlb_multihit status: KVM: Split huge pages
  Type: l1tf mitigation: PTE Inversion; VMX: conditional cache flushes, SMT
    vulnerable
  Type: mds mitigation: Clear CPU buffers; SMT vulnerable
  Type: meltdown mitigation: PTI
  Type: mmio_stale_data mitigation: Clear CPU buffers; SMT vulnerable
  Type: retbleed mitigation: IBRS
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
    prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
    sanitization
  Type: spectre_v2 mitigation: IBRS, IBPB: conditional, RSB filling,
    PBRSB-eIBRS: Not affected
  Type: srbds mitigation: Microcode
  Type: tsx_async_abort mitigation: TSX disabled
Graphics:
  Device-1: Intel HD Graphics P630 vendor: Lenovo driver: i915 v: kernel
    arch: Gen-9.5 process: Intel 14nm built: 2016-20 ports: active: eDP-1
    empty: none bus-ID: 00:02.0 chip-ID: 8086:591d class-ID: 0300
  Device-2: NVIDIA GM206GLM [Quadro M2200 Mobile] vendor: Lenovo
    driver: nvidia v: 525.60.11 alternate: nouveau,nvidia_drm non-free: 525.xx+
    status: current (as of 2022-12) arch: Maxwell code: GMxxx
    process: TSMC 28nm built: 2014-19 pcie: gen: 1 speed: 2.5 GT/s lanes: 16
    link-max: gen: 3 speed: 8 GT/s bus-ID: 01:00.0 chip-ID: 10de:1436
    class-ID: 0302
  Device-3: Acer Integrated Camera type: USB driver: uvcvideo bus-ID: 1-8:2
    chip-ID: 5986:111c class-ID: 0e02 serial: <filter>
  Display: x11 server: X.Org v: 21.1.6 with: Xwayland v: 22.1.7
    compositor: kwin_x11 driver: X: loaded: modesetting,nvidia unloaded: nouveau
    alternate: fbdev,intel,nv,vesa dri: iris gpu: i915 display-ID: :0
    screens: 1
  Screen-1: 0 s-res: 1920x1080 s-dpi: 96 s-size: 508x285mm (20.00x11.22")
    s-diag: 582mm (22.93")
  Monitor-1: eDP-1 model: AU Optronics 0x61ed built: 2016 res: 1920x1080
    hz: 60 dpi: 142 gamma: 1.2 size: 344x193mm (13.54x7.6") diag: 394mm (15.5")
    ratio: 16:9 modes: 1920x1080
  API: OpenGL v: 4.6 Mesa 22.3.1 renderer: Mesa Intel HD Graphics P630 (KBL
    GT2) direct render: Yes
Audio:
  Device-1: Intel CM238 HD Audio vendor: Lenovo driver: snd_hda_intel
    v: kernel bus-ID: 00:1f.3 chip-ID: 8086:a171 class-ID: 0403
  Device-2: NVIDIA GM206 High Definition Audio driver: snd_hda_intel
    v: kernel pcie: speed: Unknown lanes: 63 link-max: gen: 6 speed: 64 GT/s
    bus-ID: 01:00.1 chip-ID: 10de:0fba class-ID: 0403
  Sound API: ALSA v: k6.1.1-zen1-1-zen running: yes
  Sound Server-1: PulseAudio v: 16.1 running: no
  Sound Server-2: PipeWire v: 0.3.63 running: yes
Network:
  Device-1: Intel Ethernet I219-LM vendor: Lenovo driver: e1000e v: kernel
    port: N/A bus-ID: 00:1f.6 chip-ID: 8086:15e3 class-ID: 0200
  IF: enp0s31f6 state: down mac: <filter>
  Device-2: Intel Wireless 8265 / 8275 driver: iwlwifi v: kernel pcie:
    gen: 1 speed: 2.5 GT/s lanes: 1 bus-ID: 04:00.0 chip-ID: 8086:24fd
    class-ID: 0280
  IF: wlp4s0 state: up mac: <filter>
  IF-ID-1: vboxnet0 state: up speed: 10 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 704.24 GiB used: 73.49 GiB (10.4%)
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Lenovo
    model: LENSE20256GMSP34MEAT2TA size: 238.47 GiB block-size: physical: 512 B
    logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: 2.8.8341 temp: 37.9 C scheme: GPT
  SMART: yes health: PASSED on: 72d 21h cycles: 1,621
    read-units: 9,158,435 [4.68 TB] written-units: 23,437,439 [11.9 TB]
  ID-2: /dev/sda maj-min: 8:0 vendor: Seagate model: ST500LM021-1KJ152
    family: Laptop HDD size: 465.76 GiB block-size: physical: 4096 B
    logical: 512 B sata: 3.0 speed: 6.0 Gb/s type: HDD rpm: 7200
    serial: <filter> rev: LIM1 temp: 36 C scheme: GPT
  SMART: yes state: enabled health: PASSED on: 90d 22h cycles: 1508
    Pre-Fail: attribute: Spin_Retry_Count value: 100 worst: 100 threshold: 97
Partition:
  ID-1: / raw-size: 237.26 GiB size: 237.26 GiB (100.00%)
    used: 73.47 GiB (31.0%) fs: btrfs block-size: 4096 B dev: /dev/dm-0
    maj-min: 254:0 mapped: luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa
  ID-2: /boot/efi raw-size: 512 MiB size: 511 MiB (99.80%)
    used: 14.2 MiB (2.8%) fs: vfat block-size: 512 B dev: /dev/nvme0n1p1
    maj-min: 259:1
  ID-3: /home raw-size: 237.26 GiB size: 237.26 GiB (100.00%)
    used: 73.47 GiB (31.0%) fs: btrfs block-size: 4096 B dev: /dev/dm-0
    maj-min: 254:0 mapped: luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa
  ID-4: /var/log raw-size: 237.26 GiB size: 237.26 GiB (100.00%)
    used: 73.47 GiB (31.0%) fs: btrfs block-size: 4096 B dev: /dev/dm-0
    maj-min: 254:0 mapped: luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa
  ID-5: /var/tmp raw-size: 237.26 GiB size: 237.26 GiB (100.00%)
    used: 73.47 GiB (31.0%) fs: btrfs block-size: 4096 B dev: /dev/dm-0
    maj-min: 254:0 mapped: luks-8bbf76ae-7d61-4f11-baf6-975ba4e35aaa
Swap:
  Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default)
  ID-1: swap-1 type: zram size: 31.07 GiB used: 0 KiB (0.0%) priority: 100
    dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 48.0 C pch: 44.5 C mobo: N/A
  Fan Speeds (RPM): fan-1: 2314 fan-2: 2329
Info:
  Processes: 303 Uptime: 57m wakeups: 1 Memory: 31.07 GiB
  used: 9.93 GiB (32.0%) Init: systemd v: 252 default: graphical
  tool: systemctl Compilers: gcc: 12.2.0 Packages: 1430 pm: pacman pkgs: 1425
  libs: 341 tools: pamac,paru pm: flatpak pkgs: 5 Shell: garuda-inxi (sudo)
  default: Bash v: 5.1.16 running-in: yakuake inxi: 3.3.24
e[1;34mGaruda (2.6.12-1):e[0m
e[1;34m  System install date:e[0m     2022-09-04
e[1;34m  Last full system update:e[0m 2022-12-23
e[1;34m  Is partially upgraded:  e[0m No
e[1;34m  Relevant software:      e[0m NetworkManager
e[1;34m  Windows dual boot:      e[0m No/Undetected
e[1;34m  Snapshots:              e[0m Snapper
e[1;34m  Failed units:           e[0m

Update #2

"Good" news: The problem seems to be reproducable.
I could extract the following from sudo dmesg --follow this time:

[ 1692.455310] BTRFS: Transaction aborted (error -17)
[ 1692.455340] WARNING: CPU: 5 PID: 48505 at fs/btrfs/inode.c:6508 btrfs_create_new_inode.cold+0x14c/0x1b8 [btrfs]
[ 1692.455415] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc ccm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device qrtr bnep vboxnetflt(OE) intel_rapl_msr vboxnetadp(OE) intel_rapl_common vboxdrv(OE) uinput intel_tcc_cooling joydev x86_pkg_temp_thermal nvidia_drm(POE) mousedev intel_powerclamp iTCO_wdt nvidia_uvm(POE) nvidia_modeset(POE) ee1004 intel_pmc_bxt coretemp iTCO_vendor_support snd_hda_codec_hdmi snd_ctl_led btusb kvm_intel btrtl snd_hda_codec_realtek iwlmvm snd_hda_codec_generic uvcvideo btbcm kvm videobuf2_vmalloc snd_hda_intel irqbypass mac80211 btintel videobuf2_memops snd_intel_dspcfg rapl videobuf2_v4l2 intel_cstate snd_intel_sdw_acpi btmtk libarc4 snd_hda_codec intel_uncore psmouse videobuf2_common bluetooth snd_hda_core iwlwifi think_lmi videodev intel_lpss_pci snd_hwdepvfat i2c_i801 firmware_attributes_class thinkpad_acpi intel_lpss ecdh_generic wmi_bmof snd_pcm fat intel_wmi_thunderbolt mc crc16 e1000e cfg80211
[ 1692.455453]  i2c_smbus ie31200_edac snd_timer intel_pch_thermal idma64 ledtrig_audio platform_profile rfkill snd soundcore nvidia(POE) i2c_hid_acpi i2c_hid acpi_pad mac_hid crypto_user fuse zram bpf_preload ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic rtsx_pci_sdmmc gf128mul ghash_clmulni_intel mmc_core serio_raw sha512_ssse3 atkbd aesni_intel libps2 nvme crypto_simd vivaldi_fmap cryptd nvme_core xhci_pci rtsx_pci nvme_common xhci_pci_renesas i8042 serio radeon amdgpu gpu_sched drm_ttm_helper intel_agp crc32c_intel i915 drm_buddy video wmi ttm drm_display_helper cec intel_gtt
[ 1692.455486] CPU: 5 PID: 48505 Comm: nfsd Tainted: P           OE      6.1.1-zen1-1-zen #1 14158625220d9969ff9ca1425845f6e6a542f208
[ 1692.455488] Hardware name: LENOVO 20HJS1ER00/20HJS1ER00, BIOS N1UET85W (1.59 ) 07/18/2022
[ 1692.455489] RIP: 0010:btrfs_create_new_inode.cold+0x14c/0x1b8 [btrfs]
[ 1692.455543] Code: 19 00 00 eb cb 89 cf 89 8d 60 ff ff ff e8 59 a0 ff ff 8b 8d 60 ff ff ff 84 c0 74 4f 89 ce 48 c7 c7 30 21 a3 c1 e8 ca 18 9b cf <0f> 0b 8b 8d 60 ff ff ff 41 b8 01 00 00 00 eb ba 66 90 e9 7a ff ff
[ 1692.455544] RSP: 0018:ffffb356031839f8 EFLAGS: 00010286
[ 1692.455546] RAX: 0000000000000000 RBX: ffff8a2da0a1d428 RCX: 0000000000000027
[ 1692.455547] RDX: ffff8a339f961668 RSI: 0000000000000001 RDI: ffff8a339f961660
[ 1692.455548] RBP: ffffb35603183ad0 R08: 0000000000000001 R09: 00000000ffffffea
[ 1692.455549] R10: ffffffff9265b780 R11: 00000000fffff000 R12: ffffb35603183ae0
[ 1692.455550] R13: ffff8a2da0a1d23c R14: ffff8a2d2bc55428 R15: ffff8a2c50046a28
[ 1692.455551] FS:  0000000000000000(0000) GS:ffff8a339f940000(0000) knlGS:0000000000000000
[ 1692.455552] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1692.455553] CR2: 00007ffba024b000 CR3: 0000000643210006 CR4: 00000000003726e0
[ 1692.455555] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1692.455555] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1692.455557] Call Trace:
[ 1692.455558]  <TASK>
[ 1692.455562]  btrfs_create_common+0xd9/0x1d0 [btrfs 78b153f35e51259f27b8ddd90cff8ef77f51246f]
[ 1692.455610]  vfs_mkdir+0x1e9/0x2c0
[ 1692.455615]  nfsd_create_locked+0x1fd/0x2c0 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455643]  nfsd_create+0x133/0x180 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455667]  nfsd4_create+0x17c/0x3f0 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455696]  nfsd4_proc_compound+0x3ad/0x6f0 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455722]  nfsd_dispatch+0x16b/0x280 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455745]  svc_process_common+0x284/0x5e0 [sunrpc a6ab4bea3b72c1c4d365975153183bc30e8e96ba]
[ 1692.455781]  ? svc_recv+0x54c/0x910 [sunrpc a6ab4bea3b72c1c4d365975153183bc30e8e96ba]
[ 1692.455815]  ? nfsd_svc+0x3b0/0x3b0 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455837]  ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455859]  svc_process+0xb1/0x100 [sunrpc a6ab4bea3b72c1c4d365975153183bc30e8e96ba]
[ 1692.455891]  nfsd+0xd9/0x190 [nfsd d77d7273737fb0c95bafb71c5fe56f732c41d9e8]
[ 1692.455913]  kthread+0xdb/0x110
[ 1692.455916]  ? kthread_complete_and_exit+0x20/0x20
[ 1692.455918]  ret_from_fork+0x1f/0x30
[ 1692.455922]  </TASK>
[ 1692.455923] ---[ end trace 0000000000000000 ]---
[ 1692.455924] BTRFS: error (device dm-0: state A) in btrfs_create_new_inode:6508: errno=-17 Object already exists
[ 1692.455928] BTRFS info (device dm-0: state EA): forced readonly
[ 1752.987829] audit: type=1701 audit(1672823986.535:223): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=50432 comm="smbd" exe="/usr/bin/smbd" sig=6 res=1
[ 1752.988507] systemd-journald[372]: /var/log/journal/d8450ad6b6774f22ae0a6dc735559d4b/system.journal: Journal file corrupted, rotating.
[ 1752.988513] systemd-journald[372]: Failed to write entry to /var/log/journal/d8450ad6b6774f22ae0a6dc735559d4b/system.journal (24 items, 719 bytes), rotating before retrying: Bad message

SGS · 4 January 2023 08:37

garuda-inxi | tb

or garuda-assistant/systemspecs

clownshow · 4 January 2023 09:09

Thanks. I found a command for copying garuda-inxi that also worked locally
garuda-inxi is there now.

tbg · 4 January 2023 13:56

I'm sorry to hear about your recent problems.

Have you searched online for similar recent occurrences with Arch or other Linux distros?

The reason I ask is that this may not be a Garuda specific problem. It could be a bug with your virtualization software, NFS, or other causes that might not be readily apperent. Possibly you have NFS configured improperly, but without details there's no way of knowing. I think you should run some checks on your NFS drive(s) . You could do this from a Garuda live environment or from a specialized diagnostic/recovery disk to see if the hard drive is failing.

I'm not sure what else to suggest. Hopefully someone else will have other ideas to offer.

clownshow · 4 January 2023 15:36

I have, but I did not find anything specific. It is super weird, that this just started to happen and with two different systems having different update status (date-wise).
This is also the first occurrence of this happening. I worked with this setup before and did not see such problems.

Since I can reliable reproduce the problem, I could try to backtrack the updates to find the culprit (at least on machine #2), while restoring old snapshots.
For machine #1 I have lesser hope, since I does not boot currently.

Are there any guides on how to backup your btrfs content to another drive (preferably as an packaged image, since I currently only have an NTFS volume available) and restore it or parts of it in the new setup.
Then I would just re-install garuda on machine #1. It was already incompatible with the new snapper way of live (AFAIK) and grub was still broken (from the time where grub rolled out a broke-ish package that lead to debug something something messages during boot).

Maybe it might be possible to backup the timeshift snapshots, too, so I could just backtrack them on machine #1 after reinstalling. But that would probably not work, since too much might have changed after a fresh install

I got the @home sub-volume covered by backintime and already packaged the /home/user_backup folder in a safe place, but since the machine #1 does not boot anymore and the things above, I think just recreating the @home sub-volume would not cut in this case.

clownshow · 9 January 2023 10:45

This is not THE solution, rather "a solution" or even better "a workaround"

Lessons learned:

Keep your @home subvolume/user folder backup somewhere NOT on @home
You can use a live CD and delete the @home subvolume, re-recreate it and go your merry way. (steps for it below)
Backup your stuff.
BTRFS is weird and apparently somehow unstable under certain circumstances. It has tooling that does not help and makes things worse in worst-case scenarios. Apparently it heals a lot of stuff on its own and the fact that the system ran for about a year without errors shows that it CAN work stable. Also since this happened on two different devices with the same software setup proves that it depends on your usage and not BTRFS alone.
for the memes Backup your stuff!

How to delete/re-create your @home subvolume from a LIVE environment:

This is written from memory. I am not 100% sure, that all the parameters are correct.

Decrypt your volume
sudo mount /dev/mapper/luks-<blkid> /mnt
sudo btrfs subvolume list /mnt
Find the subvolume id (subVolumeId) of your @home subvolume
sudo btrfs subvolume delete --commit-after --subvolid <subVolumeId> /mnt
sudo btrfs subvolume create /mnt/@home
sudo umount /mnt

I am not sure about the next steps, because I do not know how exactly your backup is setup.
If you have a backup of your whole @home subvolume, great. Restore that.

I only had a backintime backup of my userfolder, which I recreated like this.

sudo mount -t btrfs -o subvolume=@home,defaults /mnt
sudo mkdir /mnt/<username>, where <username> is the user you are going to restore
sudo chown 1000:1000 /mnt/<username>, we can not use the username here, since this is a live environment, where that user does not exist. If you happen to have another user id, use that.
Restore your backup to that folder

If you do not have a backup, you can also just copy the contents of /etc/skel to that folder to make the user "usable" again. It would be as if you have just created a new user on your system, but your application settings and applications in general would survive the restore, if they were stored outside your home folder (e.g. in /etc).

Reboot.

Thoughts

This was a rather tedious process and I hated every part of it, that I did not understand.

I am still missing crucial information.

Why did this happen?
I could replicate the error with every snapshot from the oldest to the newest that I got of my system. But this just showed me, that this problem has been lurking in the shadows for some time now and I just did not see it until it was too late. Maybe this was also due to the fact, that BTRFS tried to fix it, thought it did, but did not really? Idk, but it never resurfaced RIGHT AFTER BOOTING, but only when I tried to write to the broken file. So it is easy to lose sight of it for the user, as advanced as they might be.
What caused it?
I know that the software stack I have mentioned in my initial post was relatively huge. I tried to dumb it down a bit in my recreation attempts, going as far as running composer on the host system itself. It did not prove anything though, since the error was the same and as I described in 1., it was probably broken for a long time without BTRFS or me noticing it.
Why did it happen on two 100% isolated from each other systems.
This is what bugs me the most. Except for the software development stack that provied to be working for MONTHS beforehand, those systems are isolated from each other. I did not restore and backups from one to the other, did not copy the FS or anything that could have REMOTELY triggered this behavior. They even had different software package versions and this issues still occurred. I mean, it could have just been bad luck in the NFS server implementation, that could have caused the corruption, because COW FS is voodoo to it, but it could also have been something completely different. Even the hardware is different on both systems.

I am confused and frankly a bit put off by this event.
It is easy to blame a distro for such behavior, even though it just a collection of many many moving parts. I really try not to, because I know that keeping this many moving parts together is very hard.
It could have been anything, from me turning the computer off too early or just a bad package somewhere, somehow.
I think, I just wished to get more support out of this forum, even though this is a hard case and everybody does this in their free time. This is not an "attack", just a wish

Anyhow. I hope this help anybody who stumbles upon this BTRFS error.

meanruse · 9 January 2023 12:11

Thanks for sharing.

Just thinking (not that I really know anything):

Was copy-on-write enabled or disabled for the VM disk images?
I see it is advised to turn it off (the warning in the first link below):
VirtualBox - ArchWiki
Btrfs - ArchWiki
Gotchas - btrfs Wiki (the "fragmentation" section)
BTRFS: Solving issues with copy-on-write | FYHTECH (though old)

Were the VM disk images preallocated fixed-size or the growing kind?

Did the VMs have raw disk access to the btrfs device?
Or maybe even access through NFS entails similar problems as you suspect?
2.8. Advanced Storage Configuration

Other random potentially useful stuff:
How to prevent disk corruption on Virtualbox - Super User (though the question is about btrfs in the VM, there may be something in the replies)

As far as I understand, there are indeed still some corner cases that trip up btrfs and, while I have no evidence it's necessary, I like to keep things that do not benefit much from being on btrfs on a separate ext4 partition.

tbg · 9 January 2023 16:06

I’m very happy you recovered from this event, even though it was a painful recovery. The important take away from all of this is to always have backups as BTRFS is not a backup method, (it is for system recovery). Sadly, most learn this lesson the hard way. You at least understand how important this this aspect is to your systems integrity.

I have used Linux a very long time and I can’t recall ever coming across a case quite like yours. I’m pretty sure that’s why you didn’t receive a ton of help. For others to offer assistance, they really need a little bit of understanding of what might have caused your situation. I think most assistants were simply at a loss in this case, because of its uniqueness. I’m sorry you felt the forum let you down, and I do understand you’re not intending to complain about the forum support volunteers.

Thank you for documenting your recovery process. This should make it easier for others to repair their system if this situation ever arises again, (hopefully it won’t). Sorry we couldn’t have been of more help, but you at least learned a ton in the process. The most important thing is you did not give up, as many others might have done in this situation. You will go far ln Linux with your type of “can do” attitude, nicely done.

clownshow · 11 January 2023 08:28

@meanruse The problem occurred with an NFS shared folder not with the hard drive of the VM, but thank you for the link list

system · 13 January 2023 08:29

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

filo · 24 January 2023 13:58

@clownshow: I reopened the topic futher to your "flag".
Please feel free to add any consideration you feel necessary and let me know if/when I can close the topic again.

clownshow · 17 October 2023 09:02

I do not work with that setup anymore, so I can not say if it works or not.
Another theory I had was, that Vagrant did not clean up properly and there were maybe TWO NFS mounts for the same folders “fighting” with each other causing BTRFS to freak.

That is just a theory though, I could never verify it. In light of a completely broken system on reproduction, I kinda do not want to right now