So just getting around to checking my logs on my backup server, and it says that I have a permanently damaged file that’s un-repairable.

How is this even possible on a raidz2 volume where each member shows zero problems and no dead drives? Isn’t that whole point of raidz2, so that if one (er, two) drives have a problem the data is recoverable? How can I figure out why this happened and why it was unrecoverable, and most importantly, prevent it in the future?

It’s only my backup server and the original file is still A-OK, but I’m really concerned here!

zpool status -v:

3-2-1-backup@BackupServer:~$ sudo zpool status -v
pool: data_pool3
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
 see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 06:59:59 with 1 errors on Sun Nov 12 07:24:00 2023
config:

    NAME                        STATE     READ WRITE CKSUM
    data_pool3                  ONLINE       0     0     0
      raidz2-0                  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx1  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx2  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx3  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx4  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx5  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx6  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx7  ONLINE       0     0     0
        wwn-0x5000ccaxxxxxxxx8  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    data_pool3/(redacted)/(redacted)@backup_script:/Documentaries/(redacted)
  • 3-2-1-backup@alien.topOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Well, two steps forwards, one step back. The scrub I ran yesterday at least showed some errors, but I’m having trouble identifying exactly what is the actual problem. I think I’ll sleep on it and form a new plan in the morning.

    Controller failure? RAM failure? Dmesg shows absolutely nothing, no panics no anything so I’m not thinking it’s ram. Hmmmm… maybe I’ll run mtest after I get some sleep.

    3-2-1-backup@BackupServer:~$ sudo zpool status -vx
    pool: data_pool3
    state: ONLINE
    status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
     see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
     scan: scrub repaired 40K in 07:07:07 with 4 errors on Tue Nov 28 22:39:33 2023
    config:
    
        NAME                        STATE     READ WRITE CKSUM
        data_pool3                  ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x5000ccax1  ONLINE       0     0     8
            wwn-0x5000ccax2 ONLINE       0     0    10
            wwn-0x5000ccax3 ONLINE       0     0     8
            wwn-0x5000ccax4 ONLINE       0     0     8
            wwn-0x5000ccax5 ONLINE       0     0     8
            wwn-0x5000ccax6 ONLINE       0     0     8
            wwn-0x5000ccax7 ONLINE       0     0     8
            wwn-0x5000ccax8 ONLINE       0     0     8
    
    errors: Permanent errors have been detected in the following files:
    
        data_pool3/(redacted)/downloads@backup_script-2023-11-28-0901:/(redacted).mkv
        data_pool3/(redacted)@backup_script-2023-11-28-2001:/ISOs/Ubuntu/23.10/ubuntu-23.10.1-desktop-amd64.iso
        data_pool3/(redacted)@backup_script-2023-11-07-0901:/(redacted).mkv
    

    Hey wow, even though my problem is getting worse (maybe), an actual honest-to-god ISO showed up in the problem file list!