So just getting around to checking my logs on my backup server, and it says that I have a permanently damaged file that’s un-repairable.
How is this even possible on a raidz2 volume where each member shows zero problems and no dead drives? Isn’t that whole point of raidz2, so that if one (er, two) drives have a problem the data is recoverable? How can I figure out why this happened and why it was unrecoverable, and most importantly, prevent it in the future?
It’s only my backup server and the original file is still A-OK, but I’m really concerned here!
zpool status -v:
3-2-1-backup@BackupServer:~$ sudo zpool status -v
pool: data_pool3
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 06:59:59 with 1 errors on Sun Nov 12 07:24:00 2023
config:
NAME STATE READ WRITE CKSUM
data_pool3 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx1 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx2 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx3 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx4 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx5 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx6 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx7 ONLINE 0 0 0
wwn-0x5000ccaxxxxxxxx8 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
data_pool3/(redacted)/(redacted)@backup_script:/Documentaries/(redacted)
Well, two steps forwards, one step back. The scrub I ran yesterday at least showed some errors, but I’m having trouble identifying exactly what is the actual problem. I think I’ll sleep on it and form a new plan in the morning.
Controller failure? RAM failure? Dmesg shows absolutely nothing, no panics no anything so I’m not thinking it’s ram. Hmmmm… maybe I’ll run mtest after I get some sleep.
Hey wow, even though my problem is getting worse (maybe), an actual honest-to-god ISO showed up in the problem file list!