Unrecoverable/Checksum errors on ZFS pools

I had a lot of errors and file corruption on the previous system and so I replicated it to the new drives only to identify the old RAM as bad.

Now that I’m on an entirely new system (albeit old replicated data on new drives), I’m still getting the message:

    One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
            using 'zpool clear' or replace the device with 'zpool replace'. 

In the new system, I’ve upgraded basically everything.

  • Ryzen 3 3100
  • Gigabyte B450M-DS3H
  • 64GB (4X16GB 3200MHz) XPG D10 Memory
  • 4x10TB Seagate Ironwolf Pro 7200RPM HDDs
  • 4x3TB Seagate Barracuda Compute 7200RPM HDDs
  • LSI 9240-8i 6Gbps SAS HBA with P20 9211 IT mode firmware
  • Fractal design R5 ATX case
Old system that shows the new case and drives.

I had corrupted files that i either removed or replaced and so there are no known errors. Rerunning a scrub ends up accumulating a lot of checksum errors. Here is my status -v output:

root@Jupiter:/mnt/Drive/Data # zpool status -v
  pool: Data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 3.27M in 0 days 05:16:27 with 0 errors on Thu Jul 23 14:54:42 2020
config:

        NAME                                            STATE     READ WRITE CKSUM
        Data                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/d7d7fc39-badb-11ea-934c-94de8027dc7f  ONLINE       0     0    23
            gptid/d7e5f1c8-badb-11ea-934c-94de8027dc7f  ONLINE       0     0    37
            gptid/d7ef5942-badb-11ea-934c-94de8027dc7f  ONLINE       0     0    37
            gptid/d7f81f80-badb-11ea-934c-94de8027dc7f  ONLINE       0     0    35

errors: No known data errors




  pool: Drive
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Jul 24 09:35:35 2020
        5.69T scanned at 797M/s, 5.31T issued at 743M/s, 5.69T total
        3.62M repaired, 93.29% done, 0 days 00:08:59 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        Drive                                           DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/dc3404b3-ba3f-11ea-a5b7-94de8027dc7f  DEGRADED     0     0    54  too many errors
            gptid/dc08c769-ba3f-11ea-a5b7-94de8027dc7f  DEGRADED     0     0    58  too many errors
            gptid/dc1366d2-ba3f-11ea-a5b7-94de8027dc7f  DEGRADED     0     0    54  too many errors
            gptid/dc428347-ba3f-11ea-a5b7-94de8027dc7f  ONLINE       0     0    50

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:01:02 with 0 errors on Fri Jul 24 03:46:02 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da8p2     ONLINE       0     0     0

errors: No known data errors  

Solution

The output is during a scrub, but multiple times with after a zpool clear develops checksum errors with no permanent file corruption. Running memtest 86 against all 4 sticks of 16GB modules had failures on test 8. Testing it in pairs yielded no errors.

I fixed it by reducing RAM speeds to 2933MHz. Test 8 would fail at 3200MHz. Will try a BIOS update and 3200MHz but for now this works. No more checksum errors!