-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve documentation around userspace tool behaviour and safety #863
base: devel
Are you sure you want to change the base?
Changes from all commits
45d4a4f
7ef4d4c
4a41d02
d73c0fc
13159fb
69d72f5
48503d0
54b2eb1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,67 @@ | ||
Scrub is a pass over all filesystem data and metadata and verifying the | ||
checksums. If a valid copy is available (replicated block group profiles) then | ||
the damaged one is repaired. All copies of the replicated profiles are validated. | ||
Scrub is a validation pass over all filesystem data and metadata that detects | ||
checksum errors, super block errors, metadata block header errors, and disk | ||
read errors. All copies of replicated profiles are validated by default. | ||
|
||
On filesystems that use replicated block group profiles (e.g. raid1), scrub will | ||
also automatically repair any damage by default by copying verified good data | ||
from one of the other replicas. | ||
|
||
.. warning:: | ||
Setting the ``No_COW`` (``chattr +C``) attribute on a file implicitly enables | ||
``nodatasum``. This means that while metadata for these files continues to | ||
be validated and corrected by scrub, the actual file data is not. | ||
|
||
Furthermore, btrfs does not currently mark missing or failed disks as | ||
unreliable, so will continue to load-balance reads to potentially damaged | ||
replicas in a replicated filesystem. This is not a problem normally because | ||
damage is detected by checksum validation and a mirror copy is used, but | ||
because ``No_COW`` files are not protected by checksum, bad data may be | ||
returned even if a good copy exists on another replica. Which replica is used | ||
is determined by the setting in ``/sys/fs/btrfs/<uuid>/read_policy``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which replica is used is determined by the read policy, which can be changed through sysfs; however, the only currently implemented read policy in upstream Linux is 'pid'. So this isn't wrong today, but one day it may be. It might be better to say "The filesystem's configured read policy determines which replica will be used." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I will clarify. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I’ve reworded this to mention the read policy setting. |
||
Currently, the only possible value for this setting is ``pid``, which uses | ||
the process ID of the executable reading the file to pick the replica. | ||
|
||
Writing to a ``No_COW`` file after reading from a bad replica will overwrite | ||
all replicas with the bad data. Detecting and recovering from a failure in | ||
this case requires manual intervention before the file is rewritten to avoid | ||
data loss. See issue `#482 <https://github.com/kdave/btrfs-progs/issues/482>`_. | ||
Even with raid1c3 or higher, for performance reasons, btrfs does not use | ||
consensus reads on any files, even ``No_COW`` files, to validate or correct | ||
data errors. | ||
|
||
Notably, `systemd sets +C on journals by default <https://github.com/systemd/systemd/commit/11689d2a021d95a8447d938180e0962cd9439763>`_, | ||
and `libvirt ≥ 6.6 sets +C on storage pool directories by default <https://www.libvirt.org/news.html#v6-6-0-2020-08-02>`_. | ||
Other applications or distributions may also set +C to try to improve | ||
performance. | ||
|
||
.. warning:: | ||
A read-write scrub will do no further harm to a damaged filesystem if it is not | ||
possible to perform a correct repair, so it is safe to use at almost any time. | ||
However, if a split-brain event occurs, btrfs scrub may cause unrecoverable data | ||
loss. This situation is unlikely and requires a specific sequence of events that | ||
cause an unhealthy device or device set to be mounted read-write in the absence | ||
of the healthy device or device set from the same filesystem. For example: | ||
|
||
1. Device set F fails and drops from the bus, while device set H continues to | ||
function and receive additional writes. | ||
2. After a reboot, healthy set H does not reappear immediately, but failed set | ||
F does. | ||
3. Failed set F is mounted read-write. At this point, it is no longer safe for | ||
set H to reappear as the transaction histories have diverged. Allowing set H | ||
and set F to recombine at any point will cause corruption of set H. Running | ||
scrub on a split-brained filesystem will overwrite good data from set H with | ||
other data from set F, increasing the amount of permanent data loss. | ||
|
||
.. note:: | ||
Scrub is not a filesystem checker (fsck) and does not verify nor repair | ||
structural damage in the filesystem. It really only checks checksums of data | ||
and tree blocks, it doesn't ensure the content of tree blocks is valid and | ||
consistent. There's some validation performed when metadata blocks are read | ||
from disk (:doc:`Tree-checker`) but it's not extensive and cannot substitute | ||
full :doc:`btrfs-check` run. | ||
Scrub is not a filesystem checker (fsck). It can only detect filesystem damage | ||
using the (:doc:`Tree-checker`) and checksum validation, and it can only repair | ||
filesystem damage by copying from other known good replicas. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately scrub doesn not utilize tree-checker for metadata. Thus it only checks:
So it's not even as strong as tree-checker. |
||
|
||
:doc:`btrfs-check` performs more exhaustive checking and can sometimes be | ||
used, with expert guidance, to rebuild certain corrupted filesystem structures | ||
in the absence of any good replica. However, when a replica exists, scrub is | ||
able to automatically correct most errors reported by ``btrfs-check``, so should | ||
normally be run first to avoid false positives from ``btrfs-check``. | ||
|
||
The user is supposed to run it manually or via a periodic system service. The | ||
recommended period is a month but it could be less. The estimated device bandwidth | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,8 +9,9 @@ for description and may need further explanation what needs to be done. | |
Error: parent transid verify error | ||
---------------------------------- | ||
|
||
Reason: result of a failed internal consistency check of the filesystem's metadata. | ||
Type: permanent | ||
| Reason: result of a failed internal consistency check of the filesystem's metadata. | ||
| Type: correctable by ``btrfs-scrub`` if a good copy exists on another replica; otherwise, permanent | ||
| | ||
|
||
.. code-block:: none | ||
|
||
|
@@ -21,17 +22,20 @@ contains target block offset and generation that last changed this block. The | |
block it points to then upon read verifies that the block address and the | ||
generation matches. This check is done on all tree levels. | ||
|
||
The number in **faled on 30736384** is the logical block number, **wanted 10** | ||
The number in **failed on 30736384** is the logical block number, **wanted 10** | ||
is the expected generation number in the parent node, **found 8** is the one | ||
found in the target block. The number difference between the generation can | ||
give a hint when the problem could have happened, in terms of transaction | ||
commits. | ||
|
||
Once the mismatched generations are stored on the device, it's permanent and | ||
cannot be easily recovered, because of information loss. The recovery tool | ||
``btrfs restore`` is able to ignore the errors and attempt to restore the data | ||
but due to the inconsistency in the metadata the data need to be verified by the | ||
user. | ||
Once the mismatched generations are stored on the device, without a good copy | ||
from another replica, it's permanent and cannot be easily recovered because of | ||
information loss. However, if a valid copy exists on another replica, btrfs will | ||
transparently correct the read error, and running ``btrfs scrub`` in read-write | ||
mode will fix the error permanently by copying the valid metadata block over the | ||
invalid one. Otherwise, the recovery tool ``btrfs restore`` is able to ignore | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nowadays we have Thus we may want to promote the new solution, and use |
||
the errors and attempt to restore the data, but due to the inconsistency in the | ||
metadata, the restored data will need to be manually verified by the user. | ||
|
||
The root cause of the error cannot be easily determined, possible reasons are: | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this note is needed.
Btrfs check handles metadata error just like kernel, if a mirror is corrupted but still have another good one available, btrfs check won't report it as an error. E.g:
So there is no false positive. And
btrfs scrub
has a much higher bar to clear (RW mount), meanwhile btrfs check is always the most comprehensive check tool.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directly contradicts my experience when using btrfs-progs 6.6.3. In which version was this changed/fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it's always the case from the beginning.
Mind to give an example where btrfs-check is reporting recoverable corrupted mirrors as an critical error?
You can use
btrfs-map-logical
to find out the physical address of a metadata copy and corrupted it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The history of the problem is linked in the PR description. Briefly: a disk failed and temporarily dropped from the array. Therefore, the two disks were no longer in sync after rebooting.
btrfs check
claimed the filesystem was corrupted and printed messages to that end.In consultation with @Zygo on IRC I manually verified by
btrfs inspect-internals
thatbtrfs check
was reporting invalid node types/values that existed only on the bad disk. The kernel logs showed the kernel driver was noticing the invalid metadata nodes on the bad disk and switching to the good one, butbtrfs check
was claiming filesystem corruption even though the good disk was not corrupted and the filesystem metadata was fine. There was no indication bybtrfs check
that it was checking (only? due to PID load-balancing?) the bad device even when explicitly given the path to the good disk. Only after read-writebtrfs scrub
fixed up the metadata on the bad disk didbtrfs check
report no errors.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still have that fs?
I'm checking the
btrfs_read_extent_buffer()
function and still can not find out a good way where split-brain case can make btrfs-check to go the bad copy without trying the good one.I think it's more like a bug other than an expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not,
btrfs replace
was used to replace the defective hardware a long time ago. However, until something is done to fix the lost writes problem (like this MVP idea from Oct 2), it seems like a synthetic reproducer should be trivial: create a raid1 fs from two disks, forcibly remove one disk, ensure btrfs continues writing to the remaining disk for a little bit, then unmount the fs, reconnect the forcibly removed disk, and remount.If the problem can’t be found in the latest version then maybe it was fixed between 6.6.3 and current head? This was btrfs-progs 6.6.3 (which is still the latest available packaged version for Debian) on kernel 6.9.10.
Sure, this seems likely to me.