Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation around userspace tool behaviour and safety #863

Open
wants to merge 8 commits into
base: devel
Choose a base branch
from
6 changes: 6 additions & 0 deletions Documentation/btrfs-check.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ by the option *--readonly*.

:command:`btrfsck` is an alias of :command:`btrfs check` command and is now deprecated.

.. note::
Even though the filesystem checker requires a device argument, it scans for all
devices belonging to the same filesystem and may report metadata errors from other
devices that are correctable by :command:`btrfs scrub`. In this case, run scrub
first to ensure any correctable metadata errors are fixed to avoid false-positives.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this note is needed.

Btrfs check handles metadata error just like kernel, if a mirror is corrupted but still have another good one available, btrfs check won't report it as an error. E.g:

Opening filesystem to check...
checksum verify failed on 30490624 wanted 0xcdcdcdcd found 0x86f5c8d8 << Mirror 1 corrupted for csum root
Checking filesystem on test.img
UUID: f01a583a-df1e-414b-a24c-7fe8bf2ef019
[1/8] checking log skipped (none written)
[2/8] checking root items
[3/8] checking extents
[4/8] checking free space tree
[5/8] checking fs roots
[6/8] checking only csums items (without verifying data)
[7/8] checking root refs
[8/8] checking quota groups skipped (not enabled on this FS)
found 147456 bytes used, no error found <<< Still no error.
total csum bytes: 0
total tree bytes: 147456
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 140595
file data blocks allocated: 0
 referenced 0

So there is no false positive. And btrfs scrub has a much higher bar to clear (RW mount), meanwhile btrfs check is always the most comprehensive check tool.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directly contradicts my experience when using btrfs-progs 6.6.3. In which version was this changed/fixed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it's always the case from the beginning.

Mind to give an example where btrfs-check is reporting recoverable corrupted mirrors as an critical error?
You can use btrfs-map-logical to find out the physical address of a metadata copy and corrupted it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The history of the problem is linked in the PR description. Briefly: a disk failed and temporarily dropped from the array. Therefore, the two disks were no longer in sync after rebooting. btrfs check claimed the filesystem was corrupted and printed messages to that end.

In consultation with @Zygo on IRC I manually verified by btrfs inspect-internals that btrfs check was reporting invalid node types/values that existed only on the bad disk. The kernel logs showed the kernel driver was noticing the invalid metadata nodes on the bad disk and switching to the good one, but btrfs check was claiming filesystem corruption even though the good disk was not corrupted and the filesystem metadata was fine. There was no indication by btrfs check that it was checking (only? due to PID load-balancing?) the bad device even when explicitly given the path to the good disk. Only after read-write btrfs scrub fixed up the metadata on the bad disk did btrfs check report no errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still have that fs?

I'm checking the btrfs_read_extent_buffer() function and still can not find out a good way where split-brain case can make btrfs-check to go the bad copy without trying the good one.

I think it's more like a bug other than an expected behavior.

Copy link
Author

@csnover csnover Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still have that fs?

I do not, btrfs replace was used to replace the defective hardware a long time ago. However, until something is done to fix the lost writes problem (like this MVP idea from Oct 2), it seems like a synthetic reproducer should be trivial: create a raid1 fs from two disks, forcibly remove one disk, ensure btrfs continues writing to the remaining disk for a little bit, then unmount the fs, reconnect the forcibly removed disk, and remount.

I'm checking the btrfs_read_extent_buffer() function and still can not find out a good way where split-brain case can make btrfs-check to go the bad copy without trying the good one.

If the problem can’t be found in the latest version then maybe it was fixed between 6.6.3 and current head? This was btrfs-progs 6.6.3 (which is still the latest available packaged version for Debian) on kernel 6.9.10.

I think it's more like a bug other than an expected behavior.

Sure, this seems likely to me.


.. warning::
Do not use *--repair* unless you are advised to do so by a developer
or an experienced user, and then only after having accepted that no *fsck*
Expand Down
71 changes: 62 additions & 9 deletions Documentation/ch-scrub-intro.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,67 @@
Scrub is a pass over all filesystem data and metadata and verifying the
checksums. If a valid copy is available (replicated block group profiles) then
the damaged one is repaired. All copies of the replicated profiles are validated.
Scrub is a validation pass over all filesystem data and metadata that detects
checksum errors, super block errors, metadata block header errors, and disk
read errors. All copies of replicated profiles are validated by default.

On filesystems that use replicated block group profiles (e.g. raid1), scrub will
also automatically repair any damage by default by copying verified good data
from one of the other replicas.

.. warning::
Setting the ``No_COW`` (``chattr +C``) attribute on a file implicitly enables
``nodatasum``. This means that while metadata for these files continues to
be validated and corrected by scrub, the actual file data is not.

Furthermore, btrfs does not currently mark missing or failed disks as
unreliable, so will continue to load-balance reads to potentially damaged
replicas in a replicated filesystem. This is not a problem normally because
damage is detected by checksum validation and a mirror copy is used, but
because ``No_COW`` files are not protected by checksum, bad data may be
returned even if a good copy exists on another replica. Which replica is used
is determined by the setting in ``/sys/fs/btrfs/<uuid>/read_policy``.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which replica is used is determined by the read policy, which can be changed through sysfs; however, the only currently implemented read policy in upstream Linux is 'pid'.

So this isn't wrong today, but one day it may be.

It might be better to say "The filesystem's configured read policy determines which replica will be used."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will clarify.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve reworded this to mention the read policy setting.

Currently, the only possible value for this setting is ``pid``, which uses
the process ID of the executable reading the file to pick the replica.

Writing to a ``No_COW`` file after reading from a bad replica will overwrite
all replicas with the bad data. Detecting and recovering from a failure in
this case requires manual intervention before the file is rewritten to avoid
data loss. See issue `#482 <https://github.com/kdave/btrfs-progs/issues/482>`_.
Even with raid1c3 or higher, for performance reasons, btrfs does not use
consensus reads on any files, even ``No_COW`` files, to validate or correct
data errors.

Notably, `systemd sets +C on journals by default <https://github.com/systemd/systemd/commit/11689d2a021d95a8447d938180e0962cd9439763>`_,
and `libvirt ≥ 6.6 sets +C on storage pool directories by default <https://www.libvirt.org/news.html#v6-6-0-2020-08-02>`_.
Other applications or distributions may also set +C to try to improve
performance.

.. warning::
A read-write scrub will do no further harm to a damaged filesystem if it is not
possible to perform a correct repair, so it is safe to use at almost any time.
However, if a split-brain event occurs, btrfs scrub may cause unrecoverable data
loss. This situation is unlikely and requires a specific sequence of events that
cause an unhealthy device or device set to be mounted read-write in the absence
of the healthy device or device set from the same filesystem. For example:

1. Device set F fails and drops from the bus, while device set H continues to
function and receive additional writes.
2. After a reboot, healthy set H does not reappear immediately, but failed set
F does.
3. Failed set F is mounted read-write. At this point, it is no longer safe for
set H to reappear as the transaction histories have diverged. Allowing set H
and set F to recombine at any point will cause corruption of set H. Running
scrub on a split-brained filesystem will overwrite good data from set H with
other data from set F, increasing the amount of permanent data loss.

.. note::
Scrub is not a filesystem checker (fsck) and does not verify nor repair
structural damage in the filesystem. It really only checks checksums of data
and tree blocks, it doesn't ensure the content of tree blocks is valid and
consistent. There's some validation performed when metadata blocks are read
from disk (:doc:`Tree-checker`) but it's not extensive and cannot substitute
full :doc:`btrfs-check` run.
Scrub is not a filesystem checker (fsck). It can only detect filesystem damage
using the (:doc:`Tree-checker`) and checksum validation, and it can only repair
filesystem damage by copying from other known good replicas.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately scrub doesn not utilize tree-checker for metadata.

Thus it only checks:

  • Bytenr
  • Fsid and chunk tree uuid
  • Checksum
  • Generation

So it's not even as strong as tree-checker.


:doc:`btrfs-check` performs more exhaustive checking and can sometimes be
used, with expert guidance, to rebuild certain corrupted filesystem structures
in the absence of any good replica. However, when a replica exists, scrub is
able to automatically correct most errors reported by ``btrfs-check``, so should
normally be run first to avoid false positives from ``btrfs-check``.

The user is supposed to run it manually or via a periodic system service. The
recommended period is a month but it could be less. The estimated device bandwidth
Expand Down
17 changes: 17 additions & 0 deletions Documentation/ch-volume-management-intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -116,3 +116,20 @@ In order to remove a device, you need to convert the profile in this case:

$ btrfs balance start -mconvert=dup -dconvert=single /mnt
$ btrfs device remove /dev/sda /mnt

.. warning::
Do not run balance to convert from a profile with more redundancy to one with
less redundancy in order to remove a failing device from a filesystem.
As the name suggests, balance tries to balance data across devices.
Converting from e.g. raid1 to single may move data from the healthy device to
the failing device. This data will become irretrievable if the failing device
corrupts the new data or fails completely before ``btrfs device remove`` can
finish moving it back onto the healthy device.

To recover from a failing device with a replicated profile when you cannot
add enough new devices to maintain the required level of redundancy,
physically remove and replace the failing device, mount the filesystem with
``-o degraded``, then use :command:`btrfs-replace` to replace the missing
device with the new one. Once the device is replaced, check
``btrfs filesystem usage``, and if any single profiles are listed, run
``btrfs balance start convert=raid1,soft`` to convert them back to raid1.
5 changes: 5 additions & 0 deletions Documentation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,11 @@ is in the :doc:`manual pages<man-index>`.

</td></tr></table>

Need help?
----------

Assistance is available from the `#btrfs channel on Libera Chat <https://web.libera.chat/#btrfs>`_ or the `linux-btrfs mailing list <https://subspace.kernel.org/vger.kernel.org.html>`_. Issues with the userspace btrfs tools can be reported to the `btrfs-progs issue tracker on GitHub <https://github.com/kdave/btrfs-progs/issues>`_.

.. raw:: html

<hr />
Expand Down
20 changes: 12 additions & 8 deletions Documentation/trouble-index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ for description and may need further explanation what needs to be done.
Error: parent transid verify error
----------------------------------

Reason: result of a failed internal consistency check of the filesystem's metadata.
Type: permanent
| Reason: result of a failed internal consistency check of the filesystem's metadata.
| Type: correctable by ``btrfs-scrub`` if a good copy exists on another replica; otherwise, permanent
|

.. code-block:: none

Expand All @@ -21,17 +22,20 @@ contains target block offset and generation that last changed this block. The
block it points to then upon read verifies that the block address and the
generation matches. This check is done on all tree levels.

The number in **faled on 30736384** is the logical block number, **wanted 10**
The number in **failed on 30736384** is the logical block number, **wanted 10**
is the expected generation number in the parent node, **found 8** is the one
found in the target block. The number difference between the generation can
give a hint when the problem could have happened, in terms of transaction
commits.

Once the mismatched generations are stored on the device, it's permanent and
cannot be easily recovered, because of information loss. The recovery tool
``btrfs restore`` is able to ignore the errors and attempt to restore the data
but due to the inconsistency in the metadata the data need to be verified by the
user.
Once the mismatched generations are stored on the device, without a good copy
from another replica, it's permanent and cannot be easily recovered because of
information loss. However, if a valid copy exists on another replica, btrfs will
transparently correct the read error, and running ``btrfs scrub`` in read-write
mode will fix the error permanently by copying the valid metadata block over the
invalid one. Otherwise, the recovery tool ``btrfs restore`` is able to ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays we have -o rescue=all,ro mount option, which is a more useful solution compared to btrfs restore

Thus we may want to promote the new solution, and use btrfs restore as the last option (and under most cases it's not that better compared to rescue=all)

the errors and attempt to restore the data, but due to the inconsistency in the
metadata, the restored data will need to be manually verified by the user.

The root cause of the error cannot be easily determined, possible reasons are:

Expand Down
Loading