123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351 |
- XFS Self Describing Metadata
- ----------------------------
- Introduction
- ------------
- The largest scalability problem facing XFS is not one of algorithmic
- scalability, but of verification of the filesystem structure. Scalabilty of the
- structures and indexes on disk and the algorithms for iterating them are
- adequate for supporting PB scale filesystems with billions of inodes, however it
- is this very scalability that causes the verification problem.
- Almost all metadata on XFS is dynamically allocated. The only fixed location
- metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
- other metadata structures need to be discovered by walking the filesystem
- structure in different ways. While this is already done by userspace tools for
- validating and repairing the structure, there are limits to what they can
- verify, and this in turn limits the supportable size of an XFS filesystem.
- For example, it is entirely possible to manually use xfs_db and a bit of
- scripting to analyse the structure of a 100TB filesystem when trying to
- determine the root cause of a corruption problem, but it is still mainly a
- manual task of verifying that things like single bit errors or misplaced writes
- weren't the ultimate cause of a corruption event. It may take a few hours to a
- few days to perform such forensic analysis, so for at this scale root cause
- analysis is entirely possible.
- However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
- to analyse and so that analysis blows out towards weeks/months of forensic work.
- Most of the analysis work is slow and tedious, so as the amount of analysis goes
- up, the more likely that the cause will be lost in the noise. Hence the primary
- concern for supporting PB scale filesystems is minimising the time and effort
- required for basic forensic analysis of the filesystem structure.
- Self Describing Metadata
- ------------------------
- One of the problems with the current metadata format is that apart from the
- magic number in the metadata block, we have no other way of identifying what it
- is supposed to be. We can't even identify if it is the right place. Put simply,
- you can't look at a single metadata block in isolation and say "yes, it is
- supposed to be there and the contents are valid".
- Hence most of the time spent on forensic analysis is spent doing basic
- verification of metadata values, looking for values that are in range (and hence
- not detected by automated verification checks) but are not correct. Finding and
- understanding how things like cross linked block lists (e.g. sibling
- pointers in a btree end up with loops in them) are the key to understanding what
- went wrong, but it is impossible to tell what order the blocks were linked into
- each other or written to disk after the fact.
- Hence we need to record more information into the metadata to allow us to
- quickly determine if the metadata is intact and can be ignored for the purpose
- of analysis. We can't protect against every possible type of error, but we can
- ensure that common types of errors are easily detectable. Hence the concept of
- self describing metadata.
- The first, fundamental requirement of self describing metadata is that the
- metadata object contains some form of unique identifier in a well known
- location. This allows us to identify the expected contents of the block and
- hence parse and verify the metadata object. IF we can't independently identify
- the type of metadata in the object, then the metadata doesn't describe itself
- very well at all!
- Luckily, almost all XFS metadata has magic numbers embedded already - only the
- AGFL, remote symlinks and remote attribute blocks do not contain identifying
- magic numbers. Hence we can change the on-disk format of all these objects to
- add more identifying information and detect this simply by changing the magic
- numbers in the metadata objects. That is, if it has the current magic number,
- the metadata isn't self identifying. If it contains a new magic number, it is
- self identifying and we can do much more expansive automated verification of the
- metadata object at runtime, during forensic analysis or repair.
- As a primary concern, self describing metadata needs some form of overall
- integrity checking. We cannot trust the metadata if we cannot verify that it has
- not been changed as a result of external influences. Hence we need some form of
- integrity check, and this is done by adding CRC32c validation to the metadata
- block. If we can verify the block contains the metadata it was intended to
- contain, a large amount of the manual verification work can be skipped.
- CRC32c was selected as metadata cannot be more than 64k in length in XFS and
- hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
- metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
- fast. So while CRC32c is not the strongest of possible integrity checks that
- could be used, it is more than sufficient for our needs and has relatively
- little overhead. Adding support for larger integrity fields and/or algorithms
- does really provide any extra value over CRC32c, but it does add a lot of
- complexity and so there is no provision for changing the integrity checking
- mechanism.
- Self describing metadata needs to contain enough information so that the
- metadata block can be verified as being in the correct place without needing to
- look at any other metadata. This means it needs to contain location information.
- Just adding a block number to the metadata is not sufficient to protect against
- mis-directed writes - a write might be misdirected to the wrong LUN and so be
- written to the "correct block" of the wrong filesystem. Hence location
- information must contain a filesystem identifier as well as a block number.
- Another key information point in forensic analysis is knowing who the metadata
- block belongs to. We already know the type, the location, that it is valid
- and/or corrupted, and how long ago that it was last modified. Knowing the owner
- of the block is important as it allows us to find other related metadata to
- determine the scope of the corruption. For example, if we have a extent btree
- object, we don't know what inode it belongs to and hence have to walk the entire
- filesystem to find the owner of the block. Worse, the corruption could mean that
- no owner can be found (i.e. it's an orphan block), and so without an owner field
- in the metadata we have no idea of the scope of the corruption. If we have an
- owner field in the metadata object, we can immediately do top down validation to
- determine the scope of the problem.
- Different types of metadata have different owner identifiers. For example,
- directory, attribute and extent tree blocks are all owned by an inode, whilst
- freespace btree blocks are owned by an allocation group. Hence the size and
- contents of the owner field are determined by the type of metadata object we are
- looking at. The owner information can also identify misplaced writes (e.g.
- freespace btree block written to the wrong AG).
- Self describing metadata also needs to contain some indication of when it was
- written to the filesystem. One of the key information points when doing forensic
- analysis is how recently the block was modified. Correlation of set of corrupted
- metadata blocks based on modification times is important as it can indicate
- whether the corruptions are related, whether there's been multiple corruption
- events that lead to the eventual failure, and even whether there are corruptions
- present that the run-time verification is not detecting.
- For example, we can determine whether a metadata object is supposed to be free
- space or still allocated if it is still referenced by its owner by looking at
- when the free space btree block that contains the block was last written
- compared to when the metadata object itself was last written. If the free space
- block is more recent than the object and the object's owner, then there is a
- very good chance that the block should have been removed from the owner.
- To provide this "written timestamp", each metadata block gets the Log Sequence
- Number (LSN) of the most recent transaction it was modified on written into it.
- This number will always increase over the life of the filesystem, and the only
- thing that resets it is running xfs_repair on the filesystem. Further, by use of
- the LSN we can tell if the corrupted metadata all belonged to the same log
- checkpoint and hence have some idea of how much modification occurred between
- the first and last instance of corrupt metadata on disk and, further, how much
- modification occurred between the corruption being written and when it was
- detected.
- Runtime Validation
- ------------------
- Validation of self-describing metadata takes place at runtime in two places:
- - immediately after a successful read from disk
- - immediately prior to write IO submission
- The verification is completely stateless - it is done independently of the
- modification process, and seeks only to check that the metadata is what it says
- it is and that the metadata fields are within bounds and internally consistent.
- As such, we cannot catch all types of corruption that can occur within a block
- as there may be certain limitations that operational state enforces of the
- metadata, or there may be corruption of interblock relationships (e.g. corrupted
- sibling pointer lists). Hence we still need stateful checking in the main code
- body, but in general most of the per-field validation is handled by the
- verifiers.
- For read verification, the caller needs to specify the expected type of metadata
- that it should see, and the IO completion process verifies that the metadata
- object matches what was expected. If the verification process fails, then it
- marks the object being read as EFSCORRUPTED. The caller needs to catch this
- error (same as for IO errors), and if it needs to take special action due to a
- verification error it can do so by catching the EFSCORRUPTED error value. If we
- need more discrimination of error type at higher levels, we can define new
- error numbers for different errors as necessary.
- The first step in read verification is checking the magic number and determining
- whether CRC validating is necessary. If it is, the CRC32c is calculated and
- compared against the value stored in the object itself. Once this is validated,
- further checks are made against the location information, followed by extensive
- object specific metadata validation. If any of these checks fail, then the
- buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
- Write verification is the opposite of the read verification - first the object
- is extensively verified and if it is OK we then update the LSN from the last
- modification made to the object, After this, we calculate the CRC and insert it
- into the object. Once this is done the write IO is allowed to continue. If any
- error occurs during this process, the buffer is again marked with a EFSCORRUPTED
- error for the higher layers to catch.
- Structures
- ----------
- A typical on-disk structure needs to contain the following information:
- struct xfs_ondisk_hdr {
- __be32 magic; /* magic number */
- __be32 crc; /* CRC, not logged */
- uuid_t uuid; /* filesystem identifier */
- __be64 owner; /* parent object */
- __be64 blkno; /* location on disk */
- __be64 lsn; /* last modification in log, not logged */
- };
- Depending on the metadata, this information may be part of a header structure
- separate to the metadata contents, or may be distributed through an existing
- structure. The latter occurs with metadata that already contains some of this
- information, such as the superblock and AG headers.
- Other metadata may have different formats for the information, but the same
- level of information is generally provided. For example:
- - short btree blocks have a 32 bit owner (ag number) and a 32 bit block
- number for location. The two of these combined provide the same
- information as @owner and @blkno in eh above structure, but using 8
- bytes less space on disk.
- - directory/attribute node blocks have a 16 bit magic number, and the
- header that contains the magic number has other information in it as
- well. hence the additional metadata headers change the overall format
- of the metadata.
- A typical buffer read verifier is structured as follows:
- #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
- static void
- xfs_foo_read_verify(
- struct xfs_buf *bp)
- {
- struct xfs_mount *mp = bp->b_target->bt_mount;
- if ((xfs_sb_version_hascrc(&mp->m_sb) &&
- !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
- XFS_FOO_CRC_OFF)) ||
- !xfs_foo_verify(bp)) {
- XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
- xfs_buf_ioerror(bp, EFSCORRUPTED);
- }
- }
- The code ensures that the CRC is only checked if the filesystem has CRCs enabled
- by checking the superblock of the feature bit, and then if the CRC verifies OK
- (or is not needed) it verifies the actual contents of the block.
- The verifier function will take a couple of different forms, depending on
- whether the magic number can be used to determine the format of the block. In
- the case it can't, the code is structured as follows:
- static bool
- xfs_foo_verify(
- struct xfs_buf *bp)
- {
- struct xfs_mount *mp = bp->b_target->bt_mount;
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
- if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
- return false;
- if (!xfs_sb_version_hascrc(&mp->m_sb)) {
- if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
- return false;
- if (bp->b_bn != be64_to_cpu(hdr->blkno))
- return false;
- if (hdr->owner == 0)
- return false;
- }
- /* object specific verification checks here */
- return true;
- }
- If there are different magic numbers for the different formats, the verifier
- will look like:
- static bool
- xfs_foo_verify(
- struct xfs_buf *bp)
- {
- struct xfs_mount *mp = bp->b_target->bt_mount;
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
- if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
- if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
- return false;
- if (bp->b_bn != be64_to_cpu(hdr->blkno))
- return false;
- if (hdr->owner == 0)
- return false;
- } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
- return false;
- /* object specific verification checks here */
- return true;
- }
- Write verifiers are very similar to the read verifiers, they just do things in
- the opposite order to the read verifiers. A typical write verifier:
- static void
- xfs_foo_write_verify(
- struct xfs_buf *bp)
- {
- struct xfs_mount *mp = bp->b_target->bt_mount;
- struct xfs_buf_log_item *bip = bp->b_fspriv;
- if (!xfs_foo_verify(bp)) {
- XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
- xfs_buf_ioerror(bp, EFSCORRUPTED);
- return;
- }
- if (!xfs_sb_version_hascrc(&mp->m_sb))
- return;
- if (bip) {
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
- hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
- }
- xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
- }
- This will verify the internal structure of the metadata before we go any
- further, detecting corruptions that have occurred as the metadata has been
- modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
- update the LSN field (when it was last modified) and calculate the CRC on the
- metadata. Once this is done, we can issue the IO.
- Inodes and Dquots
- -----------------
- Inodes and dquots are special snowflakes. They have per-object CRC and
- self-identifiers, but they are packed so that there are multiple objects per
- buffer. Hence we do not use per-buffer verifiers to do the work of per-object
- verification and CRC calculations. The per-buffer verifiers simply perform basic
- identification of the buffer - that they contain inodes or dquots, and that
- there are magic numbers in all the expected spots. All further CRC and
- verification checks are done when each inode is read from or written back to the
- buffer.
- The structure of the verifiers and the identifiers checks is very similar to the
- buffer code described above. The only difference is where they are called. For
- example, inode read verification is done in xfs_iread() when the inode is first
- read out of the buffer and the struct xfs_inode is instantiated. The inode is
- already extensively verified during writeback in xfs_iflush_int, so the only
- addition here is to add the LSN and CRC to the inode as it is copied back into
- the buffer.
- XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
- the unlinked list modifications check or update CRCs, neither during unlink nor
- log recovery. So, it's gone unnoticed until now. This won't matter immediately -
- repair will probably complain about it - but it needs to be fixed.
|