log-writes.txt 4.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
  1. dm-log-writes
  2. =============
  3. This target takes 2 devices, one to pass all IO to normally, and one to log all
  4. of the write operations to. This is intended for file system developers wishing
  5. to verify the integrity of metadata or data as the file system is written to.
  6. There is a log_write_entry written for every WRITE request and the target is
  7. able to take arbitrary data from userspace to insert into the log. The data
  8. that is in the WRITE requests is copied into the log to make the replay happen
  9. exactly as it happened originally.
  10. Log Ordering
  11. ============
  12. We log things in order of completion once we are sure the write is no longer in
  13. cache. This means that normal WRITE requests are not actually logged until the
  14. next REQ_PREFLUSH request. This is to make it easier for userspace to replay
  15. the log in a way that correlates to what is on disk and not what is in cache,
  16. to make it easier to detect improper waiting/flushing.
  17. This works by attaching all WRITE requests to a list once the write completes.
  18. Once we see a REQ_PREFLUSH request we splice this list onto the request and once
  19. the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
  20. completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
  21. simulate the worst case scenario with regard to power failures. Consider the
  22. following example (W means write, C means complete):
  23. W1,W2,W3,C3,C2,Wflush,C1,Cflush
  24. The log would show the following
  25. W3,W2,flush,W1....
  26. Again this is to simulate what is actually on disk, this allows us to detect
  27. cases where a power failure at a particular point in time would create an
  28. inconsistent file system.
  29. Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
  30. they complete as those requests will obviously bypass the device cache.
  31. Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would
  32. have all the DISCARD requests, and then the WRITE requests and then the FLUSH
  33. request. Consider the following example:
  34. WRITE block 1, DISCARD block 1, FLUSH
  35. If we logged DISCARD when it completed, the replay would look like this
  36. DISCARD 1, WRITE 1, FLUSH
  37. which isn't quite what happened and wouldn't be caught during the log replay.
  38. Target interface
  39. ================
  40. i) Constructor
  41. log-writes <dev_path> <log_dev_path>
  42. dev_path : Device that all of the IO will go to normally.
  43. log_dev_path : Device where the log entries are written to.
  44. ii) Status
  45. <#logged entries> <highest allocated sector>
  46. #logged entries : Number of logged entries
  47. highest allocated sector : Highest allocated sector
  48. iii) Messages
  49. mark <description>
  50. You can use a dmsetup message to set an arbitrary mark in a log.
  51. For example say you want to fsck a file system after every
  52. write, but first you need to replay up to the mkfs to make sure
  53. we're fsck'ing something reasonable, you would do something like
  54. this:
  55. mkfs.btrfs -f /dev/mapper/log
  56. dmsetup message log 0 mark mkfs
  57. <run test>
  58. This would allow you to replay the log up to the mkfs mark and
  59. then replay from that point on doing the fsck check in the
  60. interval that you want.
  61. Every log has a mark at the end labeled "dm-log-writes-end".
  62. Userspace component
  63. ===================
  64. There is a userspace tool that will replay the log for you in various ways.
  65. It can be found here: https://github.com/josefbacik/log-writes
  66. Example usage
  67. =============
  68. Say you want to test fsync on your file system. You would do something like
  69. this:
  70. TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
  71. dmsetup create log --table "$TABLE"
  72. mkfs.btrfs -f /dev/mapper/log
  73. dmsetup message log 0 mark mkfs
  74. mount /dev/mapper/log /mnt/btrfs-test
  75. <some test that does fsync at the end>
  76. dmsetup message log 0 mark fsync
  77. md5sum /mnt/btrfs-test/foo
  78. umount /mnt/btrfs-test
  79. dmsetup remove log
  80. replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
  81. mount /dev/sdb /mnt/btrfs-test
  82. md5sum /mnt/btrfs-test/foo
  83. <verify md5sum's are correct>
  84. Another option is to do a complicated file system operation and verify the file
  85. system is consistent during the entire operation. You could do this with:
  86. TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
  87. dmsetup create log --table "$TABLE"
  88. mkfs.btrfs -f /dev/mapper/log
  89. dmsetup message log 0 mark mkfs
  90. mount /dev/mapper/log /mnt/btrfs-test
  91. <fsstress to dirty the fs>
  92. btrfs filesystem balance /mnt/btrfs-test
  93. umount /mnt/btrfs-test
  94. dmsetup remove log
  95. replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
  96. btrfsck /dev/sdb
  97. replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
  98. --fsck "btrfsck /dev/sdb" --check fua
  99. And that will replay the log until it sees a FUA request, run the fsck command
  100. and if the fsck passes it will replay to the next FUA, until it is completed or
  101. the fsck command exists abnormally.