Collection of MFS sources, documentation and a single header library

bzt 36d89da4c1 Initial commit hace 11 meses
fuse 36d89da4c1 Initial commit hace 11 meses
linux 36d89da4c1 Initial commit hace 11 meses
mfstool 36d89da4c1 Initial commit hace 11 meses
minix 36d89da4c1 Initial commit hace 11 meses
windows 36d89da4c1 Initial commit hace 11 meses
LICENSE 36d89da4c1 Initial commit hace 11 meses
README.md 36d89da4c1 Initial commit hace 11 meses
mfs.h 36d89da4c1 Initial commit hace 11 meses
test.c 36d89da4c1 Initial commit hace 11 meses

README.md

The Minix3 File System

Collection of Minix3 File System sources, documentation, and an MIT licensed, minimal, stb-style single header MFS library.

[[TOC]]

Preface

Lot of hobby OS developers use FAT, which is a schame, because we have a MUCH better alternative, the Minix3 file system (MFS). Both has a limit of 2^32 (4G - 1) on file size, but MFS allows not just 8.3 ASCII names, uses smaller portion of the disk for it's own purpose and it is much easier to implement. The only advantage FAT has over Minix is, that FAT is very well documented, while MFS has almost zero documentation and even what it has is extremely poor quality at least to say. Sure, they expect you to buy the book (which I did of course), but there's not much information about the actual on disk layout in the book either.

So I've decided to collect the sources and write a good documentation on the Minix3 File System, with code examples and lots of explanations, focusing on how files are stored (and hopefully not getting lost in the details). The goal is, that with the help of this repository hobby OS developers realize how simple it is to implement MFS, even though it's features are far more superior to FAT, like UTF-8 file names and UNIX permissions. With a little luck, one day MFS can be used for interchaning files among hobby OSes.

In the Tanenbaum book you are told to take a look at the Minix3 source code (archived from here, 2493 SLoC). But that is very hard to read, because it went through a couple of iterations and dead code weren't cleaned up; also comments in it focus on the API, not on the file system. When you read this source, keep in mind: block is the same as zone, and ignore all expressions with s_log_zone_size in it, because those are not supported any more (see super.c line 278).

To come to the rescue, this repo ships a dependency-free and memory allocation-free, read-only MFS driver in ANSI C. This is as minimal as it gets, just 193 SLoC only, but full of comments (about 150 more lines).

Other sources include the Linux kernel's Minix file system driver (archived from here, don't be afraid, this part is actually small, 2003 SLoC and easy to read, but also supports MFS1 and MFS2, not just the Minix3 File System), and mfstools (archived from here, likewise simple and easy to read, 1583 and this even includes a small genfs tool to create such file system images).

On Linux, you can use mfstool's genfs, or mkfs.minix -3 imagefile.bin that your distro ships; for Windows there's an Open Source read / write MFS driver (archived from here, you can study this latter's source as well, also quite simple, 2281 SLoC). On MacOS (and Linux), you can use the FUSE driver (archived from here, biggest of all, 3463 SLoC, but ships fsck.mfs and mkfs.mfs tools too).

On Disk Layout

First things first, the overall disk layout goes as follows:

0       2           M               I               D                       last block
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ ... +---+
|  sb   | imap      | zmap          | inode table   | data                    |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ ... +---+

The MFS organizes the disk in blocks, so each unit above is a block. This is independent to the underlying storage's actual sector size, it's a logical block. There are two restrictions on the size: a block must be a multiple of 512 bytes and at least 1024 bytes long. We don't know yet how big a block is, we must parse the superblock to figure that out.

Starting from the second block, there's the inode allocation map, which is a simple bitmap (not needed for a read-only fs driver).

Starting at M is the block (zone) allocation map, a similar bitmap (not needed for a read-only fs driver).

Starting at I is the inode table, which is a list of continuous inode structure records, each 64 bytes long. This is the one that we're mostly interested in.

Everything after this, starting from D, considered data blocks and stores the actual file contents data.

Superblock

Offset Size Description
0 1024 reserved for boot loader
1024 4 s_ninodes total number of inodes
1028 2 always zero
1030 2 s_imap_blocks number of inode allocation map blocks
1032 2 s_zmap_blocks number of block allocation map blocks
1034 10 not really important, don't care
1044 4 s_zones total number of blocks in this file system
1048 2 s_magic the bytes Z and M to identify MFS
1050 2 always zero
1052 2 s_block_size size of one block in bytes (1024 to 32768)
1054 2 s_disk_version file system sub-version, not really used

The first order of our business is to read 3 consecutive 512 byte sectors into memory, and check if the byte at offset 1048 is Z and the byte at offset 1049 is M. If not then we don't have a Minix3 File System on this floppy / disk / partition.

Once we confirmed that it is an MFS superblock that we're looking at, then s_block_size will tell us in bytes how big unit is used for this file system. This is the main building block, everything is expressed as a block number in the file system. This way it is independent to the hardware sector size. The most common block size these days is 4096.

Now we can calculate the following values:

  • sector per block, that's s_block_size / 512. This is how many hardware sectors we need to read to get one logical block.
  • the start of the inode allocation map (imap) is always 2.
  • M, the start of the block allocation map (zmap), is 2 + s_imap_blocks.
  • I, the start of the inode table is 2 + s_imap_blocks + s_zmap_blocks.
  • D, the start of the actual data is 2 + s_imap_blocks + s_zmap_blocks + s_ninodes * 64 / s_block_size.

As you can see, all are trivial. The last one is a little bit more complicated, but actually we won't need that for a read-only fs driver, as inodes store absolute block numbers (relative to the beginning of the file system).

Inode Allocation Map

Only needed when a file is created or deleted. This is a very very simple bitmap, the simplest data format there can be. This has one bit for each inode slot in the inode table, 0 means the inode slot is free, and 1 means the inode slot is in use. Inode 0 isn't used, so the first bit belongs to inode 1.

  • get which block contains the allocation info for inode N: (N - 1) / 8 / s_block_size + 2.
  • get the byte offset within that block: ((N - 1) / 8) % s_block_size.
  • get the bit mask within that byte: 1 << ((N - 1) & 7).

Block Allocation Map

Only needed for write operations. Just the same as above, a very simple bitmap, where 0 means that a block is free, and 1 means it is used by the file system. Since the file system metadata cannot be freed, the first D blocks are not stored in this bitmap.

  • get which block contains the allocation info for block N: (N - D) / 8 / s_block_size + 2 + s_imap_blocks.
  • get the byte offset within that block: ((N - D) / 8) % s_block_size.
  • get the bit mask within that byte: 1 << ((N - D) & 7).

Inode Table

This is the interesting part. All files and directories in the MFS file system must have an inode. This is what stores all the meta data about that file. Inode 0 isn't used, so the first slot belongs to inode 1, which is also the root of the file system.

  • get which block contains the structure for inode N: (N - 1) * 64 / s_block_size + 2 + s_imap_blocks + s_zmap_blocks.
  • get the structure's offset within that block: ((N - 1) * 64) % s_block_size.

Now for each inode structure, we have the following 64 bytes:

Offset Size Description
0 2 i_mode file permissions and type
2 2 i_nlinks number of directory entries with this inode
4 2 i_usr file owner's numeric id
6 2 i_grp file group's numeric id
8 4 i_size total size of the file in bytes
12 4 i_atime when was file data last accessed
16 4 i_mtime when was file data last changed
20 4 i_ctime when was inode data last changed
24 10*4 i_zone[10] file allocation info

The i_mode stores file access permissions on the least significant 12 bits, and the type is above that. So to check the inode's type, you must clear the permission bits, aka. have to do something like (i_mode & 0xF000) == S_IFREG. All bits are standardized by POSIX, so you can find the defines on every UNIX in /usr/include/sys/stat.h. Here we are only interested in three types: S_IFDIR (0x4000 a directory), S_IFREG (0x8000 a regular file) and S_IFLNK (0xA000 a symbolic link).

The date time fields i_atime, i_mtime and i_ctime store the number of seconds passed since 1st January 1970 midnight, but we won't need these.

And we come to the last, and most interesting part, the i_zone[] array. This is what MFS uses instead of a FAT table, this is where the file system stores which data blocks belong to the contents of this file.

File Offset Mappings

Direct Mapping (file offsets 0 .. 7 * s_block_size - 1)

Values in i_zone[0] to i_zone[6] are data block numbers, meaning they are pointing to the data blocks directly. To get the block number of the first block (offsets 0 to s_block_size - 1) in a file, you'd need i_zone[0]. Similarly, to get the second block's number (offsets s_block_size to 2 * s_block_size - 1) you'd need i_zone[1].

i_zone[0] -> data block
i_zone[1] -> data block
 ...
i_zone[6] -> data block
  • get the data block's number: i_zone[offset / s_block_size]
  • get the data offset within block: offset % s_block_size.

Indirect Mapping (file offsets 7 * s_block_size .. 7 * s_block_size + s_block_size / 4 * s_block_size - 1)

Value in i_zone[7] is a tricky one, it does not point to a data block. Instead it points to a block which contains further 4 bytes block numbers (s_block_size / 4 in total). Each of these block numbers point to the actual data block instead. So you must read an additional block into memory, let's say into uint32_t block_list[1024] (the actual dimension of the array is s_block_size / 4).

i_zone[7] -> block_list[0]    -> data block
             block_list[1]    -> data block
             block_list[2]    -> data block
              ...
             block_list[1022] -> data block
             block_list[1023] -> data block    (only if s_block_size == 4096)
  • get the block numbers: i_zone[7]
  • get the data block's number: block_list[(offset / s_block_size) - 7]
  • get the data offset within block: offset % s_block_size.

Double Indirect Mapping (file offsets above)

Value in i_zone[8] is even trickier. It's double indirect, so it points to a block which contains block numbers, which in turn point to another blocks with block numbers, and only those last block numbers point to data blocks.

i_zone[8] -> block_list[0] -> block_list[0] -> data block
                              block_list[1] -> data block
                              block_list[2] -> data block
                               ...
             block_list[1] -> block_list[0] -> data block
                              block_list[1] -> data block
                              block_list[2] -> data block
                               ...
              ...
  • get the 1st level block numbers: i_zone[8]
  • get the 2nd level block numbers: block_list[((offset / s_block_size) - 7 - s_block_size / 4) / (s_block_size / 4)]
  • get the data block's number: block_list[((offset / s_block_size) - 7 - s_block_size / 4) % (s_block_size / 4)]
  • get the data offset within block: offset % s_block_size.

In theory, i_zone[9] could be used for triple indirect lists, but Minix3 does not implement that. Regardless the MFS inode has place for triple indirection, so feel free to add it in your code.

Directories

Now some files are actually directories (when (i_mode & 0xF000) == S_IFDIR). These are stored exactly the same way like regular files, the only difference is, they contain 64 byte records, so their i_size is always multiple of 64. Each record looks like:

Offset Size Description
0 4 d_ino inode number for this entry (0 = deleted entry)
4 60 d_name[60] zero terminated UTF-8 file name for this entry

That simple. Multiple records may reference the same inode by d_ino (the number of such records is kept track in i_nlinks and these are called hard links in contrast to symbolic links). At least one record should point to an inode (i_nlinks is always at least 1). Without such reference, the data blocks, indirect blocks with block numbers as well as the slot in the inode table can be freed (except for the root directory of course, inode != 1).

Although Minix originally only used ASCII file names, as far as I can tell, there's absolutely no issue with storing an UTF-8 string in d_name[]. The file extension is not stored separatedly, it is simply part of the name, and d_name[] is zero terminated.

To traverse a path (like /usr/bin/bash for example), you start by reading in the root directory, which is stored on inode 1 (the very first structure in the inode table). Then you iterate on its contents (which is a list of directory entries), looking for the one which has usr in d_name[]. When found, you get the data for the inode d_ino. This is again should be a directory, so you iterate on its contents (again, list of directory entries), looking for bin in its d_name[]. And so forth. When you reach the last part of the path, that should contain a d_ino pointing to a regular file.

Actually the directory separator is not stored on disk, so you are free to use anything you'd like, \usr\bin\bash or \Windows\System32 even [usr.bin.bash] can work just as well.

You are not limited to UNIX paths with MFS.

Symbolic Links

These are again stored just like regular files, but they contain a path. When you traverse a path and encounter a symbolic link, then you should get the inode of the path it targets, replace the current directory's inode with that path's inode and keep on parsing the remaining part of the original path.

What is problematic that symbolic link target paths might point to another device if you support mount points. If you only have a single file system on a single device then this isn't an issue.

Example Code

You can find an example dependency-free and memory allocation-free, read-only MFS implementation in an stb-style single header ANSI C library. Libc not used, just memcpy, memset and memcmp as built-ins.

Include mfs.h, and in exactly one of your source files, also define MFS_IMPLEMENTATION. In that file you might also define your desired directory separator character (defaults to / slash), the maximum path's length (defaults to 1024) and provide a sector read function.

#define MFS_IMPLEMENTATION
#define MFS_DIRSEP '/'
#define PATH_MAX 1024
#include "mfs.h"

void loadsec(uint32_t lba, uint32_t cnt, void *dst)
{
    lba += partition_lba;
    ...
}

This function should read cnt sectors of 512 bytes from the starting lba address (also in 512 byte sectors) to buffer dst. If your file system resides on a partition, then don't forget to add the partition's starting address to lba.

After this, you can use

uint32_t mfs_open(const char *fn);

To open a file. This returns the size of the file if found, 0 if not, and -1 if Minix3 File System wasn't recorgnized on this disk. As a bonus, this implementation also handles symbolic links! Although it's not a fully featured implementation, only supports symlinks in the file name part of the path and not in the directory part. But it operates on the path string only without further disk access and it's extremely simple minimal code.

uint32_t mfs_read(uint32_t offs, uint32_t size, void *buf);

Reads in size bytes from byte offset offs into buf from the previously opened file. Returns the number of bytes read or 0 on error.

void mfs_close(void);

Closes the previously opened file.

Example Data

Taken from the book version of Minix3 (latest doesn't even boot, complains about some kmod missing, so I couldn't use that to create an image).

Superblock

From the beginning:

00000000  31 c0 8e d8 fa 8e d0 bc  00 7c fb 50 50 52 89 e5  |1........|.PPR..|  <- boot loader code and data
00000010  06 56 bf 0f 7d 84 d2 7d  18 26 c4 44 08 89 46 02  |.V..}..}.&.D..F.|     (most of it removed for clearity)
*
00000400  00 20 00 00 00 00 01 00  01 00 84 00 00 00 00 00  |. ..............|  <- the MFS superblock
00000410  ff ff ff ff 00 10 00 00  5a 4d 00 00 00 10 00 00  |........ZM......|     s_imap_blocks = 1
00000420  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     s_zmap_blocks = 1
*                                                                                  s_block_size = 0x1000 = 4096
00001000

Inode Table

From (2 + 1 + 1) * 4096:

00004000  ed 41 0e 00 00 00 00 00  00 05 00 00 44 04 79 65  |.A..........D.ye|  <- inode 1 (the root directory)
00004010  44 04 79 65 44 04 79 65  84 00 00 00 00 00 00 00  |D.yeD.ye........|     i_mode = 0x4000, S_IFDIR
00004020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     i_zone[0] = 0x84
*
00004040  ed 41 02 00 00 00 00 00  80 00 00 00 43 04 79 65  |.A..........C.ye|  <- inode 2 (directory)
00004050  43 04 79 65 43 04 79 65  85 00 00 00 00 00 00 00  |C.yeC.ye........|     i_mode = 0x4000, S_IFDIR
00004060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     i_zone[0] = 0x85
*
00004080  ff 41 02 00 00 00 00 00  c0 00 00 00 43 04 79 65  |.A..........C.ye|  <- inode 3 (directory)
00004090  08 04 79 65 43 04 79 65  86 00 00 00 00 00 00 00  |..yeC.ye........|     i_mode = 0x4000, S_IFDIR
000040a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     i_zone[0] = 0x86
*
000040c0  c0 80 01 00 00 00 00 00  0c 00 00 00 43 04 79 65  |............C.ye|  <- inode 4 (file)
000040d0  2e 01 79 65 43 04 79 65  87 00 00 00 00 00 00 00  |..yeC.ye........|     i_mode = 0x8000, S_IFREG
000040e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     i_zone[0] = 0x87
*
00004100  ed 41 02 00 02 00 00 00  40 0a 00 00 43 04 79 65  |.A......@...C.ye|  <- inode 5 (directory)
00004110  ea 80 55 43 43 04 79 65  88 00 00 00 00 00 00 00  |..UCC.ye........|     i_mode = 0x4000, S_IFDIR
00004120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     i_zone[0] = 0x88
*
00004140  ed 81 01 00 02 00 00 00  88 19 00 00 43 04 79 65  |............C.ye|  <- inode 6 (file)
00004150  eb 7f 55 43 43 04 79 65  89 00 00 00 8a 00 00 00  |..UCC.ye........|     i_mode = 0x8000, S_IFREG
00004160  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     i_zone[0] = 0x89
*

Directory Entries

This is from the root directory, 0x84 (i_zone[0]) * 4096 (s_block_size) = 0x84000.

00084000  01 00 00 00 2e 00 00 00  00 00 00 00 00 00 00 00  |................|  <- current dir, "."
00084010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     d_ino = 1
*
00084040  01 00 00 00 2e 2e 00 00  00 00 00 00 00 00 00 00  |................|  <- parent dir, ".."
00084050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     d_ino = 1
*
00084080  02 00 00 00 75 73 72 00  00 00 00 00 00 00 00 00  |....usr.........|  <- "usr"
00084090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     d_ino = 2
*
000840c0  03 00 00 00 74 6d 70 00  00 00 00 00 00 00 00 00  |....tmp.........|
000840d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00084100  05 00 00 00 62 69 6e 00  00 00 00 00 00 00 00 00  |....bin.........|
00084110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00084140  00 00 00 00 43 44 00 00  00 00 00 00 00 00 00 00  |....CD..........|  <- a deleted entry
00084150  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     d_ino = 0, so skip
*
00084170  00 00 00 00 00 00 00 00  00 00 00 00 1d 00 00 00  |................|
00084180  1e 00 00 00 65 74 63 00  00 00 00 00 00 00 00 00  |....etc.........|  <- list continues normally
00084190  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|     d_ino = 30
*
000841c0  33 00 00 00 73 62 69 6e  00 00 00 00 00 00 00 00  |3...sbin........|
000841d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*

That's all! Hope it's going to be useful!

bzt