Collection of MFS sources, documentation and a single header library
bzt 36d89da4c1 Initial commit | hace 11 meses | |
---|---|---|
fuse | hace 11 meses | |
linux | hace 11 meses | |
mfstool | hace 11 meses | |
minix | hace 11 meses | |
windows | hace 11 meses | |
LICENSE | hace 11 meses | |
README.md | hace 11 meses | |
mfs.h | hace 11 meses | |
test.c | hace 11 meses |
Collection of Minix3 File System sources, documentation, and an MIT licensed, minimal, stb-style single header MFS library.
[[TOC]]
Lot of hobby OS developers use FAT, which is a schame, because we have a MUCH better alternative, the Minix3 file system (MFS). Both has a limit of 2^32 (4G - 1) on file size, but MFS allows not just 8.3 ASCII names, uses smaller portion of the disk for it's own purpose and it is much easier to implement. The only advantage FAT has over Minix is, that FAT is very well documented, while MFS has almost zero documentation and even what it has is extremely poor quality at least to say. Sure, they expect you to buy the book (which I did of course), but there's not much information about the actual on disk layout in the book either.
So I've decided to collect the sources and write a good documentation on the Minix3 File System, with code examples and lots of explanations, focusing on how files are stored (and hopefully not getting lost in the details). The goal is, that with the help of this repository hobby OS developers realize how simple it is to implement MFS, even though it's features are far more superior to FAT, like UTF-8 file names and UNIX permissions. With a little luck, one day MFS can be used for interchaning files among hobby OSes.
In the Tanenbaum book you are told to take a look at the Minix3 source code (archived from
here, 2493 SLoC). But that is very hard
to read, because it went through a couple of iterations and dead code weren't cleaned up; also comments in it focus on the API,
not on the file system. When you read this source, keep in mind: block is the same as zone, and ignore all expressions with
s_log_zone_size
in it, because those are not supported any more (see super.c line 278).
To come to the rescue, this repo ships a dependency-free and memory allocation-free, read-only MFS driver in ANSI C. This is as minimal as it gets, just 193 SLoC only, but full of comments (about 150 more lines).
Other sources include the Linux kernel's Minix file system driver (archived from here, don't be afraid, this part is actually small, 2003 SLoC and easy to read, but also supports MFS1 and MFS2, not just the Minix3 File System), and mfstools (archived from here, likewise simple and easy to read, 1583 and this even includes a small genfs tool to create such file system images).
On Linux, you can use mfstool's genfs, or mkfs.minix -3 imagefile.bin
that your distro ships; for Windows there's an Open Source
read / write MFS driver (archived from here, you can study this latter's source as
well, also quite simple, 2281 SLoC). On MacOS (and Linux), you can use the FUSE driver (archived from
here, biggest of all, 3463 SLoC, but ships fsck.mfs and mkfs.mfs tools too).
First things first, the overall disk layout goes as follows:
0 2 M I D last block
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ ... +---+
| sb | imap | zmap | inode table | data |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ ... +---+
The MFS organizes the disk in blocks, so each unit above is a block. This is independent to the underlying storage's actual sector size, it's a logical block. There are two restrictions on the size: a block must be a multiple of 512 bytes and at least 1024 bytes long. We don't know yet how big a block is, we must parse the superblock to figure that out.
Starting from the second block, there's the inode allocation map, which is a simple bitmap (not needed for a read-only fs driver).
Starting at M
is the block (zone) allocation map, a similar bitmap (not needed for a read-only fs driver).
Starting at I
is the inode table, which is a list of continuous inode structure records, each 64 bytes long. This is the one that
we're mostly interested in.
Everything after this, starting from D
, considered data blocks and stores the actual file contents data.
Offset | Size | Description |
---|---|---|
0 | 1024 | reserved for boot loader |
1024 | 4 | s_ninodes total number of inodes |
1028 | 2 | always zero |
1030 | 2 | s_imap_blocks number of inode allocation map blocks |
1032 | 2 | s_zmap_blocks number of block allocation map blocks |
1034 | 10 | not really important, don't care |
1044 | 4 | s_zones total number of blocks in this file system |
1048 | 2 | s_magic the bytes Z and M to identify MFS |
1050 | 2 | always zero |
1052 | 2 | s_block_size size of one block in bytes (1024 to 32768) |
1054 | 2 | s_disk_version file system sub-version, not really used |
The first order of our business is to read 3 consecutive 512 byte sectors into memory, and check if the byte at offset 1048 is Z
and the byte at offset 1049 is M
. If not then we don't have a Minix3 File System on this floppy / disk / partition.
Once we confirmed that it is an MFS superblock that we're looking at, then s_block_size
will tell us in bytes how big unit is
used for this file system. This is the main building block, everything is expressed as a block number in the file system. This way
it is independent to the hardware sector size. The most common block size these days is 4096.
Now we can calculate the following values:
s_block_size / 512
. This is how many hardware sectors we need to read to get one logical block.2
.M
, the start of the block allocation map (zmap), is 2 + s_imap_blocks
.I
, the start of the inode table is 2 + s_imap_blocks + s_zmap_blocks
.D
, the start of the actual data is 2 + s_imap_blocks + s_zmap_blocks + s_ninodes * 64 / s_block_size
.As you can see, all are trivial. The last one is a little bit more complicated, but actually we won't need that for a read-only fs driver, as inodes store absolute block numbers (relative to the beginning of the file system).
Only needed when a file is created or deleted. This is a very very simple bitmap, the simplest data format there can be. This has one bit for each inode slot in the inode table, 0 means the inode slot is free, and 1 means the inode slot is in use. Inode 0 isn't used, so the first bit belongs to inode 1.
(N - 1) / 8 / s_block_size + 2
.((N - 1) / 8) % s_block_size
.1 << ((N - 1) & 7)
.Only needed for write operations. Just the same as above, a very simple bitmap, where 0 means that a block is free, and 1 means it
is used by the file system. Since the file system metadata cannot be freed, the first D
blocks are not stored in this bitmap.
(N - D) / 8 / s_block_size + 2 + s_imap_blocks
.((N - D) / 8) % s_block_size
.1 << ((N - D) & 7)
.This is the interesting part. All files and directories in the MFS file system must have an inode. This is what stores all the meta data about that file. Inode 0 isn't used, so the first slot belongs to inode 1, which is also the root of the file system.
(N - 1) * 64 / s_block_size + 2 + s_imap_blocks + s_zmap_blocks
.((N - 1) * 64) % s_block_size
.Now for each inode structure, we have the following 64 bytes:
Offset | Size | Description |
---|---|---|
0 | 2 | i_mode file permissions and type |
2 | 2 | i_nlinks number of directory entries with this inode |
4 | 2 | i_usr file owner's numeric id |
6 | 2 | i_grp file group's numeric id |
8 | 4 | i_size total size of the file in bytes |
12 | 4 | i_atime when was file data last accessed |
16 | 4 | i_mtime when was file data last changed |
20 | 4 | i_ctime when was inode data last changed |
24 | 10*4 | i_zone[10] file allocation info |
The i_mode
stores file access permissions on the least significant 12 bits, and the type is above that. So to check the inode's
type, you must clear the permission bits, aka. have to do something like (i_mode & 0xF000) == S_IFREG
. All bits are standardized
by POSIX, so you can find the defines on every UNIX in /usr/include/sys/stat.h
. Here we are only interested in three types:
S_IFDIR
(0x4000 a directory), S_IFREG
(0x8000 a regular file) and S_IFLNK
(0xA000 a symbolic link).
The date time fields i_atime
, i_mtime
and i_ctime
store the number of seconds passed since 1st January 1970 midnight, but we
won't need these.
And we come to the last, and most interesting part, the i_zone[]
array. This is what MFS uses instead of a FAT table, this is
where the file system stores which data blocks belong to the contents of this file.
Values in i_zone[0]
to i_zone[6]
are data block numbers, meaning they are pointing to the data blocks directly. To get the
block number of the first block (offsets 0 to s_block_size - 1) in a file, you'd need i_zone[0]
. Similarly, to get the second
block's number (offsets s_block_size to 2 * s_block_size - 1) you'd need i_zone[1]
.
i_zone[0] -> data block
i_zone[1] -> data block
...
i_zone[6] -> data block
i_zone[offset / s_block_size]
offset % s_block_size
.Value in i_zone[7]
is a tricky one, it does not point to a data block. Instead it points to a block which contains further 4 bytes
block numbers (s_block_size / 4 in total). Each of these block numbers point to the actual data block instead. So you must read an
additional block into memory, let's say into uint32_t block_list[1024]
(the actual dimension of the array is s_block_size / 4).
i_zone[7] -> block_list[0] -> data block
block_list[1] -> data block
block_list[2] -> data block
...
block_list[1022] -> data block
block_list[1023] -> data block (only if s_block_size == 4096)
i_zone[7]
block_list[(offset / s_block_size) - 7]
offset % s_block_size
.Value in i_zone[8]
is even trickier. It's double indirect, so it points to a block which contains block numbers, which in turn
point to another blocks with block numbers, and only those last block numbers point to data blocks.
i_zone[8] -> block_list[0] -> block_list[0] -> data block
block_list[1] -> data block
block_list[2] -> data block
...
block_list[1] -> block_list[0] -> data block
block_list[1] -> data block
block_list[2] -> data block
...
...
i_zone[8]
block_list[((offset / s_block_size) - 7 - s_block_size / 4) / (s_block_size / 4)]
block_list[((offset / s_block_size) - 7 - s_block_size / 4) % (s_block_size / 4)]
offset % s_block_size
.In theory, i_zone[9]
could be used for triple indirect lists, but Minix3 does not implement that. Regardless the MFS inode has
place for triple indirection, so feel free to add it in your code.
Now some files are actually directories (when (i_mode & 0xF000) == S_IFDIR
). These are stored exactly the same way like regular
files, the only difference is, they contain 64 byte records, so their i_size
is always multiple of 64. Each record looks like:
Offset | Size | Description |
---|---|---|
0 | 4 | d_ino inode number for this entry (0 = deleted entry) |
4 | 60 | d_name[60] zero terminated UTF-8 file name for this entry |
That simple. Multiple records may reference the same inode by d_ino
(the number of such records is kept track in i_nlinks
and
these are called hard links in contrast to symbolic links). At least one record should point to an inode (i_nlinks
is always
at least 1). Without such reference, the data blocks, indirect blocks with block numbers as well as the slot in the inode table can
be freed (except for the root directory of course, inode != 1).
Although Minix originally only used ASCII file names, as far as I can tell, there's absolutely no issue with storing an UTF-8 string
in d_name[]
. The file extension is not stored separatedly, it is simply part of the name, and d_name[]
is zero terminated.
To traverse a path (like /usr/bin/bash
for example), you start by reading in the root directory, which is stored on inode 1
(the very first structure in the inode table). Then you iterate on its contents (which is a list of directory entries), looking for
the one which has usr
in d_name[]
. When found, you get the data for the inode d_ino
. This is again should be a directory, so
you iterate on its contents (again, list of directory entries), looking for bin
in its d_name[]
. And so forth. When you reach
the last part of the path, that should contain a d_ino
pointing to a regular file.
Actually the directory separator is not stored on disk, so you are free to use anything you'd like, \usr\bin\bash
or
\Windows\System32
even [usr.bin.bash]
can work just as well.
You are not limited to UNIX paths with MFS.
These are again stored just like regular files, but they contain a path. When you traverse a path and encounter a symbolic link, then you should get the inode of the path it targets, replace the current directory's inode with that path's inode and keep on parsing the remaining part of the original path.
What is problematic that symbolic link target paths might point to another device if you support mount points. If you only have a single file system on a single device then this isn't an issue.
You can find an example dependency-free and memory allocation-free, read-only MFS implementation in an stb-style single header
ANSI C library. Libc not used, just memcpy
, memset
and memcmp
as built-ins.
Include mfs.h, and in exactly one of your source files, also define
MFS_IMPLEMENTATION
. In that file you might also define your desired directory separator character (defaults to /
slash), the
maximum path's length (defaults to 1024) and provide a sector read function.
#define MFS_IMPLEMENTATION
#define MFS_DIRSEP '/'
#define PATH_MAX 1024
#include "mfs.h"
void loadsec(uint32_t lba, uint32_t cnt, void *dst)
{
lba += partition_lba;
...
}
This function should read cnt
sectors of 512 bytes from the starting lba
address (also in 512 byte sectors) to buffer dst
.
If your file system resides on a partition, then don't forget to add the partition's starting address to lba
.
After this, you can use
uint32_t mfs_open(const char *fn);
To open a file. This returns the size of the file if found, 0 if not, and -1 if Minix3 File System wasn't recorgnized on this disk. As a bonus, this implementation also handles symbolic links! Although it's not a fully featured implementation, only supports symlinks in the file name part of the path and not in the directory part. But it operates on the path string only without further disk access and it's extremely simple minimal code.
uint32_t mfs_read(uint32_t offs, uint32_t size, void *buf);
Reads in size
bytes from byte offset offs
into buf
from the previously opened file. Returns the number of bytes read or 0 on
error.
void mfs_close(void);
Closes the previously opened file.
Taken from the book version of Minix3 (latest doesn't even boot, complains about some kmod missing, so I couldn't use that to create an image).
From the beginning:
00000000 31 c0 8e d8 fa 8e d0 bc 00 7c fb 50 50 52 89 e5 |1........|.PPR..| <- boot loader code and data
00000010 06 56 bf 0f 7d 84 d2 7d 18 26 c4 44 08 89 46 02 |.V..}..}.&.D..F.| (most of it removed for clearity)
*
00000400 00 20 00 00 00 00 01 00 01 00 84 00 00 00 00 00 |. ..............| <- the MFS superblock
00000410 ff ff ff ff 00 10 00 00 5a 4d 00 00 00 10 00 00 |........ZM......| s_imap_blocks = 1
00000420 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| s_zmap_blocks = 1
* s_block_size = 0x1000 = 4096
00001000
From (2 + 1 + 1) * 4096:
00004000 ed 41 0e 00 00 00 00 00 00 05 00 00 44 04 79 65 |.A..........D.ye| <- inode 1 (the root directory)
00004010 44 04 79 65 44 04 79 65 84 00 00 00 00 00 00 00 |D.yeD.ye........| i_mode = 0x4000, S_IFDIR
00004020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| i_zone[0] = 0x84
*
00004040 ed 41 02 00 00 00 00 00 80 00 00 00 43 04 79 65 |.A..........C.ye| <- inode 2 (directory)
00004050 43 04 79 65 43 04 79 65 85 00 00 00 00 00 00 00 |C.yeC.ye........| i_mode = 0x4000, S_IFDIR
00004060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| i_zone[0] = 0x85
*
00004080 ff 41 02 00 00 00 00 00 c0 00 00 00 43 04 79 65 |.A..........C.ye| <- inode 3 (directory)
00004090 08 04 79 65 43 04 79 65 86 00 00 00 00 00 00 00 |..yeC.ye........| i_mode = 0x4000, S_IFDIR
000040a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| i_zone[0] = 0x86
*
000040c0 c0 80 01 00 00 00 00 00 0c 00 00 00 43 04 79 65 |............C.ye| <- inode 4 (file)
000040d0 2e 01 79 65 43 04 79 65 87 00 00 00 00 00 00 00 |..yeC.ye........| i_mode = 0x8000, S_IFREG
000040e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| i_zone[0] = 0x87
*
00004100 ed 41 02 00 02 00 00 00 40 0a 00 00 43 04 79 65 |.A......@...C.ye| <- inode 5 (directory)
00004110 ea 80 55 43 43 04 79 65 88 00 00 00 00 00 00 00 |..UCC.ye........| i_mode = 0x4000, S_IFDIR
00004120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| i_zone[0] = 0x88
*
00004140 ed 81 01 00 02 00 00 00 88 19 00 00 43 04 79 65 |............C.ye| <- inode 6 (file)
00004150 eb 7f 55 43 43 04 79 65 89 00 00 00 8a 00 00 00 |..UCC.ye........| i_mode = 0x8000, S_IFREG
00004160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| i_zone[0] = 0x89
*
This is from the root directory, 0x84 (i_zone[0]) * 4096 (s_block_size) = 0x84000.
00084000 01 00 00 00 2e 00 00 00 00 00 00 00 00 00 00 00 |................| <- current dir, "."
00084010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| d_ino = 1
*
00084040 01 00 00 00 2e 2e 00 00 00 00 00 00 00 00 00 00 |................| <- parent dir, ".."
00084050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| d_ino = 1
*
00084080 02 00 00 00 75 73 72 00 00 00 00 00 00 00 00 00 |....usr.........| <- "usr"
00084090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| d_ino = 2
*
000840c0 03 00 00 00 74 6d 70 00 00 00 00 00 00 00 00 00 |....tmp.........|
000840d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00084100 05 00 00 00 62 69 6e 00 00 00 00 00 00 00 00 00 |....bin.........|
00084110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00084140 00 00 00 00 43 44 00 00 00 00 00 00 00 00 00 00 |....CD..........| <- a deleted entry
00084150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| d_ino = 0, so skip
*
00084170 00 00 00 00 00 00 00 00 00 00 00 00 1d 00 00 00 |................|
00084180 1e 00 00 00 65 74 63 00 00 00 00 00 00 00 00 00 |....etc.........| <- list continues normally
00084190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| d_ino = 30
*
000841c0 33 00 00 00 73 62 69 6e 00 00 00 00 00 00 00 00 |3...sbin........|
000841d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
That's all! Hope it's going to be useful!
bzt