Home F2FS:通过mkfs.f2fs源码了解文件系统实现
Post
Cancel

F2FS:通过mkfs.f2fs源码了解文件系统实现

前言

上文通过阅读论文的方式对F2FS进行了概括性的总结,本文进一步通过源码分析的方式来理解F2FS。一方面我们可以通过调研F2FS的早期commit来上路,另一方面也可以通过对应时期的mkfs.f2fs工具来了解一个文件系统的初始化状态。虽然这些代码的稳定性相对于现在肯定是不足的,但这样更易于理解论文提到的若干特性。这篇文章先了解一个格式化的F2FS的磁盘布局(即它的初始化状态),我们通过调试内核以外的mkfs.f2fs工具源码来展开细节

Overview: On-disk layout

layout

上文提到F2FS的布局是分为6个area,如上图所示。那么它的初始化状态是怎样的?这里可以通过调试mkfs.f2fs来了解具体的数据结构。附上一些动手前的准备:

git clone https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
cd f2fs-tools

# 找到和上述内核代码相近时间点的commit,切过去
git checkout -b jintianxiaomidaobilema 036d45e551ca5405c726f8ccb51f446620cd4af4

cd mkfs

# 造一个1GB的img,调试用
dd if=/dev/zero of=f2fs.img bs=1024 count=1000000

# 编译mkfs.f2fs,使用-g3方便宏展开调试
gcc f2fs_format.c -g3 -o mkfs.f2fs

「纸面」上的F2FS数据结构可以直接看这里,作者注释超详细的

先在初始化主流程f2fs_format_device()返回前打断点,然后传递启动参数后输出f2fs_params变量。这个变量是用于初始化前的parse options,我们也可以看到它的默认行为是怎样的

(gdb) b f2fs_format.c:1273
(gdb) r f2fs.img
(gdb) set print pretty on
(gdb) p f2fs_params
$1 = {
  sector_size = 512,
  reserved_segments = 25,
  overprovision = 5,
  cur_seg = {473, 1, 0, 476, 475, 474},
  segs_per_sec = 1,
  secs_per_zone = 1,
  start_sector = 0,
  total_sectors = 2000000,
  sectors_per_blk = 8,
  blks_per_seg = 512,
  vol_label = "F2FS", '\000' <repeats 11 times>,
  heap = 1,
  fd = 3,
  device_name = 0x7fffffffe137 "f2fs.img",
  extension_list = 0x0
}

一点信息如下:

  • overprovision是预留给GC的百分比空间,并不开放给用户使用。reserved_segments是由overprovision计算得出的
  • 默认使用了heap-based allocation(heap = 1),其实是影响了cur_seg的赋值行为。下文提到data从起始地址开始,而node则从尾部地址开始
  • fd仅用于writetodisk()最终写入到设备/映像文件中,并不实际存放于文件系统
" overprovision-ratio-percentage"
Specify the percentage over the volume size for overprovision area. This area
is hidden to users, and utilized by F2FS cleaner. The default percentage is 5%.


" heap-based-allocation"
Specify 1 or 0 to enable/disable heap based block allocation policy.
If the value is equal to 1, each of active log areas are initially
assigned separately according to the whole volume size.
The default value is 1.


... heap-style segment allocation which finds free segments for data
from the beginning of main area, while for node from the end of main
area.

Super block初始化

f2fs.h中的数据结构如下所示

/*
 * For superblock
 */
struct f2fs_super_block {
        __le32 magic;                   /* Magic Number */
        __le16 major_ver;               /* Major Version */
        __le16 minor_ver;               /* Minor Version */
        __le32 log_sectorsize;          /* log2 sector size in bytes */
        __le32 log_sectors_per_block;   /* log2 # of sectors per block */
        __le32 log_blocksize;           /* log2 block size in bytes */
        __le32 log_blocks_per_seg;      /* log2 # of blocks per segment */
        __le32 segs_per_sec;            /* # of segments per section */
        __le32 secs_per_zone;           /* # of sections per zone */
        __le32 checksum_offset;         /* checksum offset inside super block */
        __le64 block_count;             /* total # of user blocks */
        __le32 section_count;           /* total # of sections */
        __le32 segment_count;           /* total # of segments */
        __le32 segment_count_ckpt;      /* # of segments for checkpoint */
        __le32 segment_count_sit;       /* # of segments for SIT */
        __le32 segment_count_nat;       /* # of segments for NAT */
        __le32 segment_count_ssa;       /* # of segments for SSA */
        __le32 segment_count_main;      /* # of segments for main area */
        __le32 segment0_blkaddr;        /* start block address of segment 0 */
        __le32 cp_blkaddr;              /* start block address of checkpoint */
        __le32 sit_blkaddr;             /* start block address of SIT */
        __le32 nat_blkaddr;             /* start block address of NAT */
        __le32 ssa_blkaddr;             /* start block address of SSA */
        __le32 main_blkaddr;            /* start block address of main area */
        __le32 root_ino;                /* root inode number */
        __le32 node_ino;                /* node inode number */
        __le32 meta_ino;                /* meta inode number */
        __u8 uuid[16];                  /* 128-bit uuid for volume */
        __le16 volume_name[512];        /* volume name */
        __le32 extension_count;         /* # of extensions below */
        __u8 extension_list[F2FS_MAX_EXTENSION][8];     /* extension array */
} __packed;

同样使用gdb来dump初始化状态,如下所示

(gdb) p super_block
$2 = {
  magic = 4076150800,
  major_ver = 1,
  minor_ver = 0,
  log_sectorsize = 9,
  log_sectors_per_block = 3,
  log_blocksize = 12,
  log_blocks_per_seg = 9,
  segs_per_sec = 1,
  secs_per_zone = 1,
  checksum_offset = 0,
  block_count = 250000,
  section_count = 478,
  segment_count = 487,
  segment_count_ckpt = 2,
  segment_count_sit = 2,
  segment_count_nat = 4,
  segment_count_ssa = 1,
  segment_count_main = 478,
  failure_safe_block_distance = 0,
  segment0_blkaddr = 512,
  start_segment_checkpoint = 512,
  sit_blkaddr = 1536,
  nat_blkaddr = 2560,
  ssa_blkaddr = 4608,
  main_blkaddr = 5120,
  root_ino = 3,
  node_ino = 1,
  meta_ino = 2,
  volume_serial_number = 0,
  volume_name = {70, 50, 70, 83, 0 <repeats 508 times>},
  extension_count = 22,
  extension_list = {"jpg\000\000\000\000", "gif\000\000\000\000", "png\000\000\000\000", "avi\000\000\000\000", "divx\000\000\000", "mp4\000\000\000\000", "mp3\000\000\000\000",
    "3gp\000\000\000\000", "wmv\000\000\000\000", "wma\000\000\000\000", "mpeg\000\000\000", "mkv\000\000\000\000", "mov\000\000\000\000", "asx\000\000\000\000", "asf\000\000\000\000",
    "wmx\000\000\000\000", "svi\000\000\000\000", "wvx\000\000\000\000", "wm\000\000\000\000\000", "mpg\000\000\000\000", "mpe\000\000\000\000", "rm\000\000\000\000\000",
    "ogg\000\000\000\000", "\000\000\000\000\000\000\000" <repeats 41 times>}
}

super block部分是相当的易懂的,除了配置基本的块设备block / sector以及segment数目信息以外,还反映了默认1:1:1的segment-section-zone三级划分,以及各个area的起始地址(*_blkaddr

SIT初始化

/**
 * @brief       It initialize SIT Data structure
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_init_sit_area(void)
{
        u_int32_t blk_size_bytes;
        u_int32_t seg_size_bytes;
        u_int32_t index = 0;
        u_int64_t sit_seg_blk_offset = 0;
        u_int8_t *zero_buf = NULL;

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        seg_size_bytes = (1 << le32_to_cpu(super_block.log_blocks_per_seg)) *
                                blk_size_bytes;

        zero_buf = calloc(sizeof(u_int8_t), seg_size_bytes);
        if(zero_buf == NULL) {
                printf("\n\tError: Calloc Failed for sit_zero_buf!!!\n");
                return -1;
        }

        sit_seg_blk_offset = le32_to_cpu(super_block.sit_blkaddr) *
                                                blk_size_bytes;

        for (index = 0;
                index < (le32_to_cpu(super_block.segment_count_sit) / 2);
                                                                index++) {
                if (writetodisk(f2fs_params.fd, zero_buf, sit_seg_blk_offset,
                                        seg_size_bytes) < 0) {
                        printf("\n\tError: While zeroing out the sit area \
                                        on disk!!!\n");
                        return -1;
                }
                sit_seg_blk_offset = sit_seg_blk_offset + seg_size_bytes;
        }

        free(zero_buf);
        return 0 ;
}

/**
 * @brief       It writes buffer to disk or storage meant to be formatted
 *              with F2FS.
 * @param       fd File descriptor for device
 * @param       buf buffer to be written
 * @param       offset where to bw written on the device
 * @param       length length of the device
 * @return      0 if success
 */
static int writetodisk(int32_t fd, void *buf, u_int64_t offset, size_t length)
{
        if (lseek64(fd, offset, SEEK_SET) < 0) {
                printf("\n\tError: While lseek to the derised location!!!\n");
                return -1;
        }

        if (write(fd, buf, length) < 0) {
                printf("\n\tError: While writing data to the disk!!! Error Num : \
                                %d\n", errno);
                return -1;
        }

        return 0;
}

在SIT的初始化函数f2fs_init_sit_area()中,并没有在内存数据结构上做出改动,核心仅是writetodisk()写入到外存数据结构对应的SIT中

从之前的dump可以看到super_block.segment_count_sit = 2,即SIT area实际使用了2个area,但是这里写入一半(1个segment)到外存映像,写入的内容是全0(zero_buf

NAT初始化

目前NAT的初始化行为与SIT一致,区别是NAT占用4个segment,其它略

Note: 在后续root directory初始化阶段中,NAT仍要为root inode做进一步更新操作

Root directory初始化

接下来的流程执行f2fs_create_root_dir(),简单来说有3步:

  1. f2fs_write_root_inode():在main area创建root inode
  2. f2fs_update_nat_root():继续初始化NAT,即在该区域记录root inode的地址映射
  3. f2fs_add_default_dentry_root()

后面我们来看这一段代码

/**
 * @brief       It initializes and writes root inode on device.
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_write_root_inode(void)
{
        struct f2fs_node *raw_node = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t data_blk_nor;
        u_int64_t main_area_node_seg_blk_offset = 0;

        raw_node = calloc(sizeof(struct f2fs_node), 1);
        if (raw_node == NULL) {
                printf("\n\tError: Calloc Failed for raw_node!!!\n");
                return -1;
        }

        // 设置root inode number
        raw_node->footer.nid = super_block.root_ino;
        raw_node->footer.ino = super_block.root_ino;
        raw_node->footer.cp_ver = cpu_to_le64(1);
        // 指向下一个block
        raw_node->footer.next_blkaddr = cpu_to_le32(
                        le32_to_cpu(super_block.main_blkaddr) +
                        f2fs_params.cur_seg[CURSEG_HOT_NODE] *
                        f2fs_params.blks_per_seg + 1);

        raw_node->i.i_mode = cpu_to_le16(0x41ed);
        raw_node->i.i_links = cpu_to_le32(2);
        raw_node->i.i_uid = cpu_to_le32(getuid());
        raw_node->i.i_gid = cpu_to_le32(getgid());

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        // 不同单位的文件大小(byte/block)
        raw_node->i.i_size = cpu_to_le64(1 * blk_size_bytes); /* dentry */
        raw_node->i.i_blocks = cpu_to_le64(2);

        raw_node->i.i_ctime = cpu_to_le32(time(NULL));
        raw_node->i.i_ctime_nsec = 0;
        raw_node->i.i_mtime = cpu_to_le32(time(NULL));
        raw_node->i.i_mtime_nsec = 0;
        raw_node->i.i_xattr_nid = 0;
        raw_node->i.i_flags = 0;
        raw_node->i.current_depth = cpu_to_le32(1);

        // HOT DATA的位置
        data_blk_nor = le32_to_cpu(super_block.main_blkaddr) +
                f2fs_params.cur_seg[CURSEG_HOT_DATA] * f2fs_params.blks_per_seg;
        // i_addr[ADDRS_PER_INODE=927]是指向data block的指针
        raw_node->i.i_addr[0] = cpu_to_le32(data_blk_nor);

        raw_node->i.i_ext.fofs = 0;
        raw_node->i.i_ext.blk_addr = cpu_to_le32(data_blk_nor);
        raw_node->i.i_ext.len = cpu_to_le32(1);

        // 先得到main area的起始地址
        main_area_node_seg_blk_offset = le32_to_cpu(super_block.main_blkaddr);
        // HOT NODE的位置
        main_area_node_seg_blk_offset += f2fs_params.cur_seg[CURSEG_HOT_NODE] *
                                        f2fs_params.blks_per_seg;
        main_area_node_seg_blk_offset *= blk_size_bytes;

        // root inode写入到HOT NODE offset
        if (writetodisk(f2fs_params.fd, raw_node, main_area_node_seg_blk_offset,
                                sizeof(struct f2fs_node)) < 0) {
                printf("\n\tError: While writing the raw_node to disk!!!\n");
                return -1;
        }

        memset(raw_node, 0xff, sizeof(struct f2fs_node));

        // 下一个block开始全部糊上0xff
        if (writetodisk(f2fs_params.fd, raw_node,
                                main_area_node_seg_blk_offset + 4096,
                                sizeof(struct f2fs_node)) < 0) {
                printf("\n\tError: While writing the raw_node to disk!!!\n");
                return -1;
        }
        free(raw_node);
        return 0;
}

/**
 * @brief       It updates NAT for root Inode
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_update_nat_root(void)
{
        // 一个block大小的数据结构,包含4k/sizeof(f2fs_nat_entry)个数的f2fs_nat_entry
        // 其entry是一个三元组<version, ino, block_addr>,意思很好懂,就是ino到addr的映射条目
        struct f2fs_nat_block *nat_blk = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t nat_seg_blk_offset = 0;

        nat_blk = calloc(sizeof(struct f2fs_nat_block), 1);
        if(nat_blk == NULL) {
                printf("\n\tError: Calloc Failed for nat_blk!!!\n");
                return -1;
        }

        /* update root */
        // 执行root inode的ino到addr的映射,刚才的分析已经得知地址了
        nat_blk->entries[super_block.root_ino].block_addr = cpu_to_le32(
                le32_to_cpu(super_block.main_blkaddr) +
                f2fs_params.cur_seg[CURSEG_HOT_NODE] * f2fs_params.blks_per_seg);
        nat_blk->entries[super_block.root_ino].ino = super_block.root_ino;

        // 剩下的还有node inode (ino=1)和meta inode (ino=2)

        /* update node nat */
        // 不明白为啥是1
        nat_blk->entries[super_block.node_ino].block_addr = cpu_to_le32(1);
        nat_blk->entries[super_block.node_ino].ino = super_block.node_ino;

        /* update meta nat */
        nat_blk->entries[super_block.meta_ino].block_addr = cpu_to_le32(1);
        nat_blk->entries[super_block.meta_ino].ino = super_block.meta_ino;

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);

        // 定位然后写入到NAT的位置
        nat_seg_blk_offset = le32_to_cpu(super_block.nat_blkaddr) *
                                                        blk_size_bytes;

        if (writetodisk(f2fs_params.fd, nat_blk, nat_seg_blk_offset,
                                sizeof(struct f2fs_nat_block)) < 0) {
                printf("\n\tError: While writing the nat_blk set0 to disk!!!\n");
                return -1;
        }

        free(nat_blk);
        return 0;
}

/**
 * @brief       It updates default dentries in Root Inode
 * @param       None
 * @return      0 if success
 */
static int8_t f2fs_add_default_dentry_root(void)
{
        // 补充一下dentry block的背景:大概包含一个bitmap表以及若干dentry,见下方bonus详解
        struct f2fs_dentry_block *dent_blk = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t data_blk_offset = 0;

        dent_blk = calloc(sizeof(struct f2fs_dentry_block), 1);
        if(dent_blk == NULL) {
                printf("\n\tError: Calloc Failed for dent_blk!!!\n");
                return -1;
        }

        // 填充root inode相关dentry信息(.和..)
        dent_blk->dentry[0].hash_code = 0;
        dent_blk->dentry[0].ino = super_block.root_ino;
        dent_blk->dentry[0].name_len = cpu_to_le16(1);
        dent_blk->dentry[0].file_type = F2FS_FT_DIR;
        memcpy(dent_blk->filename[0], ".", 1);

        dent_blk->dentry[1].hash_code = 0;
        dent_blk->dentry[1].ino = super_block.root_ino;
        dent_blk->dentry[1].name_len = cpu_to_le16(2);
        dent_blk->dentry[1].file_type = F2FS_FT_DIR;
        memcpy(dent_blk->filename[1], "..", 2);

        /* bitmap for . and .. */
        // 手写bitmap,0和1位已经存在
        dent_blk->dentry_bitmap[0] = (1 << 1) | (1 << 0);
        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        data_blk_offset = (le32_to_cpu(super_block.main_blkaddr) +
                        f2fs_params.cur_seg[CURSEG_HOT_DATA] *
                        f2fs_params.blks_per_seg) * blk_size_bytes;

        // 写入到HOT DATA
        if (writetodisk(f2fs_params.fd, dent_blk, data_blk_offset,
                                sizeof(struct f2fs_dentry_block)) < 0) {
                printf("\n\tError: While writing the dentry_blk to disk!!!\n");
                return -1;
        }

        free(dent_blk);
        return 0;
}

f2fs_write_root_inode()处理的和VFS层inode差不多。一些F2FS特化处理包括:

  • root inode写入到HOT NODE,即NODE起始位置
  • root inode的i_addr[0]作为指向data的指针,指向了HOT DATA

root inode相关输出如下(注意i是和dn/in作为一个union,后两者的输出现在无意义):

(gdb) b f2fs_format.c:1051

(gdb) info locals
raw_node = 0x4098c0
blk_size_bytes = 4096
data_blk_nor = 247296
main_area_node_seg_blk_offset = 1019215872

# 为了避免过长排版,我这里暂时关了pretty print
(gdb) p *raw_node
$3 = { {i = {i_mode = 16877, i_advise = 0 '\000', i_reserved = 0 '\000', i_uid = 1000,
      i_gid = 1000, i_links = 2, i_size = 4096, i_blocks = 2, i_ctime = 1696424848,
      i_mtime = 1696424848, i_ctime_nsec = 0, i_mtime_nsec = 0, current_depth = 1,
      i_xattr_nid = 0, i_flags = 0, i_pino = 0, i_namelen = 0,
      i_name = '\000' <repeats 255 times>, i_ext = {fofs = 0, blk_addr = 247296, len = 1},
      i_addr = {247296, 0 <repeats 926 times>}, i_nid = {0, 0, 0, 0, 0}}, dn = {addr = {16877,
        1000, 1000, 2, 4096, 0, 2, 0, 1696424848, 0, 1696424848, 0, 0, 0, 1,
        0 <repeats 69 times>, 247296, 1, 247296, 0 <repeats 931 times>}}, in = {nid = {16877,
        1000, 1000, 2, 4096, 0, 2, 0, 1696424848, 0, 1696424848, 0, 0, 0, 1,
        0 <repeats 69 times>, 247296, 1, 247296, 0 <repeats 931 times>}}}, footer = {nid = 3,
    ino = 3, flag = 0, cp_ver = 1, next_blkaddr = 248833}}

f2fs_update_nat_root()续上和刚才的NAT初始化过程,NAT仍需要为root inode执行映射处理,具体来说就是用一个block代表若干映射条目,然后记下inode映射到addr的值,最后写入到NAT area即可

f2fs_add_default_dentry_root()就是为root inode补充需要的dentry信息,处理...,填上bitmap并写入到HOT DATA

Bonus: 关于dentry的一些信息

/* 4KB-sized directory entry block */
struct f2fs_dentry_block {
        /* validity bitmap for directory entries in each block */
        __u8 dentry_bitmap[SIZE_OF_DENTRY_BITMAP];
        __u8 reserved[SIZE_OF_RESERVED];
        struct f2fs_dir_entry dentry[NR_DENTRY_IN_BLOCK];
        __u8 filename[NR_DENTRY_IN_BLOCK][F2FS_NAME_LEN];
} __packed;

/* One directory entry slot representing F2FS_NAME_LEN-sized file name */
struct f2fs_dir_entry {
        __le32 hash_code;     /* hash code of file name */
        __le32 ino;           /* inode number */
        __le16 name_len;      /* lengh of file name */
        __u8 file_type;               /* file type */
} __packed;

dentry的属性就很直观了,定位文件无非就是文件类型和哈希这些信息

而一个dentry block包含:

  • 用于快速查询更新的bitmap
  • 应该没啥用的reserved u8bits
  • 文件定位用的dentry array
  • 记录名字的filename字符串池

我们再深入点看下dentry block的具体布局

/* the number of dentry in a block */
#define NR_DENTRY_IN_BLOCK      214
#define F2FS_NAME_LEN           8       /* 256 Unicode */

#define SIZE_OF_DIR_ENTRY       11      /* by byte */
#define SIZE_OF_DENTRY_BITMAP   ((NR_DENTRY_IN_BLOCK + BITS_PER_BYTE - 1) / \
                                        BITS_PER_BYTE)
#define SIZE_OF_RESERVED        (PAGE_SIZE - ((SIZE_OF_DIR_ENTRY + \
                                F2FS_NAME_LEN) * \
                                NR_DENTRY_IN_BLOCK + SIZE_OF_DENTRY_BITMAP))

// 看着有点眼花,直接输出吧
// Note: -g3编译后可以得知大小
(gdb) macro expand SIZE_OF_DENTRY_BITMAP
expands to: ((214 + 8 - 1) / 8)
(gdb) p SIZE_OF_DENTRY_BITMAP
$4 = 27

(gdb) macro expand SIZE_OF_RESERVED
expands to: (4096 - ((11 + 8) * 214 + ((214 + 8 - 1) / 8)))
(gdb) p SIZE_OF_RESERVED
$5 = 3

也就是说一个dentry block当中:

  • 存放了214个dentry,占用\(214 \times 11 = 2354 \ bytes\)
  • 字符串部分占用\(214 \times 8 = 1712 \ bytes\)
  • bitmap按向上取整得到\(\lceil \frac{214}{8} \rceil = 27 \ bytes\)
  • 剩余部分按一个4k block减去即可\(4096 - 2354 - 1712 - 27 = 3 \ bytes\)

因此取214这个值就是让dentry block利用率最大化(\(reserved \lt 19(+1bit) \ bytes\))

Check point初始化

流程

check point部分的初始化也是一段的流程,这部分不仅涉及到check point,还有summary block数据结构

/**
 * @brief       It writes check poiint pack on Check point Area
 * @param       None
 * @return      0 if succes
 */
static int8_t f2fs_write_check_point_pack(void)
{
        struct f2fs_checkpoint *ckp = NULL;
        struct f2fs_summary_block *sum = NULL;
        u_int32_t blk_size_bytes;
        u_int64_t cp_seg_blk_offset = 0;
        u_int32_t crc = 0;
        int i;

        ckp = calloc(F2FS_CP_BLOCK_SIZE, 1);
        if (ckp == NULL) {
                printf("\n\tError: Calloc Failed for f2fs_checkpoint!!!\n");
                return -1;
        }

        sum = calloc(sizeof(struct f2fs_summary_block), 1);
        if (sum == NULL) {
                printf("\n\tError: Calloc Failed for summay_node!!!\n");
                return -1;
        }

        // 每个check point分成2个page (block)来组成,分别放在首部和尾部,恢复时若相同则表示有效
        // 同时CP area存在2个不同版本的check point (check point pack),交替更新

        /* 1. cp page 1 of checkpoint pack 1 */
        // 当前version设为1
        ckp->checkpoint_ver = 1;
        // 记录当前各个log的起始segment的编号,我们在前面已经给出具体的数值
        ckp->cur_node_segno[0] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_HOT_NODE]);
        ckp->cur_node_segno[1] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_WARM_NODE]);
        ckp->cur_node_segno[2] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_COLD_NODE]);
        ckp->cur_data_segno[0] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_HOT_DATA]);
        ckp->cur_data_segno[1] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_WARM_DATA]);
        ckp->cur_data_segno[2] =
                cpu_to_le32(f2fs_params.cur_seg[CURSEG_COLD_DATA]);
        // 作者说理论上提供了16个log,但默认就用6个(node/data各3个),其它没用的就糊上0xff吧
        for (i = 3; i < MAX_ACTIVE_NODE_LOGS; i++) {
                ckp->cur_node_segno[i] = 0xffffffff;
                ckp->cur_data_segno[i] = 0xffffffff;
        }

        ckp->cur_node_blkoff[0] = cpu_to_le16(1);
        ckp->nat_upd_blkoff[0] = cpu_to_le16(1);
        ckp->cur_data_blkoff[0] = cpu_to_le16(1);
        // 应该是只有HOT DATA和HOT NODE各占一个block,见前面NAT初始化
        ckp->valid_block_count = cpu_to_le64(2);
        ckp->rsvd_segment_count = cpu_to_le32(f2fs_params.reserved_segments);
        // 这些麻烦的计算后面看dump吧
        ckp->overprov_segment_count = cpu_to_le32(
                        (le32_to_cpu(super_block.segment_count_main) -
                        le32_to_cpu(ckp->rsvd_segment_count)) *
                        f2fs_params.overprovision / 100);
        ckp->overprov_segment_count = cpu_to_le32(
                        le32_to_cpu(ckp->overprov_segment_count) +
                        le32_to_cpu(ckp->rsvd_segment_count));

        /* main segments - reserved segments - (node + data segments) */
        // 这里free_segment_count指的是main area中的free segment数目
        // 没看懂为啥是硬编码的6……大概是6个log每个占1个segment?
        ckp->free_segment_count = cpu_to_le32(
                        le32_to_cpu(super_block.segment_count_main) - 6);
        // 减去overprov预留部分
        ckp->user_block_count = cpu_to_le64(
                        ((le32_to_cpu(ckp->free_segment_count) + 6 -
                        le32_to_cpu(ckp->overprov_segment_count)) *
                         f2fs_params.blks_per_seg));
        // checkpoint总共用了5个block
        ckp->cp_pack_total_block_count = cpu_to_le32(5);
        ckp->cp_pack_start_sum = cpu_to_le32(1);
        ckp->valid_node_count = cpu_to_le32(1);
        ckp->valid_inode_count = cpu_to_le32(1);
        ckp->next_free_nid = cpu_to_le32(
                        le32_to_cpu(super_block.root_ino) + 1);

        // 前面看到了只写入一半的SIT,因此是/2,后面NAT同理
        ckp->sit_ver_bitmap_bytesize = cpu_to_le32(
                        ((le32_to_cpu(super_block.segment_count_sit) / 2) <<
                         le32_to_cpu(super_block.log_blocks_per_seg)) / 8);

        ckp->nat_ver_bitmap_bytesize = cpu_to_le32(
                        ((le32_to_cpu(super_block.segment_count_nat) / 2) <<
                         le32_to_cpu(super_block.log_blocks_per_seg)) / 8);

        ckp->checksum_offset = cpu_to_le32(4092);

        crc = f2fs_cal_crc32(F2FS_SUPER_MAGIC, ckp,
                                        le32_to_cpu(ckp->checksum_offset));
        *((u_int32_t *)((unsigned char *)ckp +
                                le32_to_cpu(ckp->checksum_offset))) = crc;

        blk_size_bytes = 1 << le32_to_cpu(super_block.log_blocksize);
        // 当前处于check point area的起始地址
        cp_seg_blk_offset =
                le32_to_cpu(super_block.start_segment_checkpoint) * blk_size_bytes;

        // 把这部分写入外存
        if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
                                F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        /* 2. Prepare and write Segment summary for data blocks */
        // 从这里开始涉及到summary的数据结构,可以先往下翻到summary的分析
        SET_SUM_TYPE((&sum->footer), SUM_TYPE_DATA);

        // summary中首个entries为root inode
        sum->entries[0].nid = super_block.root_ino;
        sum->entries[0].bidx = 0;

        // 第二个block填入summary
        cp_seg_blk_offset += blk_size_bytes;
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 3. Fill segment summary for data block to zero. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));

        cp_seg_blk_offset += blk_size_bytes;
        // 第三个block填入0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 4. Fill segment summary for data block to zero. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));

        /* inode sit for root */
        // 记录SIT的journal信息,分为node和data,各占3个entry
        sum->n_sits = cpu_to_le16(6);
        sum->sit_j.entries[0].segno = ckp->cur_node_segno[0];
        sum->sit_j.entries[0].se.vblocks = cpu_to_le16((CURSEG_HOT_NODE << 10) | 1);
        f2fs_set_bit(0, sum->sit_j.entries[0].se.valid_map);
        sum->sit_j.entries[1].segno = ckp->cur_node_segno[1];
        sum->sit_j.entries[1].se.vblocks = cpu_to_le16((CURSEG_WARM_NODE << 10));
        sum->sit_j.entries[2].segno = ckp->cur_node_segno[2];
        sum->sit_j.entries[2].se.vblocks = cpu_to_le16((CURSEG_COLD_NODE << 10));

        /* data sit for root */
        sum->sit_j.entries[3].segno = ckp->cur_data_segno[0];
        sum->sit_j.entries[3].se.vblocks = cpu_to_le16((CURSEG_HOT_DATA << 10) | 1);
        f2fs_set_bit(0, sum->sit_j.entries[3].se.valid_map);
        sum->sit_j.entries[4].segno = ckp->cur_data_segno[1];
        sum->sit_j.entries[4].se.vblocks = cpu_to_le16((CURSEG_WARM_DATA << 10));
        sum->sit_j.entries[5].segno = ckp->cur_data_segno[2];
        sum->sit_j.entries[5].se.vblocks = cpu_to_le16((CURSEG_COLD_DATA << 10));

        cp_seg_blk_offset += blk_size_bytes;
        // 第四个block填入SIT journal
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 5. cp page2 */
        cp_seg_blk_offset += blk_size_bytes;
        // 第五个block填入check point尾部副本
        if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
                                F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        /* 6. cp page 1 of check point pack 2
         * Initiatialize other checkpoint pack with version zero
         */
        // 这里checkpoint只更改版本号和CRC
        ckp->checkpoint_ver = 0;

        crc = f2fs_cal_crc32(F2FS_SUPER_MAGIC, ckp,
                                        le32_to_cpu(ckp->checksum_offset));
        *((u_int32_t *)((unsigned char *)ckp +
                                le32_to_cpu(ckp->checksum_offset))) = crc;

        // 从check point起始地址偏移一个section
        // 即512个block的偏移量
        cp_seg_blk_offset = (le32_to_cpu(super_block.start_segment_checkpoint) +
                                f2fs_params.blks_per_seg) *
                                blk_size_bytes;
        // 第512个block记录check point#0
        if (writetodisk(f2fs_params.fd, ckp,
                                cp_seg_blk_offset, F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        /* 7. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));
        SET_SUM_TYPE((&sum->footer), SUM_TYPE_DATA);
        cp_seg_blk_offset += blk_size_bytes;
        // 第512 + 1个block记录0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 8. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));
        cp_seg_blk_offset += blk_size_bytes;
        // 第512 + 2个block记录0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 9. */
        memset(sum, 0, sizeof(struct f2fs_summary_block));
        cp_seg_blk_offset += blk_size_bytes;
        // 第512 + 3个block记录0
        if (writetodisk(f2fs_params.fd, sum, cp_seg_blk_offset,
                                sizeof(struct f2fs_summary_block)) < 0) {
                printf("\n\tError: While writing the sum_blk to disk!!!\n");
                return -1;
        }

        /* 10. cp page 2 of check point pack 2 */
        cp_seg_blk_offset += blk_size_bytes;
        // 第512 + 4个block记录check point尾部副本
        if (writetodisk(f2fs_params.fd, ckp, cp_seg_blk_offset,
                                F2FS_CP_BLOCK_SIZE) < 0) {
                printf("\n\tError: While writing the ckp to disk!!!\n");
                return -1;
        }

        free(sum) ;
        free(ckp) ;
        return  0;
}

Summary block

summary block在流程里面初看比较晦涩,需要整理一下数据结构

////////////////// summary整体


struct f2fs_summary_block {
        // #define ENTRIES_IN_SUM               512
        struct f2fs_summary entries[ENTRIES_IN_SUM];
        union {
                __le16 n_nats; // nat_j.entries[]的有效长度(实际长度是固定的)
                __le16 n_sits; // sit_j.entries[]的有效长度(实际长度是固定的)
        };
        // 大小是4k - sizeof(f2fs_summary)*512 - sizeof(summary_footer)
        union {
                struct nat_journal nat_j;
                struct sit_journal sit_j;
        };
        struct summary_footer footer;
} __attribute__((packed));

// 记录反向映射,即node to parent
struct f2fs_summary {
        __le32 nid;             /* parent node id */
        union {
                __u8 reserved[3];
                struct {
                        __u8 version;           /* node version number */
                        __le16 bidx;            /* block index in parent node */
                } __attribute__((packed));
        };
} __attribute__((packed));


////////////////// journal入口

// sizeof(struct nat_journal) = 505
// NAT_JOURNAL_ENTRIES = 38
struct nat_journal {
        struct nat_journal_entry entries[NAT_JOURNAL_ENTRIES];
        __u8 reserved[NAT_JOURNAL_RESERVED];
} __attribute__((packed));

// sizeof(struct sit_journal) = 505
// SIT_JOURNAL_ENTRIES = 6
struct sit_journal {
        struct sit_journal_entry entries[SIT_JOURNAL_ENTRIES];
        __u8 reserved[SIT_JOURNAL_RESERVED];
} __attribute__((packed));


////////////////// journal外部条目


struct nat_journal_entry {
        __le32 nid;
        struct f2fs_nat_entry ne;
} __attribute__((packed));

// sizeof(struct sit_journal_entry) = 505
struct sit_journal_entry {
        __le32 segno;
        struct f2fs_sit_entry se;
} __attribute__((packed));


////////////////// 实际两种area的条目


// 前面也大概见识过NAT三元组了
struct f2fs_nat_entry {
        __u8    version;
        __le32  ino;
        __le32  block_addr;
} __attribute__((packed));

// 由于SIT没有复杂的初始化流程,所以看着陌生
struct f2fs_sit_entry {
        __le16 vblocks; // valid blocks的意思
        __u8 valid_map[SIT_VBLOCK_MAP_SIZE];
        __le64 mtime;
} __attribute__((packed));


////////////////// footer边角料


// footer放在summary尾部,用于区分summary的类型和一致性检查用的校验和
struct summary_footer {
        unsigned char entry_type;
        __u32 check_sum;
} __attribute__((packed));

虽然数据结构不仅数目多而且长得恶心,但不难看懂:

  • f2fs_summary_block:描述外存按block单位存放的若干summary条目,剩余空间存放journal
  • f2fs_summary:反向映射,记录对应的parent node和在parent中的block index
  • nat_journal/sit_journal:journal的入口,包含若干journal entry
  • nat_journal_entry/sit_journal_entry:journal entry,包含id以及具体的area entry
  • f2fs_nat_entry/f2fs_sit_entry:需要check point记录的实际entry,对应不同area的数据结构
  • summary_footer:存放于summary block的尾部,用以区分journal类型和存放checksum校验
/* 此处画了一张绝佳的辅助图解,但你需要支付 998,244,353 CNY 才能解锁 */

辅助调试信息

调试信息如下,断点到free()前:

(gdb) b f2fs_format.c:956

(gdb) info locals
ckp = 0x4098c0
sum = 0x40a8d0
blk_size_bytes = 4096
cp_seg_blk_offset = 4210688
crc = 1257347361
i = 8

(gdb) p *ckp
$6 = {
  checkpoint_ver = 0,
  user_block_count = 220672,
  valid_block_count = 2,
  rsvd_segment_count = 25,
  overprov_segment_count = 47,
  free_segment_count = 472,
  cur_node_segno = {476, 475, 474, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295},
  cur_node_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
  nat_upd_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
  cur_data_segno = {473, 1, 0, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295},
  cur_data_blkoff = {1, 0, 0, 0, 0, 0, 0, 0},
  ckpt_flags = 0,
  cp_pack_total_block_count = 5,
  cp_pack_start_sum = 1,
  valid_node_count = 1,
  valid_inode_count = 1,
  next_free_nid = 4,
  sit_ver_bitmap_bytesize = 64,
  nat_ver_bitmap_bytesize = 128,
  checksum_offset = 4092,
  elapsed_time = 0,
  alloc_type = '\000' <repeats 15 times>,
  sit_nat_version_bitmap = ""
}

算法思路

从前面的流程以及summary block的记录来看并不足以了解check point的全貌,因为这里只有初始化过程。但是可以看出,checkpoint是存在副本的,并且横跨2个section,每section有2个check point副本

一个基本的算法思路是如果上一次写入了check point#1,那么下一次就写入check point#0,这将有助于一致性检查。另外,所谓的recovery过程似乎也不神秘,虽然目前没法得知完整的算法,但是这里所做的不过是记录所需要的metadata,仅此而已

THE END

目前大致理解了F2FS的初始化视图,下一篇文章再来探索F2FS的运行时,先这样吧

最后惯例贴上内容无关的图

fun

This post is licensed under CC BY 4.0 by the author.
Contents