一些 IO 指标的跟踪点

本文不关注具体的 IO 指标有哪些，重点是看需要观测指标时该往哪里去找，也就是所谓的跟踪点。

内容比较简单，这些也只是我在了解观测工具实现时所留下的一些随笔。

IO 的请求类型

如果只关注读和写，可以 probe 跟踪 vfs_read() 和 vfs_write() 函数，能抓到那就是有一次未经合并的读请求或者写请求。更进一步可以效仿 bcc.*slower 工具，在每个返回点都手动维护请求类型。

如果需要更加详细的信息，见 rwbs[] 数组。这是 blktrace 和内核共同维护的数据结构。除了读写以外，IO 请求类型可以有落盘相关的 FLUSH/FUA，TRIM 指令相关的 DISCARD 等等。不过 rwbs[] 的问题是没有文档说明，最近改一个简单的 bcc 工具就有点抓瞎。（在考虑要不要提 PR，写得比较菜）

更加硬核的办法是跟踪 struct request 的 cmd_flags 字段（注意不是 rq_flags，那是给电梯和 blk 层内部用的），它会用到 enum req_op，因此按位与就能判断。这个跟踪点也有很大的问题，它完全是内核实现而非某种稳定的接口，其中一个改动较大的内核版本是 4.7。但是 bcc.biolatency 工具仍在使用这种做法。

IO 的个数

见 /sys/block/<dev>/stat 文件，使用的 sysfs 机制，具体对应到 part_stat_show() 函数。就是一些定期维护的 stat 数组，但是粒度非常的精细。这也是 iostat 工具的实现方式。

NOTE: 你知道我找 part_stat_show() 这个函数有多浪费时间吗？写个文档很难？

IO 的大小

同样是跟踪 struct request，静态跟踪点有 block_rq_complete，也有对应的动态跟踪点 blk_complete_request()。比如静态跟踪点就提供 nr_sector 参数（以 sector 为单位），换算为字节数就是 IO 的大小；

也可以是静态跟踪点 block_rq_issue，直接使用 bytes。这是 bcc.bitesize 工具的实现方式。

IO 的访问顺序

同上。静态跟踪点 block_rq_complete 还有 sector 参数，表示 IO 的起始地址，搭配上 nr_sector 并与上一次记录对比就能判断是顺序访问还是随机访问。bcc.biopattern 工具使用了这种技巧。

IO 的发起设备

同上。使用 dev 参数，然后按位拆分主次设备号即可。这块我还没看内核是怎么从设备号关联到名字的，所以要想反推得到设备名，最蠢的做法是遍历 /sys/block/<all_devs>/dev 或者 /proc/diskstats 做匹配，也可以借助 lsblk 工具一步到位。

IO 的延迟

bcc.biolatency 工具的做法是算时间差：

IO 排队延迟为 blk_account_io_start() 到 blk_mq_start_request() 的时间差。
IO 执行延迟为 blk_mq_start_request() 到 blk_account_io_done() 的时间差。

后者能算出执行延迟，是因为 blk_mq_start_request() 的调用时机已经是在驱动层。

NOTE: 至于排队延迟，粗略看 blk_account_io_start() 是保证在插入 hardware queue 前调用，我个人意见是排队前应该跟踪 submit_bio() 更好一点？调用时机相比 blk_account_io_start() 会更加提前一点。

更加硬核的做法还有 blktrace 工具的实现。blktrace 延迟时间的定义可以见 what2act[] 数组：

static const struct {
        const char *act[2];
        void       (*print)(struct trace_seq *s, const struct trace_entry *ent,
                            bool has_cg);
} what2act[] = {
        [__BLK_TA_QUEUE]        = { {  "Q", "queue" },      blk_log_generic },
        [__BLK_TA_BACKMERGE]    = { {  "M", "backmerge" },  blk_log_generic },
        [__BLK_TA_FRONTMERGE]   = { {  "F", "frontmerge" }, blk_log_generic },
        [__BLK_TA_GETRQ]        = { {  "G", "getrq" },      blk_log_generic },
        [__BLK_TA_SLEEPRQ]      = { {  "S", "sleeprq" },    blk_log_generic },
        [__BLK_TA_REQUEUE]      = { {  "R", "requeue" },    blk_log_with_error },
        [__BLK_TA_ISSUE]        = { {  "D", "issue" },      blk_log_generic },
        [__BLK_TA_COMPLETE]     = { {  "C", "complete" },   blk_log_with_error },
        [__BLK_TA_PLUG]         = { {  "P", "plug" },       blk_log_plug },
        [__BLK_TA_UNPLUG_IO]    = { {  "U", "unplug_io" },  blk_log_unplug },
        [__BLK_TA_UNPLUG_TIMER] = { { "UT", "unplug_timer" }, blk_log_unplug },
        [__BLK_TA_INSERT]       = { {  "I", "insert" },     blk_log_generic },
        [__BLK_TA_SPLIT]        = { {  "X", "split" },      blk_log_split },
        [__BLK_TA_BOUNCE]       = { {  "B", "bounce" },     blk_log_generic },
        [__BLK_TA_REMAP]        = { {  "A", "remap" },      blk_log_remap },
};

而这些正是对应于延迟统计图：

 Q------->G------------>I--------->M------------------->D----------------------------->C
 |-Q time-|-Insert time-|
 |--------- merge time ------------|-merge with other IO|
 |----------------scheduler time time-------------------|---driver,adapter,storagetime--|
 
 |----------------------- await time in iostat output ----------------------------------|

Q2Q — time between requests sent to the block layer
Q2G — time from a block I/O is queued to the time it gets a request allocated for it
G2I — time from a request is allocated to the time it is Inserted into the device's queue
Q2M — time from a block I/O is queued to the time it gets merged with an existing request
I2D — time from a request is inserted into the device's queue to the time it is actually
      issued to the device
M2D — time from a block I/O is merged with an exiting request until the request is issued
      to the device
D2C — service time of the request by the device
Q2C — total time spent in the block layer for a request

进一步对照 blk_register_tracepoints() 和 /include/trace/events/block.h，可以得出跟踪点：

`blktrace` 事件	静态跟踪点	动态跟踪点
Q	`block_bio_queue`	`submit_bio_noacct_nocheck()`
G	`block_getrq`	`blk_mq_submit_bio()`
I	`block_rq_insert`	`blk_mq_insert_requests()` `blk_mq_insert_request()` `bfq_insert_request()` `kyber_insert_requests()` `dd_insert_request()`
M	`block_bio_backmerge`	`bio_attempt_back_merge()`
D	`block_rq_issue`	`blk_mq_start_request()`
C	`block_rq_complete`	`blk_complete_request()` `blk_update_request()`

附录：page cache 相关

直接搬运注释：

mark_page_accessed() for measuring cache accesses
mark_buffer_dirty() for measuring cache writes
add_to_page_cache_lru() for measuring page additions
account_page_dirtied() for measuring page dirties

上面是 perf-tools.cachestat 工具的实现方式，而 bcc.cachestat 还新增了额外的跟踪点作为备选方案（主要是改为处理 folio）：

folio_mark_accessed() 替代了 mark_page_accessed()。
mark_buffer_dirty() 不变。
filemap_add_folio() 替代了 add_to_page_cache_lru()。
folio_account_dirtied() 或者 :writeback_dirty_folio 替代了 account_page_dirtied()。

NOTE: folio_account_dirtied() 不会因重复调用 mark_buffer_dirty() 而重复计数。

cachestate 可以进一步实现以下统计：

只读的 page cache 访问计数等价于 folio_mark_accessed() 减去 mark_buffer_dirty()。
未命中的 page cache 访问计数等价于 filemap_add_folio() 减去 folio_account_dirtied()。
命中率就是在上述基础上做百分比除法。

mark_page_accessed() shows total cache accesses, and add_to_page_cache_lru() shows cache insertions (so does add_to_page_cache_locked(), which even includes a tracepoint, but doesn’t fire on later kernels). I thought for a second that these two were sufficient: assuming insertions are misses, I have misses and total accesses, and can calculate hits.

The problem is that accesses and insertions also happens for writes, dirtying cache data. So the other two kernel functions help tease this apart (remember, I only have function call rates to work with here). mark_buffer_dirty() is used to see which of the accesses were for writes, and account_page_dirtied() to see which of the insertions were for writes.

我总感觉有点不对劲，但是作者 Brendan 特意写文章说没问题，大佬说对就对吧……