Linux 文件系统解析（四）IO模式

技术2022-07-10 148

文件系统是IO流程的起点，为了满足应用程序对IO操作的各种需求，在文件系统层设计了多种IO模式，上一篇介绍的bufferIO就是最常见的一种，本篇来梳理一下其他IO模式，它们的作用，流程以及是如何实现的。

iov_iter & kiocb

介绍各IO模式的实现前，先来看两个数据结构 iov_iter，kiocb

iov_iter

struct kvec { void *iov_base; /* and that should *never* hold a userland pointer */ size_t iov_len; }; struct iov_iter { /* * Bit 0 is the read/write bit, set if we're writing. * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and * the caller isn't expecting to drop a page reference when done. */ unsigned int type; //标识读or写，以及其他属性 size_t iov_offset; //第一个iovec中，数据起始偏移 size_t count; //数据大小 union { const struct iovec *iov; //结构与kvec一致，描述用户态的一段空间 const struct kvec *kvec; //描述内核态的一段空间 const struct bio_vec *bvec; //描述一个内存页中的一段空间 struct pipe_inode_info *pipe; }; union { unsigned long nr_segs; //iovec数量 struct { int idx; int start_idx; }; }; }

“迭代器” 是内核中常见的设计，通常用来描述一个对象的处理进度。 iov_iter最初主要用于描述一次IO流程中用户空间的处理进度，以*iov保存用户空间的内存地址，iov_offset和count记录当前处理进度，这两个参数会随IO的进行会不断变化。随后该机制拓展到内核其他功能中，以union形式定义了更多属性。

参考：https://lwn.net/Articles/625077/

kiocb

struct kiocb { struct file *ki_filp; //open文件创建的file结构 /* The 'ki_filp' pointer is shared in a union for aio */ randomized_struct_fields_start loff_t ki_pos; //数据偏移 void (*ki_complete)(struct kiocb *iocb, long ret, long ret2); //IO完成回调 void *private; int ki_flags; //IO属性 u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ unsigned int ki_cookie; /* for ->iopoll */ randomized_struct_fields_end }

kiocb 中主要保存了一个file结构，以及记录读写偏移，相当于描述了一次IO中文件侧的处理进度。

iov_iter 和 kiocb 实际上分别描述了一次IO的两端，iov_iter描述内存侧，kiocb描述文件侧，文件系统提供两个接口基于这两个数据结构封装读写操作。

static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { return file->f_op->read_iter(kio, iter); }

将kiocb描述的文件数据，读到iov_iter描述的内存中。

static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { return file->f_op->write_iter(kio, iter); }

将iov_iter描述的内存数据，写到kiocb描述的文件中。

文件系统中所有IO模式的读写逻辑，最终都是基于这两个接口实现的。

DirectIO

上一篇提到的bufferIO，使用pagecache来缓存数据，但有时由于种种原因，应用不希望使用cache。directIO是用来解决这一需求的，它的实现是通过在open()调用时传入O_DIRECT参数，使得file->f_flags标记为O_DIRECT，在进行读写时根据该标记判断是否进行要绕过pagecache，当然底层文件系统驱动也需要实现directIO相关接口。还是以ext4为例看下directIO的流程。

directIO 也使用read()，write() 等系统调用，与bufferIO调用栈前面部分一致：

vfs_read() ->__vfs_read() ->call_read_iter() -> ext4_file_read_iter() -> generic_file_read_iter()

vfs_write() ->__vfs_write() -> call_write_iter() -> ext4_file_write_iter() -> __generic_file_write_iter()

主要区别在generic_file_read_iter() 和 __generic_file_write_iter()之后，来看下这两个函数的流程：

generic_file_read_iter 流程图

generic_file_write_iter 流程图

这两个函数在directIO模式下的流程很类似，可总结如下：

将pagecache中的“脏”页回写。调用a_ops->directIO，执行磁盘到iov_iter中的用户内存地址的数据读写。若上一步执行后，仍未完成所有数据的读写，以buffer IO模式读写剩下数据。

ext4 的a_ops->directIO 指向ext4_direct_IO() , 以read为例，调用栈为：

ext4_direct_IO()->ext4_direct_IO_read()->__blockdev_direct_IO()->do_blockdev_direct_IO()

do_blockdev_direct_IO() 中实现了directIO的核心逻辑，大致流程如下：

首先做一些基础信息校验，如数据是否超过inode大小，offset是否按blocksize对齐等根据kiocb->ki_complete回调是否为NULL，设置本次directIO是同步or异步调用do_direct_IO(),将user buffer地址转换为page，并提交IO提交IO前初始化bio结构时，根据同步or异步设置bio->bi_end_io回调为dio_bio_end_aio() or dio_bio_end()若是同步，调用dio_await_completion()，等待IO完成若是异步，直接返回-EIOCBQUEUED（异步IO完成时由bio->bi_end_io()（dio_bio_end_aio）->dio_complete() -> kiocb->ki_complete() 进行完成处理）

关于directIO在异步情况下再说明一点，通常的异步directIO流程只是在提交IO(submit_bio())后不等待IO完成而直接返回。但在整个IO流程中会阻塞的地方有很多，例如等待inode的锁，pagecache数据回写，包括submit_bio()函数本身也可能发生阻塞（通常是并发IO请求达到上限，在等待空闲的request，扩展阅读：block层并发请求管理）。为了使IO流程完全异步，引入了RWF_NOWAIT参数，应用程序可设置该参数，在后续IO流程中会转换为IOCB_NOWAIT和REQ_NOWAIT，使得本次IO在文件系统及block层完全不会阻塞。参考：commit b745fafaf70c0a98a2e1e7ac8cb14542889ceb0e

总的来说，directIO是给应用程序提供一种绕过pagecache进行读写的方式。

DAX

除了directIO，还有一种情况通常需要绕过pagecache，就是当底层块设备是基于内存的时候。文件系统提供了一个挂载选项 -o dax，可使得该文件系统下的所有读写均不经过pagecache。使用的条件有两个，首先底层块设备要支持dax，其次文件系统也要适配dax的接口，目前只有ext2/ext4/xfs支持。总体来看应用场景不多，就不详细介绍了，可参考Documentation/filesystems/dax.txt。

同步写

在bufferIO写流程中，只将要写入的数据拷贝至pagecache中就返回了，应用不知道数据是何时真正写入磁盘，除非显式调用sync fdatasync 等系统调用。为了优化这种场景，文件系统提供了同步写功能，开启之后应用正常以buffer IO进行写操作，但在每次写操作完成时，文件系统立即执行pagecache到磁盘的写入流程。

开启同步写的方式主要有以下三种：

open()调用时传入O_SYNC标记。有些文件系统(ext2/ext4/xfs等)支持挂载选项 -o sync，可使得该文件系统下所有写操作为同步写。有些文件系统(ext2/ext4/xfs等)支持使用ioctl 设置某个文件的同步写属性，开启后所有对该文件的写为同步写。

实现方式也比较简单，buffer IO将数据写入pagecache后，在generic_write_sync()中判断是否开启同步写，如果开启则直接进行sync操作。

同步写与directIO的区别是，directIO 直接绕过pagecache进行读写，同步写是先将数据写入pagecache再立即写入磁盘。

异步IO

通常的IO操作，尤其是在读操作时，应用进程会被挂起，直到数据读取完成返回给应用层后才能执行后面的任务。对CPU来讲磁盘处理IO是一个漫长的过程，这往往使得IO性能变成整个业务的性能瓶颈。最初的解决办法通常是设计一个或者多个独立的IO任务线程，从而实现异步与并发，以提高IO性能。随着IO栈软硬件的不断优化，系统的并发IO能力不断提升，为了让业务达到更好的IO性能，一味增加IO线程数量并不是一个好办法，这会增大线程切换的开销，也使得任务管理变得复杂。为解决这种问题，在文件系统层支持了异步IO，使得应用程序可以不用复杂的设计就能实现异步IO，并且可以方便地提升IO并发程度，以达到更高的性能。

aio

Linux最初支持的原生异步IO接口，有一个限制是必须与directIO一同使用才能有异步的效果。以下是最主要的三个系统调用：

io_setup()，创建一个异步IO上下文。io_submit()，提交IO请求。io_getevents()，检查完成的IO请求。

具体的使用方法就不赘述了，可参看 https://www.cnblogs.com/lexus/archive/2013/03/28/2987415.html，下面来看看aio是如何实现的。

数据结构

struct kioctx { struct percpu_ref users; atomic_t dead; struct percpu_ref reqs; unsigned long user_id; //io_setup()中返回给应用的io_context_t *ctxp，同时等于mmap_base ... ... /* * This is what userspace passed to io_setup(), it's not used for * anything but counting against the global max_reqs quota. * * The real limit is nr_events - 1, which will be larger (see * aio_setup_ring()) */ unsigned max_reqs; //io_setup()中的nr_events /* Size of ringbuffer, in units of struct io_event */ unsigned nr_events; //ringbuffer数量，根据nr_events计算 unsigned long mmap_base; //aio ring buffer 内存起始地址 unsigned long mmap_size; //aio ring buffer 内存大小 struct page **ring_pages; //aio ring buffer 对应的内存页 long nr_pages; //aio ring buffer 内存页数量 ... ... struct file *aio_ring_file; //aio ring buffer 内存映射的文件 unsigned id; }

kioctx 代表一个aio上下文，描述aio的全部信息，在io_setup()系统调用中生成，与应用层的aio_context_t对应。其中最重要的是通过mmap_base，mmap_size,ring_pages,nr_pages 等数据描述一段内存，用于以ringbuffer的形式存放aio请求结果io_event。

值得一提的是，这里的内存是在用户空间分配的，应用可以直接访问该段内存，查看io完成状态。其实现的方法也很有意思，在初始化kioctx 时创建一个虚拟文件，再使用mmap映射该文件得到一块共享内存，将内存起始地址mmap_base通过io_setup()的参数aio_context_t *ctxp返回给应用层，这样内核与应用就都可以访问该段内存了。

struct aio_ring { unsigned id; /* kernel internal index number */ unsigned nr; /* number of io_events */ unsigned head; /* Written to by userland or under ring_lock * mutex by aio_read_events_ring(). */ unsigned tail; unsigned magic; unsigned compat_features; unsigned incompat_features; unsigned header_length; /* size of aio_ring */ struct io_event io_events[0]; } struct io_event { __u64 data; /* the data field from the iocb */ __u64 obj; /* what iocb this event came from */ __s64 res; /* result code for this event */ __s64 res2; /* secondary result */ };

kioctx 中的共享内存中存放的是aio请求的结果，其数据格式是 aio_ring + nr_events个io_event ， aio_ring 结构主要用于管理ringbuffer的状态。后面的io_event与应用层的数据结构一致，存放请求结果。

处理流程

初始化：

在io_setup()系统调用中完成，主要工作是初始化kioctx结构，在aio_setup_ring()函数中创建虚拟文件，进行内存映射，初始化ringbuffer。

提交IO请求：

由io_submit()系统调用完成，以读操作为例，主要调用栈为：io_submit()->io_submit_one()->__io_submit_one()->aio_read()。

aio_read()函数将应用层传入的iocb结构转换为具体的io操作，大致流程如下：

调用aio_prep_rw()，设置kiocb结构的参数，最重要的是设置了kiocb->ki_complete回调为aio_complete_rw()，回忆之前分析的do_blockdev_direct_IO()函数，其中是根据kiocb->ki_complete非空来判定本次io是否为异步。调用aio_setup_rw()，将iocb中传入的aio_buf,aio_nbytes数据转换为iov_iter结构。调用rw_verify_area()，校验是否可读。调用call_read_iter()，读取文件数据。结合之前对generic_file_read_iter()以及directIO的分析。只有当使用directIO时call_read_iter()才会立即返回-EIOCBQUEUED，否则仍然会阻塞至io完成。

IO请求完成：

IO处理完成时触发aio_complete_rw()回调，在其中调用aio_complete()函数将结果按io_event结构存入ringbuffer，并更新aio_ring。若有在阻塞等待结果的线程，则将其唤醒。若跟epoll一起使用而配置了eventfd，则发信号通知epoll线程有事件可处理。

应用等待IO完成:

应用层通过io_getevents()系统调用获取io请求结果，可由参数配置是否需要阻塞，结果以io_event结构放回给应用层。其调用栈为：io_getevents()->do_io_getevents()->read_events()->aio_read_events()->aio_read_events_ring()，最终在aio_read_events_ring()中将ringbuffer中的io_event数据拷贝给应用层，并更新aio_ring。

总的来说aio是利用了block层的IO完成回调机制来实现异步，为什么只再directIO流程中支持aio，我理解可能是由bufferIO涉及到cache流程太过复杂，directIO流程相对简单。aio在Linux2.6中就已经支持了，但并未得到广泛的使用，其中最大的原因就是它与directIO的耦合太深。当用户想使用异步IO时必须要使用directIO，就不得不放弃pagecache的优化效果，并且必须处理IO的大小偏移使其对齐blocksize，也使得应用起来并不方便。而且虽然设计了共享内存但实际使用中应用层通常不会操作该内存，仍然有大量内核与应用层的数据拷贝，以及频繁的系统调用。社区内一直对aio的很多设计不够满意，直到全新的异步io接口iouring出现，它提供了一套全新的异步IO交互方式，用于解决aio使用上的问题，以及更好的发挥异步IO性能。

io_uring

Linux5.1发布的新一代异步io接口，其核心思路是使用多个基于共享内存的ringbuffer处理内核与应用之间的交互，减少数据拷贝以及通信的开销，降低系统调用频次，从而提升性能。通过设置workqueue以及内核线程，异步处理io请求，不再依赖directIO中的异步实现，可支持多种IO类型，directIO，bufferIO，甚至是socketIO。

io_uring提供的系统调用十分简洁，一共只有三个，通常的IO操作只需两个就能够完成。

io_uring_setup()，创建io_uring上下文。io_uring_enter()，多种功能的系统调用，可用于提交IO请求，或者等待IO完成，或者两者同时执行。io_uring_register()，用于注册固定的fd或者data buffer。

应用和内核的大部分交互是通过共享内存的ringbuffer完成的，liburing库中提供了应用层相关操作函数。详细介绍及使用可参看https://kernel.dk/io_uring.pdf ，下面来看看io_uring是如何实现的。

Ringbuffer

io_uring 设置了两个ringbuffer：

sq（submission queue）：存放提交的IO请求，应用层为生产者操作tail ，内核为消费者操作head。其中的entry称为sqe。

cq（completion queue）：存放处理完成的IO请求，内核为生产者操作tail，应用层为消费者操作head。其中的entry称为cqe。

实际在内存中的结构可分为三部分，sq，cq，sqe数组。其中cq中的head/tail直接保存的是cqe数组的index，而sq中的head/tail保存的是一个u32数组sq_array的index，sq_array中的数据则保存的是sqe数组的index。为什么要设计一个sq_array把sq 和sqe分开，上面提到的io_uring设计文档中说是为了方便应用层一次提交多个IO请求，从liburing的实现来看，应用层可以预先配置多个sqe，最后真正提交io时再更新sq_arry中的index，然而具体的优势我还没完全理解~ 所以内存中的ringbuffer大致结构如下：

图片引用自 https://mattermost.com/blog/iouring-and-go/

数据结构

struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ __u64 addr; /* pointer to buffer or iovecs */ __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; __u32 fsync_flags; __u16 poll_events; __u32 sync_range_flags; __u32 msg_flags; __u32 timeout_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { __u16 buf_index; /* index into fixed buffers, if used */ __u64 __pad2[3]; }; } struct io_uring_cqe { __u64 user_data; /* sqe->data submission passed back */ __s32 res; /* result code for this event */ __u32 flags; };

struct io_uring_sqe , struct io_uring_cqe 分别描述sqe和cqe，sqe中包含了一次io的全部信息。但cqe中并未标识io信息，而是将对应sqe中的user_data回传给应用层，如何标识以及处理完成的io请求，需要用户自己基于user_data来实现。

struct io_rings { /* * Head and tail offsets into the ring; the offsets need to be * masked to get valid indices. * * The kernel controls head of the sq ring and the tail of the cq ring, * and the application controls tail of the sq ring and the head of the * cq ring. */ struct io_uring sq, cq; /* * Bitmasks to apply to head and tail offsets (constant, equals * ring_entries - 1) */ u32 sq_ring_mask, cq_ring_mask; /* Ring sizes (constant, power of 2) */ u32 sq_ring_entries, cq_ring_entries; ... /* * Ring buffer of completion events. * * The kernel writes completion events fresh every time they are * produced, so the application is allowed to modify pending * entries. */ struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; }

struct io_rings 用于描述sq和cq ，其中记录了当前sq cq 的head/tail index , 同时也包含cqe数组。存放sqe数组中index的u32 数组sq_array ，也紧随io_rings一起分配。

这些数据结构都是存放在应用与内核共享的内存中，初始化时共分配两块内存：struct io_rings + u32 *sq_array 和 struct io_uring_sqe 数组。其结构图如下：

io_uing实现共享内存的方法与aio不同，aio是将共享内存分配好直接将地址返回给应用层。而io_uring分配好内存后，创建一个文件句柄fd返回给应用层，同时返回各结构在内存中的offset。应用需要使用mmap将这些结构映射出来，实际操作还是有些复杂，好在liburing已经有封装处理。

struct io_ring_ctx { ... struct { unsigned int flags; bool compat; bool account_mem; /* * Ring buffer of indices into array of io_uring_sqe, which is * mmapped by the application using the IORING_OFF_SQES offset. * * This indirection could e.g. be used to assign fixed * io_uring_sqe entries to operations and only submit them to * the queue when needed. * * The kernel modifies neither the indices array nor the entries * array. */ u32 *sq_array; //sqe index数组 unsigned cached_sq_head; unsigned sq_entries; //sq ringbuffer 大小 unsigned sq_mask; unsigned sq_thread_idle; struct io_uring_sqe *sq_sqes; //sqe ringbuffer struct list_head defer_list; //推迟提交IO列表 struct list_head timeout_list; //超时提交IO列表 } ____cacheline_aligned_in_smp; /* IO offload */ struct workqueue_struct *sqo_wq[2]; //工作队列，用于处理提交IO struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; //等于current->mm 用于标识是否有应用层用户 wait_queue_head_t sqo_wait; struct completion sqo_thread_started; struct { unsigned cached_cq_tail; unsigned cq_entries; //cq ringbuffer 大小 unsigned cq_mask; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; struct eventfd_ctx *cq_ev_fd; //若使用epoll，存放需要通知的eventfd atomic_t cq_timeouts; } ____cacheline_aligned_in_smp; struct io_rings *rings; //sq cq ringbuffer /* * If used, fixed file set. Writers must ensure that ->refs is dead, * readers must ensure that ->refs is alive as long as the file* is * used. Only updated through io_uring_register(2). */ struct file **user_files; //使用io_uring_register()注册的固定文件 unsigned nr_user_files; /* if used, fixed mapped user buffers */ unsigned nr_user_bufs; //使用io_uring_register()注册的固定buffer struct io_mapped_ubuf *user_bufs; ... }

与aio一样，io_uring也需要一个上下文结构用于描述所有信息，这个就是struct io_ring_ctx。其中除了保存之前介绍的ringbuffer相关数据结构，还包含了io处理流程中用到的结构，提交io的workqueue，poll用到的kthread，各种flag，参数，等等。

处理流程

io_uring最主要的特点是使用共享内存实现内核与应用层的通信，所以在初始化，提交IO，处理IO结果，等等流程中都需要内核与应用层相互配合完成。

初始化：

应用层首先调用io_uring_setup()，通常只需要指定entries参数，用于确定sqe cqe 数量，其中cqe数量会分配sqe的两倍。在io_uring_setup()内核创建io_ring_ctx结构，创建workqueue，kthread 等相关部件，分配好ringbuffer，返回给应用层fd以及相关参数（struct io_uring_params）。应用层需要进行mmap得到ringbuffer的共享内存，还需要定义私有结构及方法来操作ringbuffer，liburing实现了io_uring_queue_init()函数完成这些操作。

提交IO请求：

应用层需要从sqe ringbuffer中获取空闲的sqe，在sqe中设置相关IO请求信息，并更新sq中tail信息，最后调用io_uring_enter()并设置to_submit参数，触发内核的IO提交提交流程。liburing的实现流程如下：

调用io_uring_get_sqe()从sqe ringbuffer 尾部获取一个sqe。调用io_uring_prep_readv()/io_uring_prep_writev()等函数为seq设置IO信息。在提交IO至内核前可设置多个sqe。调用io_uring_submit()，在其中设置sq_array数组及sq的tail，最后调用io_uring_enter()。

若在io_uring_enter()调用中设置了to_submit参数，会触发内核提交IO，大致流程如下：

从sqe ringbuffer中获取to_submit个sqe。根据这些sqe的IO信息，封装kiocb和iov_iter结构，保存sqe->usr_data，转换为IO请求。并将IO请求加入提交列表io_ring_ctx->defer_list。唤醒workqueue，取出提交列表的IO请求，调用call_read_iter()/call_write_iter()处理IO请求。更新sq的head。

IO请求完成：

IO完成后，内核从cqe ringbuffer中获取一个cqe，将IO请求结果以及之前保存的sqe->usr_data填入，更新cq的tail。若有在阻塞等待结果的线程，则将其唤醒。若跟epoll一起使用而配置了eventfd，则发信号通知epoll线程有事件可处理。

应用等待IO完成：

应用等待IO完成与提交IO使用同一个系统调用io_uring_enter()，区别是需要设置min_complete参数，会阻塞至有min_complete个IO完成时返回。基于usr_data和res处理完IO结果后，应用需要更新cq的head。

这里有一个问题是，如果提交IO快而处理完成IO慢，会出现cq overflow的情况，当IO完成时却没有可用的cq时一些IO完成信息会被丢掉，此时会累加ctx->rings->cq_overflow，应用层可查看此数据，但应避免这种情况出现。

特殊应用：

io_uring还提供了一些特殊的功能，这里简单提一下，具体可参考https://kernel.dk/io_uring.pdf 。

LINK：确保IO提交是按顺序进行的。

IOPOLL：支持硬件iopolling，由独立的内核线程实现。

SQPOLL：应用可不用频繁调用io_uring_enter()，内核自己轮询sq状态并提交IO，由独立的内核线程实现。

FIXED_FILE：提交IO相关的文件是固定时，可由io_uring_register()配置。

FIXED_BUFF：提交IO相关的buffer是固定时，可由io_uring_register()配置。

io_uring有很多设计是借鉴了aio的思路，共享内存，epoll等等，但它是真正解决了应用对异步IO的各类需求，显著提升了IO性能，有测试显示io_uring在polling模式下性能更优于SPDK。自发布以来io_uring受到了广泛关注，更新也十分频繁，很期待今后它在异步IO领域的发展。

总结

本篇主要介绍了Linux文件系统中提供的IO模式，它们的应用场景以及实现流程。着重介绍了异步IO，它的发展，遇到的问题，以及目前提出的最新实现方案。然而随着应用对IO性能的要求不断提高，异步IO最终如何发展还要需要持续关注。

至此本系列就完全结束了，回顾一下，本系列将Linux fs目录下相关代码分为vfs，fs驱动，cache，IO模式四个主要部分进行讨论，旨在记录本人对Linux文件系统的理解，然而文件系统包含的内容太多还有很多未涉及到，文中如有理解有误的地方希望各路大神不吝赐教。

参考资料：

https://lwn.net/Articles/625077/

https://blog.csdn.net/haiross/article/details/38869025

https://www.cnblogs.com/lexus/archive/2013/03/28/2987415.html

https://blog.csdn.net/u012398613/article/details/22897279

https://mattermost.com/blog/iouring-and-go/

https://kernel.dk/io_uring.pdf

Processed: 0.013, SQL: 9