Each zio references associated dirty data, disk blocks, spa, vdev, associated transforms (e.g. compression, encryption, etc). The operations that will be executed are defined by a bitmask indicating the stages of the zio_pipeline (see below) that will be executed on it. All I/O fans in to zio_execute which is responsible for executing the zio_pipeline.
zio.c:1353
/*
* Execute the I/O pipeline until one of the following occurs:
*
*(1) the I/O completes
*(2) the pipeline stalls waiting for dependent child I/Os
*(3) the I/O issues, so we're waiting for an I/O completion interrupt
*(4) the I/O is delegated by vdev-level caching or aggregation
*(5) the I/O is deferred due to vdev-level queueing
*(6) the I/O is handed off to another thread.
*
* In all cases, the pipeline stops whenever there's no CPU work; it never
* burns a thread in cv_wait().
*
* There's no locking on io_stage because there's no legitimate way
* for multiple threads to be attempting to process the same I/O.
*/
static zio_pipe_stage_t *zio_pipeline[];
void
zio_execute(zio_t *zio)
{
zio->io_executor = curthread;
while (zio->io_stage < ZIO_STAGE_DONE) {
enum zio_stage pipeline = zio->io_pipeline;
enum zio_stage stage = zio->io_stage;
int rv;
...
do {
stage <<= 1;
} while ((stage & pipeline) == 0);
...
zio->io_stage = stage;
rv = zio_pipeline[highbit64(stage) - 1](zio);
if (rv == ZIO_PIPELINE_STOP)
return;
...
}
}
The zio pipeline stages are defined in zio_impl.h:38
/*
* XXX -- Describe ZFS I/O pipeline here. Fill in as needed.
*
* The ZFS I/O pipeline is comprised of various stages which are
* defined in the zio_stage enum below. The individual stages are
* used to construct these basic I/O operations:
* Read, Write, Free, Claim, and Ioctl.
*
* I/O operations: (XXX - provide detail for each of the operations)
*
* Read:
* Write:
* Free:
* Claim:
* Ioctl:
*
* Although the most common pipeline are used by the basic I/O operations
* above, there are some helper pipelines (one could consider them
* sub-pipelines) which are used internally by the ZIO module and are
* explained below:
*
* Interlock Pipeline:
* The interlock pipeline is the most basic pipeline and is used by all
* of the I/O operations. The interlock pipeline does not perform any I/O
* and is used to coordinate the dependencies between I/Os that are being
* issued (i.e. the parent/child relationship).
*
* Vdev child Pipeline:
* The vdev child pipeline is responsible for performing the physical I/O.
* It is in this pipeline where the I/O are queued and possibly cached.
*
* In addition to performing I/O, the pipeline is also responsible for
* data transformations. The transformations performed are based on the
* specific properties that user may have selected and modify the
* behavior of the pipeline. Examples of supported transformations are
* compression, dedup, and nop writes. Transformations will either modify
* the data or the pipeline. This list below further describes each of
* the supported transformations:
*
* Compression:
* ZFS supports three different flavors of compression -- gzip, lzjb, and
* zle. Compression occurs as part of the write pipeline and is performed
* in the ZIO_STAGE_WRITE_BP_INIT stage.
*
* Dedup:
* Dedup reads are handled by the ZIO_STAGE_DDT_READ_START and
* ZIO_STAGE_DDT_READ_DONE stages. These stages are added to an existing
* read pipeline if the dedup bit is set on the block pointer.
* Writing a dedup block is performed by the ZIO_STAGE_DDT_WRITE stage
* and added to a write pipeline if a user has enabled dedup on that
* particular dataset.
*
* NOP Write:
* The NOP write feature is performed by the ZIO_STAGE_NOP_WRITE stage
* and is added to an existing write pipeline if a crypographically
* secure checksum (i.e. SHA256) is enabled and compression is turned on.
* The NOP write stage will compare the checksums of the current data
* on-disk (level-0 blocks only) and the data that is currently being written.
* If the checksum values are identical then the pipeline is converted to
* an interlock pipeline skipping block allocation and bypassing the
* physical I/O. The nop write feature can handle writes in either
* syncing or open context (i.e. zil writes) and as a result is mutually
* exclusive with dedup.
*/
/*
* zio pipeline stage definitions
*/
enum zio_stage {
ZIO_STAGE_OPEN= 1 << 0,/* RWFCI */
ZIO_STAGE_READ_BP_INIT= 1 << 1,/* R---- */
ZIO_STAGE_FREE_BP_INIT= 1 << 2,/* --F-- */
ZIO_STAGE_ISSUE_ASYNC= 1 << 3,/* RWF-- */
ZIO_STAGE_WRITE_BP_INIT= 1 << 4,/* -W--- */
ZIO_STAGE_CHECKSUM_GENERATE= 1 << 5,/* -W--- */
ZIO_STAGE_NOP_WRITE= 1 << 6,/* -W--- */
ZIO_STAGE_DDT_READ_START= 1 << 7,/* R---- */
ZIO_STAGE_DDT_READ_DONE= 1 << 8,/* R---- */
ZIO_STAGE_DDT_WRITE= 1 << 9,/* -W--- */
ZIO_STAGE_DDT_FREE= 1 << 10,/* --F-- */
ZIO_STAGE_GANG_ASSEMBLE= 1 << 11,/* RWFC- */
ZIO_STAGE_GANG_ISSUE= 1 << 12,/* RWFC- */
ZIO_STAGE_DVA_ALLOCATE= 1 << 13,/* -W--- */
ZIO_STAGE_DVA_FREE= 1 << 14,/* --F-- */
ZIO_STAGE_DVA_CLAIM= 1 << 15,/* ---C- */
ZIO_STAGE_READY= 1 << 16,/* RWFCI */
ZIO_STAGE_VDEV_IO_START= 1 << 17,/* RWF-I */
ZIO_STAGE_VDEV_IO_DONE= 1 << 18,/* RWF-- */
ZIO_STAGE_VDEV_IO_ASSESS= 1 << 19,/* RWF-I */
ZIO_STAGE_CHECKSUM_VERIFY= 1 << 20,/* R---- */
ZIO_STAGE_DONE= 1 << 21/* RWFCI */
};
#defineZIO_INTERLOCK_STAGES\
(ZIO_STAGE_READY |\
ZIO_STAGE_DONE)
#defineZIO_INTERLOCK_PIPELINE\
ZIO_INTERLOCK_STAGES
#defineZIO_VDEV_IO_STAGES\
(ZIO_STAGE_VDEV_IO_START |\
ZIO_STAGE_VDEV_IO_DONE |\
ZIO_STAGE_VDEV_IO_ASSESS)
...
#defineZIO_WRITE_COMMON_STAGES\
(ZIO_INTERLOCK_STAGES |\
ZIO_VDEV_IO_STAGES |\
ZIO_STAGE_ISSUE_ASYNC |\
ZIO_STAGE_CHECKSUM_GENERATE)
#defineZIO_WRITE_PHYS_PIPELINE\
ZIO_WRITE_COMMON_STAGES
...
My main interest in this discussion is normal writes so I'm excluding all the other stages.
The pipeline itself is initialized later in zio.c.
zio.c:3335
/*
* =====================================================================
* I/O pipeline definition
* =====================================================================
*/
static zio_pipe_stage_t *zio_pipeline[] = {
NULL,
zio_read_bp_init,
zio_free_bp_init,
zio_issue_async,
zio_write_bp_init,
zio_checksum_generate,
zio_nop_write,
zio_ddt_read_start,
zio_ddt_read_done,
zio_ddt_write,
zio_ddt_free,
zio_gang_assemble,
zio_gang_issue,
zio_dva_allocate,
zio_dva_free,
zio_dva_claim,
zio_ready,
zio_vdev_io_start,
zio_vdev_io_done,
zio_vdev_io_assess,
zio_checksum_verify,
zio_done
};
zio_execute can be called from a number of places, but for purposes of the discussion of file system I/O we're just interested in zio_wait(zio) and zio_nowait(zio).
zio_execute can return without having completed the zio, having passed it to a taskqueue thread (this typically happens in the vdev subsystem) - zio_done (the last stage in the pipeline - see above) will call cv_broadcast on zio->io_cv if zio->io_waiter is non-NULL.
int
zio_wait(zio_t *zio)
{
int error;
...
zio->io_waiter = curthread;
zio_execute(zio);
mutex_enter(&zio->io_lock);
while (zio->io_executor != NULL)
cv_wait(&zio->io_cv, &zio->io_lock);
mutex_exit(&zio->io_lock);
error = zio->io_error;
zio_destroy(zio);
return (error);
}
void
zio_nowait(zio_t *zio)
{
ASSERT(zio->io_executor == NULL);
if (zio->io_child_type == ZIO_CHILD_LOGICAL &&
zio_unique_parent(zio) == NULL) {
/*
* This is a logical async I/O with no parent to wait for it.
* We add it to the spa_async_root_zio "Godfather" I/O which
* will ensure they complete prior to unloading the pool.
*/
spa_t *spa = zio->io_spa;
zio_add_child(spa->spa_async_zio_root[CPU_SEQID], zio);
}
zio_execute(zio);
}
See my next post "Execution of a zio_write zio" to see an example of how zio_execute executes the pipeline for a write zio.