Quantcast
Channel: The Byte Flow
Viewing all articles
Browse latest Browse all 19

Understanding ZIO - ZFS I/O notes part 3/4

$
0
0

Each zio references associated dirty data, disk blocks, spa, vdev, associated transforms (e.g. compression, encryption, etc). The operations that will be executed are defined by a bitmask indicating the stages of the zio_pipeline (see below) that will be executed on it. All I/O fans in to zio_execute which is responsible for executing the zio_pipeline.


zio.c:1353
/*
 * Execute the I/O pipeline until one of the following occurs:
 *
 *(1) the I/O completes
 *(2) the pipeline stalls waiting for dependent child I/Os
 *(3) the I/O issues, so we're waiting for an I/O completion interrupt
 *(4) the I/O is delegated by vdev-level caching or aggregation
 *(5) the I/O is deferred due to vdev-level queueing
 *(6) the I/O is handed off to another thread.
 *
 * In all cases, the pipeline stops whenever there's no CPU work; it never
 * burns a thread in cv_wait().
 *
 * There's no locking on io_stage because there's no legitimate way
 * for multiple threads to be attempting to process the same I/O.
 */
static zio_pipe_stage_t *zio_pipeline[];

void
zio_execute(zio_t *zio)
{
zio->io_executor = curthread;

while (zio->io_stage < ZIO_STAGE_DONE) {
enum zio_stage pipeline = zio->io_pipeline;
enum zio_stage stage = zio->io_stage;
int rv;
...
do {
stage <<= 1;
} while ((stage & pipeline) == 0);

...
zio->io_stage = stage;
rv = zio_pipeline[highbit64(stage) - 1](zio);

if (rv == ZIO_PIPELINE_STOP)
return;
...
}
}

The zio pipeline stages are defined in zio_impl.h:38

/*
 * XXX -- Describe ZFS I/O pipeline here. Fill in as needed.
 *
 * The ZFS I/O pipeline is comprised of various stages which are 
 * defined in the zio_stage enum below. The individual stages are 
 * used to construct these basic I/O operations: 
 * Read, Write, Free, Claim, and Ioctl.
 *
 * I/O operations: (XXX - provide detail for each of the operations)
 *
 * Read:
 * Write:
 * Free:
 * Claim:
 * Ioctl:
 *
 * Although the most common pipeline are used by the basic I/O operations
 * above, there are some helper pipelines (one could consider them
 * sub-pipelines) which are used internally by the ZIO module and are
 * explained below:
 *
 * Interlock Pipeline:
 * The interlock pipeline is the most basic pipeline and is used by all
 * of the I/O operations. The interlock pipeline does not perform any I/O
 * and is used to coordinate the dependencies between I/Os that are being
 * issued (i.e. the parent/child relationship).
 *
 * Vdev child Pipeline:
 * The vdev child pipeline is responsible for performing the physical I/O.
 * It is in this pipeline where the I/O are queued and possibly cached.
 *
 * In addition to performing I/O, the pipeline is also responsible for
 * data transformations. The transformations performed are based on the
 * specific properties that user may have selected and modify the
 * behavior of the pipeline. Examples of supported transformations are
 * compression, dedup, and nop writes. Transformations will either modify
 * the data or the pipeline. This list below further describes each of
 * the supported transformations:
 *
 * Compression:
 * ZFS supports three different flavors of compression -- gzip, lzjb, and
 * zle. Compression occurs as part of the write pipeline and is performed
 * in the ZIO_STAGE_WRITE_BP_INIT stage.
 *
 * Dedup:
 * Dedup reads are handled by the ZIO_STAGE_DDT_READ_START and
 * ZIO_STAGE_DDT_READ_DONE stages. These stages are added to an existing
 * read pipeline if the dedup bit is set on the block pointer.
 * Writing a dedup block is performed by the ZIO_STAGE_DDT_WRITE stage
 * and added to a write pipeline if a user has enabled dedup on that
 * particular dataset.
 *
 * NOP Write:
 * The NOP write feature is performed by the ZIO_STAGE_NOP_WRITE stage
 * and is added to an existing write pipeline if a crypographically
 * secure checksum (i.e. SHA256) is enabled and compression is turned on.
 * The NOP write stage will compare the checksums of the current data
 * on-disk (level-0 blocks only) and the data that is currently being written.
 * If the checksum values are identical then the pipeline is converted to
 * an interlock pipeline skipping block allocation and bypassing the
 * physical I/O.  The nop write feature can handle writes in either
 * syncing or open context (i.e. zil writes) and as a result is mutually
 * exclusive with dedup.
 */

/*
 * zio pipeline stage definitions
 */
enum zio_stage {
ZIO_STAGE_OPEN= 1 << 0,/* RWFCI */

ZIO_STAGE_READ_BP_INIT= 1 << 1,/* R---- */
ZIO_STAGE_FREE_BP_INIT= 1 << 2,/* --F-- */
ZIO_STAGE_ISSUE_ASYNC= 1 << 3,/* RWF-- */
ZIO_STAGE_WRITE_BP_INIT= 1 << 4,/* -W--- */

ZIO_STAGE_CHECKSUM_GENERATE= 1 << 5,/* -W--- */

ZIO_STAGE_NOP_WRITE= 1 << 6,/* -W--- */

ZIO_STAGE_DDT_READ_START= 1 << 7,/* R---- */
ZIO_STAGE_DDT_READ_DONE= 1 << 8,/* R---- */
ZIO_STAGE_DDT_WRITE= 1 << 9,/* -W--- */
ZIO_STAGE_DDT_FREE= 1 << 10,/* --F-- */

ZIO_STAGE_GANG_ASSEMBLE= 1 << 11,/* RWFC- */
ZIO_STAGE_GANG_ISSUE= 1 << 12,/* RWFC- */

ZIO_STAGE_DVA_ALLOCATE= 1 << 13,/* -W--- */
ZIO_STAGE_DVA_FREE= 1 << 14,/* --F-- */
ZIO_STAGE_DVA_CLAIM= 1 << 15,/* ---C- */

ZIO_STAGE_READY= 1 << 16,/* RWFCI */

ZIO_STAGE_VDEV_IO_START= 1 << 17,/* RWF-I */
ZIO_STAGE_VDEV_IO_DONE= 1 << 18,/* RWF-- */
ZIO_STAGE_VDEV_IO_ASSESS= 1 << 19,/* RWF-I */

ZIO_STAGE_CHECKSUM_VERIFY= 1 << 20,/* R---- */

ZIO_STAGE_DONE= 1 << 21/* RWFCI */
};

#defineZIO_INTERLOCK_STAGES\
(ZIO_STAGE_READY |\
ZIO_STAGE_DONE)

#defineZIO_INTERLOCK_PIPELINE\
ZIO_INTERLOCK_STAGES

#defineZIO_VDEV_IO_STAGES\
(ZIO_STAGE_VDEV_IO_START |\
ZIO_STAGE_VDEV_IO_DONE |\
ZIO_STAGE_VDEV_IO_ASSESS)
...
#defineZIO_WRITE_COMMON_STAGES\
(ZIO_INTERLOCK_STAGES |\
ZIO_VDEV_IO_STAGES |\
ZIO_STAGE_ISSUE_ASYNC |\
ZIO_STAGE_CHECKSUM_GENERATE)

#defineZIO_WRITE_PHYS_PIPELINE\
ZIO_WRITE_COMMON_STAGES
...


My main interest in this discussion is normal writes so I'm excluding all the other stages.

The pipeline itself is initialized later in zio.c.
zio.c:3335

/*
 * =====================================================================
 * I/O pipeline definition
 * =====================================================================
 */
static zio_pipe_stage_t *zio_pipeline[] = {
NULL,
zio_read_bp_init,
zio_free_bp_init,
zio_issue_async,
zio_write_bp_init,
zio_checksum_generate,
zio_nop_write,
zio_ddt_read_start,
zio_ddt_read_done,
zio_ddt_write,
zio_ddt_free,
zio_gang_assemble,
zio_gang_issue,
zio_dva_allocate,
zio_dva_free,
zio_dva_claim,
zio_ready,
zio_vdev_io_start,
zio_vdev_io_done,
zio_vdev_io_assess,
zio_checksum_verify,
zio_done
};

zio_execute can be called from a number of places, but for purposes of the discussion of file system I/O we're just interested in zio_wait(zio) and zio_nowait(zio).


zio_execute can return without having completed the zio, having passed it to a taskqueue thread (this typically happens in the vdev subsystem) - zio_done (the last stage in the pipeline - see above) will call cv_broadcast on zio->io_cv if zio->io_waiter is non-NULL.

int
zio_wait(zio_t *zio)
{
int error;
...
zio->io_waiter = curthread;

zio_execute(zio);

mutex_enter(&zio->io_lock);
while (zio->io_executor != NULL)
cv_wait(&zio->io_cv, &zio->io_lock);
mutex_exit(&zio->io_lock);

error = zio->io_error;
zio_destroy(zio);

return (error);
}

void
zio_nowait(zio_t *zio)
{
ASSERT(zio->io_executor == NULL);

if (zio->io_child_type == ZIO_CHILD_LOGICAL &&
      zio_unique_parent(zio) == NULL) {
/*
     * This is a logical async I/O with no parent to wait for it.
     * We add it to the spa_async_root_zio "Godfather" I/O which
     * will ensure they complete prior to unloading the pool.
     */
spa_t *spa = zio->io_spa;

zio_add_child(spa->spa_async_zio_root[CPU_SEQID], zio);
}

zio_execute(zio);
}

See my next post "Execution of a zio_write zio" to see an example of how zio_execute executes the pipeline for a write zio.


Viewing all articles
Browse latest Browse all 19

Trending Articles