Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

Don Dupuis (ddupuissprint@earthlink.net)
11 Apr 2002 21:23:20 -0500


I have a program called cspm at http://sourceforge.net/projects/cspm
that uses Stephen's sard patches. This program allows showing stats on
all disks, controllers, and on a systemwide basis. It will show ios,
uses, merges, and blocks per second. This program was written in QT.
Please have a look at it and provide some feedback.

Thanks.

Don Dupuis

On Thu, 2002-04-11 at 06:06, Zlatko Calusic wrote:
> Hi!
>
> I have ported Stephen's excellent sard I/O accounting patch to
> 2.5.8-pre3. Additionaly, I have adapted/cleaned up iostat utility
> (originally written by Greg Franks) and created a man page for it.
>
> All this gives Linux a much needed functionality - extensive
> measurements and reporting of disk device performance/bottlenecks.
> No server should be without it. See below for a real life story.
>
> The patch, iostat utility and man page can be found at
>
> <URL:http://linux.inet.hr/>
>
> Also, find the patch appended to this mail, for quicker browsing.
>
>
> Some observations:
> -----------------
>
> The patch is quite well tested and stable (that is, it shouldn't eat
> your machine :)). Both on UP & SMP configurations, but I have tested
> it only on IDE hardware. Please test it on your configuration
> (especially SCSI) and report results.
>
> I had to make some changes, as kernel internals have changed since the
> time patch was originally written. Alan has also included this patch
> in his 2.4.x-ac kernel series, but it is not working well.
>
> First problem is that somehow, misteriously, ios_in_flight variable
> drops to value of -1 when disks are idle. Of course, this skews lots
> of other numbers and iostat reports garbage. I tried to find the cause
> of this behaviour, but failed (looks like we have a request fired on
> each disk, whose start is never accounted but completion is?!). So I
> resolved it this way
>
> if (hd->ios_in_flight)
> --hd->ios_in_flight;
>
> which works well, but I would still love to know how number of I/Os
> can drop below zero. :)
>
> Second problem/nuisance is that blk_partition_remap() destroys
> partition information from the bio->bi_dev before the request is
> queued. That's why -ac kernel doesn't report per-partition
> information. To resolve that, I had to put additional code in
> locate_hd_struct() to extract partition information back again which
> is kind of ugly (first we destroy it, then we recreate it again). The
> other way we could solve this is to put something like bio->bi_acctdev
> in bio structure and keep original device there, but that is not only
> a duplication of info (which is also ugly), but would also waste
> precious memory. I don't know if block re-mapping which is happening
> in the abovementioned function is strictly necessary. Jens, could you
> help with that one and/or give a suggestion which way to go with that
> one?
>
>
> TODO:
> ----
>
> Please check the patch on various hardware you're equipped with and
> report success/failure. Somebody should really go through the man page
> and correct grammar mistakes :), maybe make it more readable in
> places. By the time this patch is ready for inclusion in the Linus
> kernel, old disk statistics counters should be removed (as they are
> buggy, incomplete and obsolete).
>
> This patch w/ iostat utility is probably a necessary ingredient to
> make quality measurement of the new IDE-TCQ code. Too bad I have a
> Maxtor drive on a Promise controller here (any hope of me running TCQ
> on that combination in the future?).
>
>
> Real life story:
> ---------------
>
> So, I had this applications heavily inserting data in the Oracle
> database. And its performance mattered to me. As the disk rattled all
> the time inserts were going on, I decided I needed to spend some time
> reordering datafiles to achieve the peak performance (disk I/O was a
> performance bottleneck, obviously). But after an hour or two I got
> only few percents improvement. How could that happen?!
>
> Well, it happened that the database and application were already tuned
> quite well, and the load on database was (correctly) CPU based. As I
> have an SMP machine, I forgot that 40% free CPU could be on that
> second processor, obviously unusable in the app in question. And disk
> rattling was normal, lots of fsyncs, some seeks...
>
> So, when I finally managed to port this patch to 2.5.x (we are
> speaking of a home development database, 2.5.x kernels are OK for
> those, right?! :)) iostat showed only ~20% disk utilization! That is,
> there was _NO_ heavy I/O contention and disks were _NOT_ a culprit.
> Live and learn.
>
> If only I had known that before I wasted my time needlessly moving big
> datafiles around and fixing configuration... :(
>
>
> Regards,
>
> Zlatko
>
>
> Index: 7.12/include/linux/blkdev.h
> --- 7.12/include/linux/blkdev.h Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.1 644)
> +++ 7.14/include/linux/blkdev.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.2 644)
> @@ -34,6 +34,7 @@
> int rq_status; /* should split this into a few status bits */
> kdev_t rq_dev;
> int errors;
> + unsigned long start_time;
> sector_t sector;
> unsigned long nr_sectors;
> unsigned long hard_sector; /* the hard_* are block layer
> Index: 7.12/include/linux/genhd.h
> --- 7.12/include/linux/genhd.h Tue, 15 Jan 2002 22:59:02 +0100 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1 644)
> +++ 7.14/include/linux/genhd.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1.2.1 644)
> @@ -61,6 +61,22 @@
> unsigned long start_sect;
> unsigned long nr_sects;
> devfs_handle_t de; /* primary (master) devfs entry */
> +
> + /* Performance stats: */
> + unsigned int ios_in_flight;
> + unsigned int io_ticks;
> + unsigned int last_idle_time;
> + unsigned int last_queue_change;
> + unsigned int aveq;
> +
> + unsigned int rd_ios;
> + unsigned int rd_merges;
> + unsigned int rd_ticks;
> + unsigned int rd_sectors;
> + unsigned int wr_ios;
> + unsigned int wr_merges;
> + unsigned int wr_ticks;
> + unsigned int wr_sectors;
> int number; /* stupid old code wastes space */
> };
>
> @@ -238,6 +254,19 @@
> #ifdef __KERNEL__
>
> char *disk_name (struct gendisk *hd, int minor, char *buf);
> +
> +/*
> + * disk_round_stats is used to round off the IO statistics for a disk
> + * for a complete clock tick.
> + */
> +void disk_round_stats(struct hd_struct *hd);
> +
> +/*
> + * Account for the completion of an IO request (used by drivers which
> + * bypass the normal end_request processing)
> + */
> +struct request;
> +void req_finished_io(struct request *);
>
> extern void devfs_register_partitions (struct gendisk *dev, int minor,
> int unregister);
> Index: 7.12/drivers/scsi/scsi_lib.c
> --- 7.12/drivers/scsi/scsi_lib.c Tue, 12 Feb 2002 09:26:34 +0100 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2 644)
> +++ 7.14/drivers/scsi/scsi_lib.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2.1.1 644)
> @@ -382,6 +382,7 @@
> */
> if (req->waiting)
> complete(req->waiting);
> + req_finished_io(req);
>
> add_blkdev_randomness(major(req->rq_dev));
>
> Index: 7.12/drivers/block/genhd.c
> --- 7.12/drivers/block/genhd.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.1 644)
> +++ 7.14/drivers/block/genhd.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.2 644)
> @@ -168,6 +168,10 @@
> read_unlock(&gendisk_lock);
> }
>
> +
> +/* Normalise the disk performance stats to a notional timer tick of 1ms. */
> +#define MSEC(x) ((x) * 1000 / HZ)
> +
> static int show_partition(struct seq_file *part, void *v)
> {
> struct gendisk *sgp = v;
> @@ -175,19 +179,34 @@
> char buf[64];
>
> if (sgp == gendisk_head)
> - seq_puts(part, "major minor #blocks name\n\n");
> + seq_puts(part, "major minor #blocks name "
> + "rio rmerge rsect ruse "
> + "wio wmerge wsect wuse "
> + "running use aveq\n\n");
>
> /* show all non-0 size partitions of this disk */
> for (n = 0; n < (sgp->nr_real << sgp->minor_shift); n++) {
> - if (sgp->part[n].nr_sects == 0)
> + struct hd_struct *hd = &sgp->part[n];
> +
> + if (!hd->nr_sects)
> continue;
> - seq_printf(part, "%4d %4d %10d %s\n",
> - sgp->major, n, sgp->sizes[n],
> - disk_name(sgp, n, buf));
> + disk_round_stats(hd);
> + seq_printf(part, "%4d %4d %10d %s "
> + "%d %d %d %d %d %d %d %d %d %d %d\n",
> + sgp->major, n, sgp->sizes[n],
> + disk_name(sgp, n, buf),
> + hd->rd_ios, hd->rd_merges,
> + hd->rd_sectors, MSEC(hd->rd_ticks),
> + hd->wr_ios, hd->wr_merges,
> + hd->wr_sectors, MSEC(hd->wr_ticks),
> + hd->ios_in_flight, MSEC(hd->io_ticks),
> + MSEC(hd->aveq));
> }
>
> return 0;
> }
> +
> +#undef MSEC
>
> struct seq_operations partitions_op = {
> start: part_start,
> Index: 7.12/drivers/block/ll_rw_blk.c
> --- 7.12/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.1 644)
> +++ 7.14/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 18:37:24 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.3 644)
> @@ -944,6 +944,136 @@
> printk(KERN_ERR "drive_stat_acct: cmd not R/W?\n");
> }
>
> +/* Return up to two hd_structs on which to do IO accounting for a given
> + * request. On a partitioned device, we want to account both against
> + * the partition and against the whole disk.
> + */
> +static void locate_hd_struct(struct request *req,
> + struct hd_struct **hd1,
> + struct hd_struct **hd2)
> +{
> + int p, n_part;
> + struct gendisk *gd;
> + struct hd_struct *hd;
> +
> + *hd1 = NULL;
> + *hd2 = NULL;
> +
> + gd = get_gendisk(req->rq_dev);
> + if (gd && gd->part) {
> + *hd1 = &gd->part[0];
> +
> + /* Try to reconstruct the partition */
> + n_part = 1 << gd->minor_shift;
> + for (p = 1; p < n_part; p++) {
> + hd = &gd->part[p];
> + /* skip empty and extended partitions */
> + if (hd->nr_sects < 3)
> + continue;
> + if (req->sector >= hd->start_sect
> + && req->sector < hd->start_sect + hd->nr_sects)
> + break;
> + }
> + if (p < n_part)
> + *hd2 = hd;
> + }
> +}
> +
> +/* Round off the performance stats on an hd_struct. The average IO
> + * queue length and utilisation statistics are maintained by observing
> + * the current state of the queue length and the amount of time it has
> + * been in this state for. Normally, that accounting is done on IO
> + * completion, but that can result in more than a second's worth of IO
> + * being accounted for within any one second, leading to >100%
> + * utilisation. To deal with that, we do a round-off before returning
> + * the results when reading /proc/partitions, accounting immediately for
> + * all queue usage up to the current jiffies and restarting the counters
> + * again.
> + */
> +void disk_round_stats(struct hd_struct *hd)
> +{
> + unsigned long now = jiffies;
> +
> + hd->aveq += hd->ios_in_flight * (now - hd->last_queue_change);
> + hd->last_queue_change = now;
> +
> + if (hd->ios_in_flight)
> + hd->io_ticks += (now - hd->last_idle_time);
> + hd->last_idle_time = now;
> +}
> +
> +static inline void down_ios(struct hd_struct *hd)
> +{
> + disk_round_stats(hd);
> + if (hd->ios_in_flight)
> + --hd->ios_in_flight;
> +}
> +
> +static inline void up_ios(struct hd_struct *hd)
> +{
> + disk_round_stats(hd);
> + ++hd->ios_in_flight;
> +}
> +
> +static void account_io_start(struct hd_struct *hd, struct request *req,
> + int merge, int sectors)
> +{
> + switch (rq_data_dir(req)) {
> + case READ:
> + if (merge)
> + hd->rd_merges++;
> + hd->rd_sectors += sectors;
> + break;
> + case WRITE:
> + if (merge)
> + hd->wr_merges++;
> + hd->wr_sectors += sectors;
> + break;
> + default:
> + }
> + if (!merge)
> + up_ios(hd);
> +}
> +
> +static void account_io_end(struct hd_struct *hd, struct request *req)
> +{
> + unsigned long duration = jiffies - req->start_time;
> + switch (rq_data_dir(req)) {
> + case READ:
> + hd->rd_ticks += duration;
> + hd->rd_ios++;
> + break;
> + case WRITE:
> + hd->wr_ticks += duration;
> + hd->wr_ios++;
> + break;
> + default:
> + }
> + down_ios(hd);
> +}
> +
> +void req_new_io(struct request *req, int merge, int sectors)
> +{
> + struct hd_struct *hd1, *hd2;
> +
> + locate_hd_struct(req, &hd1, &hd2);
> + if (hd1)
> + account_io_start(hd1, req, merge, sectors);
> + if (hd2)
> + account_io_start(hd2, req, merge, sectors);
> +}
> +
> +void req_finished_io(struct request *req)
> +{
> + struct hd_struct *hd1, *hd2;
> +
> + locate_hd_struct(req, &hd1, &hd2);
> + if (hd1)
> + account_io_end(hd1, req);
> + if (hd2)
> + account_io_end(hd2, req);
> +}
> +
> /*
> * add-request adds a request to the linked list.
> * queue lock is held and interrupts disabled, as we muck with the
> @@ -1018,6 +1148,8 @@
> * counts here.
> */
> if (q->merge_requests_fn(q, req, next)) {
> + struct hd_struct *hd1, *hd2;
> +
> elv_merge_requests(q, req, next);
>
> blkdev_dequeue_request(next);
> @@ -1027,6 +1159,15 @@
>
> req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;
>
> + /* One last thing: we have removed a request, so we now have
> + * one less expected IO to complete for accounting purposes.
> + */
> + locate_hd_struct(req, &hd1, &hd2);
> + if (hd1)
> + down_ios(hd1);
> + if (hd2)
> + down_ios(hd2);
> +
> blkdev_release_request(next);
> }
> }
> @@ -1117,6 +1258,7 @@
> req->biotail = bio;
> req->nr_sectors = req->hard_nr_sectors += nr_sectors;
> drive_stat_acct(req, nr_sectors, 0);
> + req_new_io(req, 1, nr_sectors);
> attempt_back_merge(q, req);
> goto out;
>
> @@ -1142,6 +1284,7 @@
> req->sector = req->hard_sector = sector;
> req->nr_sectors = req->hard_nr_sectors += nr_sectors;
> drive_stat_acct(req, nr_sectors, 0);
> + req_new_io(req, 1, nr_sectors);
> attempt_front_merge(q, req);
> goto out;
>
> @@ -1210,6 +1353,8 @@
> req->waiting = NULL;
> req->bio = req->biotail = bio;
> req->rq_dev = bio->bi_dev;
> + req->start_time = jiffies;
> + req_new_io(req, 0, nr_sectors);
> add_request(q, req, insert_here);
> out:
> if (freereq)
> @@ -1649,6 +1794,7 @@
> {
> if (req->waiting)
> complete(req->waiting);
> + req_finished_io(req);
>
> blkdev_release_request(req);
> }
>
> --
> Zlatko
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/