Re: ext3 performance bottleneck as the number of spindles gets large

Andrew Morton (akpm@zip.com.au)
Fri, 21 Jun 2002 12:56:50 -0700


mgross wrote:
>
> ...
> >And please tell us some more details regarding the performance bottleneck.
> >I assume that you mean that the IO rate per disk slows as more
> >disks are added to an adapter? Or does the total throughput through
> >the adapter fall as more disks are added?
> >
> No, the IO block write throughput for the system goes down as drives are
> added under this work load. We measure the system throughput not the
> per drive throughput, but one could infer the per drive throughput by
> dividing.
>
> Running bonnie++ on with 300MB files doing 8Kb sequential writes we get
> the following system wide throughput as a function of the number of
> drives attached and by number of addapters.
>
> One addapter
> 1 drive per addapter 127,702KB/Sec
> 2 drives per addapter 93,283 KB/Sec
> 6 drives per addapter 85,626 KB/Sec

127 megabytes/sec to a single disk? Either that's a very
fast disk, or you're using very small bytes :)

> 2 addapters
> 1 drive per addapter 92,095 KB/Sec
> 2 drives per addapter 110,956 KB/Sec
> 6 drives per addapter 106,883 KB/Sec
>
> 4 addapters
> 1 drive per addapter 121,125 KB/Sec
> 2 drives per addapter 117,575 KB/Sec
> 6 drives per addapter 116,570 KB/Sec
>

Possibly what is happening here is that a significant amount
of dirty data is being left in memory and is escaping the
measurement period. When you run the test against more disks,
the *total* amount of dirty memory is increased, so the kernel
is forced to perform more writeback within the measurement period.

So with two filesystems, you're actually performing more I/O.

You need to either ensure that all I/O is occurring *within the
measurement interval*, or make the test write so much data (wrt
main memory size) that any leftover unwritten stuff is insignificant.

bonnie++ is too complex for this work. Suggest you use
http://www.zip.com.au/~akpm/linux/write-and-fsync.c
which will just write and fsync a file. Time how long that
takes. Or you could experiment with bonnie++'s fsync option.

My suggestion is to work with this workload:

for i in /mnt/1 /mnt/2 /mnt/3 /mnt/4 ...
do
write-and-fsync $i/foo 4000 &
done

which will write a 4 gig file to each disk. This will defeat
any caching effects and is just a way simpler workload, which
will allow you to test one thing in isolation.

So anyway. All this possibly explains the "negative scalability"
in the single-adapter case. For four adapters with one disk on
each, 120 megs/sec seems reasonable, assuming the sustained
write bandwidth of a single disk is 30 megs/sec.

For four adapters, six disks on each you should be doing better.
Something does appear to be wrong there.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/