>On Thu, 2003-06-26 at 09:04, Nick Piggin wrote:
>>>One of the things I tried in this area was basically queue ownership.
>>>When each process woke up, he was given strict ownership of the queue
>>>and could submit up to N number of requests. One process waited for
>>>ownership in a yield loop for a max limit of a certain number of
>>>jiffies, all the others waited on the request queue.
>>Not sure exactly what you mean by one process waiting for ownership
>>in a yield loop, but why don't you simply allow the queue "owner" to
>>submit up to a maximum of N requests within a time limit. Once either
>>limit expires (or, rarely, another might become owner -) the process
>>would just be put to sleep by the normal queue_full mechanism.
>You need some way to wakeup the queue after that time limit has expired,
>in case the owner never submits another request. This can either be a
>timer or a process in a yield loop. Given that very short expire time I
>set (10 jiffies), I went for the yield loop.
>>>It generally increased the latency in __get_request wait by a multiple
>>>of N. I didn't keep it because the current patch is already full of
>>>subtle interactions, I didn't want to make things more confusing than
>>>they already were ;-)
>>Yeah, something like that. I think that in a queue full situation,
>>the processes are wanting to submit more than 1 request though. So
>>the better thoughput you can achieve by batching translates to
>>better effective throughput. Read my recent debate with Andrea
>>about this though - I couldn't convince him!
>Well, it depends ;-) I think we've got 3 basic kinds of procs during a
>1) wants to submit lots of somewhat contiguous io
>2) wants to submit a single io
>3) wants to submit lots of random io
>>From a throughput point of view, we only care about giving batch
>ownership to #1. giving batch ownership to #3 will help reduce context
>switches, but if it helps throughput than the io wasn't really random
>(you've got a good point about locality below, drive write caches make a
>huge difference there).
>The problem I see in 2.4 is the elevator can't tell any of these cases
>apart, so any attempt at batch ownership is certain to be wrong at least
>part of the time.
I think though, for fairness, if we allow one to submit a batch
of requests, we have to give that opportunity to the others.
And yeah, it does reduce context switches, and it does improve
throughput for "random" localised IO.
>>I have seen much better maximum latencies, 2-3 times the
>>throughput, and an order of magnitude less context switching on
>>many threaded tiobench write loads when using batching.
>>In short, measuring get_request latency won't give you the full
>Very true. But get_request latency is the minimum amount of time a
>single read is going to wait (in 2.4.x anyway), and that is what we need
>to focus on when we're trying to fix interactive performance.
The read situation is different to write. To fill the read queue,
you need queue_nr_requests / 2-3 (for readahead) reading processes
to fill the queue, more if the reads are random.
If this kernel is being used interactively, its not our fault we
might not give quite as good interactive performance. I'm sure
the fileserver admin would rather take the tripled bandwidth ;)
That said, I think a lot of interactive programs will want to do
more than 1 request at a time anyway.
>>>The real problem with this approach is that we're guessing about the
>>>number of requests a given process wants to submit, and we're assuming
>>>those requests are going to be highly mergable. If the higher levels
>>>pass these hints down to the elevator, we should be able to do a better
>>>job of giving both low latency and high throughput.
>>No, the numbers (batch # requests, batch time) are not highly scientific.
>>Simply when a process wakes up, we'll let them submit a small burst of
>>requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
>>make this more effective, fair, and without possible missed wakes because
>>io contexts means that multiple processes can be batching at the same
>>time, and dynamically allocated requests means it doesn't matter if we
>>go a bit over the queue limit.
>I agree 2.5 has a lot more room for the contexts to be effective, and I
>think they are a really good idea.
>>I think a decent solution for 2.4 would be to simply have the one queue
>>owner, but he allowed the queue to fall below the batch limit, wake
>>someone else and make them the owner. It can be a bit less fair, and
>>it doesn't work across queues, but they're less important cases.
>>>Between bios and the pdflush daemons, I think 2.5 is in pretty good
>>>shape to do what we need. I'm not 100% sure we need batching when the
>>>requests being submitted are not highly mergable, but I haven't put lots
>>>of thought into that part yet.
>>No, there are a couple of problems here.
>>First, good locality != sequential. I saw tiobench 256 random write
>>throughput _doubled_ because each process is writing within its own
>>Second, mergeable doesn't mean anything if your request size only
>>grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
>>go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
>Well, play around with raw io, my box writes at roughly disk speed with
>128k synchronous requests (contiguous writes).
Yeah, I'm not talking about request overhead - I think a 128K sized
request is just fine. But when there are 256 threads writing, with
FIFO method, 128 threads will each have 1 request in the queue. If
they are sequential writers, each request will probably be 128K.
That isn't enough to get good disk bandwidth. The elevator _has_ to
make a suboptimal decision.
With batching, say 8 processes have 16 sequential requests on the
queue each. The elevator can make good choices.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to firstname.lastname@example.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/