If you surf on over to
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.15/ you'll see
some code which performs 64k I/Os.  Reads direct into pagecache.
It reduces the cost of reading from disk by 25% in my testing.
(That code is ready to go - just waiting for Linus to rematerialise).
The remaining profile is interesting.  The workload is simply
`cat large_file > /dev/null':
c012b448 33       0.200877    kmem_cache_free         
c0131af8 33       0.200877    flush_all_zero_pkmaps   
c01e51bc 33       0.200877    blk_recount_segments    
c01f9aec 34       0.206964    hpt374_udma_stop        
c016eb80 36       0.219138    ext2_get_block          
c0133320 37       0.225225    page_cache_readahead    
c013740c 37       0.225225    __getblk                
c0131ba0 41       0.249574    kmap_high               
c01fa1c4 41       0.249574    ata_start_dma           
c016e7dc 46       0.28001     ext2_block_to_path      
c01e5320 48       0.292184    blk_rq_map_sg           
c01c65d0 50       0.304358    radix_tree_reserve      
c014bfb0 53       0.32262     do_mpage_bio_readpage   
c01f4d88 54       0.328707    ata_irq_request         
c0136b34 64       0.389579    __get_hash_table        
c0126a00 72       0.438276    do_generic_file_read    
c016e910 82       0.499148    ext2_get_branch         
c0126610 88       0.535671    unlock_page             
c0106df4 91       0.553932    system_call             
c012b04c 94       0.572194    kmem_cache_alloc        
c01f2494 126      0.766983    ata_taskfile            
c01c66e8 163      0.992208    radix_tree_lookup       
c012d250 165      1.00438     rmqueue                 
c0105274 2781     16.9284     default_idle            
c0126e48 11009    67.0136     file_read_actor         
That's a single 500MHz PIII Xeon, reading at 35 megabytes/sec.
There's 17% "overhead" here.  Going to a larger filesystem
blocksize would provide almost zero benefit in the I/O layers.
Savings from larger blocks and larger pages would come into
the radix tree operations, get_block, a few other places.
At a guess, 8k blocks would cut the overhead to 10-12%.
And larger block size significantly penalises bandwidth for
the many-small-file case.  The larger the blocks, the worse
it gets.  You end up having to implement complexities such
as tail-merging to get around the inefficiency which the
workaround for your other inefficiency caused.
And larger pages with small blocks isn't an answer - CPU load
and seek costs from 2-blocks-per-page is measurable.  At
4 blocks-per-page it's getting serious.
Small pages and pagesize=blocksize are good.  I see no point in
going to larger pages or blocks until the current scheme is 
working efficiently and has been *proven* to still be unfixably
inadequate.
The current code sucks.  Simply amortising that suckiness across
larger blocks is not the right thing to do.
-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/