On a very large SMP machine with many CPUS scripts are run with tens of simultaneous jobs (fewer than the number of CPUs) like this:
some_program -in FIFO1 >OUTPUT1 2>s_p1.log </dev/null & some_program -in FIFO2 >OUTPUT2 2>s_p2.log </dev/null & ... some_program -in FIFO40 >OUTPUT40 2>s_p40.log </dev/null & splitter_program -in real_input.dat -out FIFO1,FIFO2...FIFO40
The splitter reads the input data flat out and distributes it to the FIFOs in order. (Records 1,41,81… to FIFO1; 2,42,82 to FIFO2, etc.) The splitter has low overhead and can pretty much process data as fast as the file system can supply it.
Each some_program processes its stream and writes it to its output file. However, nothing controls the order in which the file system sees these writes. The writes are also very small, on the order of 10 bytes. The script “knows” that there are 40 streams here and that they could be buffered in 20M (or whatever) chunks, and then each chunk written to the file system sequentially. That is, queued writes should be used to maximize write speed to the disks. The OS, however, just sees a bunch of writes at about the same rate on each of the 40 streams.
What happens in practice during a run is that the subprocesses get a lot of CPU time (in top, >80%), then a flush process appears (10% CPU), and all the others drop to low CPU (1%), then it goes back to the higher rate. These pauses go on for several seconds at a time. The flush means that the writes are overwhelming the file caching. Also I think the OS and/or the underlying RAID controller is probably bouncing the physical disk heads around erratically which is reducing the ultimate write speed to the physical disks. That is just a guess though, since it is hard to say what exactly is happening as there is file cache (in a system with over 500Gb of RAM) and a RAID controller between the writes and the disk.
Is there a program or method around for controlling this sort of IO, forcing the file system writes to queue nicely to maximize write speed?
The “buffer” program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn’t be an orderly queuing of the writes, so several could go out at the same time. If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous. The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
There are dozens of parameters for tuning the file system, some of those might help. The scheduler was changed from cfq to deadline because the system was locking up for minutes at a time with the former.