Along with some other experimental work (multiple transmit queues - which have just now made it in to HEAD) I now have per-cpu flow caches in a personal svn branch:
http://svn.freebsd.org/base/user/kmacy/HEAD_fast_multi_xmit/
Interestingly enough, with small numbers of connections with TSO enabled there is little measurable benefit for normal frames or 9k frames. Evidently TSO does a sufficiently good job of coalescing calls down in to ip_output that the reduction in lock contention or lookup time doesn't measurably impact performance. However, with an MTU of 1500 bytes and TSO disabled there is a clear impact. Below, "current" is flow caching disabled and using a single transmit queue and "multiflow" is flow caching and multiple transmit queues enabled.Measurements are in Gbps using Robert Watson's tcpp:
./tcpp -c 10.0.0.150 -p 4 -t 100 -m 100 -b 10000000
This connects to 10.0.0.150 with 4 processes, each process creates a 100 connections and pushes 10MB across each connection before closing.
As you can see from ministat, there is an 81% increase in throughput over the default implementation for 400 connections spread over 4 processes. Interestingly, there is a 30% increase in performance even for single connections. This would appear to indicate that rtentry and ARP lookup are fairly expensive.
I took these measurements a week ago. I've managed to substantially further improve aggregate throughput since then. However, I'll talk about that another time.
http://svn.freebsd.org/base/user/kmacy/HEAD_fast_multi_xmit/
Interestingly enough, with small numbers of connections with TSO enabled there is little measurable benefit for normal frames or 9k frames. Evidently TSO does a sufficiently good job of coalescing calls down in to ip_output that the reduction in lock contention or lookup time doesn't measurably impact performance. However, with an MTU of 1500 bytes and TSO disabled there is a clear impact. Below, "current" is flow caching disabled and using a single transmit queue and "multiflow" is flow caching and multiple transmit queues enabled.Measurements are in Gbps using Robert Watson's tcpp:
./tcpp -c 10.0.0.150 -p 4 -t 100 -m 100 -b 10000000
This connects to 10.0.0.150 with 4 processes, each process creates a 100 connections and pushes 10MB across each connection before closing.
ministat -c 90 -w 74 current multiflow
x current
+ multiflow
+--------------------------------------------------------------------------+
| xx + |
|x x xxx + + |
|x x xx x xxxx ++ + ++ + + +++ + +|
| |____A__M_| |________AM_______| |
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 19 1.886103 2.457827 2.388812 2.2664012 0.20140299
+ 15 3.476315 4.744577 4.153121 4.1130334 0.35428067
Difference at 90.0% confidence
1.84663 +/- 0.163126
81.4786% +/- 7.19758%
(Student's t, pooled s = 0.2788)
As you can see from ministat, there is an 81% increase in throughput over the default implementation for 400 connections spread over 4 processes. Interestingly, there is a 30% increase in performance even for single connections. This would appear to indicate that rtentry and ARP lookup are fairly expensive.
I took these measurements a week ago. I've managed to substantially further improve aggregate throughput since then. However, I'll talk about that another time.