IP ToS routing problem and fix

Mark Frazer (mark@somanetworks.com)
Tue, 5 Jun 2001 21:26:34 -0400


--IS0zKkzwUGydFO0o
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi Alexey. I've been having a problem getting ToS routing working
properly, well, the way I wanted it to. Basically, it would not route
using anything except the RFC 1349 ToS settings.

I've prepared the attached patch to add a new configuration option to
allow routing based on RFC 2474 Diffserv DSCP values in the IPv4 TOS
field.

Attached is a test script which works for the following setup.

The test machine has eth1 and eth2 on the same ethernet network. There is
another machine on the test ethernet network that the test machine pings.
The IP addresses used in the setup are:
test:eth1 = 192.168.0.1.
test:eth2 = 192.168.0.2.
target = 192.168.0.3.

I've also included the output of the test for an unpatched 2.4.5 system
and a patched 2.4.5 system.

Please have a look and let me know what you think.

I wouldn't promote putting this in the regular kernel until I've let it
run around here for a few weeks.

cheers
-mark

-- 
Mark Frazer <mark@somanetworks.com>

--IS0zKkzwUGydFO0o Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="patch.dscp"

# Configure.help needs updating for the new config option. # This option makes the old IP_ROUTE_TOS boolean a choice # option allowing either the old RFC1349 behaviour or the # RFC2474 behaviour. diff -urN linux-2.4.5/Documentation/Configure.help linux-2.4.5-mjf/Documentation/Configure.help --- linux-2.4.5/Documentation/Configure.help Thu May 24 18:03:06 2001 +++ linux-2.4.5-mjf/Documentation/Configure.help Mon Jun 4 18:23:31 2001 @@ -3943,13 +3943,30 @@ equal "cost" and chooses one of them in a non-deterministic fashion if a matching packet arrives. -IP: use TOS value as routing key -CONFIG_IP_ROUTE_TOS - The header of every IP packet carries a TOS (Type Of Service) value +IP: use of TOS value as routing key +CONFIG_IP_ROUTE_TOSNOTUSED + The second octet of the IP header is defined by several Internet RFCs. + The linux routing facility can make a routing decision on several of + the interpretations of this RFC. This second octect in the header was + originally named the Type of Service field in Postel's RFC 971. This + choice allows you to select the treatment you wish to use in your + network. + + None indicates that the ToS field will not be used to make routing + decisions. + + TOS indicates that the ToS field will be treated as specified in + RFC 1349. The header of every IP packet carries a single TOS value with which the packet requests a certain treatment, e.g. low latency - (for interactive traffic), high throughput, or high reliability. If - you say Y here, you will be able to specify different routes for - packets with different TOS values. + (for interactive traffic), high throughput, high reliability, or minimum + cost. These settings are mutually exclusive. Note that the minimum cost + setting is clobbered if ECN is activated. + + DSCP indicates that the ToS field will be treated as specified in + RFC 2474. This selects treating the field as a DiffServ Differentiated + Service Code Point which allow the top 6 bits of the field to be used + to specify a per hop behaviour which is generally set by policy at each + routing point in the network. IP: use netfilter MARK value as routing key CONFIG_IP_ROUTE_FWMARK

# Minor fix do netfilter location in the kernel docs. diff -urN linux-2.4.5/Documentation/kernel-docs.txt linux-2.4.5-mjf/Documentation/kernel-docs.txt --- linux-2.4.5/Documentation/kernel-docs.txt Fri Apr 6 13:42:48 2001 +++ linux-2.4.5-mjf/Documentation/kernel-docs.txt Mon May 28 12:37:26 2001 @@ -333,7 +333,8 @@ * Title: "The Kernel Hacking HOWTO" Author: Various Talented People, and Rusty. - URL: http://www.samba.org/~netfilter/kernel-hacking-HOWTO.html + URL: http://netfilter.gnumonks.org/unreliable-guides/kernel-hacking/ + lk-hacking-guide.html Keywords: HOWTO, kernel contexts, deadlock, locking, modules, symbols, return conventions. Description: From the Introduction: "Please understand that I @@ -393,9 +394,8 @@ * Title: "Linux Kernel Locking HOWTO" Author: Various Talented People, and Rusty. - URL: - http://netfilter.kernelnotes.org/unreliable-guides/kernel-locking- - HOWTO.html + URL: http://netfilter.gnumonks.org/unreliable-guides/kernel-locking/ + lklockingguide.html Keywords: locks, locking, spinlock, semaphore, atomic, race condition, bottom halves, tasklets, softirqs. Description: The title says it all: document describing the

# I've placed the mask for valid DSCP values here. diff -urN linux-2.4.5/include/linux/ip.h linux-2.4.5-mjf/include/linux/ip.h --- linux-2.4.5/include/linux/ip.h Fri May 25 21:01:27 2001 +++ linux-2.4.5-mjf/include/linux/ip.h Mon Jun 4 18:31:30 2001 @@ -38,6 +38,8 @@ #define IPTOS_PREC_PRIORITY 0x20 #define IPTOS_PREC_ROUTINE 0x00 +/* RFC 2474 Diffserv code point */ +#define IPTOS_DSCP_MASK 0xFC /* IP options */ #define IPOPT_COPY 0x80

# Here, I use either the TOS mask from RFC 1349 or 2474, # as selected by the configuration. diff -urN linux-2.4.5/include/net/route.h linux-2.4.5-mjf/include/net/route.h --- linux-2.4.5/include/net/route.h Fri May 25 21:02:42 2001 +++ linux-2.4.5-mjf/include/net/route.h Mon Jun 4 18:42:36 2001 @@ -129,9 +129,17 @@ } #ifdef CONFIG_INET_ECN -#define IPTOS_RT_MASK (IPTOS_TOS_MASK & ~3) +#define IPTOS_ECN_TOS_MASK 0xFC #else -#define IPTOS_RT_MASK IPTOS_TOS_MASK +#define IPTOS_ECN_TOS_MASK 0xFF +#endif + +#ifdef CONFIG_IP_ROUTE_RFC1349_TOS +#define IPTOS_RT_MASK (IPTOS_TOS_MASK & IPTOS_ECN_TOS_MASK) +#elif defined CONFIG_IP_ROUTE_RFC2474_DSCP +#define IPTOS_RT_MASK (IPTOS_DSCP_MASK & IPTOS_ECN_TOS_MASK) +#else +#define IPTOS_RT_MASK (IPTOS_TOS_MASK & IPTOS_ECN_TOS_MASK) #endif

# set up the choice option diff -urN linux-2.4.5/net/ipv4/Config.in linux-2.4.5-mjf/net/ipv4/Config.in --- linux-2.4.5/net/ipv4/Config.in Tue May 1 23:59:24 2001 +++ linux-2.4.5-mjf/net/ipv4/Config.in Mon Jun 4 18:33:36 2001 @@ -14,7 +14,16 @@ bool ' IP: fast network address translation' CONFIG_IP_ROUTE_NAT fi bool ' IP: equal cost multipath' CONFIG_IP_ROUTE_MULTIPATH - bool ' IP: use TOS value as routing key' CONFIG_IP_ROUTE_TOS + choice ' IP: use of TOS value as routing key' \ + "None CONFIG_IP_ROUTE_TOSNOTUSED \ + TOS CONFIG_IP_ROUTE_RFC1349_TOS \ + DSCP CONFIG_IP_ROUTE_RFC2474_DSCP" None + if [ "$CONFIG_IP_ROUTE_RFC1349_TOS" = "y" ]; then + define_bool CONFIG_IP_ROUTE_TOS y + fi + if [ "$CONFIG_IP_ROUTE_RFC2474_DSCP" = "y" ]; then + define_bool CONFIG_IP_ROUTE_TOS y + fi bool ' IP: verbose route monitoring' CONFIG_IP_ROUTE_VERBOSE bool ' IP: large routing tables' CONFIG_IP_ROUTE_LARGE_TABLES fi

# A missed KERN_ERR in a printk. diff -urN linux-2.4.5/net/ipv4/fib_rules.c linux-2.4.5-mjf/net/ipv4/fib_rules.c --- linux-2.4.5/net/ipv4/fib_rules.c Tue May 1 23:59:24 2001 +++ linux-2.4.5-mjf/net/ipv4/fib_rules.c Thu May 31 14:01:19 2001 @@ -155,7 +155,7 @@ if (r->r_dead) kfree(r); else - printk("Freeing alive rule %p\n", r); + printk(KERN_ERR "Freeing alive rule %p\n", r); } } @@ -340,13 +340,13 @@ case RTN_UNREACHABLE: read_unlock(&fib_rules_lock); return -ENETUNREACH; + case RTN_PROHIBIT: + read_unlock(&fib_rules_lock); + return -EACCES; default: case RTN_BLACKHOLE: read_unlock(&fib_rules_lock); return -EINVAL; - case RTN_PROHIBIT: - read_unlock(&fib_rules_lock); - return -EACCES; } if ((tb = fib_get_table(r->r_table)) == NULL)

# a few comments for readability diff -urN linux-2.4.5/net/ipv4/route.c linux-2.4.5-mjf/net/ipv4/route.c --- linux-2.4.5/net/ipv4/route.c Wed May 16 13:31:27 2001 +++ linux-2.4.5-mjf/net/ipv4/route.c Thu May 31 16:32:56 2001 @@ -101,8 +101,8 @@ #define RT_GC_TIMEOUT (300*HZ) -int ip_rt_min_delay = 2 * HZ; -int ip_rt_max_delay = 10 * HZ; +int ip_rt_min_delay = 2 * HZ; /* try not to flush faster than */ +int ip_rt_max_delay = 10 * HZ; /* try to flush this fast */ int ip_rt_max_size; int ip_rt_gc_timeout = RT_GC_TIMEOUT; int ip_rt_gc_interval = 60 * HZ; @@ -398,6 +398,12 @@ static spinlock_t rt_flush_lock = SPIN_LOCK_UNLOCKED; +/* + * Call with delay == -1 to schedule a route cache flush ASAP + * which is in ip_rt_min_delay jiffies + * Call with delay == 0 to cause a route cache flush NOW! + * Call with another value to flush the cash in `delay' jiffies + */ void rt_cache_flush(int delay) { unsigned long now = jiffies;

--IS0zKkzwUGydFO0o Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=ip-test

#!/bin/bash

for dev in eth2 eth1 do ip route flush dev $dev ip link set $dev down ip addr flush dev $dev done

ip addr add 192.168.0.1/24 brd + dev eth1 ip link set eth1 up

ip addr add 192.168.0.2/24 brd + dev eth2 ip link set eth2 up

ip route flush dev eth2 ip route flush dev eth1

ip route add 192.168.0.0/24 dsfield 0xe0 proto static table main dev eth2 ip route add 192.168.0.0/24 dsfield 0xc0 proto static table main dev eth2 ip route add 192.168.0.0/24 dsfield 0x08 proto static table main dev eth2 ip route add 192.168.0.0/24 dsfield 0x88 proto static table main dev eth2 ip route add 192.168.0.0/24 dsfield 0x89 proto static table main dev eth2 ip route add 192.168.0.0/24 dsfield 0x8c proto static table main dev eth2

ip route add 192.168.0.0/24 dsfield 0x00 proto static table main dev eth1

ip route list table main

rtn=0 while read tos src comment do ping -Q $tos -c 1 192.168.0.3 > ip-test.$$ if grep -q "from 192.168.$src :" ip-test.$$ ; then msg=ok else msg=FAILED! rtn=$(( $rtn + 1 )) fi echo -e \\ntos $tos should come from $src: $msg grep from ip-test.$$ done <<EOF 0x00 0.1 0xa0 0.1 0xa8 0.1 0xc0 0.2 0xe0 0.2 0x88 0.2 0x89 0.2 # ECN blows away lsb when enabled 0x8c 0.2 EOF

/bin/rm ip-test.$$

if [ $rtn -eq 0 ] ; then echo -e \\nAll tests passed else echo -e \\nFailed $rtn tests fi

exit $rtn

--IS0zKkzwUGydFO0o Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="unpatched.results"

192.168.0.0/24 tos 0xe0 dev eth2 proto static scope link 192.168.0.0/24 tos 0xc0 dev eth2 proto static scope link 192.168.0.0/24 tos 0x8c dev eth2 proto static scope link 192.168.0.0/24 tos 0x89 dev eth2 proto static scope link 192.168.0.0/24 tos 0x88 dev eth2 proto static scope link 192.168.0.0/24 tos 0x08 dev eth2 proto static scope link 192.168.0.0/24 dev eth1 proto static scope link 10.11.0.0/16 dev eth0 proto kernel scope link src 10.11.11.103 127.0.0.0/8 dev lo scope link default via 10.11.1.1 dev eth0

tos 0x00 should come from 0.1: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=250 usec

tos 0xa0 should come from 0.1: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=142 usec

tos 0xa8 should come from 0.1: FAILED! PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=233 usec

tos 0xc0 should come from 0.2: FAILED! PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=156 usec

tos 0xe0 should come from 0.2: FAILED! PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=149 usec

tos 0x88 should come from 0.2: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=149 usec

tos 0x89 should come from 0.2: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=143 usec

tos 0x8c should come from 0.2: FAILED! PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=151 usec

Failed 4 tests

--IS0zKkzwUGydFO0o Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="patched.results"

192.168.0.0/24 tos 0xe0 dev eth2 proto static scope link 192.168.0.0/24 tos 0xc0 dev eth2 proto static scope link 192.168.0.0/24 tos 0x8c dev eth2 proto static scope link 192.168.0.0/24 tos 0x89 dev eth2 proto static scope link 192.168.0.0/24 tos 0x88 dev eth2 proto static scope link 192.168.0.0/24 tos 0x08 dev eth2 proto static scope link 192.168.0.0/24 dev eth1 proto static scope link 10.11.0.0/16 dev eth0 proto kernel scope link src 10.11.11.103 127.0.0.0/8 dev lo scope link default via 10.11.1.1 dev eth0

tos 0x00 should come from 0.1: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=234 usec

tos 0xa0 should come from 0.1: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=120 usec

tos 0xa8 should come from 0.1: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.1 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=204 usec

tos 0xc0 should come from 0.2: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=137 usec

tos 0xe0 should come from 0.2: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=122 usec

tos 0x88 should come from 0.2: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=125 usec

tos 0x89 should come from 0.2: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=125 usec

tos 0x8c should come from 0.2: ok PING 192.168.0.3 (192.168.0.3) from 192.168.0.2 : 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=255 time=126 usec

All tests passed

--IS0zKkzwUGydFO0o-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/