Revisions to nftables, masquerade: packets go through wrong outbounding interface

further restrict the 2nd routing rule, to not alter behavior of routed UDP traffic using source port 51820

Source Link

edited Aug 6, 2023 at 15:34

A.B

39.5k
2
87
134

The combination of:

not having enabled Strict Reverse Path Forwarding (RFC 3704)

having the peer contact the server first (see the additional issue at the end)

having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

allows WireGuard to somewhat work, so a ping 10.8.0.1 from peer gets a reply and allows any following internal WireGuard traffic to work.

The behavior of the routing stack with the ping test (including the actual dnat that happened in prerouting and the un-masquerade that will happen for the reply) can be summarized with these two commands that query the kernel about what route will be used:

The combination of:

not having enabled Strict Reverse Path Forwarding (RFC 3704)

having the peer contact the server first (see the additional issue at the end)

having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

allows WireGuard to somewhat work, so a ping 10.8.0.1 from peer gets a reply and allows any following WireGuard traffic to continue using the same envelope addresses.

When not stating a source address for a local (non-routed) flow, the routing stack has to figure out which one should be used for the given route. This is especially important for UDP where a socket is often kept unbound (ie: having source 0.0.0.0 aka INADDR_ANY). This is not an issue for a TCP server, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack. Here, WireGuard uses UDP with INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0 and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

ip rule add iif lo ipproto udp sport 51820 lookup subnets

(iif lo is a special syntax to mean locally initiated (non-forwarded) traffic, it's not really about the lo interface).

The combination of:

not having enabled Strict Reverse Path Forwarding (RFC 3704)

having the peer contact the server first (see the additional issue at the end)

having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

allows WireGuard to somewhat work, so a ping 10.8.0.1 from peer gets a reply and allows any following internal WireGuard traffic to work.

The behavior of the routing stack with the ping test (including the actual dnat that happened in prerouting and the un-masquerade that will happen for the reply) can be summarized with these two commands that query the kernel about what route will be used:

When not stating a source address for a local (non-routed) flow, the routing stack has to figure out which one should be used for the given route. This is especially important for UDP where a socket is often kept unbound (ie: having source 0.0.0.0 aka INADDR_ANY). This is not an issue for a TCP server, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack. Here, WireGuard uses UDP with INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0 and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

ip rule add ipproto udp sport 51820 lookup subnets

The behavior of the routing stack with the ping test (including the actual dnat that happened in prerouting and the un-masquerade that will happen for the reply) can be summarized with these two commands that query the kernel about what route will be used:

The combination of:

not having enabled Strict Reverse Path Forwarding (RFC 3704)

having the peer contact the server first (see the additional issue at the end)

having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

allows WireGuard to somewhat work, so a ping 10.8.0.1 from peer gets a reply and allows any following WireGuard traffic to continue using the same envelope addresses.

When not stating a source address for a local (non-routed) flow, the routing stack has to figure out which one should be used for the given route. This is especially important for UDP where a socket is often kept unbound (ie: having source 0.0.0.0 aka INADDR_ANY). This is not an issue for a TCP server, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack. Here, WireGuard uses UDP with INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0 and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

ip rule add iif lo ipproto udp sport 51820 lookup subnets

(iif lo is a special syntax to mean locally initiated (non-forwarded) traffic, it's not really about the lo interface).

reorder the 2 parts of the answer (the enveloppe issue being probably invisible for OP)

Source Link

edited Aug 6, 2023 at 15:17

A.B

39.5k
2
87
134

This question is actually to solve a problem about UDP (WireGuard's tunnel envelope).

When not statingforwarding done with a source address, the routing stack has to figure out which one should be used for the given route.

Here, WireGuard uses INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0dnat rule and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

If server's WireGuard had been initiating the very first packet (ratherrather than peer doing it), it would have used 203.0.113.134 instead of 198.51.100.3: the routing stack has no specific ip rule for 0.0.0.0: the ip rule 32765: from 198.51.100.0/24 lookup subnets doesn't match and no special policy routing is applied. In the end, the UDP packet leaves as 203.0.113.134 using ens18.

It appears the kernel implementation at least then continues to use the same address it was queried onjust local traffic. That's not to be relied upon, multi-homingThere's also a hidden problem with UDP services requires special support from applications because of this. This is not an issue for TCP, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack.

Sought outcome for WireGuard:

# ip route get from 198.51.100.105 to 192.0.2.233
192.0.2.233 from 198.51.100.105 via 198.51.100.1 dev ens19 table subnets uid 0 
    cache

Actual outcome tunnel envelope also fixed at least if it's the first to initiate traffic:end.

# ip route get from 0.0.0.0 to 192.0.2.233
192.0.2.233 via 203.0.113.1 dev ens18 src 203.0.113.134 uid 0 
    cache

not having enabled Strict Reverse Path Forwarding (RFC 3704)
having the peer contact the server first (see the additional issue at the end)
having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

Ping testsThe behavior of the routing stack with the ping test (including the actual dnat that happened in prerouting and the un-masquerade that will happen for the reply) can be summarized with these two commands that query the kernel about what route will be used:

(this This also fixes the case wherewith rp_filter=1rp_filter=1 where the first route test above would just fail with RTNETLINK answers: Invalid cross-device link), even if normally one should add the wg0 route in this table too.

# ip route get from 10.8.0.105 iif wg0 to 192.0.2.2
192.0.2.2 from 10.8.0.105 via 198.51.100.1 dev ens19 table subnets 
    cache iif wg0

The ping test will now work correctly.

There's an additional somewhat hidden WireGuard envelope routing problem to.

When not stating a source address for a local (non-routed) flow, the routing stack has to figure out which one should be used for the given route. This is especially important for UDP where a socket is often kept unbound (ie: having source 0.0.0.0 aka INADDR_ANY). This is not an issue for a TCP server, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack. Here, WireGuard uses UDP with INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0 and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

If server's WireGuard had been initiating the very first packet (rather than peer doing it), it would have used 203.0.113.134 instead of 198.51.100.3: the routing stack has no specific ip rule for 0.0.0.0: the ip rule 32765: from 198.51.100.0/24 lookup subnets doesn't match and no special policy routing is applied. In the end, the UDP packet leaves as 203.0.113.134 using ens18.

It appears the kernel implementation at least then continues to use the same address it was queried on. That's not to be relied upon, multi-homing with UDP services requires special support (eg: using IP_PKTINFO) from applications because of this.

Sought outcome for WireGuard:

# ip route get from 198.51.100.105 to 192.0.2.233
192.0.2.233 from 198.51.100.105 via 198.51.100.1 dev ens19 table subnets uid 0 
    cache

Actual outcome at least if it's the first to initiate traffic:

# ip route get from 0.0.0.0 to 192.0.2.233
192.0.2.233 via 203.0.113.1 dev ens18 src 203.0.113.134 uid 0 
    cache

This question is actually to solve a problem about UDP (WireGuard's tunnel envelope).

When not stating a source address, the routing stack has to figure out which one should be used for the given route.

Here, WireGuard uses INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0 and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

If server's WireGuard had been initiating the very first packet (rather than peer doing it), it would have used 203.0.113.134 instead of 198.51.100.3: the routing stack has no specific ip rule for 0.0.0.0: the ip rule 32765: from 198.51.100.0/24 lookup subnets doesn't match and no special policy routing is applied. In the end, the UDP packet leaves as 203.0.113.134 using ens18.

It appears the kernel implementation at least then continues to use the same address it was queried on. That's not to be relied upon, multi-homing with UDP services requires special support from applications because of this. This is not an issue for TCP, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack.

Sought outcome for WireGuard:

# ip route get from 198.51.100.105 to 192.0.2.233
192.0.2.233 from 198.51.100.105 via 198.51.100.1 dev ens19 table subnets uid 0 
    cache

Actual outcome at least if it's the first to initiate traffic:

# ip route get from 0.0.0.0 to 192.0.2.233
192.0.2.233 via 203.0.113.1 dev ens18 src 203.0.113.134 uid 0 
    cache

not having enabled Strict Reverse Path Forwarding (RFC 3704)
having the peer contact the server first
having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

Ping tests (including the actual dnat that happened in prerouting and the un-masquerade that will happen for the reply):

(this also fixes the case where rp_filter=1 where the first route test above would just fail with RTNETLINK answers: Invalid cross-device link)

# ip route get from 10.8.0.105 iif wg0 to 192.0.2.2
192.0.2.2 from 10.8.0.105 via 198.51.100.1 dev ens19 table subnets 
    cache iif wg0

This question is to solve a problem about forwarding done with a dnat rule and rather than just local traffic. There's also a hidden problem with the WireGuard tunnel envelope also fixed at the end.

not having enabled Strict Reverse Path Forwarding (RFC 3704)
having the peer contact the server first (see the additional issue at the end)
having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

The behavior of the routing stack with the ping test (including the actual dnat that happened in prerouting and the un-masquerade that will happen for the reply) can be summarized with these two commands that query the kernel about what route will be used:

This also fixes the case with rp_filter=1 where the first route test above would just fail with RTNETLINK answers: Invalid cross-device link, even if normally one should add the wg0 route in this table too.

# ip route get from 10.8.0.105 iif wg0 to 192.0.2.2
192.0.2.2 from 10.8.0.105 via 198.51.100.1 dev ens19 table subnets 
    cache iif wg0

The ping test will now work correctly.

There's an additional somewhat hidden WireGuard envelope routing problem to.

When not stating a source address for a local (non-routed) flow, the routing stack has to figure out which one should be used for the given route. This is especially important for UDP where a socket is often kept unbound (ie: having source 0.0.0.0 aka INADDR_ANY). This is not an issue for a TCP server, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack. Here, WireGuard uses UDP with INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0 and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

If server's WireGuard had been initiating the very first packet (rather than peer doing it), it would have used 203.0.113.134 instead of 198.51.100.3: the routing stack has no specific ip rule for 0.0.0.0: the ip rule 32765: from 198.51.100.0/24 lookup subnets doesn't match and no special policy routing is applied. In the end, the UDP packet leaves as 203.0.113.134 using ens18.

It appears the kernel implementation at least then continues to use the same address it was queried on. That's not to be relied upon, multi-homing with UDP services requires special support (eg: using IP_PKTINFO) from applications because of this.

Sought outcome for WireGuard:

# ip route get from 198.51.100.105 to 192.0.2.233
192.0.2.233 from 198.51.100.105 via 198.51.100.1 dev ens19 table subnets uid 0 
    cache

Actual outcome at least if it's the first to initiate traffic:

# ip route get from 0.0.0.0 to 192.0.2.233
192.0.2.233 via 203.0.113.1 dev ens18 src 203.0.113.134 uid 0 
    cache

Source Link

answered Aug 6, 2023 at 14:59

A.B

39.5k
2
87
134

Remarks and adjustments (probably caused by obfuscation):

I'll assume the WireGuard peer uses 10.8.0.105 instead of 10.8.0.107, to match the nftables ruleset.
233.252.0.0 would cause problems in a simulation (especially on peer) because it's a multicast address. I'll use 192.0.2.233 below (with no network relation to 192.0.2.2).

This question is actually to solve a problem about UDP (WireGuard's tunnel envelope).

When not stating a source address, the routing stack has to figure out which one should be used for the given route.

Here, WireGuard uses INADDR_ANY. In particular it doesn't bind to 198.51.100.3. That means it presents as source 0.0.0.0 and leaves to the kernel's routing stack the resolution of the outgoing source IP address.

If server's WireGuard had been initiating the very first packet (rather than peer doing it), it would have used 203.0.113.134 instead of 198.51.100.3: the routing stack has no specific ip rule for 0.0.0.0: the ip rule 32765: from 198.51.100.0/24 lookup subnets doesn't match and no special policy routing is applied. In the end, the UDP packet leaves as 203.0.113.134 using ens18.

It appears the kernel implementation at least then continues to use the same address it was queried on. That's not to be relied upon, multi-homing with UDP services requires special support from applications because of this. This is not an issue for TCP, as the duplicated established socket created after accept(2) is not bound to 0.0.0.0 anymore but to the correct address: it will then present this address to the routing stack.

Sought outcome for WireGuard:

# ip route get from 198.51.100.105 to 192.0.2.233
192.0.2.233 from 198.51.100.105 via 198.51.100.1 dev ens19 table subnets uid 0 
    cache

Actual outcome at least if it's the first to initiate traffic:

# ip route get from 0.0.0.0 to 192.0.2.233
192.0.2.233 via 203.0.113.1 dev ens18 src 203.0.113.134 uid 0 
    cache

The combination of:

not having enabled Strict Reverse Path Forwarding (RFC 3704)
having the peer contact the server first
having (at least) the kernel implementation figure out it should reply with the same source it was initially contacted to

allows WireGuard to somewhat work, so a ping 10.8.0.1 from peer gets a reply and allows any following internal WireGuard traffic to work.

Ping tests (including the actual dnat that happened in prerouting and the un-masquerade that will happen for the reply):

# ip route get from 192.0.2.2 iif ens19 to 10.8.0.105
10.8.0.105 from 192.0.2.2 dev wg0 
    cache iif ens19 
# ip route get from 10.8.0.105 iif wg0 to 192.0.2.2
192.0.2.2 from 10.8.0.105 via 203.0.113.1 dev ens18 
    cache iif wg0

Here one can see the reply uses the wrong interface. The address will be rewritten by the conntrack entry's content: still 198.51.100.105 even if it doesn't appear above.

This one is caused by a missing rule: anything that comes (back) from wg0 should use the table subnets. Fixed with:

ip rule add iif wg0 lookup subnets

(this also fixes the case where rp_filter=1 where the first route test above would just fail with RTNETLINK answers: Invalid cross-device link)

giving now:

# ip route get from 10.8.0.105 iif wg0 to 192.0.2.2
192.0.2.2 from 10.8.0.105 via 198.51.100.1 dev ens19 table subnets 
    cache iif wg0

To really fix the WireGuard tunnel multi-homed routing itself, one can use a per-L4-protocol routing rule:

ip rule add ipproto udp sport 51820 lookup subnets

Giving:

# ip route get from 0.0.0.0 ipproto udp sport 51820 to 192.0.2.233
192.0.2.233 via 198.51.100.1 dev ens19 table subnets src 198.51.100.3 uid 0 
    cache

Despite presenting INADDR_ANY as source, having the UDP source port 51820 now selects the subnets routing table.

Stack Exchange Network

Return to Answer