2

I have interfaces enp101s0f0u2u{1..3}, on each of which there is device responding to 192.168.8.1.
I want a local processes to be able to reach all of them simultaneously.
This is one process, so network namespaces are not an option.
I am looking for a solution that doesn't use socat or another proxy that can bind an outgoing interface.
I thought of locally making virtual IPs 192.168.8.1{1..3} to point to them.

What I got so far:

  • Interface enp101s0f0u2ux has ipv4 192.168.8.2x/32.
  • ip rule 100x: from all to 192.168.8.1x lookup 20x
  • ip route default dev enp101s0f0u2ux table 20x scope link src 192.168.8.2x

(this means the interface and src are correct when chosen automatically)

chain output {
    type nat hook output priority dstnat; policy accept;
    ip daddr 192.168.8.1x meta mark set 20x counter dnat to 192.168.8.1
}

(this means the destination ip is changed to .1, unfortunately I only found a way to do this before routing decision is made, so we need the next thing)

  • ip rule 110x: from all fwmark 20x lookup 20x

(this means that despite dst being 192.168.8.1, it goes to the …ux interface) now the hard part:

chain input {  
    type nat hook input priority filter; policy accept;  
    ip saddr 192.168.8.1 ip daddr 192.168.8.2x counter snat to 192.168.8.1x  
}

(this should restore the src of the return packet to .1x, so the socket and application are not astonished)

Unfortunately, at this point if I try to curl, tcpdump sees a 192.168.8.21.11111 > 192.168.8.1.80 (SYN) and multiple 192.168.8.1.80 > 192.168.8.21.11111 (SYN-ACK) attempts, but the input chain counter is not hit.

However, if I add the seemingly useless

chain postrouting {
  type nat hook postrouting priority srcnat; policy accept;
  ip daddr 192.168.8.1 counter masquerade
}

I get 1 packet hitting the input snat rule, and the application gets some data back! However, all the consequent packets from 192.168.8.1 in the flow are dropped. Here is a tcpdump and a conntrack

I'm at the end of my rope, been at it for days. There's no firewall/filter happening (which conntrack would be opening for me), I have empty nftables besides the chains I showed here.

I cannot understand why the masquerade makes a difference, and in general what goes on in conntrack. (The entry gets created and destroyed twice, and then an entry starting from outside gets created?) Of note is that the entries are not symmetrical, they mention both 192.168.8.1 and 192.168.8.12 in each entry for opposite directions.

I especially don't understand how or why in absence of masquerade the returning 192.168.8.1.80 > 192.168.8.21.11111 (SYN-ACK) packets get dropped instead of going to input chain. Would this happen if the application TCP socket did CONNECT and so only wants replies from .11? But shouldn't input be able to intercept before the socket? And I can't snat in prerouting anyway, so where would this have to be done?

Update:

Adding

    type filter hook output priority raw; policy accept;
    ip daddr 192.168.8.11 counter notrack

makes it stop hitting this counter too:

    type nat hook output priority dstnat; policy accept;
    ip daddr 192.168.8.11 meta mark set 201 counter dnat to 192.168.8.1

Does notrack prevent entering nat chains, instead of entering them for all packets and not just first? And so, prevents doing -nat actions altogether?

7
  • in my head, instead of the handstand here, you'd simply bind() the socket to the address of the interface you want to use, setting the port to 0, before you connect() to 192.168.8.1 with the target port (or use sendmsg, or whatever you have planned with that socket) Commented Jun 11 at 16:44
  • @MarcusMüller (1) I genuinely cannot control the actual application making the requests, and (2) I am unwilling to use a socks or http proxy or forward the specific ports with socat. (I tried, and it works, but I want to learn here.) Commented Jun 11 at 19:37
  • but you can control the programs, it seems, to a large degree: it seems you have all the privileges to run them in a modified environment. Which also means you can catch connect syscalls when they happen, and run a bind on the affected socket first. Commented Jun 12 at 9:50
  • @MarcusMüller that sounds intriguing. I have no idea how to approach that either, but what I am trying to do here is further my understanding of kernel networking. Commented Jun 12 at 10:56
  • 1
    as a first step, I'd lazily just attach gdb to your running process, and catch syscall connect. If that happens rarely enough, you can actually make a comm program in GDB that reacts to that syscall happening and fires of other calls and modify the arguments of that syscall. When that works in general, I'd replicate the same in bpftrace, and see whether I can add useful data to the socket, to make it easy for nftables to route things. Commented Jun 12 at 11:07

1 Answer 1

2

Well, there's round-robin DNS. The client programs that make connections use a hostname instead of an IP address, so they perform DNS lookups to learn the IP address.

A DNS server is set up to resolve this hostname for the service/device that's listening for connections on the several IP addresses. The DNS server responds to requests for the address (A) record of that hostname by giving out an answer that has the list of target IP addresses instead of just one IP address.

Each time the DNS server returns the answer, it rotates the list of addresses in the answer, so a different address is first in the list. The TTL (expiration) time in the A record is short, so client programs get a new list almost every time they make a connection.

Most client DNS libraries will pick the first address in the list and try to connect to it. With different clients receiving a different address at the top of the list, client connections spread out fairly evenly among the target IP addresses. Since the clients connect directly to the target addresses, there's no proxying involved.

A static list of target addresses in the DNS server is not perfect. The DNS server often doesn't know when one of the addresses becomes unavailable, so it continues to give the address to clients. If the clients only try the address at the top of the list without falling back to the others, then some connections will fail.

BIND9 is one such DNS server that supports this kind of multi-address answer to a DNS request. The feature has been around for 20+ years, so other DNS server packages likely support this as well. There are dynamic versions of this, where the DNS server can probe the target addresses to see which ones are still accepting connections. Or there's "service discovery" where the software at the target addresses register with the DNS server, and any that don't keep refreshing that registration will be removed from the DNS server's answer.

I think your solution won't be found at layers 4 and below of the network. It will likely be at the highest layer or outside networking like the above DNS technique.

1
  • I'm sorry if it wasn't clear, and I would be grateful to hear suggestions for how I can better word my question in that regard, but I am not looking for the three devices to be accessed randomly and interchangeably. I need them individually addressable, and specifically to sockets of local processes which don't use the bind syscall. (Although a solution that handles even those that bind the wrong interface is also welcome for educational purposes!) Commented Jun 11 at 19:46

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.