2

I am facing the following issue when running traceroute between two nodes in the same subnet. This is done as a test whether the network connection between this 2 nodes is reliable or not. We were told to use this approach from a known DB vendor's Support Team.

While running the command: traceroute -s 10.1.3.205 -r 10.1.3.210 there are packets randomly not received, and no RTT is reported:

traceroute to 10.1.3.210 (10.1.3.210), 30 hops max, 60 byte packets
 1  10.1.3.210 (10.1.3.210)  0.152 ms  0.064 ms *

In opposite, running traceroute with option ICMP: traceroute -I -s 10.1.3.205 -r 10.1.3.210 is reliable and no missing packets occur.

The same issue was discovered on several Linux systems in our environment with different patch levels, different versions of traceroute and no matter whether system is a VM or physical.

To simplify and for easier reading of tcpdump, I tried with the following command:

for i in {1..10}; do traceroute -s 10.1.3.205 -r 10.1.3.210 -n 1 -m 1 -q 1; done

The output is the following:

for i in {1..10}; do traceroute -s 10.1.3.205 -r 10.1.3.210 -n 1 -m 1 -q 1; done
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.203 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.067 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.067 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.071 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.067 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.075 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  *
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.142 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.067 ms
traceroute to 10.1.3.210 (10.1.3.210), 1 hops max, 28 byte packets
 1  10.1.3.210  0.054 ms

Every 7th packet gets no response and this is reproducable. Now the Support team finalizes like we would have an issue in our network setup with this packet loss.

Running the same loop with a delay of 1 sec. all 10 packets are sent and received:

for i in {1..10}; do traceroute -s 10.1.3.205 -r 10.1.3.210 -n 1 -m 1 -q 1; sleep 1; done

I am a little bit in doubt if this way of testing network reliability is correct or not, since usually traceroute is being used for monitoring of network path over routed connections.

I tried a network connection test over several hours with niping from SAP, where no lost connections where discovered.

So is there anything obvious I have missed and is using traceroute this way a feasible way to test network reliabilty?

2 Answers 2

6

As Steffen said, traceroute is the wrong tool for the job, because its error detection and reporting is simply not sufficient for the job. So, say "hi" to the DB vendor team from us, and tell them other admins said that while traceroute is a good tool for detecting whether routes exist from source to target, it's not helpful for diagnosing UDP reliability issues in a single subnet.

That's twice as true as the network load caused by traceroute is really minimal. If you want to figure out whether your network drops UDP packets under high database load, it's not even trying to load the network.

Diagnosing dropped packets is probably something you'll want to do with the people running your network, but for a first load test, you could use iperf, like so on the client machine:

RATE_BITS_PER_SEC=10m
SERVER=10.1.3.210
CLIENT=10.1.3.205
iperf -c ${SERVER} -B ${CLIENT} -i 1 -u -b ${RATE_BYTES_PER_SEC}

and so on the server machine:

iperf -s -i 1 -w 4M -u

If 10 Mb/s work, you can then increase the RATE_BITS_PER_SEC variable, until your server starts printing numbers that indicate you lost packets. That has to happen at some point, namely when you exceed wire speed, the question is how close you get to that.

1
  • I will keep the suggestion regarding using iperf as another option to stress the network connection. I will reference in the open Support Request to this post, hopefully it will be accepted that the current test procedure is wrong. Commented Sep 10, 2024 at 7:26
5

UDP based traceroute expects ICMP TTL exceeded messages to find out if a packet with a specific TTL was dropped by a router. There is often a rate limit for sending ICMP TTL exceeded and sometimes these are simply not send or ICMP blocked/rate limited by intermediate systems.

Therefore traceroute is not a useful tool to test the reliability of a network regarding packet loss. The tool was never intended for this. Used with appropriate knowledge traceroute might be used to find strange things in the network which might also affect the reliability, but it is not a general purpose reliability testing tool.

1
  • I am with you, this proofs my assumption that traceroute is the wrong approach for testing. However, ICMP rate limit is a good point to start, we will have a look on the configuration of the switches involved to see if there is any ICMP rate limit involved. Commented Sep 10, 2024 at 11:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.