TL;DR: The delays you measure (32.24 ms between Start and
End, and 15.6 ms between End and Start) are exactly what
is expected from the amount of data you send through the serial port.
Yes, exactly: there is zero overhead from the CPU doing other things.
Let me show how the expected timings can be computed. First, a note
about the exact baud rate. A rate of 9600 bps implies that each bit
takes about 104.17 µs to send. However, the Arduino being clocked
at 16 MHz, it cannot achieve this exact baud rate. Instead, when
you request 9600 bps, you get the closest value it can achieve,
which is about 9615 bps. Then, each bit takes exactly
104 µs. Or at least “exactly” as per the Arduino clock: it really
takes 1664 CPU cycles.
Then, as already noted in previous answers, one character is worth
10 raw bits. It thus needs 1040 µs to be transmitted.
Now, let's look at your code. It should be noted that
Serial.print(micros());
is compiled into something equivalent to
unsigned long timestamp = micros();
Serial.print(timestamp);
i.e. the timestamp is taken and then it is transmitted. Your loop()
is then equivalent to:
void loop()
{
Serial.println(); // 2 chars: CR and LF
Serial.print("Start: "); // 7 chars
unsigned long time_start = micros();
Serial.print(time_start); // 6 chars
Serial.print("/////////////////////"); // 21 chars
Serial.print("End:"); // 4 chars
unsigned long time_end = micros();
Serial.print(time_end); // 6 chars
}
By folding the last Serial.print() of one iteration into the beginning
of the next one, one gets the pseudo-code:
forever {
print 15 chars; // 15*1040 = 15600 µs
read time_start;
print 31 chars; // 31*1040 = 32240 µs
read time_end;
}
The computed durations match exactly your printout. This shows that,
once the Serial output buffer is full, the timing of your code
execution is completely governed by how fast the UART sends the bits out
the line. The time taken by the Arduino to do actual CPU work (calling
micros(), formatting numbers...) is irrelevant because the CPU works
in parallel with the UART and, being faster, it ends up waiting for
the UART anyway.