0

We have a very demanding application that is split into multiple components, all running on the same physical machine. The machine has 56 cores and 32 GB of RAM.

The application has several components in Java and Scala, and one in Python. All are very intensive:

  • There's a lot of network IO as the components communicate with other devices in the network.
  • There's a lot of localhost IO - communication between components.
  • There's a lot of CPU usage - parsing the data coming over the network and running calculations on it.

Recently, something weird was discovered: When the components have their log level set to DEBUG, the system runs smoothly. When the log level is set to INFO, the system behaves erratically and seems to have a lot of contention around CPU (workers timing out, messages not being sent between components, etc.).

We do write a lot to DEBUG.

Is it possible that the IO caused by DEBUG (writing to many log files) actually reduces CPU contention and improves system stability?

1 Answer 1

2

It's rare, but, yes, that is possible. The reasons for that are extremely varied though. All I can provide is a single example just to proof how this can be possible.

Example: Failure to apply nagle's/metcalfe algorithm.

Let's say we are using an optimistic locking situation somewhere. Some know this as a 'retry based locking system'. For example, postgresql (or just about any other major, modern DB engine) in TransactionIsolation.SERIALIZABLE configuration.

I'll use ethernet as an example.

When you hook up 10 computers on the exact same line, to form a network, there's a problem: If 2 computers send data at the same time, then everybody just reads noise - 2 signals overlapping are useless.

The solution in the 80s was a so-called token-ring network: Some computer would be the 'controller' and would be the only one allowed to 'speak', at least at first, to set up a strategy. It communicates to each other system on the wire which position they have in the 'ring'. The system worked as follows:

  • Computer 1 can send, and only computer 1. It gets 95ms to send whatever it wants. If it has nothing to send, they can maybe send a 'I cede my time' signal.
  • Controller computer will send a signal that indicates 'okay, that was it. Next!'. Computer 1 must have already stopped sending. This is merely to ensure every system keeps the same 'clock'. This sync window is 5msec.
  • Computer 2 now gets 95ms.
  • .. and so on.
  • After computer 10, computer 1 can send again.
  • Note that a full second passes between each 'cycle', so your ping time is on average about 500msec, which is a lot.

This system is both fair and contention free: Assuming no egregiously wrong clocks or misbehaving systems, 2 computers can never talk at the same time, and everybody gets fair use. However, if computer 1 wants to send a file to computer 2, then only 10% of the actual available capacity is used, and ping time is a full second here.

As a consequence, as simple and nice and elegant as it sounds, token ring sucked.

Enter metcalfe and ethernet. His solution was much, much simpler:

Whatever, there is no system, just send whenever you want. Every computer just sends the moment they feel like sending.

To solve the problem of 'noise', all senders also check what they sent. If they notice noise, they know there was a conflict. All systems can detect and will ignore noise, and all senders that detect that their own message ended up as noise will just wait a little bit and resend.

The one problem is that computers are annoyingly reliable: It's like two people who are about to walk into each other on the street, and they both go through a routine where they both lean left, apologize again, both lean left again. To avoid that, network cards actually roll dice (figuratively). The pause between detecting a failed send and re-sending it is [A] exponentially growing (if sending it fails repeatedly, keep waiting longer and longer between sends), and [B] includes a random amount of waiting. That random amount is crucial - it avoids the scenario where the conflicting senders continually wait identical times and hence keep conflicting forever.

However, if too many computers all want to send too much data and you haven't fine-tuned that waiting stuff properly, then at some point your network's performance falls apart as the vast majority of packets are lost due to conflicts.

However, slow down one system and perhaps the conflicts are much more rare or even don't occur at all.

As an example, imagine 10 computers where all computers want to send 1GB to their neighbour, simultaneously. A token ring setup would be faster than ethernet, by a decent margin: The bottleneck is the wire, every computer has stuff to send if they get a 'window'. The token ring is 100% utilized and never wastes time on noise, whereas the ethernet system probably has on the order of 20% of the time 'wasted' due to clashing packets.

Introducing hard locking, which many log frameworks do (they either straight up lock, or the code that writes to disk locks, or the kernel driver that actually ends up writing to disk does, or the log framework invokes fsync to tell the OS to lock until the write went through, because usually the log statements immediately before a hard crash are the most important, and without fsync, you'd lose those) - lets you turn an ethernet-like system into something that's more like token-ring. Even if not hard locking, slowing down a few computers on an ethernet system can speed up the sum total set of jobs.

I doubt you're writing a network kernel driver here. However, the exact same principle can be at work, for example when multiple threads are all reading/writing to an optimistic-locking based DB using TransactionLevel.ISOLATABLE (for example: Postgres) and are properly doing what they need to do (namely: rerun the code that interacts with the DB) when RetryExceptions occur due to contention. The ethernet vs. token ring example just feels like the cleanest way to explain the principle.

Just about every other optimistic locking scenario has the same issue. If there's way more to do than the system has capability to do it, the 'error rate' goes up so far that everything slows down to a crawl, and bizarrely enough slowing certain parts down actually speeds up the total system.

If one of the 'high contention' blocks of code logs to debug and the other does not, then changing the 'ignore all logs below this level' option from INFO to DEBUG introduces fsync and/or disk locks in that block of code which slows it way down, and thus, could lead to the total system actually becoming faster. Even the high contention block (once it IS down with the slow-as-molasses log call, the odds that the job it is actually trying to needs to be retried due to an optimistic lock contention is much, much lower).

Optimistic Locking oversaturation is just one example of 'slow one part down, the total is faster'. There are many more.

But, it is highly likely that some part of your system is wasting tons of time either retrying or inefficiently waiting on locks. For example, busy-waiting, or checking if a lock is now free very often, and the check is expensive – example: You aren't using locks, you have something like:

while (!checkIfSomeSystemIsFree()) {
  Thread.sleep(50L);
  LOG.debug("foo");
}

if checkifSomeSystemIsFree() is fairly CPU intensive, then many threads all doing the above can trivially be run faster by enabling debug logging.

Sign up to request clarification or add additional context in comments.

1 Comment

That's a great explanation. I'm going to do a lot of digging to see if any of the resources (DB, message queue, etc) have any behavior like that. Much appreciated.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.