0

In our overall organization, we run Jenkins 2.303.1 onprem. We run thousands of builds a day. The project I work on uses one Jenkins master and a set of about ten build nodes. We build a few hundred Maven/Java/Spring applications with similar architectures.

In the build process, we have a "tools image" that contains java and mvn and some other tools.

Yesterday, we updated the build process to reference a newer version of the tools image that has some additional tools we need to use. A little later after we made that update, we noticed that there were now four build nodes where builds were all failing in the same way, with this approximate command line and output:

+ bash -o pipefail -c mvn -U -s ... -Duser.home=/ clean compile test-compile 2>&1 | tee mvn.out
The JAVA_HOME environment variable is not defined correctly,
this environment variable is needed to run this program.

Note that this command is run by a "sh" pipeline step.

This error message comes from inside the "mvn" script. This error will occur if it finds that $JAVA_HOME/bin/java doesn't exist.

I then added several "sh" calls before this to show the following:

  • which java
  • which mvn
  • ls -lt $JAVA_HOME/bin/java

On the "bad" nodes, the result from both of the first two commands was an empty string. That means that neither "java" nor "mvn" are found in the PATH, or they are not executable. On the "good" nodes, they print the expected location of the "java" and "mvn" executable.

The output from third command is this:

-rwxr-xr-x. 1 root root 12768 Oct 17 21:48 /opt/java/openjdk/bin/java

I also added the "env" output before this. It shows that "JAVA_HOME" is equal to "/opt/java/openjdk", and that PATH has the bin directories of both the mvn and java distribution in the PATH.

This evidence shows multiple factors that just don't make sense together. The "mvn" script is clearly complaining that $JAVA_HOME/bin/java does not exist, but the sh output clearly shows it does. The "which mvn" output says that "mvn" is not found in the PATH, but the bash command line above is executing just "mvn" without an absolute path, so the only way it could get to it is from the PATH, and it clearly shows that it is finding it, otherwise that error message would not be printed from inside the "mvn" script.

I've tried to compare several aspects of the builds running on the "good" nodes with the ones running on the "bad" nodes. For instance, I copied the list of env vars from both and compared them, and there were no significant differences.

We tried restarting the bad build nodes. We tried purging the entire local docker cache and restarting docker. Neither of these steps made a difference.

I'm looking for any ideas of possible areas to explore to explain this problem. We've had several people staring at this for quite a while now, including one person who maintains the Jenkins build nodes, one person who maintains the tools image, and several others with considerable experience. We are all drawing a blank here.

3
  • If possible run bash with -xv with stderr into a file and look for non-printing chars, eg through cat -vet. Commented Jan 20, 2024 at 12:21
  • Start up the image with an interactive shell, debug from there. Do you get the same behavior? Commented Jan 21, 2024 at 15:18
  • It had occurred to me that we haven't tried exactly that scenario, but there are multiple scenarios where the symptom does NOT occur. It does happen on the four specific build nodes, but it doesn't happen on the other build nodes in the pool, and someone also tried creating a cluster pod with that tools image, and it also doesn't have the symptom. I can try to set up a test of this tomorrow, but I think it's unlikely to show the issue. I'm also going to try some tests with the previous version of this image on those "bad" build nodes. Commented Jan 21, 2024 at 17:51

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.