Prathish Deivendiran for Nife Labs

Posted on Jun 13 • Originally published at docs.nife.io

How to Unit Test Shell Scripts from LLMs Without Blowing Up Your Server

#llm #sandbox #shellscripts #nginx

So, you've got a shiny new shell script courtesy of ChatGPT, Copilot, or your favorite AI. It looks good, it even feels good. But that nagging doubt creeps in: "Is this thing really safe to run in production?"

This is the world of unit testing shell scripts generated by LLMs – a world where the stakes are high, sudo is a double-edged sword, and a single misplaced rm -rf can ruin your entire day. This post provides a battle-tested strategy to safely test and validate scripts that manage critical services like PM2, Docker, Nginx, or anything interacting with your system's state.

The Perils of Trusting LLM-Generated Shell Scripts

Large Language Models (LLMs) are fantastic for quickly generating shell scripts. However, even the best LLMs are prone to:

Making assumptions about your environment: They might assume specific package installations or directory structures that don't exist on your server.
Using incorrect binary names: For example, using pgrep -x PM2 instead of the correct pm2.
Overlooking side effects: Commands like systemctl restart docker aren't always harmless; they can cause unexpected downtime.

Even if the script's logic is 90% correct, that remaining 10% can lead to:

Services restarting at the wrong time.
Data written to incorrect log paths.
Broken idempotency (repeated runs causing unintended changes).

That's why robust unit testing is crucial – not in the traditional pytest sense, but using shell-native methods to verify logic and safety.

Strategy 1: Embrace the `--dry-run` Mode

Every LLM-generated script should include a --dry-run flag. This allows you to preview the script's actions without executing them.

Here's how to implement it:

DRY_RUN=false
[[ "$1" == "--dry-run" ]] && DRY_RUN=true

log_action() {
  echo "$(date): $1"
  $DRY_RUN && echo "[DRY RUN] $1" || eval "$1"
}

# Example usage:
log_action "sudo systemctl restart nginx"

This approach provides traceable and reversible operations, letting you inspect the intended actions before execution.

Strategy 2: Mock External Commands

You don't want docker restart or pm2 resurrect running during your tests. We can override these commands using mocking:

Create a mock-bin directory: mkdir mock-bin
Create a mock docker script:

   echo -e '#!/bin/bash\necho "[MOCK] $0 $@"' > mock-bin/docker
   chmod +x mock-bin/docker

Add the mock directory to your PATH: export PATH="$(pwd)/mock-bin:$PATH"

Now, any call to docker will output a harmless message instead of interacting with your containers. Repeat this process for other potentially disruptive commands like systemctl, pm2, and rm.

This technique, borrowed from the excellent Bash Automated Testing System (BATS), allows for isolated and safe testing.

Strategy 3: Leverage `shellcheck`

LLMs sometimes make mistakes with quoting, variables, or command usage. shellcheck is your invaluable ally here.

Simply run:

shellcheck myscript.sh

shellcheck will identify:

Unquoted variables ("$var" vs $var).
Incorrect command usage.
Malformed if conditions.

Think of it as a linter for your shell scripts, ensuring their structural integrity.

Strategy 4: Modularize with Functions

Break your script into smaller, testable functions:

check_pm2() {
  ps aux | grep '[P]M2' > /dev/null
}

restart_all() {
  pm2 resurrect
  docker restart my-app
  systemctl restart nginx
}

This allows you to mock and call these functions individually within a test harness, avoiding the need to run the entire script each time.

Strategy 5: Log Everything (Seriously!)

Log every decision point. Why? Because "works on my machine" is unhelpful when a container fails to restart or PM2 silently exits.

log() {
  echo "$(date '+%F %T') [LOG] $1" >> /var/log/pm2_watchdog.log
}

Comprehensive logging provides crucial debugging information when things go wrong.

Strategy 6: Sandbox Your Tests

If you have access to Docker or a virtual machine, create a replica environment to run your tests. It's far better to break a test server than your production system!

For example:

docker run -it ubuntu:20.04
# Then install necessary packages: pm2, docker, nginx, etc.

Bonus: Useful Tools

BATS: A powerful Bash unit testing framework.
shunit2: An xUnit-style testing framework for POSIX shells.
assert.sh: A simple shell assertion helper.
shellspec: A full-featured, RSpec-like testing framework.

Final Thoughts: Test Before You Trust

It's tempting to simply run an LLM-generated script, but in production environments, especially those managing critical services, testing is paramount. Use dry-run flags, mock commands, employ shellcheck, add comprehensive logging, and test in a sandbox. Prioritize safety – your sanity and uptime will thank you!

💬 Your thoughts?
Did this help you? Have questions? Drop a comment below!

🔗 Read more
Full article on our blog with additional examples and resources.

DEV Community

How to Unit Test Shell Scripts from LLMs Without Blowing Up Your Server

The Perils of Trusting LLM-Generated Shell Scripts

Strategy 1: Embrace the `--dry-run` Mode

Strategy 2: Mock External Commands

Strategy 3: Leverage `shellcheck`

Strategy 4: Modularize with Functions

Strategy 5: Log Everything (Seriously!)

Strategy 6: Sandbox Your Tests

Bonus: Useful Tools

Final Thoughts: Test Before You Trust

Top comments (0)

The Perils of Trusting LLM-Generated Shell Scripts

Strategy 1: Embrace the --dry-run Mode

Strategy 2: Mock External Commands

Strategy 3: Leverage shellcheck

Strategy 4: Modularize with Functions

Strategy 5: Log Everything (Seriously!)

Strategy 6: Sandbox Your Tests

Bonus: Useful Tools

Final Thoughts: Test Before You Trust

Strategy 1: Embrace the `--dry-run` Mode

Strategy 3: Leverage `shellcheck`