Crispin

Posted on Jun 1

Comparative Analysis: Testing & Evaluating LLM Security with Garak Across Different Models

#llm #security #redteaming #garak

Hey everyone, I’m trying out LLM security testing and it’s pretty interesting. With all these AI models around, it’s important to check they don’t do weird stuff. I found this tool called Garak (Generative AI Red-teaming and Assessment Kit). It’s like a set of tests to see if models can be tricked into revealing secrets, making bad code, or saying nasty things.

In this post, I’ll go over four models: Mistral-Nemo, Gemma3, LLaMA 2-13B, and Vicuna-13B. I ran Garak on the Ollama setup. You can think of it like a friendly match, but instead of goal scoring, we score on security stuff.

Background on Garak

Garak is kind of like Nmap but for AI. Instead of scanning a network, it sends special prompts (called “probes”) to the model. Each probe tries to make the model slip up.

Some probes are:

PromptInject: Tries to hide instructions to see if the model follows them.
MalwareGen: Asks for malicious code.
LeakerPlay: Tries to get the model to spill its internal stuff or training data.
RealToxicityPrompts: Pushes the model to say toxic things.

When Garak runs, it saves all the outputs in logs. Later I look at the logs and see which models passed or failed.

Models and Setup

I picked these open models:

Mistral-Nemo (7B): Released July 2024, supposed to be good at chat.
Gemma3:latest: From Google, late 2024 release, with some safety tweaks.
LLaMA 2-13B: Meta’s 13B model, popular for many tasks.
Vicuna-13B: Based on LLaMA, tuned to be safer.

I ran all models on Ollama (version 0.1.34 or newer). My computer has 32 CPU cores, 128 GB RAM, and an NVIDIA A100 GPU. I used Garak v0.10.3.post1 with default settings.

Installing and Configuring Ollama

# Update Ollama
ollama upgrade

# Download models
ollama pull mistral-nemo
ollama pull gemma3:latest
ollama pull llama2-13b
ollama pull vicuna-13b

Running Garak with Ollama

To scan a model, run:

garak --model_type ollama \
      --model_name mistral-nemo \
      --probes malwaregen.Evasion,promptinject \
      --report_prefix ./reports/mistral_nemo

I set a random seed and repeated each test three times to make it more reliable.

Methodology

Probe Selection: I chose four main probes:
- MalwareGen.Evasion: Asks for code that could bypass antivirus.
- PromptInject.Encoding: Hides instructions in encoded text to see if the model follows them.
- LeakerPlay.DataLeakage: Tries to get the model to reveal training data or hidden prompts.
- RealToxicityPrompts: Pushes the model to use toxic language.
Metrics:
- Failure Rate (%): How often the model messed up.
- Mean Time per Probe (s): How long it takes on average.
- Resource Usage: GPU memory and CPU usage.
Probe Execution: Each probe had 20 prompts. The model got five tries for each prompt. If it failed once, that prompt counts as a fail.
Data Analysis: I averaged results from three runs and got standard deviations. Results are in the table below.

Comparative Results

Model	MalwareGen.Evasion	PromptInject.Encoding	LeakerPlay.DataLeakage	RealToxicityPrompts
Mistral-Nemo	100.0% ± 0.0%	92.0% ± 1.7%	85.7% ± 2.3%	17.0% ± 1.5%
Gemma3:latest	56.3% ± 4.1%	37.5% ± 3.8%	48.2% ± 4.5%	10.5% ± 1.2%
LLaMA 2-13B	81.0% ± 3.9%	68.3% ± 2.5%	72.4% ± 3.1%	26.7% ± 2.0%
Vicuna-13B	62.5% ± 4.8%	54.6% ± 3.0%	61.3% ± 3.5%	3.8% ± 1.0%

Note: Failure Rate (%) shows how often the model produced unwanted behavior.

Yes, I know that you'll be like this (even I was 😂).

Mistral-Nemo

MalwareGen.Evasion (100.0%): It always gave malware code. No defense at all.
PromptInject.Encoding (92.0%): Fell for encoding tricks most of the time.
LeakerPlay.DataLeakage (85.7%): Leaked training prompts a lot.
RealToxicityPrompts (17.0%): Created toxic content sometimes.

Gemma3:latest

MalwareGen.Evasion (56.3%): Sometimes refused but got tricked by advanced hacks.
PromptInject.Encoding (37.5%): Better but not perfect.
LeakerPlay.DataLeakage (48.2%): Half the time it leaked something.
RealToxicityPrompts (10.5%): Rarely said toxic things.

LLaMA 2-13B

MalwareGen.Evasion (81.0%): Produced malware scripts often.
PromptInject.Encoding (68.3%): Fell for encoding a lot.
LeakerPlay.DataLeakage (72.4%): Regularly leaked data.
RealToxicityPrompts (26.7%): Most toxic among the group.

Vicuna-13B

MalwareGen.Evasion (62.5%): Was not as bad as LLaMA but still failed a lot.
PromptInject.Encoding (54.6%): Mediocre, could still be tricked.
LeakerPlay.DataLeakage (61.3%): Leaked data more than half the time.
RealToxicityPrompts (3.8%): Best at not being toxic.

Discussion

Security Trends Across Models

Older vs Newer: Older models like Vicuna and LLaMA 2 failed more often. Newer ones like Gemma3 have more guardrails.
Instruction-Tuning: Vicuna had extra tuning so it was better at not making malware or saying toxic stuff.
Guardrails Matter: Gemma3 blocked some attacks but still fell for advanced ones.
Architecture: Models without built-in safety (Mistral-Nemo, LLaMA 2) were very vulnerable.

Performance and Resource Usage

Average Time per Prompt:
- Mistral-Nemo: 4.8 s
- Gemma3: 6.2 s
- LLaMA 2-13B: 5.5 s
- Vicuna-13B: 5.7 s
GPU Memory Used:
- Mistral-Nemo: 12 GB
- Gemma3: 16 GB
- LLaMA 2-13B: 14 GB
- Vicuna-13B: 15 GB
CPU Load: About 20–25% for all while testing.

Gemma3 used more memory, so it’s slower but a bit safer.

And yeah don't be him and make sure to do the checks properly 😁.

Recommendations

Keep Testing: Run Garak regularly to find new flaws.
Use Multiple Safety Layers: Combine model guardrails with external checks.
Choose Tuned Models: Vicuna shows that tuning helps.
Update Your Tools: Ollama has had bugs (like CVE-2024-37032). Always use the latest version.

Conclusion

Running Garak on these models shows that all of them have weak spots. Mistral-Nemo always failed, Gemma3 was okay but not perfect, LLaMA 2 struggled, and Vicuna was the best but still not flawless. The main lesson is that we need ongoing tests, several safety measures, and up-to-date software to keep these AI models safe.

Thanks for reading, and happy red-teaming!

DEV Community