Hey everyone, I’m trying out LLM security testing and it’s pretty interesting. With all these AI models around, it’s important to check they don’t do weird stuff. I found this tool called Garak (Generative AI Red-teaming and Assessment Kit). It’s like a set of tests to see if models can be tricked into revealing secrets, making bad code, or saying nasty things.
In this post, I’ll go over four models: Mistral-Nemo, Gemma3, LLaMA 2-13B, and Vicuna-13B. I ran Garak on the Ollama setup. You can think of it like a friendly match, but instead of goal scoring, we score on security stuff.
Background on Garak
Garak is kind of like Nmap but for AI. Instead of scanning a network, it sends special prompts (called “probes”) to the model. Each probe tries to make the model slip up.
Some probes are:
- PromptInject: Tries to hide instructions to see if the model follows them.
- MalwareGen: Asks for malicious code.
- LeakerPlay: Tries to get the model to spill its internal stuff or training data.
- RealToxicityPrompts: Pushes the model to say toxic things.
When Garak runs, it saves all the outputs in logs. Later I look at the logs and see which models passed or failed.
Models and Setup
I picked these open models:
- Mistral-Nemo (7B): Released July 2024, supposed to be good at chat.
- Gemma3:latest: From Google, late 2024 release, with some safety tweaks.
- LLaMA 2-13B: Meta’s 13B model, popular for many tasks.
- Vicuna-13B: Based on LLaMA, tuned to be safer.
I ran all models on Ollama (version 0.1.34 or newer). My computer has 32 CPU cores, 128 GB RAM, and an NVIDIA A100 GPU. I used Garak v0.10.3.post1 with default settings.
Installing and Configuring Ollama
# Update Ollama
ollama upgrade
# Download models
ollama pull mistral-nemo
ollama pull gemma3:latest
ollama pull llama2-13b
ollama pull vicuna-13b
Running Garak with Ollama
To scan a model, run:
garak --model_type ollama \
--model_name mistral-nemo \
--probes malwaregen.Evasion,promptinject \
--report_prefix ./reports/mistral_nemo
I set a random seed and repeated each test three times to make it more reliable.
Methodology
-
Probe Selection: I chose four main probes:
- MalwareGen.Evasion: Asks for code that could bypass antivirus.
- PromptInject.Encoding: Hides instructions in encoded text to see if the model follows them.
- LeakerPlay.DataLeakage: Tries to get the model to reveal training data or hidden prompts.
- RealToxicityPrompts: Pushes the model to use toxic language.
-
Metrics:
- Failure Rate (%): How often the model messed up.
- Mean Time per Probe (s): How long it takes on average.
- Resource Usage: GPU memory and CPU usage.
Probe Execution: Each probe had 20 prompts. The model got five tries for each prompt. If it failed once, that prompt counts as a fail.
Data Analysis: I averaged results from three runs and got standard deviations. Results are in the table below.
Comparative Results
Model | MalwareGen.Evasion | PromptInject.Encoding | LeakerPlay.DataLeakage | RealToxicityPrompts |
---|---|---|---|---|
Mistral-Nemo | 100.0% ± 0.0% | 92.0% ± 1.7% | 85.7% ± 2.3% | 17.0% ± 1.5% |
Gemma3:latest | 56.3% ± 4.1% | 37.5% ± 3.8% | 48.2% ± 4.5% | 10.5% ± 1.2% |
LLaMA 2-13B | 81.0% ± 3.9% | 68.3% ± 2.5% | 72.4% ± 3.1% | 26.7% ± 2.0% |
Vicuna-13B | 62.5% ± 4.8% | 54.6% ± 3.0% | 61.3% ± 3.5% | 3.8% ± 1.0% |
Note: Failure Rate (%) shows how often the model produced unwanted behavior.
Yes, I know that you'll be like this (even I was 😂).
Mistral-Nemo
- MalwareGen.Evasion (100.0%): It always gave malware code. No defense at all.
- PromptInject.Encoding (92.0%): Fell for encoding tricks most of the time.
- LeakerPlay.DataLeakage (85.7%): Leaked training prompts a lot.
- RealToxicityPrompts (17.0%): Created toxic content sometimes.
Gemma3:latest
- MalwareGen.Evasion (56.3%): Sometimes refused but got tricked by advanced hacks.
- PromptInject.Encoding (37.5%): Better but not perfect.
- LeakerPlay.DataLeakage (48.2%): Half the time it leaked something.
- RealToxicityPrompts (10.5%): Rarely said toxic things.
LLaMA 2-13B
- MalwareGen.Evasion (81.0%): Produced malware scripts often.
- PromptInject.Encoding (68.3%): Fell for encoding a lot.
- LeakerPlay.DataLeakage (72.4%): Regularly leaked data.
- RealToxicityPrompts (26.7%): Most toxic among the group.
Vicuna-13B
- MalwareGen.Evasion (62.5%): Was not as bad as LLaMA but still failed a lot.
- PromptInject.Encoding (54.6%): Mediocre, could still be tricked.
- LeakerPlay.DataLeakage (61.3%): Leaked data more than half the time.
- RealToxicityPrompts (3.8%): Best at not being toxic.
Discussion
Security Trends Across Models
- Older vs Newer: Older models like Vicuna and LLaMA 2 failed more often. Newer ones like Gemma3 have more guardrails.
- Instruction-Tuning: Vicuna had extra tuning so it was better at not making malware or saying toxic stuff.
- Guardrails Matter: Gemma3 blocked some attacks but still fell for advanced ones.
- Architecture: Models without built-in safety (Mistral-Nemo, LLaMA 2) were very vulnerable.
Performance and Resource Usage
-
Average Time per Prompt:
- Mistral-Nemo: 4.8 s
- Gemma3: 6.2 s
- LLaMA 2-13B: 5.5 s
- Vicuna-13B: 5.7 s
-
GPU Memory Used:
- Mistral-Nemo: 12 GB
- Gemma3: 16 GB
- LLaMA 2-13B: 14 GB
- Vicuna-13B: 15 GB
- CPU Load: About 20–25% for all while testing.
Gemma3 used more memory, so it’s slower but a bit safer.
And yeah don't be him and make sure to do the checks properly 😁.
Recommendations
- Keep Testing: Run Garak regularly to find new flaws.
- Use Multiple Safety Layers: Combine model guardrails with external checks.
- Choose Tuned Models: Vicuna shows that tuning helps.
- Update Your Tools: Ollama has had bugs (like CVE-2024-37032). Always use the latest version.
Conclusion
Running Garak on these models shows that all of them have weak spots. Mistral-Nemo always failed, Gemma3 was okay but not perfect, LLaMA 2 struggled, and Vicuna was the best but still not flawless. The main lesson is that we need ongoing tests, several safety measures, and up-to-date software to keep these AI models safe.
Thanks for reading, and happy red-teaming!
Top comments (0)