Description
Name and Version
llama-server --version
version: 5554 (3600cc2)
built with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
or any subsequent version up to b5604 included
Operating systems
Mac
GGML backends
Metal
Hardware
M4 Max, M2 Max, M1 Max
Models
gemma models :
fastest to test is the the [1B gemma-3 Q4_K_M] (https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF)
./llama-server -hf ggml-org/gemma-3-1b-it-GGUF:Q4_K_M -ngl 200 -c 4096
Problem description & steps to reproduce
using b5552 KV cache works as expected :
2nd query only has 1 token processed (vs 250 for first query)
cd ../../build-b5552/bin
./llama-server -m $LLAMA_CACHE/ggml-org_gemma-3-1b-it-GGUF_gemma-3-1b-it-Q4_K_M.gguf -ngl 200 -c 4096
python test_kv_cacke.py
Sending request 1...
response 1: Blacksmith.
prompt_n: 250
system_fingerprint: b5552-3f55f781
duration = 0.136
Sending request 2...
response 2: Blacksmith.
prompt_n: 1
system_fingerprint: b5552-3f55f781
duration = 0.035
using b5554 KV cache does not work as expected :
2nd query also has 250 tokens processed (vs 250 for first query) and is much slower than should be
cd ../../build-b5554/bin
./llama-server -m $LLAMA_CACHE/ggml-org_gemma-3-1b-it-GGUF_gemma-3-1b-it-Q4_K_M.gguf -ngl 200 -c 4096
Sending request 1...
response 1: Blacksmith.
prompt_n: 250
system_fingerprint: b5554-3600cc28
duration = 0.150
Sending request 2...
response 2: Blacksmith.
prompt_n: 250
system_fingerprint: b5554-3600cc28
duration = 0.101
First Bad Commit
b5554
Relevant log output
#test_kv_cache.py
import requests
import json
import time
# Configuration
url = "http://localhost:8080/v1/chat/completions"
headers = {"Content-Type": "application/json"}
# Prompt (approx. 340 tokens, with fixed-answer question)
prompt = """
In the fantasy novel *The Shattered Crown*, set in the war-torn kingdom of Eryndor, a young blacksmith discovers a hidden prophecy etched on a mysterious amulet buried beneath his forge. The prophecy foretells the rise of a shadowed heir who will either unite the fractured realms or plunge them into eternal darkness. This blacksmith, unaware of his own lineage, embarks on a quest to find the Crown of Ages, a relic said to hold the power to restore balance. Joined by Lira, a rogue sorceress with a shadowed past, and Torren, a grizzled knight banished for treason, he faces trials including the labyrinthine Caves of Sorrow, where spectral guardians test his resolve, and the Court of Whispers, a den of political intrigue where allies betray for power. The blacksmith uncovers clues about his heritage, learning the prophecy may point to him or his ruthless brother, Varn, who leads a fanatical cult bent on conquest. The novel weaves themes of destiny, loyalty, and sacrifice, set against Eryndor’s misty vales, towering spires, and haunted forests. Who is the main character of *The Shattered Crown*?
"""
# Payload for chat completion
payload = {
"model": "gemma-3-4b-it",
"messages": [
{"role": "system", "content": "You are a terse assistant."},
{"role": "user", "content": prompt}
],
"max_tokens": 10,
"temperature": 0.0 # Set to 0.0 for deterministic output
}
print()
for i in range(1,3):
start = time.time()
print(f"Sending request {i}...")
response = requests.post(url, headers=headers, data=json.dumps(payload))
if response.status_code == 200:
print(f"response {i}:", response.json()["choices"][0]["message"]["content"])
print("prompt_n:", response.json()["timings"]["prompt_n"])
print("system_fingerprint:", response.json()["system_fingerprint"])
# print("First response:", response.text)
else:
print("Error:", response.status_code, response.text)
duration = time.time() - start
print(f"{duration = :.3f}")
print()