Skip to content

Eval bug: KV cache stopped working in b5554 version #14071

Open
@lefromage

Description

@lefromage

Name and Version

llama-server --version
version: 5554 (3600cc2)
built with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0

or any subsequent version up to b5604 included

Operating systems

Mac

GGML backends

Metal

Hardware

M4 Max, M2 Max, M1 Max

Models

gemma models :
fastest to test is the the [1B gemma-3 Q4_K_M] (https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF)

./llama-server -hf ggml-org/gemma-3-1b-it-GGUF:Q4_K_M -ngl 200 -c 4096

Problem description & steps to reproduce

using b5552 KV cache works as expected :
2nd query only has 1 token processed (vs 250 for first query)

cd ../../build-b5552/bin
./llama-server -m $LLAMA_CACHE/ggml-org_gemma-3-1b-it-GGUF_gemma-3-1b-it-Q4_K_M.gguf -ngl 200 -c 4096

python test_kv_cacke.py

Sending request 1...
response 1: Blacksmith.
prompt_n: 250
system_fingerprint: b5552-3f55f781
duration = 0.136

Sending request 2...
response 2: Blacksmith.
prompt_n: 1
system_fingerprint: b5552-3f55f781
duration = 0.035

using b5554 KV cache does not work as expected :
2nd query also has 250 tokens processed (vs 250 for first query) and is much slower than should be

cd ../../build-b5554/bin
./llama-server -m $LLAMA_CACHE/ggml-org_gemma-3-1b-it-GGUF_gemma-3-1b-it-Q4_K_M.gguf -ngl 200 -c 4096

Sending request 1...
response 1: Blacksmith.
prompt_n: 250
system_fingerprint: b5554-3600cc28
duration = 0.150

Sending request 2...
response 2: Blacksmith.
prompt_n: 250
system_fingerprint: b5554-3600cc28
duration = 0.101

First Bad Commit

b5554

Relevant log output

#test_kv_cache.py 

import requests
import json
import time

# Configuration
url = "http://localhost:8080/v1/chat/completions"
headers = {"Content-Type": "application/json"}

# Prompt (approx. 340 tokens, with fixed-answer question)
prompt = """
In the fantasy novel *The Shattered Crown*, set in the war-torn kingdom of Eryndor, a young blacksmith discovers a hidden prophecy etched on a mysterious amulet buried beneath his forge. The prophecy foretells the rise of a shadowed heir who will either unite the fractured realms or plunge them into eternal darkness. This blacksmith, unaware of his own lineage, embarks on a quest to find the Crown of Ages, a relic said to hold the power to restore balance. Joined by Lira, a rogue sorceress with a shadowed past, and Torren, a grizzled knight banished for treason, he faces trials including the labyrinthine Caves of Sorrow, where spectral guardians test his resolve, and the Court of Whispers, a den of political intrigue where allies betray for power. The blacksmith uncovers clues about his heritage, learning the prophecy may point to him or his ruthless brother, Varn, who leads a fanatical cult bent on conquest. The novel weaves themes of destiny, loyalty, and sacrifice, set against Eryndor’s misty vales, towering spires, and haunted forests. Who is the main character of *The Shattered Crown*?
"""

# Payload for chat completion
payload = {
    "model": "gemma-3-4b-it",
    "messages": [
        {"role": "system", "content": "You are a terse assistant."},
        {"role": "user", "content": prompt}
    ],
    "max_tokens": 10,
    "temperature": 0.0  # Set to 0.0 for deterministic output
}

print()
for i in range(1,3):
  start = time.time()
  print(f"Sending request {i}...")
  response = requests.post(url, headers=headers, data=json.dumps(payload))
  if response.status_code == 200:
    print(f"response {i}:", response.json()["choices"][0]["message"]["content"])
    print("prompt_n:", response.json()["timings"]["prompt_n"])
    print("system_fingerprint:", response.json()["system_fingerprint"])
    # print("First response:", response.text)
  else:
    print("Error:", response.status_code, response.text)
  duration = time.time() - start
  print(f"{duration = :.3f}")
  print()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions