Skip to content

Feature Request: Exclude thinking tokens from server cache for reasoning models #14379

@firecoperana

Description

@firecoperana

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Remove the thinking tokens from server cache if webui also excludes thinking tokens

Motivation

For webui there is an option to remove thinking tokens when sending to server, but server still cache the thinking tokens untill the user sends the new prompt without thinking tokens. This works fine but will make server reprocess the last response and also wastes context size.

Possible Implementation

First find when tokens is part of the thinking tokens. Remove thinking tokens in cached tokens and kv cache in llama server.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      close