Releases: LostRuins/koboldcpp
koboldcpp-1.94.2
koboldcpp-1.94.2
are we comfy yet?
- NEW: Added unpacked mini-launcher: Now when unpacking KoboldCpp to a directory, a 5MB mini pyinstaller launcher is also generated in that same directory, that allows you to easily start an unpacked KoboldCpp without needing to install python or other dependencies. You can copy the unpacked directory and use it anywhere (thanks @henk717)
- NEW: Chroma Image Generation Support: Merged support for the Chroma model, a new architecture based on Flux Schnell (thanks @stduhpf)
- NEW: Added PhotoMaker Face Cloning Use
--sdphotomaker
to load PhotoMaker along with any SDXL based model. Then open KoboldCpp SDUI and upload any reference image in the PhotoMaker input to clone the face! Works in all modes (inpaint/img2img/text2img). - Swapping .gguf models in admin mode now allows overriding the config with a different one as well (both are customizable).
- Improve GNBF grammar performance by attempting culled grammar search first (thanks @Reithan)
- Allow changing the main GPU with
--maingpu
when loading multi-gpu setups. The main GPU uses more VRAM and has a larger performance impact. By default it is the first GPU. - Added configurable soft resolution limits and VAE tiling limits (thanks @wbruna), also fixed VAE tiling artifacts.
- Added
--sdclampedsoft
which provides "soft" total resolution clamping instead.(e.g. 640 would allow 640x640, 512x768 and 768x512 images), can be combined with--sdclamped
which provides hard clamping (no dimension can exceed it) - Added
--sdtiledvae
which replaces--sdnotile
: Allows specifying a size beyond which VAE tiling is applied.
- Added
- Use
--embeddingsmaxctx
to limit the max context length for embedding models (if you run out of memory, this will help) - Added
--embeddingsgpu
to allow offloading embeddings model layers to GPU. This is NOT recommended as it doesn't provide much speedup, since embedding models already use the GPU for processing even without dedicated offload. - Display available RAM on startup, display version number in terminal window title
- ComfyUI emulation now covers the
/upload/image
endpoint which allows Img2Img comfyui workflows. Files are stored temporarily in memory only. - Added more performance stats for token speeds and timings.
- Updated Kobold Lite, multiple fixes and improvements
- Fixed Chub.ai importer again
- Added card importer for char-archive.evulid.cc
- Added option to import image from webcam
- Allow markdown when streaming current turn
- Improved CSS import sanitizer (thanks @PeterPeet)
- Word Frequency Search (inspired from @trincadev MyGhostWriter)
- Allow usermods and CSS to be loaded from file.
- Added WebSearch for corpo mode
- Added Img2Img support for ComfyUI backends
- Added ability to use custom OpenAI endpoint for TextDB embedding model
- Minor linting and splitter/merge tool by @ehoogeveen-medweb
- Fixed lookahead scanning for Author's note insertion point
- Merged new model support, fixes and improvements from upstream
Hotfix 1.94.1 - Minor bugfixes, fixed ollama compatible vision, added avx/avx2 detection for backend auto-selection, cleaned up oldpc builds to only include oldpc files.
Hotfix 1.9.2 - Fixed an issue with swa models when context is full, try to fix a vulkan oom regression
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Deprecation Reminder: Binary filenames have been renamed: The files named koboldcpp_cu12.exe
, koboldcpp_oldcpu.exe
, koboldcpp_nocuda.exe
, koboldcpp-linux-x64-cuda1210
, and koboldcpp-linux-x64-cuda1150
have been removed. Please switch to the new filenames.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.93.2
koboldcpp-1.93.2
those left behind
- NEW: Added Windows Shell integration. You can now associate
.gguf
files to open automatically in KoboldCpp (e.g. double clicking a gguf). If another kcpp instance is already running locally on the same port, it will be replaced. The default handler can be installed/uninstalled from the 'Extras' tab (thanks @henk717)- This is handled by the
/api/extra/shutdown
api, which can only be triggered from localhost. - Will not affect instances started without
--singleinstance
flag. All this is automatic when you launch via windows shell integration.
- This is handled by the
- NEW: Added an option to simply unload a model from the admin API, the server will free the memory but continue to run. You can then switch to a different model via the admin panel in Lite.
- NEW: Added Save and Load States (sessions). This allows you to take a Savestate Snapshot of the current context, and then reload it again later at any time. Available over the admin API, you can trigger it from the admin panel in Lite.
- Works similarly to 'session files' in llama.cpp, but the snapshot states are stored entirely in memory.
- Used correctly, it can allow you to swap between multiple different sessions/chats without any reprocessing at all.
- There are 3 available slots to use (total 4 including the current session).
- Fixed a regression with flash attention not working for some GPUs in the previous version.
- Added a text LoRA scale option. Removed text LoRA base as it was no longer used in modern ggufs. If provided it will be silently ignored.
- Function/Tool calling can now use higher temperatures (up to 1.0)
- Added more Ollama compatibility endpoints.
- Fixed a few clip skip issues in image generation.
- Added an adapter flag
add_sd_step_limit
to limit max image generation step counts. - Fixed crash on thread count 0.
- Match a few common openai tts voice ids
- Fixed a ctx bug with embeddings (still does not work with qwen3 embed, but should work with most others)
- KoboldCpp Colab now uses KoboldCpp's internal downloader instead of downloading the models first externally.
- Updated Kobold Lite, multiple fixes and improvements
- Added support for embeddings models into KoboldAI Lite's TextDB (thanks @esolithe)
- Added support for saving and loading world info files independently (thanks @esolithe)
- NEW: Added new "Smart" Image Autogeneration mode. This allows the AI to decide when it should generate images, and create image prompt automatically.
- Added a new scenario: Replaced defunct aetherroom.club with prompts.forthisfeel.club
- Added support for importing cards from character-tavern.com
- Improved Tavern World Info support
- Added support for welcome messages in corpo mode.
- Fixed copy to clipboard not working for some browsers.
- Interactive Storywriter scenario fix: now no longer overwrites your regex settings. However, hiding input text is now off by default.
- Added a toggle to make a usermod permanent. Use with caution.
- Markdown fixes, also prevent your username from being overwritten when changing chat scenario.
- Merged fixes and improvements from upstream
Hotfix 1.93.1 - Fixed a crash due to outdated VC runtime dlls, fixed a bad adapter, added base64 embeddings support, added webcam upload support for KoboldAI Lite Add Image, fixed chubai importer, added more options for idle response trigger times.
Hotfix 1.93.2 - Revert back to VS2019+cuda12.1 for windows build to solve reports of crashes. Fixed issues with embeddings endpoint. Added --embeddingsmaxctx
option.
Important Breaking Changes (File Naming Change Notice):
- For improved clarity and ease of use, many binaries are being RENAMED.
- Please observe the new name changes for your automated scripts to avoid disruption:
- Linux:
koboldcpp-linux-x64-cuda1210
is nowkoboldcpp-linux-x64
(Cuda12, AVX2, Newer PCs)koboldcpp-linux-x64-cuda1150
is nowkoboldcpp-linux-x64-oldpc
(Cuda11, AVX1, Older PCs)koboldcpp-linux-x64-nocuda
is stillkoboldcpp-linux-x64-nocuda
(No CUDA)
- Windows:
koboldcpp_cu12.exe
is nowkoboldcpp.exe
(Cuda12, AVX2, Newer PCs)koboldcpp_oldcpu.exe
is nowkoboldcpp-oldpc.exe
(Cuda11, AVX1, Older PCs)koboldcpp_nocuda.exe
is nowkoboldcpp-nocuda.exe
(No CUDA)
- If you are using our official URLs or docker images, this should be handled automatically, but ensure your docker image is up-to-date.
- If you are using platforms that do not support the main build, you can continue using the
oldpc
builds, which remain on cuda11 and avx1 and will continue to be maintained. The cuda12+ version on the main build may be subject to change in future. - For now, both filenames are uploaded to avoid breaking existing scripts. The old filenames will be removed soon, so please update.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Deprecation Warning: The files named koboldcpp_cu12.exe
, koboldcpp_oldcpu.exe
, koboldcpp_nocuda.exe
, koboldcpp-linux-x64-cuda1210
, and koboldcpp-linux-x64-cuda1150
will be removed very soon. Please switch to the new filenames.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.92.1
koboldcpp-1.92.1
early bug is for the birds edition
- Added support for SWA mode which uses much less memory for the KV cache, use
--useswa
to enable.- Note: SWA mode is not compatible with ContextShifting, and may result in degraded output when used with FastForwarding.
- Fixed an off-by-one error in some cases when Fast Forwarding that resulted in degraded output.
- Greatly improved tool calling by enforcing grammar on the output field names, and doing the automatic tool selection as a separate pass. Tool calling should be much more reliable now.
- Added model size information in the HF Huggingface Search and download menu
- CLI terminal output is now truncated in the middle of very long strings instead of at the end.
- Fixed unicode path handling for Image Generation models.
- Enabled threadpools, this should result in a speedup for Qwen3MoE.
- Merged Vision support for Llama4 models, simplified some vision preprocessing code.
- Fixes for prompt formatting for GLM4 models. GLM4 batch processing on Vulkan is fixed (thanks @0cc4m).
- Fixed incorrect AutoGuess adapter for some Mistral models. Also fixed some KoboldCppAutomatic placeholder tag replacements.
- AI Horde default advertised context now matches main max context by default. This can be changed.
- Disable
--showgui
if--skiplauncher
is used - StableUI now increments clip_skip and seed correctly when generating multiple images in a batch (thanks @wbruna)
- clip_skip is now stored inside image metadata, and random seed's actual number is also indicated.
- Added DDIM sampler for image generation.
- Added a simple optional python reqs install script in
launch.cmd
for launching when run from unpacked directories. - Updated Kobold Lite, multiple fixes and improvements
- Integrated dPaste.org (open source pastebin) as a platform for quickly sharing Save Files. You can also use a self hosted instance by changing the endpoint URL. You can now share stories as a single URL with
Save/Load > Share > Export Share as Web URL
- Added an option to allow Horizontal Stacking of multiple images in one row.
- Fixed importing of Chub.AI character cards as they changed their endpoint.
- Added support for RisuAI V3 character cards (.charx archive format), also fixed KAISTORY handling.
- SSE streaming is now the default for all cases. It can be disabled in Advanced Settings.
- Changed markdown renderer to render markdown separately for each instruct turn.
- Better passthrough for KoboldCppAutomatic instruct preset, especially with split tags.
- Added an option to use TTS from Pollinations API, which routes through OpenAI TTS models. Note that this TTS service has a server-side censorship via a content filter that I cannot control.
- Lite now sends stop sequences in OpenAI Chat Completions Endpoint mode (up to 4)
- Added ST based randomizer macros like
{{roll:3d6}}
(thanks @hu-yijie) - Added new Immortal sampler preset by Jeb Carter
- In polled streaming mode, you can fetch last generated text if the request fails halfway.
- Added an exit button when editing raw text in corpo mode.
- Re-enabled a debug option for using raw placeholder tags on request. Not recommended.
- Added a debug option that allows changing the connected API at runtime.
- Integrated dPaste.org (open source pastebin) as a platform for quickly sharing Save Files. You can also use a self hosted instance by changing the endpoint URL. You can now share stories as a single URL with
- Merged fixes and improvements from upstream
Hotfix 1.92.1 - Fixes for a GLM4 vulkan bug, allow extra EOG tokens to trigger a stop.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3 etc) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
kcpp_tools_rolling
This release contains the latest KoboldCpp tools used to convert and quantize models. Alternatively, you can also use the tools released by the llama.cpp project, they should be cross compatible. The binaries here will be periodically updated.
koboldcpp-1.91
koboldcpp-1.91
Entering search mode edition
- NEW: Huggingface Model Search Tool - Grabbing a model has never been easier! KoboldCpp now comes with a HF model browser so you can search and find the GGUF models you like directly from huggingface. Simply search for and select the model, and it will be downloaded before launch.
- Embedded aria2c downloader for windows builds - this provides extremely fast downloads and is automatically used when downloading models via provided URLs.
- Added CUDA target for compute capability 3.5. This may allow KoboldCpp to be used with K6000, GTX 780, K80. I have received some success stories - if you do try, share your experiences on the discussions page!
- Reduced CUDA binary sizes by switching most cuda cc targets to virtual, thanks to a good suggestion from Johannes at ggml-org#13135
- Improved ComfyUI emulation, can now adapt to any kind of workflow so long as there is a KSampler node connected to a text prompt somewhere in it.
- Fixed GLM-4 prompt handling even for quants with incorrect BOS set.
- Added support for Classifier-Free Guidance (CFG) since I wanted to mess with it. At long last I have finally added CFG, but I don't really like it - results are not great. Anyway, if you wish to use it simple check
Enable Guidance
or use--enableguidance
, then set a negative prompt and CFG scale from the lite tokens menu. Note that guidance doubles KV usage and halves generation speed. Overall, it was a disappointing addition and not really worth the effort. - StableUI now clears the queue when cancelling a generation
- Further fixes for Zenity/YAD in multilingual environments
- Removed flash attention limits and warnings for Vulkan
- Updated Kobold Lite, multiple fixes and improvements
- Important Change: KoboldCppAuto is now the default instruct preset. This will let the KoboldCpp backend automatically choose the correct instruct tags to use at runtime, based on the model loaded. This is done transparently in the backend and not visible to the user. If it doesn't work properly, you can always still switch to your preferred instruct format (e.g. Alpaca).
- NEW: Corpo mode now supports Text mode and Adventure mode as well, making it usable in all 4 modes.
- Added quick save and delete buttons for corpo mode.
- Added Pollinations.ai as an option for TTS and Image Gen (optional online service)
- Instruct placeholders are now always used (but you can change what they map to, including themselves)
- Added confirmation box for loading from slots
- Improved think tag handling and output formatting.
- Added a new scenario: Nemesis
- Chat match any name is no longer on by default
- Fixed autoscroll jumping on edit in corpo mode
- Fix char spec v2 embedded WI import by @Cohee1207
- Merged fixes and improvements from upstream
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.90.2
koboldcpp-1.90.2
Qwen of the line edition
-
NEW: Android Termux Auto-Installer - You can now setup KoboldCpp via Termux on Android with a single command, which triggers an automated installation script. Check it out here. Install Termux from F-Droid, then run the command with internet accessible, and everything will be setup, downloaded, compiled and configured for instant use with a Gemma3-1B model.
-
Merged support for Qwen3. Now also triggers
--nobostoken
automatically if a model metadata explicitly indicates no_bos_token, it can still be enabled manually for other models. -
Fixes for THUDM GLM-4, note that this model enforces
--blasbatchsize 16
or smaller in order to get coherent output. -
Merged overhaul to Qwen2.5vl projector. Both old (HimariO version) and new (ngxson version) mmprojs should work, retaining backwards compatibility. However, you should update to the new projectors.
-
Merged functioning Pixtral support. Note that pixtral is very token heavy, about 4000 tokens for a 1024px image, you can try increasing max
--contextsize
or lowering--visionmaxres
. -
Added support for OpenAI Structured Outputs in chat completions API, also accepts the schema when sent as a stringified JSON object in the "grammar" field. You can use this to enforce JSON outputs with specific schema.
-
--blasbatchsize -1
now exclusively uses a batch size of 1 when processing prompt. Also permitted--blasbatchsize 16
which replicates the old behavior (batch of 16 does not trigger GEMM). -
KCPP API server now correctly handles explicitly set nulled fields.
-
Fixed Zenity/YAD detection not working correctly in the previous version.
-
Improved input sanitization when launching and passing url as a model param, Also for better security,
--onready
shell commands can still be used as a CLI parameter, but cannot be embedded into a .kcppt or .kcpps file. -
More robust checks for system glslc when building vulkan shaders.
-
Improved auto gpu layers when loading multi-part GGUF models (on 1 gpu), also slightly tightened memory estimation, and accounts for quantized KV when guessing layers.
-
Added new flag
--mmprojcpu
that allows you to load and run the projector on CPU while keeping the main model on GPU. -
noscript mode randomizes generated image names to prevent browser caching.
-
Updated Kobold Lite, multiple fixes and improvements
- Increased default tokens generated and slider limits (can be overridden)
- ChatGLM-4 and Qwen3 (chatml think/nothinking) presets added. You can disable thinking in Qwen3 by swapping between ChatML (No Thinking) and normal ChatML.
- Added toggle to disable LaTeX while leaving markdown enabled
-
Merged fixes and improvements from upstream
-
Hotfix 1.90.1:
- Reworked thinking tags handling. ChatML (No thinking) is removed, instead, thinking can be forced or prevented for all instruct formats (Settings > Tokens > CoT).
- More GLM4 fixes, now works fine with larger batches on CUDA, on vulkan glm4 ubatch size is still limited to 16.
- Some chat completions parsing fixes.
- Updated Lite with a new scenario
-
Hotfix 1.90.2:
- Pulled further upstream updates. Massive file size increase caused by ggml-org#13199, I can't do anything about it. Don't ask me.
- NEW: Added a hugginface model search tool! Now you can find, browse and download models straight from huggingface.
- Increased
--defaultgenamount
range - Try to fix YAD GUI launcher
- Added rudimentary websocket spoof for ComfyUI, increased comfyui compatibility.
- Fixed a few parsing issues for nulled chat completions params
- Automatically handle multipart file downloading, up to 9 parts.
- Fixed rope config not saving correctly to kcpps sometimes
- Merged fixes for Plamo models, thanks to @CISC
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.89
koboldcpp-1.89
retro.mp4
- NEW: Improved NoScript mode - NoScript mode now has chat mode and image generation support entirely without Javascript! Access it by default at http://localhost:5001/noscript on your browser. Tested to work on Internet Explorer 5, Netscape Navigation 4, NetSurf, Lynx, Dillo, and basically any browser made after 1999.
- Added new launcher flags
--overridekv
and--overridetensors
which work in the same way as llama.cpp's flags.--overridekv
allows you to specify a single metadata property to be overwritten. Input format iskeyname=type:value
--overridetensors
allow you to place tensors matching a pattern onto a specific backend. Input format istensornamepattern=buffertype
- Enabled @jeffbolznv coopmat2 support for Vulkan (supports flash attention, overall slightly faster). CM2 is only enabled if you have the latest Nvidia Game Ready Driver (576.02) and should provide all round speedups. Thought the (OldCPU) Vulkan binaries will now exclude coopmat, coopmat2 and DP4A, so please use OldCPU mode if you encounter issues.
- Display available GPU memory when estimating layers
- Fixed RWKV model loading
- Added more sanity checks for Zenity, made YAD the default filepicker instead. If you still encounter issues, please select Legacy TK filepicker in the extras page, and report the issue.
- Minor fixes for stable UI inpainting brush selection.
- Enabled usage of Image Generation LoRAs even with a quantized diffusion model (the LoRA should still be unquantized)
- Fixed a crash when using certain image LoRAs due to graph size limits. Also reverted CLIP quant to f32 changes.
- CLI mode fixes
- Updated Kobold Lite, multiple fixes and improvements
- IMPORTANT: Relocated Tokens Tab and WebSearch Tab into Settings Panel (from context panel). Likewise, the regex and token sequence configs are now stored in settings rather than story (and will persist even on a new story).
- Fixed URLs not opening on new tab
- Reworked thinking tag handling - now separates display and submit regex behaviors (3 modes each)
- Added Retain History toggle for WebSearch to retain some old search results on subsequent queries.
- Added a editable Template for character creator (by @PeterPeet)
- Increased to 10 local and 10 remote save slots.
- Removed aetherroom club (dead site)
- Merged fixes and improvements from upstream
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.88
koboldcpp-1.88
- NEW: Added Image Inpainting support to StableUI, and merged inpainting support from stable-diffusion.cpp (by @stduhpf)
- You can use the built-in StableUI to mask out areas to inpaint when editing with Img2Img (Similar to A1111). API docs for this are updated.
- Added slider for setting clip-skip in StableUI.
- Other improvements from stable-diffusion.cpp are also merged.
- Added Zenity and YAD support for displaying file picker dialogs on linux (by @henk717), if they are installed on your system they will be used. To continue using the previous TKinter file picker, you can select "Use Classic FilePicker" in the extras tab.
- Added a new API endpoint
/api/extra/json_to_grammar
which can be used to convert a JSON schema into GBNF grammar (check API docs for an example). - Added
--maxrequestsize
flag, you can configure the server max payload size before a HTTP request is dropped (default 32mb). - Can now perform GPU memory estimation using vulkaninfo too (if nvidia-smi is not available).
- Merged Llama 4 support from upstream llama.cpp. Qwen3 is technically included too, but until it releases officially we won't know if it actually works.
- Fixed not autosetting backend and layers when swapping to new model in admin mode using a template.
- Added additional warnings in GUI and terminal when you try to use FlashAttention on Vulkan backend - generally this is discouraged due to performance issues.
- Fixed system prompt on gemma3 template
- Updated Kobold Lite, multiple fixes and improvements
- Added Llama4 prompt format
- Consolidated vision dropdown when selecting a vision provider
- Fixed think tokens formatting issue with markdown
- Merged fixes and improvements from upstream
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.87.4
koboldcpp-1.87.4
- NEW: Embeddings endpoint added - GGUF embedding models can now be loaded with
--embeddingsmodel
and accessed from/v1/embeddings
or/api/extra/embeddings
, this can be used to encoding text for search or storage within a Vector database. - NEW: Added OuteTTS Voice Cloning Support! - Now you can upload Speaker JSONs over the TTS API which represent a cloned voice when generating TTS. Read more here and try some sample speakers, or make your own.
- NEW: Merged Qwen2.5VL support from @HimariO fork - Also fixed issues with Qwen2VL when multiple images are used.
- NEW: Automatic function (tool) calling - Improved tool calling support, thanks to help from @henk717, KoboldCpp can now work with tool calls from frontends such as OpenWebUI. Additionally, auto mode is also supported, allowing the model to decide for itself whether a function call is needed or not, and which tool to use (though manually selecting the desired tool with
tool_choice
still provides better results). Note that tool calling requires a relatively intelligent modern model to work correctly (recommended model: Gemma3). For more info on function calling, see here. The tool call detection template can be customized by settingcustom_tools_prompt
in the chat completions adapter. - NEW: Added Command Line Chat Mode - KoboldCpp has come full circle! Now you can use it fully without a GUI, just like good old llama.cpp. Simply run it with
--cli
to enter terminal mode, where you can chat interactively using the command line shell! - Improved AMD rocwmma build detection, also changed Vulkan build process (now requires compiling shaders)
- Merged DP4A Vulkan enhancements by @0cc4m for greater performance on legacy quants in AMD and Intel, please report if you encounter any issues.
--quantkv
can now be used without flash attention, when this is done it only applies quantized-K without quantized-V. Not really advised, performance can suffer.- Truncated base64 image printouts in console (they were too long)
- Added a timeout for vulkaninfo in case it hangs.
- Fixed
--savedatafile
with relative paths - Fixed llama3 template AutoGuess detection.
- Added localtunnel as an alternative fallback option in the Colab, in case Cloudflare tunnels happen to be blocked.
- Updated Kobold Lite, multiple fixes and improvements
- NEW: Added World Info Groups - You can now categorize your world info entries into groups (e.g. for a character/location/event) and easily toggle them on and off in a single click.
- You can also toggle each entry on/off individually without removing it from WI.
- You can easily Import and Export each world info group as JSON to use within another story or chat.
- Added a menu to upload a cloned speaker JSON for use in voice cloning. Read the section for OuteTTS voice cloning above.
- Multiplayer mode UI streamlining.
- Add toggle to allow uploading images as a new turn.
- Increased max resolution of uploaded images used with vision models.
- Switching a model in admin mode now auto refreshes Lite when completed
- NEW: Added World Info Groups - You can now categorize your world info entries into groups (e.g. for a character/location/event) and easily toggle them on and off in a single click.
- Merged fixes and improvements from upstream
Hotfix 1.87.1 - Fixed embeddings endpoint, fixed gemma3 autoguess system tag.
Hotfix 1.87.2 - Fixed broken DP4A vulkan and savedatafile bug.
Hotfix 1.87.3 - Fixed a regression in savedatafile, also added queueing for SDUI.
Hotfix 1.87.4 - Revert gemma3 system role template as it was not working correctly. Increased warmup batch size. Merged further DP4A improvements.
This month marks KoboldCpp entering into it's third year! Somehow it has survived, against all odds. Thanks to everyone who has provided valuable feedback, contributions and support over the past months. Enjoy!
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.
koboldcpp-1.86.2
koboldcpp-1.86.2
- Integrated Gemma3 support, to use it you can grab the gguf model and vision mmproj such as this one and load both of them in KoboldCpp, similar to earlier vision models. Everything else should work out of the box in Lite (click Add Img to paste or upload an image). Vision will also work in SillyTavern in custom Chat Completions API (enabling inline images)
- Fixed OpenAI API
finish_reason
value and tool calling behaviors. - Reenable support for cuda compute capability 3.7 (K80)
- Allow option to save stories to google drive when used in Colab
- Added speculative success rate information in
/api/extra/perf/
- Allow downloading Image Generation LoRAs from URL launch arguments
- Added image Generation param metadata to generated image (thanks @wbruna)
- CI builds now also rebuild Vulkan shaders.
- Replaced winclinfo.exe with a simpler version (see simpleclinfo.cpp) that only fetches GPU names.
- Allow admin mode to runtime swap between gguf model files as well, in addition to swapping between kcpps configs. When swapping models in this way, default GPU layers and selections will be picked.
- Updated Kobold Lite, multiple fixes and improvements
- Added a new instruct preset "KoboldCppAutomatic" which automatically obtains the instruct template from KoboldCpp.
- Improvements and fixes for side panel mode
- Merged fixes and improvements from upstream
Hotfix 1.86.1:
- Added new option
--defaultgenamount
, controls the max amount of tokens generated by default (e.g. third party client using chat completions) if not specified - Added new option
--nobostoken
, prevents BOS tokens from being used automatically at the start. Not recommended unless you know what you're doing. - Fixed bugs with CL VRAM detections, Gemma3 Vision on MacOS, and rescaling issues.
Hotfix 1.86.2:
- NEW: Allowed using quantized KV (
--quantkv
) together with context shift, as it's been enabled upstream. This means the only requirement for using quantized KV now is to enable flash attention. Report any issues you face. - Fixed chat completions function (tool) calling again, which now works in all modes except "auto" (set
tool_choice
to "forced" to let the AI choose which tool to use). - Improved mmproj memory estimation, fix for Image Gen metadata and input fields sanitization, fixed output image metadata.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.