Models & Model Compatibility

What models run on Featherless?

Model Compatibility

Featherless aims to provide serverless inference for all AI models. We currently support over 4,000 text generation models which are fine-tunes of the following base architectures:

FamilyModel ClassContext LengthConcurrency Cost
Deepseek 3deepseek-v3-lc327684
Gemma 2gemma2-2b81921
Gemma 3gemma3-12b163841
gemma3-27b163842
GLM 4glm4-9b163841
glm4-32b163844
Llama 2tinyllama-1b120481
llama2-7b40961
llama2-solar-10b7-4k40961
llama2-13b-4k40961
Llama 3llama3-8b-8k81921
llama3-15b-8k81921
llama3-70b-8k81924
Llama 3.1llama31-8b-16k163841
llama31-70b-16k163844
Llama 3.2llama32-1b327681
Llama 3.3llama33-70b-16k163844
Mistralmistral-v02-7b-std-lc81921
mistral-v01-7b40961
mistral-nemo-12b-lc163841
mixtral-8x22b-lc163844
Mistral 3mistral-24b-lc163842
Mistral 3.1mistral-24b-2503163842
Qwen 1.5qwen15-1b8327681
qwen15-32b-lc163842
Qwen 2qwen2-0b51310721
qwen2-7b-lc163841
qwen2-14b-lc163841
qwen2-32b-lc163842
qwen2-72b-lc163844
Qwen 2.5qwen25-0b5327681
qwen25-1b51310721
qwen25-7b-lc163841
qwen25-14b-lc163841
qwen25-32b-lc163842
qwen25-72b-lc163844
Qwen 3qwen3-0b6409601
qwen3-4b409601
qwen3-8b163841
qwen3-14b163841
qwen3-32b163844
QWERKYqrwkv-32b-32k327681
qrwkv-72b-32k327681
RWKV 5rwkv5-7b163841
RWKV 6rwkv6-7b-16k163841
rwkv6-14b-16k163841
rwkv6moe-37b-16k163841

HuggingFace Repository Requirements

For models to be loaded in featherless, we require

  • a model card in the hugging face hub

  • full weights (not LoRA or QLoRA)

  • weights in safetensors format (not GGUF, not pickled torch vectors)

  • fp16 precision (though we quant to fp8 as part of model boot)

  • no variation of tensor shape relative to one of the above base models (e.g. no change in embedding tensor size)

Model Availability

Any public model from hugging face with 100+ downloads will automatically be available for inference on featherless. Users may request public models with fewer downloads either by email or through the #model-suggestions channel discord.

Private models meeting the compatibility requirements outlined here can be run on featherless by scale customers that have connected their hugging face account. Please visit the private models page in the profile section of the web-app.

Context Lengths

All models are served at one of 4k, 8k or 16k context length - i.e. the total of token count of the prompt plus the completion cannot exceed the context length of the model.

What context length a model can be used at depends on it’s architecture and the following table.

Context Length

Model Architectures Serving this Length

4k

  • Llama 2 (7B, 11B, 13B)

8k

  • Llama 3 (8B, 15B, 70B)

  • Mistral v2 (7B)

16k

  • Llama 3.1, 3.3 (8B, 70B)

  • Mistral Nemo (12B)

  • Mistral 3, 3.1 (24B)

  • Qwen (1.5-32B, 2-72B, 2.5-72B, 3-32B)

32k

  • Qwerky (32B, 72B)

  • Deepseek (V3, R1)

e.g. since Anthracite’s Magnum is a Qwen 2 72B fine-tune, it’s context length is 16k

e.g. since Sao10K’s fimbulvetr is a fine-tune of the Llama2 11B, it’s context length is 4k

We aim to operate the models at maximum useable context, however continue to make tradeoffs to ensure sufficiently low TTFT and a consistent token throughput of > 10 tok/s for all models.

Quantization

Though our model ingestion pipeline requires weights in safetensors format with FP16 precision, all models are served at FP8 precision (they are quanted before loading).
An exception to this rule are models under 5B, these will be run at FP16 precision
This is a tradeoff to balance output quality with inference speeds.

Last edited: Jun 23, 2025