Hugging Face Optimum
Installation
pip as follows:
python -m pip install optimumIf you'd like to use the accelerator-specific features of
| Accelerator | Installation |
|---|---|
| ONNX Runtime | pip install --upgrade-strategy eager optimum[onnxruntime] |
| Intel Neural Compressor | pip install --upgrade-strategy eager optimum[neural-compressor] |
| OpenVINO | pip install --upgrade-strategy eager optimum[openvino,nncf] |
| Habana Gaudi Processor (HPU) | pip install --upgrade-strategy eager optimum[habana] |
| FuriosaAI | pip install --upgrade-strategy eager optimum[furiosa] |
The --upgrade-strategy eager option is needed to ensure the different packages are upgraded to the latest possible version.
To install from source:
python -m pip install git+https://github.com/huggingface/optimum.gitFor the accelerator-specific features, append optimum[accelerator_type] to the above command:
python -m pip install optimum[onnxruntime]@git+https://github.com/huggingface/optimum.gitAccelerated Inference
- ONNX / ONNX Runtime
- TensorFlow Lite
- OpenVINO
- Habana first-gen Gaudi / Gaudi2, more details here
The export and optimizations can be done both programmatically and with a command line.
Features summary
| Features | ONNX Runtime | Neural Compressor | OpenVINO | TensorFlow Lite |
|---|---|---|---|---|
| Graph optimization | N/A | N/A | ||
| Post-training dynamic quantization | βοΏ½? | N/A | ||
| Post-training static quantization | βοΏ½? | |||
| Quantization Aware Training (QAT) | N/A | N/A | ||
| FP16 (half precision) | N/A | |||
| Pruning | N/A | N/A | ||
| Knowledge Distillation | N/A | N/A |
OpenVINO
This requires to install the OpenVINO extra by doing pip install --upgrade-strategy eager optimum[openvino,nncf]
To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx class with the corresponding OVModelForXxx class. To load a PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, you can set export=True when loading your model.
- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
model.save_pretrained("./distilbert")
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
results = classifier("He's a dreadful magician.")You can find more examples in the documentation and in the examples.
Neural Compressor
This requires to install the Neural Compressor extra by doing pip install --upgrade-strategy eager optimum[neural-compressor]
Dynamic quantization can be applied on your model:
optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output ./quantized_distilbertTo load a model quantized with Intel Neural Compressor, hosted locally or on the π€ hub, you can do as follows :
from optimum.intel import INCModelForSequenceClassification
model_id = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = INCModelForSequenceClassification.from_pretrained(model_id)You can find more examples in the documentation and in the examples.
ONNX + ONNX Runtime
This requires to install the ONNX Runtime extra by doing pip install optimum[exporters,onnxruntime]
It is possible to export
optimum-cli export onnx -m deepset/roberta-base-squad2 --optimize O2 roberta_base_qa_onnx
The model can then be quantized using onnxruntime:
optimum-cli onnxruntime quantize \
--avx512 \
--onnx_model roberta_base_qa_onnx \
-o quantized_roberta_base_qa_onnxThese commands will export deepset/roberta-base-squad2 and perform O2 graph optimization on the exported model, and finally quantize it with the avx512 configuration.
For more information on the ONNX export, please check the documentation.
Run the exported model using ONNX Runtime
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend:
- from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
model_id = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForQuestionAnswering.from_pretrained(model_id)
+ model = ORTModelForQuestionAnswering.from_pretrained("roberta_base_qa_onnx")
qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
question = "What's Optimum?"
context = "Optimum is an awesome library everyone should use!"
results = qa_pipe(question=question, context=context)More details on how to run ONNX models with ORTModelForXXX classes here.
TensorFlow Lite
This requires to install the Exporters extra by doing pip install optimum[exporters-tf]
Just as for ONNX, it is possible to export models to TensorFlow Lite and quantize them:
optimum-cli export tflite \
-m deepset/roberta-base-squad2 \
--sequence_length 384 \
--quantize int8-dynamic roberta_tflite_model
Accelerated training
- Habana's Gaudi processors
- ONNX Runtime (optimized for GPUs)
Habana
This requires to install the Habana extra by doing pip install --upgrade-strategy eager optimum[habana]
- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForXxx.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
output_dir="path/to/save/folder/",
+ use_habana=True,
+ use_lazy_mode=True,
+ gaudi_config_name="Habana/bert-base-uncased",
...
)
# Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
...
)
# Use Habana Gaudi processor for training!
trainer.train()You can find more examples in the documentation and in the examples.
ONNX Runtime
- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
output_dir="path/to/save/folder/",
optim="adamw_ort_fused",
...
)
# Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
+ feature="sequence-classification", # The model type to export to ONNX
...
)
# Use ONNX Runtime for training!
trainer.train()You can find more examples in the documentation and in the examples.

