Jihun Lim

Posted on May 7

Model Distillation for Amazon Nova Vision: Fine-Tuning Text-Image-to-Text

#genai #amazonnova #modeldistillation #aws

In this post, I'll introduce a Text-Image-to-Text fine-tuning method to effectively transfer the Vision capabilities of Amazon Nova Pro Model to the Lite Model.

Before diving into the main content, I'd like to mention that I initially wanted to cover Model Distillation techniques in the Vision field directly, but the current support for this in Amazon Bedrock is limited. As an alternative, I'll share how to implement Vision Language Model distillation indirectly using the "Fine-Tuning: Text-Image-to-Text" approach.

⚗️ Model Distillation

At re:Invent 2024, the Amazon Bedrock ecosystem began providing a new model customization feature called Model Distillation, in addition to Fine-tuning and Continued pre-training. Also, recently (April 30), Amazon released Nova Premier as a teacher model for model distillation of complex tasks.

Model distillation is a technique that transfers knowledge from a large teacher model to a smaller student model, allowing you to reduce model size and computational costs while maintaining performance as much as possible.

Amazon Bedrock Model Distillation consists of two main steps. First, generating the training data needed for training, and second, creating the distilled model by fine-tuning the student model using this generated training data.

Bedrock doesn't officially support model distillation for image tasks at present. However, if you understand the basic principles of the distillation process, you can implement model distillation for image tasks on your own by using a teacher model to generate training data and performing fine-tuning separately.

📸 Task Setting - Comparing Image Labeling Tasks

Multimodal models with Vision Understanding capabilities include Image Captioning functionality that can describe given images. When you provide an image and request keyword extraction for desired styles (photography techniques, mood, objects, etc.), you can receive relevant keywords for that image.

👇 Image Labeling Example Prompt

You are an image keyword extraction expert. Please analyze the image and extract concise keywords optimized for search.

Extract keywords according to the following 5 categories, but provide the final result as a single list separated by commas without category distinctions:

1. Main objects/people: People (gender, age group, ethnicity), animals, objects, and other core elements
2. Location/background: Places, landscapes, environments (indoor/outdoor), time, season
3. Actions/emotions: Verbs describing activities, adjectives indicating mood
4. Visual characteristics: Main colors, composition, photography techniques, image style
5. Contextual elements: Fashion, landmarks, cultural context, event/festival-related information

Please provide 2-5 keywords per category, totaling 15-25 search-optimized keywords. Avoid duplications and be concise.

The image above shows the results of Image Labeling performed using Nova Pro and Lite models on one of the photos from the ShutterstockInc/high_resolution_images dataset.

You can see that even when using the same prompt, the response results from each model are very different. Please remember that in this post, rather than determining which model is superior for Image Labeling tasks, we focus on making the Lite model produce responses similar to those of the Pro model!

To understand the similarity between the two models' answers, we measured the Jaccard index of the overlapping parts they both presented, resulting in 0.129. Now, let's see how similar the responses can become by fine-tuning the Lite model with data from Pro.

🧑‍🔬 Self-Implementation of VLM Model Distillation

Dataset Preparation Process

To distill VLM models ourselves, we'll perform fine-tuning using the Text-Image-to-Text approach. For this, we need to prepare the fine-tuning dataset in the following four steps.

In this post, we used the medium dataset of ShutterstockInc/high_resolution_images available on Hugging Face to implement VLM model distillation ourselves.

1. Image Preprocessing

The scope of image preprocessing is very broad. Here, assuming that classification suitable for specific tasks has been completed, we'll only cover preprocessing related to image resizing. Different tasks require different resolutions, but in most cases, high-resolution images are not necessary.

For example, Claude models calculate the token count of an image using the following formula: Token count = (width px × height px) ÷ 750

For a 300 × 199 image:

Total pixels: 300 × 199 = 59,700 pixels
Required tokens: 59,700 ÷ 750 = 79.6 ≈ 80 tokens

For a 1000 × 665 image:

Total pixels: 1000 × 665 = 665,000 pixels
Required tokens: 665,000 ÷ 750 = 886.67 ≈ 887 tokens

As you can see, token consumption varies greatly depending on image resolution, so it's important to appropriately reduce the size of high-resolution images before building a training dataset. This not only reduces model training costs but also contributes to improved processing speed, enabling efficient learning without performance degradation for most tasks.

2. Reference Data Composition

In this process, we call the teacher model to generate prompt-response pair data. The responses generated by the teacher model are later used as fine-tuning data for the student model.

We called the teacher model through the Converse API that supports multimodal functionality, and saved the model's responses and corresponding image filenames in JSONL format for building the fine-tuning dataset.

system_prompts = [{"text": system_prompt}]
conversation = [
    {
        "role": "user",
        "content": [
            {"text": user_prompts},
            {
                "image": {
                    "format": "jpeg",
                    "source": { "bytes": image_bytes }
                }
            }
        ]
    }
]

response = client.converse(
    modelId=teacher_model_id,
    system=system_prompts,
    messages=conversation,
    inferenceConfig={"maxTokens": 1024, "temperature": 0.5, "topP": 0.9},
)

reponse_text = response["output"]["message"]["content"][0]["text"]
jsonl_data = { "image": image_path.name, "label": reponse_text }

3. Training Dataset Creation

Following Bedrock's fine-tuning requirements, we create the dataset needed for model learning in JSONL format, referencing the Preparing data for fine-tuning Understanding models guidelines.

In this post, we prepare the data in the Single image custom fine tuning format.
In this process, we use the data generated in the second step to appropriately place values in the system, messages text fields and the uri field of image to complete the dataset.

4. Dataset Validation

Before starting the fine-tuning process, first check the validity of your dataset using the Dataset Validation for Fine-tuning Nova Understanding models script provided by the aws-samples GitHub repository.

Running the command python3 nova_ft_dataset_validator.py -i <file path> -m <model name> will perform the check, and if all samples pass validation, the message Validation successful, all samples passed will be displayed.

Fine-tuning

Once dataset preparation is complete, the fine-tuning process is very simple. Just specify the S3 location where the dataset is stored in the Amazon Bedrock console and set the necessary hyperparameter values.

For this training, we increased the default epoch value of Nova Lite model from 2 to 5, while maintaining the default values for other parameters.

Upon completion of training, training result metrics are stored in the S3 location specified during the fine-tuning process. Through the step_wise_training_metrics.csv file, you can check training loss values for each step and epoch, allowing you to confirm the model's learning progress.

🖍️ Fine-Tuning Text-Image-to-Text Results

In this post, we used the medium dataset of 🤗 ShutterstockInc/high_resolution_images, which consists of 1,000 images.
For data utilization, we used 900 images as training data, and the remaining 100 images were used to verify model performance after fine-tuning was completed. Considering the limited nature of the data, we conducted two training sessions using 300 and 900 images respectively.

Nova Pro & Nova Lite Comparison

First, to check the performance difference between Nova Pro and Lite models without fine-tuning, we compared the analysis results for 100 images. The Jaccard similarity between the two models was found to be mostly distributed between 0.1 and 0.4.

Nova Pro & Nova Lite (300 images)

After training with 300 sample data, the Jaccard similarity improved to between 0.2 and 0.6. This shows that even with a relatively small amount of data, the Lite model can approach the performance of the Pro model.

Nova Pro & Nova Lite (900 images)

After training with 900 sample data, the Jaccard similarity improved to between 0.2 and 0.6, and when compared to the model trained with 300 images (red), the model trained with 900 images (purple) showed slightly higher performance.

In this experiment, we used only 900 images due to image data limitations, but Amazon Bedrock's image fine-tuning feature supports up to 20,000 data points. Therefore, we expect performance to improve further if fine-tuning is performed with more data.

💸 Model Customization Costs

I've listed the costs incurred in the experiment, which I hope will help you estimate expected costs when planning future fine-tuning tasks. 🙃

Nova Lite Fine-Tuning Costs

Usage Type	Data Count	Cost	Training Time	Provisioned Throughput Cost (No Commitment)	Model Storage Cost
USE1-NovaLite-Customization-Training	300 images	About $2.1	About 1 hour	$108.15 per hour	$1.95 per month
USE1-NovaLite-Customization-Training	900 images	About $7.5	About 2 hours	$108.15 per hour	$1.95 per month

These cost details do not include the costs incurred in generating prompt-response pair data using the teacher model. To calculate these costs, measure the token consumption after performing the task once and calculate separately.

🌟 Conclusion

In this post, we explored how to implement model distillation indirectly through Text-Image-to-Text fine-tuning in a situation where Amazon Bedrock does not officially support model distillation for Vision tasks.

For successful VLM model distillation, a systematic dataset preparation process is essential. The steps of optimizing token consumption through image preprocessing, building reference data using teacher models, creating training datasets that meet Bedrock requirements, and validating datasets before fine-tuning directly impact model performance.

Also, after completing fine-tuning, it's necessary to confirm the model's performance improvement through a validation process. In this article, we measured response consistency between models using Jaccard similarity and found that as the amount of data increased, the Lite model came closer to the Pro model's responses.

While this indirect distillation method is not an officially supported feature, it shows that similar results to high-performance models can be achieved even in lightweight models through proper dataset composition and fine-tuning. We hope that official support for Vision model distillation will expand in Amazon Bedrock in the future, and until then, this approach can be useful in practical applications. I hope this methodology helps in your projects as well.

🤣 Actually, this post is part of what I experimented with while preparing for my AWS Seoul Summit 2025 presentation. I'll share the presentation video here when it becomes available!

DEV Community