DEV Community

Cover image for Create training job for YOLO model on Amazon SageMaker with AWS Lambda
Hung____ for AWS Community Builders

Posted on

Create training job for YOLO model on Amazon SageMaker with AWS Lambda

In this blog, I will show you how to create a training job for YOLO11x model on Amazon SageMaker through a Lambda function, and then deploy it into an enpoint.

I have prepared a repo that have all the code I use, please have a look:
https://github.com/Hung-00/Amazon-SageMaker-YOLO-training-job

The process

First, you need to have an image that contains all the packages and code files for training.

Making the image from scratch is kinda tricky, so that I have made this simple Dockerfile, you can have a look and give it a try.

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

# Set timezone to avoid interactive prompt
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=UTC

# Install system dependencies
RUN apt-get update && apt-get install -y \
    git \
    python3-pip \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    libgomp1 \
    libgl1-mesa-glx \
    tzdata \
    && rm -rf /var/lib/apt/lists/*

# Upgrade pip
RUN pip install --upgrade pip

# Create requirements.txt with pinned versions
RUN echo "numpy==1.24.4\n\
sagemaker-training\n\
ultralytics==8.3.170\n\
albumentationsx\n\
opencv-python-headless==4.9.0.80\n\
pillow==10.1.0\n\
pandas==2.0.3\n\
matplotlib==3.7.2\n\
seaborn==0.12.2\n\
tqdm==4.66.1\n\
pyyaml==6.0.1\n\
scipy==1.10.1" > /requirements.txt

# Install all dependencies at once
RUN pip install -r /requirements.txt

# Set up the working directory
WORKDIR /opt/ml/code

# Copy training script
COPY train.py /opt/ml/code/train.py
COPY code/inference.py /opt/ml/code/inference.py
COPY code/requirements.txt /opt/ml/code/requirements.txt
# Only for testing
COPY debug.py /opt/ml/code/debug.py 

# Set the entrypoint to the training script
ENV SAGEMAKER_PROGRAM train.py

Enter fullscreen mode Exit fullscreen mode

Notice the version ultralytics==8.3.170, latest verion at July 2025, you may need to upgrade to a higher version to access YOLO12 or later YOLO version.
And albumentationsx is an upgrade version of albumentations, it gives you better augmentations when training your model.
All the steps copying code files is important because we are using sagemaker-training-toolkit. Read more about it here: https://github.com/aws/sagemaker-training-toolkit

I have also created a script upload_image_to_ECR.py to build and upload the image straight to ECR, just make sure you have Docker running.

The image will be around 3.8GB.

Or you can test on local first, here is the commands you can use to test the image on local

docker build -t yolo .

docker run --rm -it \
  --gpus all \
  -v $(pwd)/local_test/input/data:/opt/ml/input/data \
  -v $(pwd)/local_test/model:/opt/ml/model \
  -v $(pwd)/local_test/output:/opt/ml/output \
  -e SM_MODEL_DIR=/opt/ml/model \
  -e SM_CHANNEL_TRAIN=/opt/ml/input/data/train \
  -e SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation \
  -e SM_OUTPUT_DATA_DIR=/opt/ml/output/data \
  yolo \
  /bin/bash
Enter fullscreen mode Exit fullscreen mode

Your local_test folder should looks like this structure so that you can bring it to container, you have to prepare training data by yourself:

Alsp, the dataset.yaml should looks like this:

names:
- class_0
- class_1
- class_2
- class_3
nc: 4
path: /opt/ml/input/data
train: train
val: validation
Enter fullscreen mode Exit fullscreen mode

Test installation success or not

python debug.py
Enter fullscreen mode Exit fullscreen mode

Test result:

Start training in container:

python train.py --epochs 1 --batch-size 2 --imgsz 640
Enter fullscreen mode Exit fullscreen mode

Training results:

My train.py will put model,pt and other result images into 2 folder:

/opt/ml/model/
/opt/ml/output/data/
Everything in these two directories gets uploaded to S3.
This is where you should save your trained model artifacts and other results when training.
You can read about this here: https://nono.ma/sagemaker-model-dir-output-dir-and-output-data-dir-parameters

In model folder:

In output/data folder:

After the training job finish, inside the S3 bucket will have these 2 zip files. contains all of the above:

Now you know how the code inside the image works, then let's create training job with Lambda.

Go to 2_create training_job\trigger_training.py in my GitHub repo, take that function and deploy a new Lambda function with it.

Go to IAM create a role so that SageMaker can assume like below:

Set up SAGEMAKER_ROLE_ARN with the role above and ECR_IMAGE_URI with the URI of the latest image in your ECR repo, like ***********.dkr.ecr.ap-southeast-1.amazonaws.com/yolo11-training:latest

Your S3 bucket that has data should looks like this:

s3://your-data-bucket/
├── train/
│   ├── images/
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   └── ...
│   ├── labels/
│   │   ├── image1.txt
│   │   ├── image2.txt
│   │   └── ...
│   └── dataset.yaml
└── val/
    ├── images/
    │   └── ...
    └── labels/
        └── ...
Enter fullscreen mode Exit fullscreen mode

Create a test event like format below, replace S3 path with the correct S3 URI, also give it an output bucket:

    {
        "training_data_s3": "s3://your-bucket/path/to/train",
        "validation_data_s3": "s3://your-bucket/path/to/val",
        "output_s3": "s3://your-bucket/path/to/output",
        "instance_type": "ml.g4dn.xlarge",
        "hyperparameters": {
            "epochs": 200,
            "batch-size": 16,
            "learning-rate": 0.01,
            "imgsz": 640
        }
    }
Enter fullscreen mode Exit fullscreen mode

Click the test event to create a training job, the job will took around 35 minutes for 200 epochs with my settings:

The result will be create in your output bucket:

Now go to 3_create_endpoint\create_endpoint.py, take the code and deploy a Lambda function to create endpoint

Create an IAM role for SageMaker to assume with S3FullAccess policy. Save it to SAGEMAKER_ENPOINT_ROLE_ARN envirment variable.

Create test event, replace with your output bucket:

{
        "bucket_and_train_folder": "s3://your-output-bucket/trained-model/yolo11x-20250807-093817",
        "instance_type": "ml.c5.xlarge"
}
Enter fullscreen mode Exit fullscreen mode

Test the event to deploy the endpoint, it may take up to 5 minutes.

You can test the endpoint with the simple code below:

!pip install opencv-python
import boto3, cv2, time, base64, json, os


infer_start_time = time.time()

# Read the image into a numpy array
orig_image = cv2.imread('images-test/a.jpg')

# Conver the array into jpeg
jpeg = cv2.imencode('.jpg', orig_image)[1]
# Serialize the jpg using base 64
payload = base64.b64encode(jpeg).decode('utf-8')

conf = 0.85
iou = 0.8
payload = f"{payload},{conf},{iou}"

runtime= boto3.client('runtime.sagemaker')
response = runtime.invoke_endpoint(EndpointName="your-yolo11x-endpoint", ContentType='text/csv', Body=payload)

response_body = response['Body'].read()
result = json.loads(response_body.decode('ascii'))

infer_end_time = time.time()

print(f"Inference Time = {infer_end_time - infer_start_time:0.4f} seconds")

print(result)
Enter fullscreen mode Exit fullscreen mode

Hope this would be a helpful document.

Reference:

  1. https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html
  2. https://nono.ma/sagemaker-model-dir-output-dir-and-output-data-dir-parameters
  3. https://stackoverflow.com/questions/69024005/how-to-use-sagemaker-estimator-for-model-training-and-saving
  4. https://github.com/aws/sagemaker-training-toolkit
  5. https://github.com/Hung-00/Amazon-SageMaker-YOLO-training-job

Top comments (0)