About the code
Object detection YOLO v1 loss function implementation with Python + TensorFlow 2.x. It based on the Pytorch implementations below and re-implemented with TensorFlow based on my research on the paper and other resources. Added comments which part of the YOLO v1 paper corresponds to the code lines.
- github Machine-Learning-Collection/ML/Pytorch/object_detection/YOLO/loss.py
- github a-g-moore/YOLO/loss.py
The first Machine-Learning-Collection has the code commentary in YouTube which I followed to understand the YOLO v1 loss function.
Objective
Get feedbacks on the Python coding and if the loss function would work correctly.
References
- YOLOv1 from Scratch by Aladdin Persson
- YOLO Made Simple: Interpreting the You Only Look Once Paper
- YOLO — ‘You only look once’ for Object Detection explained
- Yolo Loss function explanation
- How are the confidence scores in YOLO computed?
- Yolo v1 - what is the purpose of the confidence Ci?
Implementation
Python dependencies
Python 3.9.13
numpy version: 1.24.1
TensorFlow version: 2.10.0
Keras version: 2.10.0
loss.py
"""
YOLO v1 loss function module based on:
1. https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/object_detection/YOLO/loss.py
2. https://github.com/a-g-moore/YOLO/blob/master/loss.py
[Terminology]
input image: 448x448 color image which is divide into an S x S grid.
grid: S x S divisions of an input image
cell: a cell in the grid
responsible cell:
[YOLO v1 paper]
If the center of an object falls into a grid cell, that grid cell
is responsible for detecting that object.
responsible bounding box:
[YOLO v1 paper]
YOLO predicts multiple bounding boxes per grid cell.
At training time we only want one bounding box predictor to be responsible
for each object. We assign one predictor to be “responsible” for predicting
an object based on which prediction has the highest current IOU with the
ground truth.
bbox: bounding box
localization: predicted bounding box (cp, x, y, w, h)
[YOLO v1 paper]
Each bounding box consists of 5 predictions: x, y, w, h, and confidence.
The (x; y) coordinates represent the center of the box relative to the
bounds of the grid cell. The width and height are predicted relative to
the whole image.
cp: confidence score = (Pr(Object) * IOU_truth_pred) = IOU
'p' to distinguish from C/c for Classification
Pr(Object) will be 0:non-exist or 1:exist, hence cp is expected to be IOU
as stated in the paper.
[YOLO v1 paper]
Formally we define confidence as (Pr(Object) * IOU_truth_pred) . If no object
exists in that cell, the confidence scores should be zero.
Otherwise, we want the confidence score to equal the intersection over union
(IOU) between the predicted box and the ground truth.
x, y:
center of a bounding box from the left/top corner of a grid, normalized
between [0, 1] relative to the grid cell size, but can be larger than 1
if the center is outside the cell.
w, h:
width of a bounding box normalized between [0, 1] relative to the image
width and height, but can be larger than 1 when the bounding box is larger
than the image itself.
Ci/C(i):
ground truth classification probability for class i in each cell.
C_hat(i):
predicted conditional classification probability for class i.
[YOLO v1 paper]
Each grid cell also predicts C conditional class probabilities, Pr(Classij|Object).
These probabilities are conditioned on the grid cell containing an object.
We only predict one set of class probabilities per grid cell, regardless of
the number of boxes B.
At test time we multiply the conditional class probabilities and the individual
box confidence predictions,
Pr(Class_|Object) * (Pr(Object) * IOU_truth_pred) = Pr(Class_i) * IOU_truth_pred (1)
which gives us class-specific confidence scores for each box. These scores
encode both the probability of that class appearing in the box and how well
the predicted box fits the object.
Iobj_i: 0 or 1 to tell if object appears in cell i
Iobj_j: if the j-th bounding box predictor in cell i is “responsible” bbox.
IOU: Intersection Over Union
S: number of divisions per side. S=7 in YOLO v1.
B: number of bounding boxes to be predicted per cell. B=2 in YOLO v1.
P: size of a prediction of a bounding box = len(cp, x, y, w, h) = 5.
C:
number of classes to classify an object into. C=20 in YOLO v1.
YOLO v1 uses PASCAL VOC 2007 and 2012 datasets that have 20 classes.
[YOLO v1 paper]
We train the network for about 135 epochs on the training and validation
data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also
include the VOC 2007 test data for training.
non-max-suppression:
A mechanism to identify one cell that identifies an object to avoid multiple
cells from detecting the same object. At each cell, C_hat(i) is multiplied
with the confidence score cp of *every* bounding box, not just with the best
bounding box with the max IOU at each cell. This generates S*S*B class-specific
confidence scores for all the bounding boxes.
Then a box that has the highest class confidence score for an object will be
identified as the box for the object. This is encoded in the formula in the paper.
Pr(Class_|Object) * (Pr(Object) * IOU_truth_pred) = Pr(Class_i) * IOU_truth_pred (1)
See:
https://medium.com/diaryofawannapreneur/yolo-you-only-look-once-for-object-detection-explained-6f80ea7aaa1e
[YOLO v1 paper]
Figure 1: The YOLO Detection System.
Processing images with YOLO is simple and straightforward. Our system (1) resizes
the input image to 448 x 448, (2) runs a single convolutional network on the
image, and (3) thresholds the resulting detections by the model’s confidence.
...
At test time we multiply the conditional class probabilities and the individual
box confidence predictions,
Pr(Class_|Object) * (Pr(Object) * IOU_truth_pred) = Pr(Class_i) * IOU_truth_pred (1)
which gives us class-specific confidence scores for each box. These scores
encode both the probability of that class appearing in the box and how well
the predicted box fits the object.
...
Often it is clear which grid cell an object falls in to and the network only
predicts one box for each object. However, some large objects or objects near
the border of multiple cells can be well localized by multiple cells.
Non-maximal suppression can be used to fix these multiple detections.
[References]
* PASCAL VOC (Visual Object Classes) - http://host.robots.ox.ac.uk/pascal/VOC/
* PASCAL VOC 2007 - http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ (information and link to data)
* PASCAL VOC 2007 examples - http://host.robots.ox.ac.uk/pascal/VOC/voc2007/examples/index.html
* PASCAL VOC 2007 Development Kit - http://host.robots.ox.ac.uk/pascal/VOC/voc2007/htmldoc/index.html
(Details about the dataset)
---
Objects of the twenty classes listed above are annotated in the ground truth.
class:
the object class e.g. `car' or `bicycle'
bounding box:
an axis-aligned rectangle specifying the extent of the object visible in the image.
view:
`frontal', `rear', `left' or `right'.
The views are subjectively marked to indicate the view of the `bulk' of the object.
Some objects have no view specified.
`truncated':
an object marked as `truncated' indicates that the bounding box specified for
the object does not correspond to the full extent of the object e.g. an image
of a person from the waist up, or a view of a car extending outside the image.
`difficult':
an object marked as `difficult' indicates that the object is considered difficult
to recognize, for example an object which is clearly visible but unidentifiable
without substantial use of context. Objects marked as difficult are currently
ignored in the evaluation of the challenge.
---
* TensorFlow Data Set - https://www.tensorflow.org/datasets/catalog/voc
> This dataset contains the data from the PASCAL Visual Object Classes Challenge,
> corresponding to the Classification and Detection competitions.
> https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/object_detection/voc.py
---
PASCAL_VOC_CLASSES: List[str] = [
"aeroplane", # 0
"bicycle", # 1
"bird", # 2
"boat", # 3
"bottle", # 4
"bus", # 5
"car", # 6
"cat", # 7
"chair", # 8
"cow", # 9
"diningtable", # 10
"dog", # 11
"horse", # 12
"motorbike", # 13
"person", # 14
"pottedplant", # 15
"sheep", # 16
"sofa", # 17
"train", # 18
"tvmonitor" # 19
]
---
[NOTE]
assert, logger are for eager mode only for unit testing purpose (pytest) only.
"""
# pylint: disable=too-many-statements
import logging
from typing import (
Union,
Optional,
)
import numpy as np
from tensorflow import keras # pylint: disable=unused-import
import tensorflow as tf
from keras.losses import (
Loss,
# MeanSquaredError,
)
from util_constant import (
TYPE_FLOAT,
TYPE_INT,
)
from util_tf.yolo.v1.constant import (
DEBUG_LEVEL,
DUMP,
EPSILON,
YOLO_GRID_SIZE,
YOLO_V1_PREDICTION_NUM_CLASSES,
YOLO_V1_PREDICTION_NUM_BBOX,
YOLO_V1_PREDICTION_NUM_PRED,
YOLO_V1_LABEL_INDEX_CP,
)
from util_logging import (
get_logger,
)
from util_tf.yolo.v1.utils import (
intersection_over_union,
)
# --------------------------------------------------------------------------------
# Logging
# --------------------------------------------------------------------------------
_logger: logging.Logger = get_logger(__name__, level=DEBUG_LEVEL)
# --------------------------------------------------------------------------------
# Loss function
# --------------------------------------------------------------------------------
class YOLOLoss(Loss):
"""YOLO v1 objective (loss) layer
[References]
https://www.tensorflow.org/guide/keras/train_and_evaluate#custom_losses
If you need a loss function that takes in parameters beside T and y_pred,
you can subclass the tf.keras.losses.Loss class and implement the two methods:
1. __init__(self): accept parameters to pass during the call of the loss function
2. call(self, y_true, y_pred): compute the model's loss
https://www.tensorflow.org/api_docs/python/tf/keras/losses/Loss
> To be implemented by subclasses:
> call(): Contains the logic for loss calculation using y_true, y_pred.
"""
def __init__(
self,
S = YOLO_GRID_SIZE, # pylint: disable=invalid-name
B = YOLO_V1_PREDICTION_NUM_BBOX, # pylint: disable=invalid-name
C = YOLO_V1_PREDICTION_NUM_CLASSES, # pylint: disable=invalid-name
P = YOLO_V1_PREDICTION_NUM_PRED, # pylint: disable=invalid-name
**kwargs
):
"""
Initialization of the Loss instance
Args:
S: Number of division per side (YOLO divides an image into SxS grids)
B: number of bounding boxes per grid cell
C: number of classes to detect
P: size of predictions len(cp, x, y, w, h) per bounding box
"""
super().__init__(**kwargs)
self.batch_size = TYPE_FLOAT(0)
# pylint: disable=invalid-name
self.N = -1 # Total cells in the batch (S * S * batch_size) pylint: disable=invalid-name
self.S = S # pylint: disable=invalid-name
self.B = B # pylint: disable=invalid-name
self.C = C # pylint: disable=invalid-name
self.P = P # pylint: disable=invalid-name
# --------------------------------------------------------------------------------
# lambda parameters to prioritise the localization vs classification
# --------------------------------------------------------------------------------
# [YOLO v1 paper]
# YOLO uses sum-squared error because it is easy to optimize,
# however it does not perfectly align with our goal of maximizing
# average precision. It weights localization error equally with
# classification error which may not be ideal.
# Also, in every image many grid cells do not contain any
# object. This pushes the “confidence” scores of those cells
# towards zero, often overpowering the gradient from cells
# that do contain objects. This can lead to model instability,
# causing training to diverge early on.
# To remedy this, we increase the loss from bounding box
# coordinate predictions and decrease the loss from confidence
# predictions for boxes that don’t contain objects. We
# use two parameters, lambda_coord=5 and lambda_noobj=0.5
# --------------------------------------------------------------------------------
self.lambda_coord: TYPE_FLOAT = TYPE_FLOAT(5.0)
self.lambda_noobj: TYPE_FLOAT = TYPE_FLOAT(0.5)
# --------------------------------------------------------------------------------
# Identity function Iobj_i tells if an object exists in a cell.
# [Original Paper]
# where Iobj_i denotes if object appears in cell i and Iobj_j denotes that
# the jth bounding box predictor in cell i is “responsible” for that prediction.
# Note that the loss function only penalizes classification error if an object
# is present in that grid cell (hence the conditional class probability discussed).
# --------------------------------------------------------------------------------
self.Iobj_i: Optional[tf.Tensor] = None # pylint: disable=invalid-name
self.Inoobj_i: Optional[tf.Tensor] = None # pylint: disable=invalid-name
def get_config(self) -> dict:
"""Returns the config dictionary for a Loss instance.
https://www.tensorflow.org/api_docs/python/tf/keras/losses/Loss#get_config
Return serializable layer configuration from which the instance can be reinstantiated.
"""
config = super().get_config().copy()
config.update({
'S': self.S,
'B': self.B,
'C': self.C,
'P': self.P,
'lambda_coord': self.lambda_coord,
'lambda_noobj': self.lambda_noobj,
})
return config
def loss_fn(self, y_true: tf.Tensor, y_pred:tf.Tensor) -> tf.Tensor:
"""Sum squared loss function
[YOLO v1 paper]
We use sum-squared error because it is easy to optimize
[Note]
tf.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.SUM)
does normalization at axis=-1, but YOLO v1 sums.
Args:
y_true: ground truth
y_pred: prediction
Returns: batch size normalized loss
"""
tf.debugging.assert_all_finite(x=y_true, message="expected y_true is finite")
tf.debugging.assert_all_finite(x=y_pred, message="expected y_pred is finite")
return tf.math.reduce_sum(tf.square(y_true - y_pred)) / self.batch_size
def Iobj_j( # pylint: disable=invalid-name
self,
bounding_boxes: tf.Tensor,
best_box_indices: tf.Tensor
) -> tf.Tensor:
"""
Identify the responsible bounding boxes and get predictions from them.
Iobj_j is supposed to be a binary function to return 0 or 1. However,
the effective result of Iobj_j is to get (x, y) or (w, h) or (pc) from
the predictions, hence repurpose it to return them.
[YOLO v1 paper]
Iobj_j denotes that the jth bounding box in cell i is “responsible”
for that prediction.
Args:
bounding_boxes:
Bounding boxes from all the cells in shape (N, B, D) where (B, D)
is the B bounding boxes from a cell. D depends on the content.
When (w,h) is passed, then D==2.
best_box_indices:
list of index to the best bounding box of a cell in shape (N,)
Returns: predictions from the responsible bounding boxes
"""
# --------------------------------------------------------------------------------
# From the bounding boxes of each cell, take the box identified by the beset box index.
# MatMul X:(N, B, D) with OneHotEncoding:(N, B) extracts the rows as (N, D).
# --------------------------------------------------------------------------------
responsible_boxes: tf.Tensor = tf.einsum(
"nbd,nb->nd",
# Reshape using -1 cause an error ValueError: Shape must be rank 1 but is rank 0
# https://github.com/tensorflow/tensorflow/issues/46776
# tf.reshape(tensor=bounding_boxes, shape=(self.N, self.B, -1)),
tf.reshape(tensor=bounding_boxes, shape=(self.N, self.B, self.P)),
tf.one_hot(
# indices=tf.reshape(tensor=best_box_indices, shape=(-1)),
indices=tf.reshape(tensor=best_box_indices, shape=(self.N,)),
depth=self.B,
dtype=bounding_boxes.dtype
)
)
return responsible_boxes
def call(
self,
y_true: Union[np.ndarray, tf.Tensor],
y_pred: Union[np.ndarray, tf.Tensor]
) -> tf.Tensor:
"""YOLO loss function calculation
See ../image/yolo_loss_function.png and the original v1 paper.
Follow the defined Σ formula strictly. If summation is 1..B, then sum along 1..B.
[Steps]
1. Take the max IoU per cell (from between b-boxes and the truth at each cell).
2. Get Iobj_i_j per each cell i.
Iobj_i is 1 if the cell i is responsible (cp in truth==1) for an object, or 0.
Iobj_i_j is 1 if Iobj_i is 1 and bbox j has the max IoU at the cell, or 0.
3. Take sqrt(w) and sqrt(h) to reduce the impact from the object size.
4. Calculate localization loss.
5. Calculate confidence loss.
6. Calculate classification loss.
7. Sum the losses
[References]
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_squared_error
Args:
y_true: ground truth
y_pred: prediction of shape (N, S, S, (C+B*P)) where N is batch size.
[Original Paper]
Each grid cell predicts B bounding boxes and confidence scores for
those boxes. These confidence scores reflect how confident the model
is that the box contains an object and also how accurate it thinks
the box is that it predicts.
Formally we define confidence as Pr(Object) * IOU between truth and pred .
If no object exists in that cell, the confidence scores should be zero.
Otherwise we want the confidence score to equal the intersection over
union (IOU) between the predicted box and the ground truth.
Each bounding box consists of 5 predictions: x, y, w, h, and confidence.
The (x; y) coordinates represent the center of the box relative to the
bounds of the grid cell. The width and height are predicted relative to
the whole image. Finally the confidence prediction represents the IOU
between the predicted box and any ground truth box.
Returns: loss
"""
_name: str = "call()"
# --------------------------------------------------------------------------------
# Sanity checks
# --------------------------------------------------------------------------------
# The model output shape for single input image should be
# (S=7 * S=7 * (C+B*P)=30) or (S, S, (C+B*P)).
assert isinstance(y_pred, (np.ndarray, tf.Tensor))
assert y_pred.shape[-1] in (
(self.S * self.S * (self.C + self.B * self.P)),
(self.C + self.B * self.P)
)
# The label shape for one input image should be
# (S=7 * S=7 * (C+P)=25) or (S, S, (C+P)).
assert isinstance(y_true, (np.ndarray, tf.Tensor))
assert y_true.shape[-1] in (
(self.S * self.S * (self.C + self.P)),
(self.C + self.P)
)
self.batch_size = tf.cast(tf.shape(y_pred)[0], dtype=TYPE_FLOAT)
_logger.debug(
"%s: batch size:[%s] total cells:[%s]", _name, self.batch_size, self.N
)
tf.debugging.assert_all_finite(x=y_true, message="expected y_true is finite")
tf.debugging.assert_all_finite(x=y_pred, message="expected y_pred is finite")
tf.debugging.assert_non_negative(x=self.batch_size, message="expected batch size non negative")
# --------------------------------------------------------------------------------
# Reshape y_pred into N consecutive predictions in shape (N, (C+B*P)).
# Reshape y_true into N consecutive labels in shape (N, (C+P)).
# All we need are the predictions and label at each cell, hence no need to retain
# (S x S) util_tf.geometry of the grids.
# --------------------------------------------------------------------------------
# pylint: disable=invalid-name
Y: tf.Tensor = tf.reshape(tensor=y_pred, shape=(-1, self.C + self.B * self.P))
T: tf.Tensor = tf.reshape(tensor=y_true, shape=(-1, self.C + self.P))
# You can't use Python bool in graph mode. You should instead use tf.cond.
# assert tf.shape(Y)[0] == tf.shape(T)[0], \
# f"got different number of predictions:[{tf.shape(Y)[0]}] and labels:[{tf.shape(T)[0]}]"
# tf.assert_equal(x=tf.shape(Y)[0], y=tf.shape(T)[0], message="expected same number")
self.N = tf.shape(Y)[0]
# tf.print(_name, "number of cells to process (N)=", self.N)
self.Iobj_i = T[..., YOLO_V1_LABEL_INDEX_CP:YOLO_V1_LABEL_INDEX_CP+1]
self.Inoobj_i = 1.0 - self.Iobj_i
tf.debugging.assert_equal(x=tf.shape(self.Iobj_i), y=(self.N, 1), message="expected Iobj_i shape (N,1")
# assert self.Iobj_i.shape == (self.N, 1), \
# f"expected shape {(self.N, 1)} got {self.Iobj_i.shape}."
# tf.assert_equal(x=self.Iobj_i.shape, y=(self.N, 1), message="expected same shape")
DUMP and _logger.debug("%s: self.Iobj_i:[%s]", _name, self.Iobj_i)
# --------------------------------------------------------------------------------
# Classification loss
# [YOLO v1 paper]
# Note that the loss function only penalizes classification error if an object is
# present in that cell (hence the conditional class probability discussed earlier).
# --------------------------------------------------------------------------------
classification_loss: tf.Tensor = self.loss_fn(
y_true=self.Iobj_i * T[..., :self.C],
y_pred=self.Iobj_i * Y[..., :self.C],
)
_logger.debug("%s: classification_loss[%s]", _name, classification_loss)
# --------------------------------------------------------------------------------
# Bounding box predictions (c, x, y, w, h)
# --------------------------------------------------------------------------------
box_pred: tf.Tensor = tf.reshape(
tensor=Y[..., self.C:], # Take B*(c,x,y,w,h)
shape=(-1, self.B, self.P)
)
box_true: tf.Tensor = tf.reshape(
tensor=T[..., self.C:],
shape=(-1, self.P)
)
DUMP and _logger.debug(
"%s: box_pred shape:%s\n[%s]", _name, tf.shape(box_pred), box_pred
)
# --------------------------------------------------------------------------------
# IoU between predicted bounding boxes and the ground truth at a cell.
# IOU shape (N, B)
# --------------------------------------------------------------------------------
IOU: tf.Tensor = tf.concat( # pylint: disable=invalid-name
values=[
intersection_over_union(
box_pred[..., j, 1:5], # (x,y,w,h) shape:(N,4) from one of B boxes
box_true[..., 1:5] # (x,y,w,h) shape:(N,4) from ground truth
)
for j in range(self.B) # IOU for each bounding box from B predicted boxes
],
axis=-1,
name="IOU"
)
tf.debugging.assert_equal(x=tf.shape(IOU), y=(self.N, self.B), message="expected shape (N,B)")
# --------------------------------------------------------------------------------
# Max IOU per grid cell (axis=-1)
# best_box_j tells which bbox j has the max IOU.
#
# [YOLO v1 paper]
# YOLO predicts multiple bounding boxes per grid cell. At training time we only
# want one bounding box predictor to be responsible for each object.
# We assign one predictor to be “responsible” for predicting an object based on
# which prediction has the highest current IOU with the ground truth.
# This leads to specialization between the bounding box predictors.
# Each predictor gets better at predicting certain sizes, aspect ratios, or
# classes of object, improving overall recall.
# --------------------------------------------------------------------------------
# pylint: disable=invalid-name
max_IOU: tf.Tensor = tf.math.reduce_max(input_tensor=IOU, axis=-1, keepdims=True)
tf.debugging.assert_equal(x=tf.shape(max_IOU), y=(self.N, 1), message="expected MAX IOU shape (N,1)")
DUMP and _logger.debug("%s: max_IOU[%s]", _name, max_IOU)
best_box_j: tf.Tensor = tf.reshape( # argmax drops the last dimension
tensor=tf.math.argmax(input=IOU, axis=-1, output_type=TYPE_INT),
shape=(self.N, 1)
)
DUMP and _logger.debug("%s: best_box_j:%s", _name, best_box_j)
del IOU
# --------------------------------------------------------------------------------
# Bbox prediction (cp, x, y, w, h) from the responsible bounding box j per cell.
# This corresponds to Iobj_j as picking up the best box j at each cell.
# if best box j == 0, then YOLO_V1_PREDICTION_INDEX_X1:YOLO_V1_PREDICTION_INDEX_H1+1 as
# the (x, y, w, h) for the predicted localization. If j == 1, the other.
#
# [YOLO v1 paper]
# It also only penalizes bounding box coordinate error if that predictor is
# “responsible” for the ground truth box (i.e. has the highest IOU of any
# predictor in that grid cell).
# --------------------------------------------------------------------------------
best_boxes: tf.Tensor = self.Iobj_j(
bounding_boxes=box_pred,
best_box_indices=best_box_j
)
tf.debugging.assert_equal(
x=tf.shape(best_boxes), y=(self.N, self.P), message="expected bestbox shape (N,P)"
)
DUMP and _logger.debug("%s: best_boxes[%s]", _name, best_boxes)
# --------------------------------------------------------------------------------
# Localization loss (x, y)
# --------------------------------------------------------------------------------
x_y_loss = self.lambda_coord * self.loss_fn(
y_true=self.Iobj_i * box_true[..., 1:3], # shape (N, 2)
y_pred=self.Iobj_i * best_boxes[..., 1:3] # shape (N, 2) as (x,y) from (cp,x,y,w,h)
)
_logger.debug("%s: x_y_loss[%s]", _name, x_y_loss)
# --------------------------------------------------------------------------------
# Localization loss (sqrt(w), sqrt(h))
# https://datascience.stackexchange.com/questions/118674
# https://youtu.be/n9_XyCGr-MI?list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&t=2804
# --------------------------------------------------------------------------------
# Prevent infinite gradient f'(x=0) = 1/sqrt(x=0) -> inf during the back propagation.
# Gradient sqrt(x) is 0.5*sqrt(x). sqrt(abs(x)+eps) to avoid infinity by sqrt(x=0).
# sign(x) to restore the original sign which is lost via abs(x).
# --------------------------------------------------------------------------------
# [Original yolo v1 paper]
# Sum-squared error also equally weights errors in large boxes and small boxes.
# Our error metric should reflect that small deviations in large boxes matter
# less than in small boxes. To partially address this we predict the square root
# of the bounding box width and height instead of the width and height directly.
# --------------------------------------------------------------------------------
_w_h_pred: tf.Tensor = best_boxes[..., 3:5] # Shape (N,2) as (w,h) from (cp,x,y,w,h)
sqrt_w_h_pred: tf.Tensor = \
tf.math.sign(_w_h_pred) * tf.math.sqrt(tf.math.abs(_w_h_pred) + EPSILON)
_w_h_true: tf.Tensor = box_true[..., 3:5] # Shape (N,2) as (w,h) from (cp,x,y,w,h)
sqrt_w_h_true: tf.Tensor = \
tf.math.sign(_w_h_true) * tf.math.sqrt(tf.math.abs(_w_h_true) + EPSILON)
w_h_loss = self.lambda_coord * self.loss_fn(
y_true=self.Iobj_i * sqrt_w_h_true,
y_pred=self.Iobj_i * sqrt_w_h_pred
)
_logger.debug("%s: w_h_loss[%s]", _name, w_h_loss)
# --------------------------------------------------------------------------------
# Confidence loss with an object in a cell
# [YOLO v1 paper]
# These confidence scores reflect how confident the model is that the box contains
# an object and also how accurate it thinks the box is that it predicts.
# Formally we define confidence as Pr(Object) IOU(truth,pred) . If no object exists
# in that cell, the confidence scores should be zero.
# Otherwise we want the confidence score to equal the intersection over union (IOU)
# between the predicted box and the ground truth.
#
# https://stats.stackexchange.com/q/559122
# the ground-truth value 𝐶𝑖 is computed during training (IOU).
#
# https://github.com/aladdinpersson/Machine-Learning-Collection/pull/44/commits
# object_loss = self.mse(
# torch.flatten(exists_box * target[..., 20:21]),
# # To calculate confidence score in paper, I think it should multiply iou value.
# torch.flatten(exists_box * target[..., 20:21] * iou_maxes),
# )
# --------------------------------------------------------------------------------
confidence_pred: tf.Tensor = best_boxes[..., 0:1] # cp from (cp,x,y,w,h)
confidence_true: tf.Tensor = max_IOU
# assert tf.reduce_all(tf.shape(confidence_pred) == (self.N, 1)), \
# f"expected confidence shape:{(self.N, 1)}, got " \
# f"confidence_pred:{confidence_pred} confidence_truth:{tf.shape(confidence_true)}"
confidence_loss: tf.Tensor = self.loss_fn(
y_true=self.Iobj_i * confidence_true,
y_pred=self.Iobj_i * confidence_pred
)
_logger.debug("%s: confidence_loss[%s]", _name, confidence_loss)
# --------------------------------------------------------------------------------
# Confidence loss with no object
# --------------------------------------------------------------------------------
# [YOLO v1 paper]
# Also, in every image many grid cells do not contain any object.
# This pushes the “confidence” scores of those cells towards zero, often
# overpowering the gradient from cells that do contain objects.
# This can lead to model instability, causing training to diverge early on.
# To remedy this, we increase the loss from bounding box coordinate predictions
# and decrease the loss from confidence predictions for boxes that don’t contain
# objects. We use two parameters, coord and noobj to accomplish this.
# We set coord = 5 and noobj = :5.
# --------------------------------------------------------------------------------
# Each cell has B number of cp in (cp,x,y,w,h).
# Calculate C_hat(i) by taking the sum of B number of cp per cell on axis=1
# as per the loss function formula Σ (C(i)-C_hat(i))^2 along (1..B) reducing
# box_pred[..., 0] of shape (N, B) into shape (N, 1) with keepdims=True.
no_obj_confidences_pred: tf.Tensor = \
self.Inoobj_i * tf.math.reduce_sum(box_pred[..., 0], axis=-1, keepdims=True)
tf.debugging.assert_equal(
x=tf.shape(no_obj_confidences_pred),
y=(self.N, 1),
message="expected no_obj_confidences_pred shape:(N,1)"
)
# No subtraction of no_obj_confidence_true.
# no_obj_confidence_true
# = Inoobj_i * box_true[..., 0]
# = Inoobj_i * Iobj_i
# = (1 - Iobj_i) * Iobj_i
# = Iobj_i - Iobj_i^2
# = Iobj_i - Iobj_i # Iobj_i^2 == Iobj_i because it is either 1 or 0
# = 0
# or ...
# no_obj_confidence_true = Inoobj_i * cp_i = 0 always, because:
# When Inoobj_i = 1 with no object, then cp_i is 0, hence Inoobj_i * cp_i -> 0.
# When Inoobj_i = 0 with an object, then again Inoobj_i * cp_i -> 0.
# no_obj_confidence_true = self.Inoobj_i * self.Iobj_i
no_obj_confidence_true: tf.Tensor = 0.0
no_obj_confidence_loss: tf.Tensor = self.lambda_noobj * self.loss_fn(
y_true=no_obj_confidence_true,
y_pred=self.Inoobj_i * no_obj_confidences_pred
)
_logger.debug("%s: no_obj_confidence_loss[%s]", _name, no_obj_confidence_loss)
# --------------------------------------------------------------------------------
# Total loss
# tf.add_n be more efficient than reduce_sum because it sums the tensors directly.
# --------------------------------------------------------------------------------
# tf.print("x_y_loss", x_y_loss)
# tf.print("w_h_loss", w_h_loss)
# tf.print("confidence_loss", confidence_loss)
# tf.print("no_obj_confidence_loss", no_obj_confidence_loss)
# tf.print("classification_loss", classification_loss)
# loss: tf.Tensor = tf.math.add_n([
# x_y_loss,
# w_h_loss,
# confidence_loss,
# no_obj_confidence_loss,
# classification_loss
# ])
loss: tf.Tensor = \
x_y_loss + \
w_h_loss + \
confidence_loss + \
no_obj_confidence_loss + \
classification_loss
return loss
def main():
"""Simple test run"""
loss: Loss = YOLOLoss()
S: int = YOLO_GRID_SIZE # pylint: disable=invalid-name
C: int = YOLO_V1_PREDICTION_NUM_CLASSES # pylint: disable=invalid-name
B: int = YOLO_V1_PREDICTION_NUM_BBOX # pylint: disable=invalid-name
P: int = YOLO_V1_PREDICTION_NUM_PRED # pylint: disable=invalid-name
y_pred: tf.Tensor = tf.constant(np.ones(shape=(1, S, S, C+B*P)), dtype=TYPE_FLOAT)
y_true: tf.Tensor = tf.constant(np.zeros(shape=(1, S, S, C+P)), dtype=TYPE_FLOAT)
loss: tf.Tensor = loss(y_pred=y_pred, y_true=y_true)
print(loss)
if __name__ == "__main__":
main()
constant.py
"""
Constant definitions
"""
# pylint: disable=invalid-name
import logging
import numpy as np
import tensorflow as tf
# --------------------------------------------------------------------------------
# Logging
# --------------------------------------------------------------------------------
DEBUG_LEVEL: int = logging.INFO
DUMP: bool = False
logging.basicConfig(level=DEBUG_LEVEL)
# --------------------------------------------------------------------------------
# TYPES
# --------------------------------------------------------------------------------
TYPE_FLOAT = np.float32
TYPE_INT = np.int32
ZERO: tf.Tensor = tf.constant(0, dtype=TYPE_FLOAT)
ONE: tf.Tensor = tf.constant(1, dtype=TYPE_FLOAT)
EPSILON = TYPE_FLOAT(1e-6) # small enough value e.g. to avoid div by zero
YOLO_GRID_SIZE: int = 7
# --------------------------------------------------------------------------------
# YOLO v1 Input
# --------------------------------------------------------------------------------
YOLO_V1_IMAGE_WIDTH: int = 448
YOLO_V1_IMAGE_HEIGHT: int = 448
# --------------------------------------------------------------------------------
# YOLO Model
# --------------------------------------------------------------------------------
YOLO_LEAKY_RELU_SLOPE: TYPE_FLOAT = TYPE_FLOAT(0.1)
# --------------------------------------------------------------------------------
# YOLO v1 Predictions
# YOLO v1 prediction format = (C=20, B*P for each grid cell.
# P = (cp=1, x=1, y=1, w=1, h=1)
# Total S * S grids, hence model prediction output = (S, S, (C+B*P)).
# --------------------------------------------------------------------------------
# Prediction shape = (C+B*P)
YOLO_PREDICTION_NUM_CLASSES: int = 20 # number of classes
YOLO_PREDICTION_NUM_BBOX: int = 2 # number of bbox per grid cell
YOLO_PREDICTION_NUM_PRED: int = 5 # (cp, x, y, w, h)
YOLO_PREDICTION_INDEX_CP1: int = 20 # Index to cp of the first bbox in P=(C+B*5)
# Index to x in the first BBox
YOLO_PREDICTION_INDEX_X1: int = YOLO_PREDICTION_INDEX_CP1 + 1
# Index to y in the first BBox
YOLO_PREDICTION_INDEX_Y1: int = YOLO_PREDICTION_INDEX_X1 + 1
# Index to w in the first BBox
YOLO_PREDICTION_INDEX_W1: int = YOLO_PREDICTION_INDEX_Y1 + 1
# Index to h in the first BBox
YOLO_PREDICTION_INDEX_H1: int = YOLO_PREDICTION_INDEX_W1 + 1
assert YOLO_PREDICTION_INDEX_X1 == 21
assert YOLO_PREDICTION_INDEX_H1 == 24
YOLO_PREDICTION_INDEX_CP2: int = YOLO_PREDICTION_INDEX_H1 + 1
YOLO_PREDICTION_INDEX_X2: int = YOLO_PREDICTION_INDEX_CP2 + 1
YOLO_PREDICTION_INDEX_Y2: int = YOLO_PREDICTION_INDEX_X2 + 1
YOLO_PREDICTION_INDEX_W2: int = YOLO_PREDICTION_INDEX_Y2 + 1
YOLO_PREDICTION_INDEX_H2: int = YOLO_PREDICTION_INDEX_W2 + 1
assert YOLO_PREDICTION_INDEX_X2 == 26
assert YOLO_PREDICTION_INDEX_H2 == 29
YOLO_LABEL_INDEX_CP: int = 20
YOLO_LABEL_INDEX_X: int = YOLO_LABEL_INDEX_CP + 1
YOLO_LABEL_INDEX_Y: int = YOLO_LABEL_INDEX_X + 1
YOLO_LABEL_INDEX_W: int = YOLO_LABEL_INDEX_Y + 1
YOLO_LABEL_INDEX_H: int = YOLO_LABEL_INDEX_W + 1
utils package
"""
YOLO v1 utility module based on YOLOv1 from Scratch by Aladdin Persson.
https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/object_detection/YOLO/utils.py
[References]
https://datascience.stackexchange.com/q/118656/68313
TODO:
clarify if it is OK to clip values without consideration of the back propagation?
"""
# pylint: disable=too-many-statements
import logging
import tensorflow as tf
from constant import (
TYPE_FLOAT,
EPSILON,
)
from util_logging import (
get_logger,
)
# --------------------------------------------------------------------------------
# Constant
# --------------------------------------------------------------------------------
MAX_EXPECTED_W_PREDICTION: TYPE_FLOAT = TYPE_FLOAT(5.0)
MAX_EXPECTED_H_PREDICTION: TYPE_FLOAT = TYPE_FLOAT(5.0)
# --------------------------------------------------------------------------------
# Logging
# --------------------------------------------------------------------------------
_logger: logging.Logger = get_logger(__name__)
# --------------------------------------------------------------------------------
# Utility
# --------------------------------------------------------------------------------
def intersection_over_union(
boxes_preds: tf.Tensor,
boxes_labels: tf.Tensor,
box_format: str ="midpoint"
) -> tf.Tensor:
"""
Calculates intersection over union
[Original YOLO v1 paper]
```
Each bounding box consists of 5 predictions: cp, x, y, w, h where cp is confidence.
The (x, y) is the center of the box relative to the bounds of the grid cell.
The w and h are predicted relative to the whole image.
confidence is the IOU between the predicted box and any ground truth box.
We normalize the bounding box width and height by the image width and height
so that they fall between 0 and 1
```
e.g. (x, y, w, h) = (0.3, 0.7, 0.6, 1.1). 1.1. because the bounding box to
surround the object can outgrow the image itself.
Check if w, h within MAX_EXPECTED_W_PREDICTION and MAX_EXPECTED_H_PREDICTION.
See:
https://youtu.be/n9_XyCGr-MI?list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq&t=281
towardsdatascience.com/yolo-made-simple-interpreting-the-you-only-look-once-paper-55f72886ab73
Parameters:
boxes_preds (tensor): Predictions of Bounding Boxes (x, y, w, h) in shape:(N, 4)
boxes_labels (tensor): Correct labels of Bounding Boxes (x, y, w, h) in shape:(N, 4)
box_format (str): midpoint/corners, if boxes (x,y,w,h) or (x1,y1,x2,y2)
Returns:
IOU(Intersection over union) for all examples
"""
_name: str = "intersection_over_union()"
assert boxes_preds.ndim == 2, \
f"expected dims=4 with shape (N, 4) got {boxes_preds.ndim} dimensions."
assert boxes_labels.ndim == 2, \
f"expected dims=4 with shape (N, 4) got {boxes_labels.ndim} dimensions."
assert boxes_preds.shape == boxes_labels.shape, \
f"expected same shape, got {boxes_preds.shape} and {boxes_labels.shape}."
assert boxes_preds.shape[1] == boxes_labels.shape[1] == 4 # (x/0, y/1, w/2, h/3)
# Check if w, h within MAX_EXPECTED_W_PREDICTION and MAX_EXPECTED_H_PREDICTION
w_predicted: tf.Tensor = boxes_preds[..., 2]
h_predicted: tf.Tensor = boxes_preds[..., 3]
assert tf.math.reduce_all(w_predicted <= MAX_EXPECTED_W_PREDICTION + EPSILON), \
"expected w_predicted <= MAX_EXPECTED_W_PREDICTION, " \
f"got\n{w_predicted[(w_predicted > MAX_EXPECTED_W_PREDICTION + EPSILON)]}"
assert tf.math.reduce_all(h_predicted <= MAX_EXPECTED_H_PREDICTION + EPSILON), \
"expected height <= MAX_EXPECTED_H_PREDICTION, " \
f"got\n{h_predicted[(h_predicted > MAX_EXPECTED_H_PREDICTION + EPSILON)]}"
_logger.debug(
"%s: sample prediction (x, y, w, h) = %s",
_name,
(
boxes_preds[..., 0, 0],
boxes_preds[..., 0, 1],
boxes_preds[..., 0, 2],
boxes_preds[..., 0, 3])
)
N: int = boxes_preds.shape[0] # pylint: disable=invalid-name
_logger.debug("%s:total cells [%s]", N)
# --------------------------------------------------------------------------------
# Corner coordinates of Bounding Boxes and Ground Truth
# --------------------------------------------------------------------------------
if box_format == "midpoint":
# predicted box left x coordinate
box1_x1 = boxes_preds[..., 0:1] - boxes_preds[..., 2:3] / 2
# predicted box right x coordinate
box1_x2 = boxes_preds[..., 0:1] + boxes_preds[..., 2:3] / 2
# predicted box bottom y coordinate
box1_y1 = boxes_preds[..., 1:2] - boxes_preds[..., 3:4] / 2
# predicted box top y coordinate
box1_y2 = boxes_preds[..., 1:2] + boxes_preds[..., 3:4] / 2
box2_x1 = boxes_labels[..., 0:1] - boxes_labels[..., 2:3] / 2
box2_y1 = boxes_labels[..., 1:2] - boxes_labels[..., 3:4] / 2
box2_x2 = boxes_labels[..., 0:1] + boxes_labels[..., 2:3] / 2
box2_y2 = boxes_labels[..., 1:2] + boxes_labels[..., 3:4] / 2
elif box_format == "corners":
box1_x1 = boxes_preds[..., 0:1]
box1_y1 = boxes_preds[..., 1:2]
box1_x2 = boxes_preds[..., 2:3]
box1_y2 = boxes_preds[..., 3:4] # (N, 1)
box2_x1 = boxes_labels[..., 0:1]
box2_y1 = boxes_labels[..., 1:2]
box2_x2 = boxes_labels[..., 2:3]
box2_y2 = boxes_labels[..., 3:4]
else:
raise RuntimeError(f"invalid box_format {box_format}")
# --------------------------------------------------------------------------------
# Intersection
# - YOLO v1 predicts bbox w/h as relative to image w/h, e.g. 0.6 * image w/h.
# --------------------------------------------------------------------------------
x1 = tf.math.maximum(box1_x1, box2_x1) # pylint: disable=invalid-name
y1 = tf.math.maximum(box1_y1, box2_y1) # pylint: disable=invalid-name
x2 = tf.math.maximum(box1_x2, box2_x2) # pylint: disable=invalid-name
y2 = tf.math.maximum(box1_y2, box2_y2) # pylint: disable=invalid-name
assert x1.shape == (N, 1)
_logger.debug(
"%s: sample intersection corner coordinates (x1, y1, x2, y2) = %s",
_name, (x1[..., 0, 0], y1[..., 0, 0], x2[..., 0, 0], y2[..., 0, 0])
)
# Clip with 0 in case no intersection,
width: tf.Tensor = x2 - x1
height: tf.Tensor = y2 - y1
width = tf.clip_by_value(
width, clip_value_min=TYPE_FLOAT(0), clip_value_max=TYPE_FLOAT(5.0)
)
height = tf.clip_by_value(
height, clip_value_min=TYPE_FLOAT(0), clip_value_max=TYPE_FLOAT(5.0)
)
intersection: tf.Tensor = tf.math.multiply(width, height)
# --------------------------------------------------------------------------------
# Union
# --------------------------------------------------------------------------------
box1_area = abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))
box2_area = abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))
union: tf.Tensor = (box1_area + box2_area - intersection + EPSILON)
# --------------------------------------------------------------------------------
# IOU between (0, 1)
# --------------------------------------------------------------------------------
IOU: tf.Tensor = tf.clip_by_value( # pylint: disable=invalid-name
# EPSILON to avoid div by 0.
tf.math.divide(intersection, union),
clip_value_min=TYPE_FLOAT(0.0),
clip_value_max=TYPE_FLOAT(1.0)
)
_logger.debug("%s: sample IOU = %s", _name, IOU[0])
assert IOU.shape == (N, 1), f"expected IOU shape {(N, 1)}, got {IOU.shape}"
assert tf.math.reduce_all(IOU <= TYPE_FLOAT(1.0+EPSILON)), \
f"expected IOU <= 1.0, got\n{IOU[(IOU > TYPE_FLOAT(1.0+EPSILON))]}"
return IOU
util.logging package
import logging
from typing import (
Optional,
)
def get_logger(name: str, level: Optional[int] = None) -> logging.Logger:
"""Logger instance factory method
See https://docs.python.org/2/howto/logging.html#logging-advanced-tutorial
The logger name should follow the package/module hierarchy
Args:
name: logger name following the package/module hierarchy
level: optional log level
Returns:
logger instance
"""
_logger = logging.getLogger(name=name)
if level:
_logger.setLevel(level)
else:
_logger.setLevel(DEFAULT_LOG_LEVEL)
return _logger
Test
"""
Pytest for YOLO loss function
"""
import sys
import logging
import numpy as np
import torch
import torch.nn as nn
import tensorflow as tf
from tensorflow import keras # pylint: disable=unused-import
from constant import (
DEBUG_LEVEL,
DUMP,
TYPE_FLOAT,
YOLO_GRID_SIZE,
YOLO_PREDICTION_NUM_CLASSES,
YOLO_PREDICTION_NUM_BBOX,
YOLO_PREDICTION_NUM_PRED,
YOLO_LABEL_INDEX_CP,
)
from loss import (
YOLOLoss
)
# sys.path.append("../../../../../lib")
# sys.path.append("../src")
from util_logging import (
get_logger
)
# --------------------------------------------------------------------------------
# Logging
# --------------------------------------------------------------------------------
_logger: logging.Logger = get_logger(__name__, level=DEBUG_LEVEL)
def intersection_over_union(box1, box2):
box1x1 = box1[..., 0] - box1[..., 2] / 2
box1y1 = box1[..., 1] - box1[..., 3] / 2
box1x2 = box1[..., 0] + box1[..., 2] / 2
box1y2 = box1[..., 1] + box1[..., 3] / 2
box2x1 = box2[..., 0] - box2[..., 2] / 2
box2y1 = box2[..., 1] - box2[..., 3] / 2
box2x2 = box2[..., 0] + box2[..., 2] / 2
box2y2 = box2[..., 1] + box2[..., 3] / 2
box1area = torch.abs((box1x1 - box1x2) * (box1y1 - box1y2))
box2area = torch.abs((box2x1 - box2x2) * (box2y1 - box2y2))
x1 = torch.max(box1x1, box2x1)
y1 = torch.max(box1y1, box2y1)
x2 = torch.min(box1x2, box2x2)
y2 = torch.min(box1y2, box2y2)
intersection_area = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)
iou = intersection_area / (box1area + box2area - intersection_area + 1e-6)
return iou
class TorchYoloLoss(nn.Module):
def __init__(self, num_classes=20, num_boxes=2):
super(TorchYoloLoss, self).__init__()
self.mse = nn.MSELoss(reduction="sum")
self.num_classes = num_classes
self.num_boxes = num_boxes
self.lambda_noobj = 0.5
self.lambda_coord = 5
def forward(self, predictions, target):
_name: str = "forward()"
batch_size = target.size()[0]
predictions = predictions.reshape(-1, 7, 7, self.num_classes + 5 * self.num_boxes)
class_predictions = predictions[..., :self.num_classes]
class_target = target[..., :self.num_classes]
indicator_i = target[..., self.num_classes].unsqueeze(3)
DUMP and _logger.debug("%s indicator_i: shape %s\n%s", _name, indicator_i.shape, indicator_i)
# class loss
class_loss = self.mse(
indicator_i * class_predictions,
indicator_i * class_target
) / float(batch_size)
_logger.debug("%s: class_loss[%s]", _name, class_loss)
box_predictions = predictions[..., self.num_classes:].reshape(-1, 7, 7, self.num_boxes, 5)
# print(f"box_predictions: shape:{box_predictions.shape}\n{box_predictions}")
DUMP and _logger.debug(
"%s: box_predictions shape:%s\n[%s]",
_name, box_predictions.shape, box_predictions
)
box_target = target[..., self.num_classes:]
box_target = torch.cat((box_target, box_target), dim=3).reshape(-1, 7, 7, self.num_boxes, 5)
iou = torch.cat(
[
intersection_over_union(
box_predictions[..., i, 1:],
box_target[..., i, 1:]
).unsqueeze(3).unsqueeze(0)
for i in range(self.num_boxes)
],
dim=0
)
# print(f"iou: shape:{iou.shape}\n{iou}")
best_iou, best_box = torch.max(iou, dim=0)
# print(f"best_box: shape:{best_box.shape}\n{best_box}")
DUMP and _logger.debug("%s: best_box[%s]", _name, best_box)
first_box_mask = torch.cat((torch.ones_like(indicator_i), torch.zeros_like(indicator_i)), dim=3)
second_box_mask = torch.cat((torch.zeros_like(indicator_i), torch.ones_like(indicator_i)), dim=3)
# print(f"first_box_mask: shape:{first_box_mask.shape}\n{first_box_mask}")
# print(f"second_box_mask: shape:{second_box_mask.shape}\n{second_box_mask}")
indicator_ij = (indicator_i * ((1 - best_box) * first_box_mask + best_box * second_box_mask))
# print(f"indicator_ij: shape:{indicator_ij.shape}\n{indicator_ij}")
indicator_ij = indicator_ij.unsqueeze(4)
# print(f"indicator_ij: shape:{indicator_ij.shape}\n{indicator_ij}")
box_target[..., 0] = torch.cat((best_iou, best_iou), dim=3)
box_target = indicator_ij * box_target
# localization loss
xy_loss = self.lambda_coord * self.mse(
indicator_ij * box_predictions[..., 1:3],
indicator_ij * box_target[..., 1:3]
) / float(batch_size)
_logger.debug("%s: localization_xy_loss[%s]", _name, xy_loss)
wh_loss = self.lambda_coord * self.mse( # pylint: disable=no-member
indicator_ij * torch.sign(box_predictions[..., 3:5]) * torch.sqrt(
torch.abs(box_predictions[..., 3:5]) + 1e-6),
indicator_ij * torch.sign(box_target[..., 3:5]) * torch.sqrt(torch.abs(box_target[..., 3:5]) + 1e-6)
) / float(batch_size)
_logger.debug("%s: localization_wh_loss[%s]", _name, wh_loss)
# object loss
object_loss = self.mse(
indicator_ij * box_predictions[..., 0:1],
indicator_ij * box_target[..., 0:1]
) / float(batch_size)
_logger.debug("%s: confidence_loss[%s]", _name, object_loss)
# no object loss
no_object_loss = self.lambda_noobj * self.mse(
(1 - indicator_ij) * box_predictions[..., 0:1],
(1 - indicator_ij) * box_target[..., 0:1]
) / float(batch_size)
DUMP and _logger.debug(
"%s: (1-indicator_ij) * box_predictions[..., 0:1] \n%s",
_name, (1 - indicator_ij) * box_predictions[..., 0:1]
)
_logger.debug("%s: no_obj_confidence_loss[%s]", _name, no_object_loss)
loss = xy_loss + wh_loss + object_loss + no_object_loss + class_loss
_logger.debug("%s: loss[%s]", _name, loss)
return loss
def test_compare_with_torch_rand():
"""
Objective:
Verify the loss values from Pytorch and TF implementations with
random value initialization are close.
Expected:
1. Loss difference is within a limit.
"""
N: int = 4 # Batch size pylint: disable=invalid-name
S: int = YOLO_GRID_SIZE # pylint: disable=invalid-name
B: int = YOLO_PREDICTION_NUM_BBOX # pylint: disable=invalid-name
C: int = YOLO_PREDICTION_NUM_CLASSES # pylint: disable=invalid-name
P: int = YOLO_PREDICTION_NUM_PRED # pylint: disable=invalid-name
MAX_ALLOWANCE: int = 25 # pylint: disable=invalid-name
# --------------------------------------------------------------------------------
# random value initialization
# --------------------------------------------------------------------------------
# Bounding box predictions (cp, x, y, w, h)
pred: np.ndarray = np.random.random((N, S, S, C + B * P)).astype(TYPE_FLOAT)
# --------------------------------------------------------------------------------
# Bounding box ground truth
# --------------------------------------------------------------------------------
true: np.ndarray = np.random.random((N, S, S, C + P)).astype(TYPE_FLOAT)
# Set 0 or 1 to the confidence score of the ground truth.
# In ground truth, confidence=1 when there is an object in a cell, or 0.
true[..., YOLO_LABEL_INDEX_CP] = \
np.random.randint(low=0, high=2, size=N * S * S).astype(TYPE_FLOAT).reshape((N, S, S))
# Set only one class of the C classes to 1 because the object class in a cell is known
# to be a specific class e.g. a dog.
index_to_true_class = np.random.randint(low=0, high=C + 1, size=1)
true[..., :YOLO_LABEL_INDEX_CP] = TYPE_FLOAT(0)
true[..., index_to_true_class] = TYPE_FLOAT(1)
# --------------------------------------------------------------------------------
# Loss from Torch
# --------------------------------------------------------------------------------
y_pred_torch = torch.Tensor(pred)
y_true_torch = torch.Tensor(true)
_logger.debug("-" * 80)
_logger.debug("Torch")
_logger.debug("-" * 80)
torch_loss_instance = TorchYoloLoss()
loss_from_torch: np.ndarray = torch_loss_instance.forward(
predictions=y_pred_torch,
target=y_true_torch
).numpy()
# --------------------------------------------------------------------------------
# Loss from TF
# --------------------------------------------------------------------------------
y_pred_tf: tf.Tensor = tf.constant(pred)
y_true_tf: tf.Tensor = tf.constant(true)
_logger.debug("-" * 80)
_logger.debug("TF")
_logger.debug("-" * 80)
tf_loss_instance = YOLOLoss()
loss_from_tf: tf.Tensor = tf_loss_instance(
y_true=y_true_tf,
y_pred=y_pred_tf
).numpy()
# --------------------------------------------------------------------------------
# Test condition #1: loss diff is within a limit.
# Somehow the Torch and TF calculation gives differences. Not sure why.
# Classification error is the same, but other values where Torch implementation
# calculate the loss with mse on the multiplies with indication_i_j differ.
# --------------------------------------------------------------------------------
assert np.allclose(a=loss_from_torch, b=loss_from_tf, atol=TYPE_FLOAT(MAX_ALLOWANCE)), \
f"loss_from_torch:{loss_from_torch} loss_from_tf:{loss_from_tf}"
if __name__ == "__main__":
logging.basicConfig(level=DEBUG_LEVEL)
for _ in range(5):
test_compare_with_torch_rand()