Generative Adversarial Networks (GANs) - A Beginners Overview and Experiments.

During my internship at the Responsible Artificial Intelligence Lab, I conducted research into Generative Adversarial Networks (GANs). I started by learning the fundamentals with the MNIST dataset and then advanced my work by combining a Deep Convolutional GAN (DCGAN) architecture with a Conditional GAN (CGAN). Below is a summary of my accomplishments, key takeaways, challenges faced, and future objectives:

Understanding the fundamentals of GANs:

GANs are essentially made up of two competing neural networks:

Generator: This network uses random noise as input to produce synthetic data. To fool the second network (its adversary), the generator aims to produce realistic images from the data fed as input.

Discriminator: This network receives both real data (from a dataset) and fake data (from the Generator). Its objective is to classify the real data as real correctly and the fake data as fake.

Real life Analogy of how GANs work

The Generator can be thought of as a forger having an adversary, the Discriminator (shop owner). The forger learns to create wine in such a way that the shop owner should not be able to distinguish it as fake anymore. The competition between these two teams is what improves their knowledge, until the forger (generator) succeeds in creating wine that seems realistic to the shop owner.

The competition between these networks creates a feedback loop: as the Generator gets better at creating convincing fakes, the Discriminator must become more discerning. Training them simultaneously requires a delicate balance—if one network gets too strong too fast, the other struggles to improve.

An overview of the GAN Adversarial Networks.

1. Generator:

Purpose:

The primary role of the Generator is to create fake data that resembles real data. In the case of image generation, it takes random noise as input and transforms it into images that ideally should look indistinguishable from real images in the training dataset.

How It Works:

Input: The Generator takes a random noise vector (usually sampled from a normal distribution) as its input. This noise serves as a seed from which the Generator can create various outputs.

Architecture: The Generator typically employs a series of transposed convolutional layers (also known as deconvolutional layers) to up sample the input noise. These layers reverse the down sampling process of convolutional layers used in the Discriminator. Activation Functions: The final layer usually employs a Tanh activation function to scale the output pixel values between -1 and 1 (commonly used in image generation). Intermediate layers often use ReLU or Leaky ReLU to maintain non-linearity and improve the flow of gradients.

Training Process: During training, the Generator aims to maximize the probability of deceiving the Discriminator by producing fake images that look real. It learns through backpropagation, receiving feedback from the Discriminator about how "real" or "fake" its outputs are.

2. Discriminator

Purpose:

The Discriminator's role is to differentiate between real data and the fake data generated by the Generator. It acts as a binary classifier, aiming to maximize its accuracy in identifying the real data from the fake.

How It Works

Input: The Discriminator takes both real images from the training dataset and fake images generated by the Generator as input.

Architecture: Usually, it downsamples the input using convolutional layers and then pooling layers. These layers assist the model in extracting significant features and learning spatial hierarchies. Activation Functions: A sigmoid activation function is typically used by the output layer to generate a probability score ranging from 0 (fake) to 1 (actual).

Training Process: The Discriminator is trained to maximize the probability of correctly classifying real and fake images. It learns through backpropagation based on the errors it makes during predictions. The Discriminator’s feedback helps the Generator improve by letting it know how “real” or “fake” its generated outputs are.

Deep Convolutional GAN (DCGAN)

For my first GAN, I explored the Deep Convolutional GAN (DCGAN). Deep Convolutional GANs (DCGANs) are a significant advancement in the GAN framework, proposed by Radford, Metz, and Chintala in their 2015 paper, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. This paper introduced some developments that could improve the performance of standard GANs. It provided valuable insights into the architectural features of DCGANs and the rationale behind them, making it easier to understand why these modifications lead to improved performance.

Key Features of DCGAN:

Standard GANs rely on fully connected layers, which don’t do a great job of capturing the spatial relationships in images. DCGANs, on the other hand, use convolutional layers, which are much better at recognizing features like edges and textures.

The Generator used transposed convolutions to upsample the data, creating more detailed images.

It has batch normalization to stabilize the training and avoid wild fluctuations.

The discriminator uses LeakyReLU as its activation function to ensure better gradient flow and Tanh in the Generator’s output layer to normalize data between -1 and 1.

Experiments

For my experiments, I used the MNIST dataset, which has grayscale images of handwritten digits (0-9). I implemented a training loop for 200 epochs, focusing on optimizing both the Generator and Discriminator. The losses reported after 200 epochs were as follows:

Generator Loss: 1.7830
Discriminator Loss: 0.5462

These results revealed that while the Discriminator was performing well, the Generator struggled to produce convincing images. This was because the discriminator was too strict on penalizing the generator for producing images it detected as fake. Such an imbalance in training dynamics is a common problem in GANs, where the Generator’s progress is hindered due to the Binary Cross-Entropy (BCE) loss function, which can lead to vanishing gradients and slow learning.

Challenges with the loss function (Binary Cross Entropy Cost)

1. Issues with Binary Cross-Entropy (BCE) Loss: While it provides a clear metric for distinguishing between real and fake images, it can lead to problems during training. Specifically, BCE can result in poor gradient flow when the Discriminator becomes too confident, assigning low probability scores to generated samples. This overconfidence can halt the learning process for the Generator, making it difficult to improve and contribute to other challenges like mode collapse.

2. Mode Collapse: Mode collapse is a phenomenon where the Generator produces a limited variety of outputs, often generating the same or very similar images for different inputs. This issue can severely restrict the diversity of the generated data, undermining the GAN's ability to learn and replicate the underlying distribution of the training data. Mode collapse is particularly problematic in applications where diversity is essential, such as in image synthesis.

3. Vanishing Gradients: Another issue was vanishing gradients, which can occur when the Discriminator becomes too powerful relative to the Generator. When the Discriminator learns to distinguish real from fake images too effectively, the Generator receives minimal gradient feedback, which is essential for updating its weights.

This situation can lead to stagnation in the Generator's learning, further exacerbating mode collapse and hindering overall model performance.

Solution: Earth Mover's Distance to BCE

One solution to the limitations of Binary Cross-Entropy (BCE) is the Earth Mover's Distance (EMD), also known as Wasserstein distance. EMD provides a better way to compare the distributions of real and generated data.

What is Earth Mover's Distance (EMD)? EMD measures the minimum "cost" to change one distribution into another. Think of it like moving a pile of dirt (representing generated samples) to create a new pile that looks like another distribution (representing real samples). EMD calculates how much effort it takes to make this transformation, considering how far each piece of dirt has to move.

The EMD between two probability distributions can be expressed as:

Using EMD as a loss function in GANs has several benefits over BCE:

Improved Gradient Flow: EMD gives clearer gradients when the Discriminator and Generator are far apart. Unlike BCE, where gradients can disappear if the Discriminator is too confident, WGAN avoids saturating gradients by using a critic instead of a strict discriminator.
Mitigating Mode Collapse: EMD encourages the Generator to explore more of the output space. This helps prevent mode collapse, where the Generator produces only a limited range of outputs.
Meaningful Training Progress: Since EMD focuses on the distance between distributions rather than simple yes-or-no classifications, it better reflects how well the Generator is performing. This leads to a more stable and understandable training process.

Wasserstein Loss (w-loss)

The use of EMD in GANs is often done through the Wasserstein loss (w-loss). This loss function relates directly to EMD and makes training GANs more practical.

Why w-loss Works:

Stronger Convergence: w-loss stabilizes training by providing clear signals for both the Discriminator and Generator. It helps both models learn better, especially when dealing with difficult data.
Avoiding Vanishing Gradients: Even when the Discriminator is very confident, w-loss keeps the loss meaningful. This prevents the Generator from stalling during training by ensuring it gets useful gradients.
Lipschitz Constraint Requirement: w-loss requires a Lipschitz constraint on the Discriminator. This means the Discriminator's output should change smoothly. We can enforce this constraint using methods like gradient penalty, which helps create stable training dynamics.

By switching from BCE to EMD and using w-loss, I noticed significant improvements in training. The process became more stable, and the Generator produced more diverse and realistic outputs.

Lipschitz Constraint and Solutions

To tackle stability and convergence issues, I looked into ways to enforce the Lipschitz constraint. Here are two main methods I found:

Weight Clipping: This simple approach involves limiting the weights of the Discriminator to a specific range (like [-0.01, 0.01]) after each update. While this method enforces the Lipschitz constraint, it has drawbacks. Clipping can make the Discriminator's function sharp and discontinuous, which can hurt its learning ability and reduce its flexibility.
Gradient Penalty: This method adds a penalty to the Discriminator's loss based on how much the output changes for its input. The gradient penalty encourages smoother transitions in the Discriminator’s decisions. Gradient penalty is more effective because it keeps the Discriminator flexible while ensuring stable training.

Gradient penalty is mostly used in Wasserstein GANs (WGANs), because it reduces problems like mode collapse and vanishing gradients. It allows the Generator to receive meaningful gradients, which helps it learn better and produce diverse outputs.

Training Results for DCGAN:

For 50 epochs, the training results for the DCGAN with Binary Cross-Entropy (BCE) loss were:

Generator Loss: 0.9231
Discriminator Loss: 0.6941

These results show that the DCGAN model is making solid progress, with both networks actively pushing each other to improve. The Generator is learning to generate more realistic images while the Discriminator continues to refine its ability to tell them apart.

Exploring Conditional GAN with DCGAN architecture

After experimenting with DCGAN, I decided to take on Conditional GANs (cGANs), which offer an exciting twist on the traditional GAN framework.

What is a CGAN?

Conditional GANs extend the GAN concept by allowing control over the generated output through conditioning. They take input from the user about which class they wish to generate an image for and they generate an image belonging to the same class. This conditioning mechanism enables the model to produce outputs that are more aligned to a specific criteria. For instance, when working with the MNIST dataset, I could specify which digit I wanted to generate (like “1” or “7”), and the model would respond by producing a corresponding image of that digit or you could specify that the model should generate a face with a pointed nose and blonde hair. This added layer of control makes cGANs incredibly powerful for tasks where specificity is required.

What Makes a CGAN Special?

The Conditional GAN introduces an additional input: conditioning information. This information can be anything that adds context to what the image should look like, such as:

A label (like a number indicating what digit to generate),
A class (like "cat" or "dog"),
Even a text description (like "moon across the sea").

The idea is to give both the Generator and the Discriminator some context, enabling the network to generate more targeted and relevant images.

Architecture Changes in CGAN

Generator Changes

In a standard GAN, the Generator takes a random noise vector z and outputs an image.

In a CGAN, the Generator takes two inputs:

1. Random noise z.

2. Conditioning information y (like a label, e.g., "4" for generating a handwritten digit "4").

These inputs are concatenated together into a single input vector, which then gets processed through the Generator network to produce an image that should match the given condition.

Discriminator Changes

In a CGAN, the Discriminator also receives the conditioning information y along with the image. The image and the label are combined, often by concatenating the label as an extra channel in the image. This setup forces the Discriminator to not only determine if an image is real or fake but also whether it matches the condition.

By adding conditioning, we can:

Directly Control the Output: You decide what kind of image you want by providing a condition, making the process predictable.
Improve Training Stability: Conditioning can help prevent mode collapse (when the Generator produces the same type of output repeatedly) by providing diverse training data through different labels.
Generate Diverse Outputs: A CGAN can create varied images for the same condition since the noise input allows for creativity, while the condition keeps the result relevant.

Implementation:

To implement the CGAN architecture, I started by utilizing one-hot encoded labels for conditioning. This approach allows the model to interpret the label data effectively:

Generator Modifications: In the Generator, I concatenated the noise vector with the one-hot encoded label. This means that when the model generates an image, it doesn’t just rely on random noise but also considers the label, guiding the generation process. This concatenation effectively integrates the information about which digit to produce, making the Generator's task more straightforward.
Discriminator Modifications: Similarly, in the Discriminator, the label was treated as an additional input channel. By including the label in the Discriminator’s input, I ensured that it could evaluate whether the generated image correctly corresponds to the specified digit. This adjustment helps the Discriminator learn the association between the label and the image, making it more discerning during the training process.

Training:

At epoch 20, the training showed encouraging progress, with the following results:

Generator Loss: 0.7779
Discriminator Loss: 0.6120

These loss values provide valuable insight into the model's performance:

Interpreting the Loss Values

Generator Loss (0.7779): This loss value indicates how well the Generator is doing in creating images that can deceive the Discriminator. A loss of 0.7779 suggests the Generator is learning effectively, producing images that look increasingly realistic. However, there is still room for improvement to ensure the generated images are even more convincing and accurately align with the conditioning information (like labels).
Discriminator Loss (0.6120): With a loss value of 0.6120, the Discriminator is performing well in distinguishing between real and generated images. Ideally, a Discriminator loss around 0.5 means it is being fooled about half the time, which indicates a perfect balance between the Generator and the Discriminator. A slightly higher value suggests that the Discriminator is doing a solid job but still finds some generated images challenging to classify.

Loss Curve

This stage of training highlights that the cGAN is on the right track, with both networks improving and pushing each other toward better results.

Conclusion:

In conclusion, diving into Generative Adversarial Networks has been a fun experiment that improved my understanding of deep learning, particularly through the exploration of Deep Convolutional GANs (DCGANs) and Conditional GANs (CGANs). Although it was difficult to implement the architectures of the GANs and ensure a delicate balance between the generator and discriminator, it was worth it. This experience has deepened my appreciation for the capabilities of GANs in generating realistic data and has taught me how important it is to experiment, as even small tweaks can make a big difference in how well the models work. I aim to refine my knowledge of GANs further and explore their applications in various fields.

Understanding the fundamentals of GANs:

Real life Analogy of how GANs work

An overview of the GAN Adversarial Networks.

1. Generator:

Purpose:

How It Works:

2. Discriminator

Purpose:

How It Works

Deep Convolutional GAN (DCGAN)

Key Features of DCGAN:

Experiments

Challenges with the loss function (Binary Cross Entropy Cost)

Solution: Earth Mover's Distance to BCE

Recommended by LinkedIn

Wasserstein Loss (w-loss)

Why w-loss Works:

Lipschitz Constraint and Solutions

Training Results for DCGAN:

Exploring Conditional GAN with DCGAN architecture

What is a CGAN?

What Makes a CGAN Special?

Architecture Changes in CGAN

Generator Changes

Discriminator Changes

Implementation:

Training:

Interpreting the Loss Values

Loss Curve

Conclusion:

More articles by Kenneth Dotse

Ghanaian Food Vision model

Others also viewed

Impact of Linear Activation on Convolution Networks

Generative Adversarial Networks: What it is, How they work, and My Experiments

The 10 Deep Learning Methods AI Practitioners Need to Apply

From scratch to XAI - A personal 1-week experience developing a simple explainable artificial neural network

Neural Forests: Weaving Intelligence with a New Hybrid Machine Learning Paradigm

Yandex, ISP RAS, and Sechenov University Create a Neural Network Using a Federated Machine Learning Approach

🧠 Day 5: Key Moments in Machine Learning History

Generative Adversarial Networks (GANs): An Introduction

A simple CNN In TensorFlow: Practical CIFAR-10 Guide

The Math Behind the Foundation of AI

Explore content categories