Generative Adversarial Networks (GANs) - A Beginners Overview and Experiments.
During my internship at the Responsible Artificial Intelligence Lab, I conducted research into Generative Adversarial Networks (GANs). I started by learning the fundamentals with the MNIST dataset and then advanced my work by combining a Deep Convolutional GAN (DCGAN) architecture with a Conditional GAN (CGAN). Below is a summary of my accomplishments, key takeaways, challenges faced, and future objectives:
Understanding the fundamentals of GANs:
GANs are essentially made up of two competing neural networks:
Generator: This network uses random noise as input to produce synthetic data. To fool the second network (its adversary), the generator aims to produce realistic images from the data fed as input.
Discriminator: This network receives both real data (from a dataset) and fake data (from the Generator). Its objective is to classify the real data as real correctly and the fake data as fake.
Real life Analogy of how GANs work
The Generator can be thought of as a forger having an adversary, the Discriminator (shop owner). The forger learns to create wine in such a way that the shop owner should not be able to distinguish it as fake anymore. The competition between these two teams is what improves their knowledge, until the forger (generator) succeeds in creating wine that seems realistic to the shop owner.
The competition between these networks creates a feedback loop: as the Generator gets better at creating convincing fakes, the Discriminator must become more discerning. Training them simultaneously requires a delicate balance—if one network gets too strong too fast, the other struggles to improve.
An overview of the GAN Adversarial Networks.
1. Generator:
Purpose:
The primary role of the Generator is to create fake data that resembles real data. In the case of image generation, it takes random noise as input and transforms it into images that ideally should look indistinguishable from real images in the training dataset.
How It Works:
Input: The Generator takes a random noise vector (usually sampled from a normal distribution) as its input. This noise serves as a seed from which the Generator can create various outputs.
Architecture: The Generator typically employs a series of transposed convolutional layers (also known as deconvolutional layers) to up sample the input noise. These layers reverse the down sampling process of convolutional layers used in the Discriminator. Activation Functions: The final layer usually employs a Tanh activation function to scale the output pixel values between -1 and 1 (commonly used in image generation). Intermediate layers often use ReLU or Leaky ReLU to maintain non-linearity and improve the flow of gradients.
Training Process: During training, the Generator aims to maximize the probability of deceiving the Discriminator by producing fake images that look real. It learns through backpropagation, receiving feedback from the Discriminator about how "real" or "fake" its outputs are.
2. Discriminator
Purpose:
The Discriminator's role is to differentiate between real data and the fake data generated by the Generator. It acts as a binary classifier, aiming to maximize its accuracy in identifying the real data from the fake.
How It Works
Input: The Discriminator takes both real images from the training dataset and fake images generated by the Generator as input.
Architecture: Usually, it downsamples the input using convolutional layers and then pooling layers. These layers assist the model in extracting significant features and learning spatial hierarchies. Activation Functions: A sigmoid activation function is typically used by the output layer to generate a probability score ranging from 0 (fake) to 1 (actual).
Training Process: The Discriminator is trained to maximize the probability of correctly classifying real and fake images. It learns through backpropagation based on the errors it makes during predictions. The Discriminator’s feedback helps the Generator improve by letting it know how “real” or “fake” its generated outputs are.
Deep Convolutional GAN (DCGAN)
For my first GAN, I explored the Deep Convolutional GAN (DCGAN). Deep Convolutional GANs (DCGANs) are a significant advancement in the GAN framework, proposed by Radford, Metz, and Chintala in their 2015 paper, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. This paper introduced some developments that could improve the performance of standard GANs. It provided valuable insights into the architectural features of DCGANs and the rationale behind them, making it easier to understand why these modifications lead to improved performance.
Key Features of DCGAN:
Standard GANs rely on fully connected layers, which don’t do a great job of capturing the spatial relationships in images. DCGANs, on the other hand, use convolutional layers, which are much better at recognizing features like edges and textures.
The Generator used transposed convolutions to upsample the data, creating more detailed images.
It has batch normalization to stabilize the training and avoid wild fluctuations.
The discriminator uses LeakyReLU as its activation function to ensure better gradient flow and Tanh in the Generator’s output layer to normalize data between -1 and 1.
Experiments
For my experiments, I used the MNIST dataset, which has grayscale images of handwritten digits (0-9). I implemented a training loop for 200 epochs, focusing on optimizing both the Generator and Discriminator. The losses reported after 200 epochs were as follows:
These results revealed that while the Discriminator was performing well, the Generator struggled to produce convincing images. This was because the discriminator was too strict on penalizing the generator for producing images it detected as fake. Such an imbalance in training dynamics is a common problem in GANs, where the Generator’s progress is hindered due to the Binary Cross-Entropy (BCE) loss function, which can lead to vanishing gradients and slow learning.
Challenges with the loss function (Binary Cross Entropy Cost)
1. Issues with Binary Cross-Entropy (BCE) Loss: While it provides a clear metric for distinguishing between real and fake images, it can lead to problems during training. Specifically, BCE can result in poor gradient flow when the Discriminator becomes too confident, assigning low probability scores to generated samples. This overconfidence can halt the learning process for the Generator, making it difficult to improve and contribute to other challenges like mode collapse.
2. Mode Collapse: Mode collapse is a phenomenon where the Generator produces a limited variety of outputs, often generating the same or very similar images for different inputs. This issue can severely restrict the diversity of the generated data, undermining the GAN's ability to learn and replicate the underlying distribution of the training data. Mode collapse is particularly problematic in applications where diversity is essential, such as in image synthesis.
3. Vanishing Gradients: Another issue was vanishing gradients, which can occur when the Discriminator becomes too powerful relative to the Generator. When the Discriminator learns to distinguish real from fake images too effectively, the Generator receives minimal gradient feedback, which is essential for updating its weights.
This situation can lead to stagnation in the Generator's learning, further exacerbating mode collapse and hindering overall model performance.
Solution: Earth Mover's Distance to BCE
One solution to the limitations of Binary Cross-Entropy (BCE) is the Earth Mover's Distance (EMD), also known as Wasserstein distance. EMD provides a better way to compare the distributions of real and generated data.
What is Earth Mover's Distance (EMD)? EMD measures the minimum "cost" to change one distribution into another. Think of it like moving a pile of dirt (representing generated samples) to create a new pile that looks like another distribution (representing real samples). EMD calculates how much effort it takes to make this transformation, considering how far each piece of dirt has to move.
The EMD between two probability distributions can be expressed as:
Using EMD as a loss function in GANs has several benefits over BCE:
Recommended by LinkedIn
Wasserstein Loss (w-loss)
The use of EMD in GANs is often done through the Wasserstein loss (w-loss). This loss function relates directly to EMD and makes training GANs more practical.
Why w-loss Works:
By switching from BCE to EMD and using w-loss, I noticed significant improvements in training. The process became more stable, and the Generator produced more diverse and realistic outputs.
Lipschitz Constraint and Solutions
To tackle stability and convergence issues, I looked into ways to enforce the Lipschitz constraint. Here are two main methods I found:
Gradient penalty is mostly used in Wasserstein GANs (WGANs), because it reduces problems like mode collapse and vanishing gradients. It allows the Generator to receive meaningful gradients, which helps it learn better and produce diverse outputs.
Training Results for DCGAN:
For 50 epochs, the training results for the DCGAN with Binary Cross-Entropy (BCE) loss were:
These results show that the DCGAN model is making solid progress, with both networks actively pushing each other to improve. The Generator is learning to generate more realistic images while the Discriminator continues to refine its ability to tell them apart.
Exploring Conditional GAN with DCGAN architecture
After experimenting with DCGAN, I decided to take on Conditional GANs (cGANs), which offer an exciting twist on the traditional GAN framework.
What is a CGAN?
Conditional GANs extend the GAN concept by allowing control over the generated output through conditioning. They take input from the user about which class they wish to generate an image for and they generate an image belonging to the same class. This conditioning mechanism enables the model to produce outputs that are more aligned to a specific criteria. For instance, when working with the MNIST dataset, I could specify which digit I wanted to generate (like “1” or “7”), and the model would respond by producing a corresponding image of that digit or you could specify that the model should generate a face with a pointed nose and blonde hair. This added layer of control makes cGANs incredibly powerful for tasks where specificity is required.
What Makes a CGAN Special?
The Conditional GAN introduces an additional input: conditioning information. This information can be anything that adds context to what the image should look like, such as:
The idea is to give both the Generator and the Discriminator some context, enabling the network to generate more targeted and relevant images.
Architecture Changes in CGAN
Generator Changes
In a standard GAN, the Generator takes a random noise vector z and outputs an image.
In a CGAN, the Generator takes two inputs:
1. Random noise z.
2. Conditioning information y (like a label, e.g., "4" for generating a handwritten digit "4").
These inputs are concatenated together into a single input vector, which then gets processed through the Generator network to produce an image that should match the given condition.
Discriminator Changes
In a CGAN, the Discriminator also receives the conditioning information y along with the image. The image and the label are combined, often by concatenating the label as an extra channel in the image. This setup forces the Discriminator to not only determine if an image is real or fake but also whether it matches the condition.
By adding conditioning, we can:
Implementation:
To implement the CGAN architecture, I started by utilizing one-hot encoded labels for conditioning. This approach allows the model to interpret the label data effectively:
Training:
At epoch 20, the training showed encouraging progress, with the following results:
These loss values provide valuable insight into the model's performance:
Interpreting the Loss Values
Loss Curve
This stage of training highlights that the cGAN is on the right track, with both networks improving and pushing each other toward better results.
Conclusion:
In conclusion, diving into Generative Adversarial Networks has been a fun experiment that improved my understanding of deep learning, particularly through the exploration of Deep Convolutional GANs (DCGANs) and Conditional GANs (CGANs). Although it was difficult to implement the architectures of the GANs and ensure a delicate balance between the generator and discriminator, it was worth it. This experience has deepened my appreciation for the capabilities of GANs in generating realistic data and has taught me how important it is to experiment, as even small tweaks can make a big difference in how well the models work. I aim to refine my knowledge of GANs further and explore their applications in various fields.