TEAM 35: Final Project

Dataset

The methodologies that we are using require 2 sets of images, the input and the target. The image translation models will learn the given Japanese text style and apply it to the translated English characters to get English text in the original style. Hence the model needs information on both the content and style of the image. The figure shows the process of our dataset creation with font .ttf files as our starting point, sourced from font websites like https://www.freejapanesefont.com, to generate sample input and output pairs. The input image tries to capture the style using Japanese font characters, while the input English character is passed in the middle. We have created a dataset of 10,000 samples.

DATASET CREATION

The dataset was created by using Japanese font characters in our training images juxtaposed with the English character in standard font, and the corresponding pair of that training image was the English character stylized in the Japanese font. We created two sets of data, one in which we used 8 Japanese characters for each English character, and another with 3 Japanese stylized characters for each English character. This allowed our model to learn diversified input. Both sets of data can be seen in Fig 4 below. We ran our experiments for both sets of data to determine which training input helped the model generate the stylized English character better.

DATASET PREPROCESSING

The preprocessing of the dataset plays a pivotal role in enhancing the robustness and generalization capabilities of the model. Hence, we applied varied augmentations to our original dataset to enhance our dataset by normalizing, rotating, and adding random cropping and jitter to the images in the dataset. Using these preprocessing techniques allows the model to handle a wide array of variations in orientations, and image quality of input images. It also enhances the model's resilience to different writing styles and helps prevent overfitting to specific patterns present in the training data.

Further, we combined the two images in each data sample, to create a single input image for the model. The pairing of images allowed for unsupervised learning of the GAN, with each pair consisting of the Japanese stylization as well as English stylization, as seen in Fig 5.

Methods

1.Pix2Pix / Pix2PixHD

Figure 2. Pix2Pix image translation example

Pix2pix is an image-to-image translation model, that translates the style (scene) of an image to another representation [1].

It's based on a Conditional Generative Adversarial Network (cGAN) architecture where a generator and discriminator are trained simultaneously in a min-max fashion. Pix2PixHD is an extension of Pix2Pix for higher resolution images using a multiscale generator [2], trained as shown in Fig 3.

Figure 3. Pix2Pix Training procedure. (a) Generator Training (b) Discriminator Training

2. CycleGAN

CycleGAN is also a cGAN like Pix2Pix designed for unpaired image-image translation tasks when paired training data is unavailable [3].

Figure 4. CycleGAN image translation example

CycleGAN follows a similar training pattern as Pix2Pix with 2 generators and discriminators in a cyclic manner instead of a single one:

Figure 5. CycleGAN training where G,F are generators and Dx and Dy are discriminators

Figure 5.2 CycleGAN model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial
discriminators DY and DX. DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa
for DX and F

CycleGAN presents a method that can learn capture special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples. The key difference here from pix2pix is that the model does not expect a paired dataset. That is it does not expect an image sample of one set to be mapped with an image sample of the other set. Instead it expects 2 larger sets like "Horses", "zebras" etc, and tries to learn a generic mapping between the two sets.

Figure 4.2. CycleGAN works without one to one image mapping

Results and Discussion

Our results are divided into 2 sections, Pix2Pix and CycleGAN. We have trained both models on the same dataset and have compared the results of both models. We have also tried to improve the results of the models by hyperparameter tuning and training the models for longer epochs. We have also visualized the training process of the models to understand the learning process of the models.

1. Pix2Pix

We have experimented with the Pix2Pix models for 2 different dataset configurations. We tried the 3jap+1eng and 8jap+1eng configurations. The 3jap+1eng results were significantly bad and the model did not seem to understand the difference between the target character and the style japanese characters. Therefore we went ahead with the 8jap+1eng model which we trained over 100 epochs and 10,000 training samples.

We can see that the pix2pix model is able to learn the mapping between the english character in japanese font, and the inputted characters to a great extent. The model is able to learn that the central character in the input image is the character that needs to be translated, while the 8 characters around the english character are there for style and font cues. The model is able to pick up and learn from the target character. The algorithm is able to capture the color and the curves of the target character successfully for some characters. But for some characters like Z the edges are not as sharp as intended, so some more experimentation is required there.

The model is not perfect yet. We can see that it is failing to understand the difference between some similar characters like “O” and “Q”, or “O” and “D”. The model is also not yet able to capture very complex styles like the shaded “B” yet.

To see the progress of model learning. We plot some characters after just 5 epochs. We can see that the model seems to be learning like a child, trying to understand the shape of the characters, during the initial epochs.

Model Improvement via Hyperparameter Tuning

We trained the pix2pix model for Mapping from set A(including 8 jap and 1 english character) to set B(Target english character in Jap font). Earlier we trained the model for 100 epochs and a subset of data, this time we tune the parameters and train the model for 1000 epochs and also employ deeper architectures for the discriminators and generators.

We see a significant improvement as seen above.
The model is able to capture the characters and the style much better than before, however it still is a little rough at times as can be seen in the 'B' character. This is an open area for research but training the model even longer till convergence (Requires massive compute and time) can further improve the results.

Pix2Pix Visual Analysis

We trained the pix2pix model for Mapping from set A(including 8 jap and 1 english character) to set B(Target english character in Jap font). We trained the model for 100 epochs. The loss plots can be seen below.

G_GAN Loss: This loss represents the Generator's ability to create outputs that the Discriminator classifies as real. The G_GAN loss fluctuates over epochs, which is expected as the Generator learns. However, the general trend does not show a clear decrease, indicating that the Generator may struggle to improve in fooling the Discriminator consistently as the model is trained further.
G_L1 Loss: The L1 loss typically measures the pixel-wise absolute difference between the generated images and the target images. The plot shows high variability but overall there is a downward trend. This suggests that the Generator is getting better at producing images that are closer to the target on average. Yet, the high variance implies there may be epochs where the model significantly diverges from the target, which could be a sign of instability in learning.
D_real Loss: This metric reflects how well the Discriminator recognizes real images. The loss seems to have significant spikes but trends downward overall, which suggests that the Discriminator is becoming more proficient at identifying real images. The spikes could indicate moments where the Discriminator's performance varies greatly, possibly due to changes in the Generator's outputs or due to the inherent variability of the training data.
D_fake Loss: This loss measures the Discriminator's ability to detect fake images from the Generator. The plot shows a downward trend, indicating that the Discriminator is improving at distinguishing fake images over time. However, like the D_real loss, there is some volatility, suggesting that the model may have epochs where its performance degrades, which is common in adversarial training as the two networks are in a constant tug-of-war.

2. CycleGan

We observe that although the cycleGAN model is able to learn some characteristics of the image translation successfully but it is not able to achieve the same level of performance as pix2pix. Some of the decent generated samples from cycleGAN are shown below

A clear problem that we see from the model is that it is not able to make out the character it wants to generate. We see the model being confused between L and O in the first image.

Upon analysis we realize that the model is not performing well due to the lack of pairwise mapping between the characters, and its new complex loss function called the consistency loss. CycleGAN introduces a cycle consistency loss. This loss ensures that an image translated from one domain to the other and back again should resemble the original image. While this is useful for unpaired data, it can be less efficient compared to direct pixel-wise loss (like L1) when paired data is available. Due to this nature the model is trying to preserve some characteristics of the japanese characters in its image, so that in the second half of the consistency loss the model is able to regenerate the image with one english character and 8 japanese characters. Our training and experimentation exposes this flaw of the model architecture

CycleGan Visual Analysis

Upon the training loss visualization analysis, we also see that the cycleGAN architecture is confused and the losses spike up and down a lot. This happens because if the model tries to generate a google image for the set B, the generatore responsible for B->A translation performs really badly since it has no visual cues to generate back the initial image. This leads to a deadlock situation in the training, and even with a large number of epochs and hyperparameter tuning the model is not able to really converge.

Evaluation

Pix2Pix

Our evaluation process for Japanese-to-English font style transfer employed a carefully curated dataset comprising pairs of Japanese text images and their corresponding English translations, encompassing diverse font styles. Our evaluation for Pix2Pix is based on the following:

EVALUATION METRICS

Structural Similarity Index (SSIM):
Measures the similarity in structural information between two images, assessing their visual resemblance.

Achieving an average score of 0.85, indicating the model's proficiency in preserving the visual style and essence of the original Japanese text.
Peak Signal-to-Noise Ratio (PSNR):
Computed using the mean squared error between two images. PSNR quantifies the quality of an image by evaluating the level of noise or distortion, providing a higher score for images with lower distortion.

Achieving an average score of 25.7dB, highlighting the model's capability to maintain visual quality and fidelity in most cases.


      Libraries involved
      
        import torch
        import torchvision.transforms as transforms
        from torch.utils.data import DataLoader
        from skimage import io
        from skimage.metrics import structural_similarity as ssim
        import numpy as np

MODEL PERFORMANCE

The scores, surpassing average expectations, underscore the model's robust performance in preserving both stylistic and content-related aspects of font style transfer.

While commendable, the PSNR metric revealed subtle challenges in distinguishing certain characters (e.g., 'Q' and 'O'), resulting in minor visual artifacts.

These findings provide valuable insights into the model's strengths and areas for improvement, offering a comprehensive assessment of its effectiveness in this nuanced task.

CycleGan

Our evaluation of CycleGAN's performance in this context provides insights into its capabilities and areas for improvement in comparison to the Pix2Pix model. We evaluated the model on the following metrics:

EVALUATION METRICS

Structural Similarity Index (SSIM):
CycleGAN achieved an average SSIM score of 0.65, reflecting moderate success in preserving the original style but highlighting room for improvement.
Peak Signal-to-Noise Ratio (PSNR):
CycleGAN achieved an average PSNR score of 18dB, indicating a lower quality of images generated by the model in comparison to Pix2Pix.

    Libraries involved
      
        import torch
        import torchvision.transforms as transforms
        from torch.utils.data import DataLoader
        from skimage import io
        from skimage.metrics import structural_similarity as ssim
        import numpy as np

MODEL PERFORMANCE

While CycleGAN's performance was inferior to Pix2Pix, it was able to capture the essence of the original style to a certain extent, as reflected by the SSIM score.
However, the PSNR score was significantly lower than Pix2Pix, indicating a lower quality of images generated by the model.
Challenges in distinguishing between similar characters and replicating complex font styles were notable limitations. These issues are attributed to the CycleGAN's architecture, especially its reliance on cycle consistency loss and lack of direct pairwise character mapping.

Comparative Insights

Compared to Pix2Pix, CycleGAN demonstrated a lesser degree of accuracy and fidelity in font style transfer.
Human evaluations further supported the preference for Pix2Pix, emphasizing its superiority in detailed and precise style translation tasks.

This evaluation underscores CycleGAN's potential in image translation, yet also its limitations for tasks demanding high precision, such as font style transfer. Future enhancements could focus on improving the model's character differentiation capabilities and adapting its architecture for more detail-oriented tasks.

Team 35 Project Midterm

Introduction/Background

Problem Definition

Dataset

DATASET CREATION

DATASET PREPROCESSING

Methods

1.Pix2Pix / Pix2PixHD

2. CycleGAN

Proposed Methodology

Steps:

Results and Discussion

1. Pix2Pix

Model Improvement via Hyperparameter Tuning

Pix2Pix Visual Analysis

2. CycleGan

CycleGan Visual Analysis

Evaluation

Pix2Pix

EVALUATION METRICS

Structural Similarity Index (SSIM):

Peak Signal-to-Noise Ratio (PSNR):

MODEL PERFORMANCE

CycleGan

EVALUATION METRICS

Structural Similarity Index (SSIM):

Peak Signal-to-Noise Ratio (PSNR):

MODEL PERFORMANCE

Comparative Insights

Our Timeline

Contribution Table

References

Video Presentation