Team 35 Project Midterm

CS 7641 Team Members: Rachit Chadha, Khalid Shaikh, Mini Jain, Asmit Kumar Singh, Amrit Khera

Introduction/Background

We have all experienced that curiosity that arises while watching anime, pondering the meaning of the text displayed within the episodes. Animes frequently employ textual visuals to introduce new storylines or conclude existing ones, and these elements play a pivotal role in the overall viewing experience. Therefore, it would be immensely beneficial if this text could be seamlessly translated and integrated into the video, preserving its original style, including the color and font, while altering the content itself.

The potential applications of solving this challenge extend beyond anime, to translating movie or show posters effortlessly into the desired language, making real-time text translations accessible and natural.



Problem Sample Intro

Problem Definition

The objective of this project is to transform a Japanese image, including posters, anime frames, or real-world images, into English while preserving the essence, meaning, and visual style of the original image. This intricate challenge can be deconstructed into a pipeline as shown below.

Problem pipeline

The OCR and translation models have been the center of many research problems, and hence it's a solved problem, with no significant scope for improvement. The text eraser model detects and removes any textual elements within the image and fills the void seamlessly, maintaining the image's integrity, and is efficiently performed by models like DALLE. Hence, the primary focus of our project would be the Inter-Language font transfer model.

Dataset

Figure 1. Dataset Creation

The methodologies that we are using require 2 sets of images, the input and the target. The image translation models will learn the given Japanese text style and apply it to the translated English characters to get English text in the original style. Hence the model needs information on both the content and style of the image. The figure shows the process of our dataset creation with font .ttf files as our starting point, sourced from font websites like https://www.freejapanesefont.com, to generate sample input and output pairs. The input image tries to capture the style using Japanese font characters, while the input English character is passed in the middle. We have created a dataset of 10,000 samples.

DATASET CREATION

The dataset was created by using Japanese font characters in our training images juxtaposed with the English character in standard font, and the corresponding pair of that training image was the English character stylized in the Japanese font. We created two sets of data, one in which we used 8 Japanese characters for each English character, and another with 3 Japanese stylized characters for each English character. This allowed our model to learn diversified input. Both sets of data can be seen in Fig 4 below. We ran our experiments for both sets of data to determine which training input helped the model generate the stylized English character better.

Dataset Creation

DATASET PREPROCESSING

The preprocessing of the dataset plays a pivotal role in enhancing the robustness and generalization capabilities of the model. Hence, we applied varied augmentations to our original dataset to enhance our dataset by normalizing, rotating, and adding random cropping and jitter to the images in the dataset. Using these preprocessing techniques allows the model to handle a wide array of variations in orientations, and image quality of input images. It also enhances the model's resilience to different writing styles and helps prevent overfitting to specific patterns present in the training data.

Further, we combined the two images in each data sample, to create a single input image for the model. The pairing of images allowed for unsupervised learning of the GAN, with each pair consisting of the Japanese stylization as well as English stylization, as seen in Fig 5.

Dataset Creation

Methods

1.Pix2Pix / Pix2PixHD


Figure 2. Pix2Pix image translation example

Pix2pix is an image-to-image translation model, that translates the style (scene) of an image to another representation [1].

It's based on a Conditional Generative Adversarial Network (cGAN) architecture where a generator and discriminator are trained simultaneously in a min-max fashion. Pix2PixHD is an extension of Pix2Pix for higher resolution images using a multiscale generator [2], trained as shown in Fig 3.


Figure 3. Pix2Pix Training procedure. (a) Generator Training (b) Discriminator Training

2. CycleGAN

CycleGAN is also a cGAN like Pix2Pix designed for unpaired image-image translation tasks when paired training data is unavailable [3].


Figure 4. CycleGAN image translation example

CycleGAN follows a similar training pattern as Pix2Pix with 2 generators and discriminators in a cyclic manner instead of a single one:

Figure 5. CycleGAN training where G,F are generators and Dx and Dy are discriminators
Figure 5.2 CycleGAN model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial
discriminators DY and DX. DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa
for DX and F

CycleGAN presents a method that can learn capture special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples. The key difference here from pix2pix is that the model does not expect a paired dataset. That is it does not expect an image sample of one set to be mapped with an image sample of the other set. Instead it expects 2 larger sets like "Horses", "zebras" etc, and tries to learn a generic mapping between the two sets.

Figure 4.2. CycleGAN works without one to one image mapping

Proposed Methodology

Our methodology uses these state-of-the-art cGAN models (Pix2Pix, Cycle GAN) to transfer the style of Japanese characters in visual media to English.


Figure 6. Proposed Methodology

Steps:

  1. Extract and translate Japanese text from the input image to English.
  2. Use our dataset to train the style transfer between Japanese and English characters.
  3. Apply the trained model to translate the style of the English characters.

Results and Discussion

Our results are divided into 2 sections, Pix2Pix and CycleGAN. We have trained both models on the same dataset and have compared the results of both models. We have also tried to improve the results of the models by hyperparameter tuning and training the models for longer epochs. We have also visualized the training process of the models to understand the learning process of the models.

1. Pix2Pix

We have experimented with the Pix2Pix models for 2 different dataset configurations. We tried the 3jap+1eng and 8jap+1eng configurations. The 3jap+1eng results were significantly bad and the model did not seem to understand the difference between the target character and the style japanese characters. Therefore we went ahead with the 8jap+1eng model which we trained over 100 epochs and 10,000 training samples.

Dataset Creation

We can see that the pix2pix model is able to learn the mapping between the english character in japanese font, and the inputted characters to a great extent. The model is able to learn that the central character in the input image is the character that needs to be translated, while the 8 characters around the english character are there for style and font cues. The model is able to pick up and learn from the target character. The algorithm is able to capture the color and the curves of the target character successfully for some characters. But for some characters like Z the edges are not as sharp as intended, so some more experimentation is required there.

The model is not perfect yet. We can see that it is failing to understand the difference between some similar characters like “O” and “Q”, or “O” and “D”. The model is also not yet able to capture very complex styles like the shaded “B” yet.

Dataset Creation

To see the progress of model learning. We plot some characters after just 5 epochs. We can see that the model seems to be learning like a child, trying to understand the shape of the characters, during the initial epochs.

Dataset Creation

Model Improvement via Hyperparameter Tuning

We trained the pix2pix model for Mapping from set A(including 8 jap and 1 english character) to set B(Target english character in Jap font). Earlier we trained the model for 100 epochs and a subset of data, this time we tune the parameters and train the model for 1000 epochs and also employ deeper architectures for the discriminators and generators.

Dataset Creation
  1. We see a significant improvement as seen above.
  2. The model is able to capture the characters and the style much better than before, however it still is a little rough at times as can be seen in the 'B' character. This is an open area for research but training the model even longer till convergence (Requires massive compute and time) can further improve the results.

Pix2Pix Visual Analysis

We trained the pix2pix model for Mapping from set A(including 8 jap and 1 english character) to set B(Target english character in Jap font). We trained the model for 100 epochs. The loss plots can be seen below.

Dataset Creation
  1. G_GAN Loss: This loss represents the Generator's ability to create outputs that the Discriminator classifies as real. The G_GAN loss fluctuates over epochs, which is expected as the Generator learns. However, the general trend does not show a clear decrease, indicating that the Generator may struggle to improve in fooling the Discriminator consistently as the model is trained further.
  2. G_L1 Loss: The L1 loss typically measures the pixel-wise absolute difference between the generated images and the target images. The plot shows high variability but overall there is a downward trend. This suggests that the Generator is getting better at producing images that are closer to the target on average. Yet, the high variance implies there may be epochs where the model significantly diverges from the target, which could be a sign of instability in learning.
  3. D_real Loss: This metric reflects how well the Discriminator recognizes real images. The loss seems to have significant spikes but trends downward overall, which suggests that the Discriminator is becoming more proficient at identifying real images. The spikes could indicate moments where the Discriminator's performance varies greatly, possibly due to changes in the Generator's outputs or due to the inherent variability of the training data.
  4. D_fake Loss: This loss measures the Discriminator's ability to detect fake images from the Generator. The plot shows a downward trend, indicating that the Discriminator is improving at distinguishing fake images over time. However, like the D_real loss, there is some volatility, suggesting that the model may have epochs where its performance degrades, which is common in adversarial training as the two networks are in a constant tug-of-war.

2. CycleGan

We observe that although the cycleGAN model is able to learn some characteristics of the image translation successfully but it is not able to achieve the same level of performance as pix2pix. Some of the decent generated samples from cycleGAN are shown below

epoch 100

A clear problem that we see from the model is that it is not able to make out the character it wants to generate. We see the model being confused between L and O in the first image.

Confused CGan

Upon analysis we realize that the model is not performing well due to the lack of pairwise mapping between the characters, and its new complex loss function called the consistency loss. CycleGAN introduces a cycle consistency loss. This loss ensures that an image translated from one domain to the other and back again should resemble the original image. While this is useful for unpaired data, it can be less efficient compared to direct pixel-wise loss (like L1) when paired data is available. Due to this nature the model is trying to preserve some characteristics of the japanese characters in its image, so that in the second half of the consistency loss the model is able to regenerate the image with one english character and 8 japanese characters. Our training and experimentation exposes this flaw of the model architecture

CycleGan Visual Analysis

Training Loss CGan

Upon the training loss visualization analysis, we also see that the cycleGAN architecture is confused and the losses spike up and down a lot. This happens because if the model tries to generate a google image for the set B, the generatore responsible for B->A translation performs really badly since it has no visual cues to generate back the initial image. This leads to a deadlock situation in the training, and even with a large number of epochs and hyperparameter tuning the model is not able to really converge.

Evaluation

Pix2Pix

Our evaluation process for Japanese-to-English font style transfer employed a carefully curated dataset comprising pairs of Japanese text images and their corresponding English translations, encompassing diverse font styles. Our evaluation for Pix2Pix is based on the following:

EVALUATION METRICS


      Libraries involved
      
        import torch
        import torchvision.transforms as transforms
        from torch.utils.data import DataLoader
        from skimage import io
        from skimage.metrics import structural_similarity as ssim
        import numpy as np
      
    

MODEL PERFORMANCE

CycleGan

Our evaluation of CycleGAN's performance in this context provides insights into its capabilities and areas for improvement in comparison to the Pix2Pix model. We evaluated the model on the following metrics:

EVALUATION METRICS

    Libraries involved
      
        import torch
        import torchvision.transforms as transforms
        from torch.utils.data import DataLoader
        from skimage import io
        from skimage.metrics import structural_similarity as ssim
        import numpy as np
      
      

MODEL PERFORMANCE

Comparative Insights

This evaluation underscores CycleGAN's potential in image translation, yet also its limitations for tasks demanding high precision, such as font style transfer. Future enhancements could focus on improving the model's character differentiation capabilities and adapting its architecture for more detail-oriented tasks.

Our Timeline

High Level Gantt

Contribution Table

contribution-table

References

  1. [1] Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  2. [2] Wang, Ting-Chun, et al. "High-resolution image synthesis and semantic manipulation with conditional gans." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  3. [3] Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." Proceedings of the IEEE international conference on computer vision. 2017.

Video Presentation