THESIS
2024
1 online resource (xiii, 79 pages) : color illustrations
Abstract
In this thesis, we address the resource-consuming problem of recent large text-to-image (T2I) generative models. We propose a three-stage training strategy with stage-specific datasets to reduce the training resources and time. i) Pixel dependency learning, where our model learns low-level pixel dependencies from the ImageNet dataset. This stage focuses on understanding the intrinsic pixel relationships in natural images. ii) Text-image alignment learning, where our model learns textual concepts from the SAM dataset, whose captions are refined by a large vision language model. This stage aims to align textual concepts with their visual representations. iii) High-resolution and aesthetic image generation, where our model is fine-tuned to generate high-resolution and aesthetic images. Fo...[
Read more ]
In this thesis, we address the resource-consuming problem of recent large text-to-image (T2I) generative models. We propose a three-stage training strategy with stage-specific datasets to reduce the training resources and time. i) Pixel dependency learning, where our model learns low-level pixel dependencies from the ImageNet dataset. This stage focuses on understanding the intrinsic pixel relationships in natural images. ii) Text-image alignment learning, where our model learns textual concepts from the SAM dataset, whose captions are refined by a large vision language model. This stage aims to align textual concepts with their visual representations. iii) High-resolution and aesthetic image generation, where our model is fine-tuned to generate high-resolution and aesthetic images. For this purpose, we utilize an internal dataset similar to JourneyDB. When we combine our three-stage training strategy with an existing parameter-efficient transformer-based diffusion model, experimental results demonstrate that our approach achieves comparable or even superior image quality and semantic control compared to the SOTA T2I model Stable Diffusion XL, while our training strategy only requires only 10.8% of its training time.
Post a Comment