CoSeR is capable of extracting cognitive features from low-resolution (LR) images and generating high-quality reference images aligning closely with the LR image in terms of semantics and textures. The inclusion of the generated reference image along with the cognitive features serves to notably boost our super-resolution (SR) performance.
Existing real-world super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks.
Enriched by a comprehensive understanding of scene information, CoSeR excels in enhancing high-quality texture details. As demonstrated in the first and second rows, our results exhibit significantly clearer and more realistic fur and facial features in the animals. Similarly, in the third and fourth rows, our method adeptly reconstructs realistic textures such as the anemone tentacles and succulent leaves—achievements unmatched by other methods. Particularly, our model's cognitive capabilities enable the recovery of semantic details almost lost in low-resolution inputs. Notably, in the first row, only our model successfully restores the dhole's eyes, while in the fifth row, only our method can reconstruct the sand within the hourglass.
Our CoSeR employs a dual-stage process for restoring LR images. Initially, we develop a cognitive encoder to conduct a thorough analysis of the image content, conveying the cognitive embedding to the diffusion model. This enables the activation of pre-existing image priors within the pre-trained Stable Diffusion model, facilitating the restoration of intricate details. Additionally, our approach utilizes cognitive understanding to generate high-fidelity reference images that closely align with the input semantics. These reference images serve as auxiliary information, contributing to the enhancement of super-resolution results. Ultimately, our model simultaneously applies three conditional controls to the pre-trained Stable Diffusion model: the LR image, cognitive embedding, and reference image.
We choose to utilize the feature embedding for the cognition process rather than directly generating a caption from LR for several compelling reasons. Firstly, although guided by language embedding, our cognitive embedding retains fine-grained image features, proving advantageous in generating reference images with high semantic similarity. In the first row of the figure above, we show the BLIP2 captions generated from LR images. They fail to identify the precise taxon, color, and texture of the animals, leading to suboptimal generations compared to our cognitive adapter. Secondly, pre-trained image caption models may produce inaccurate captions for LR images due to disparities in the input distribution. In contrast, our cognitive adapter is more robust for LR images, shown in the second row of the figure above. Thirdly, employing a pre-trained image caption model requires a substantial number of parameters, potentially reaching 7B. In contrast, our cognitive adapter is significantly lighter, with only 3% parameters, resulting in favorable efficiency.
@article{sun2023coser,
title={CoSeR: Bridging Image and Language for Cognitive Super-Resolution},
author={Sun, Haoze and Li, Wenbo and Liu, Jianzhuang and Chen, Haoyu and Pei, Renjing and Zou, Xueyi and Yan, Youliang and Yang, Yujiu},
journal={arXiv preprint arXiv:2311.16512},
year={2023},
}