AI4media project

We present a novel graph Transformer generative adversarial network (GTGAN) to learn effective graph node relations
in an end-to-end fashion for the challenging graph-constrained house generation task. The proposed graph- Transformer-based generator includes a novel graph Transformer encoder that combines graph convolutions and self-attentions in a Transformer to model both local and global interactions across connected and non-connected graph nodes. Specifically, the proposed connected node attention (CNA) and non-connected node attention (NNA) aim to capture the global relations across connected nodes and non-connected nodes in the input graph, respectively. The proposed graph modeling block (GMB) aims to exploit local vertex interactions based on a house layout topology. Moreover, we propose a new node classification-based discriminator to preserve the high-level semantic and discriminative node features for different house components. Finally, we propose a novel graph-based cycle-consistency loss that aims at maintaining the relative spatial relationships between ground truth and predicted graphs. Experiments on two challenging graph-constrained house generation tasks (i.e., house layout and roof generation) with two public datasets demonstrate the effectiveness of GTGAN in terms of objective quantitative scores and subjective visual realism. New state-of-the-art results are established by large margins on both tasks.

Multi-Channel Attention Selection GANs for Guided Image-to-Image Translation

We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation, where we translate an input image into another while respecting an external semantic guidance. The proposed SelectionGAN explicitly utilizes the semantic guidance information and consists of two stages. In the first stage, the input image and the conditional semantic guidance are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using the proposed multi-scale spatial pooling & channel selection module and the multi-channel attention selection module. Moreover, uncertainty maps automatically learned from attention maps are used to guide the pixel loss for better network optimization. Exhaustive experiments on four challenging guided image-to-image translation tasks (face, hand, body, and street view) demonstrate that our SelectionGAN is able to generate significantly better results than the state-of-the-art methods. Meanwhile, the proposed framework and modules are unified solutions and can be applied to solve other generation tasks such as semantic image synthesis. The code is available at https://github.com/Ha0Tang/SelectionGAN.

PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis

Deep Unsupervised Key Frame Extraction for Eficient Video Classification

Bipartite Graph Reasoning GANs for Person Pose and Facial Image Synthesis

Local and Global GANs with Semantic-Aware Upsampling for Image Generation

AttentionGAN: Unpaired Image-to-Image Translation Using Attention-Guided Generative Adversarial Networks

Disentangle Saliency Detection into Cascaded Detail Modeling and Body Filling

Quasi-equilibrium Feature Pyramid Network for Salient Object Detection

3D-Aware Semantic-Guided Generative Model for Human Synthesis

Geometry-Contrastive Transformer for Generalized 3D Pose Transfer

Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

AniFormer: Data-driven 3D Animation with Transformer

We present a novel task, i.e., animating a target 3D object through the motion of a raw driving sequence. In previous works, extra auxiliary correlations between source and target meshes or intermedia factors are inevitable to capture the motions in the driving sequences. Instead, we introduce AniFormer, a novel Transformer-based architecture, that generates animated 3D sequences by directly taking the raw driving sequences and arbitrary same-type target meshes as inputs. Specifically, we customize the Transformer architecture for 3D animation that generates mesh sequences by integrating styles from target meshes and motions from the driving meshes. Besides, instead of the conventional single regression head in the vanilla Transformer, AniFormer generates multiple frames as outputs to preserve the sequential consistency of the generated meshes. To achieve this, we carefully design a pair of regression constraints, i.e., motion and appearance constraints, that can provide strong regularization on the generated mesh sequences. Our AniFormer achieves high-fidelity, realistic, temporally coherent animated results and outperforms compared start-of-the-art methods on benchmarks of diverse categories. Code is available: https://github.com/mikecheninoulu/AniFormer.

Cascaded Cross MLP-Mixer GANs for Cross-View Image Translation

Institution: ETH Zurich;

Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis

PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

Graph Transformer GANs for Graph-Constrained House Generation

Multi-Channel Attention Selection GANs for Guided Image-to-Image Translation

PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis

Deep Unsupervised Key Frame Extraction for Eficient Video Classification

Bipartite Graph Reasoning GANs for Person Pose and Facial Image Synthesis

Local and Global GANs with Semantic-Aware Upsampling for Image Generation

AttentionGAN: Unpaired Image-to-Image Translation Using Attention-Guided Generative Adversarial Networks

Disentangle Saliency Detection into Cascaded Detail Modeling and Body Filling

Quasi-equilibrium Feature Pyramid Network for Salient Object Detection

3D-Aware Semantic-Guided Generative Model for Human Synthesis

Geometry-Contrastive Transformer for Generalized 3D Pose Transfer

Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

AniFormer: Data-driven 3D Animation with Transformer

Cascaded Cross MLP-Mixer GANs for Cross-View Image Translation