Awesome

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

The transformer is now a basic component, adopted in nearly all AI models. Keep updated --> updated irregularly.

New Hope: LLM-in-Vision

Resource

ChatGPT for Robotics: Design Principles and Model Abilities, [Paper], [Code]
DIFFUSIONDB [Page], [Paper]
LAION-5B [Page], [Paper]
LAVIS [Page], [Paper]
Imagen Video [Page], [Paper]
Phenaki [Page], [Paper]
DREAMFUSION [Page], [Paper]
MAKE-A-VIDEO [Page], [Paper]
Stable Difffusion [Page], [Paper]
NUWA-Infinity [Page], [Paper]
Parti [Page], [Code]
Imagen [Page], [Paper]
Gato: A Generalist Agent, [Paper]
PaLM: Scaling Language Modeling with Pathways, [Paper]
DALL·E 2 [Page], [Paper]
SCENIC: A JAX Library for Computer Vision Research and Beyond, [Code]
V-L joint learning study (with good tables): [METER], [Kaleido-BERT]
Attention is all you need, [Paper]
CLIP [Page], [Paper], [Code], [arXiv]
DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survey

(arXiv 2023.2) TRANSFORMER-BASED SENSOR FUSION FOR AUTONOMOUS DRIVING: A SURVEY, [Paper], [Page]
(arXiv 2023.2) Deep Learning for Video-Text Retrieval: a Review, [Paper]
(arXiv 2023.2) Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey, [Paper]
(arXiv 2023.2) Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey, [Paper]
(arXiv 2023.2) Knowledge Distillation in Vision Transformers: A Critical Review, [Paper]
(arXiv 2023.2) A Survey on Efficient Training of Transformers, [Paper]
(arXiv 2023.1) ChatGPT is not all you need. A State of the Art Review of large Generative AI models, [Paper]
(arXiv 2022.12) Transformers in Action Recognition: A Review on Temporal Modeling, [Paper]
(arXiv 2022.11) Vision Transformers in Medical Imaging: A Review, [Paper]
(arXiv 2022.11) A survey on knowledge-enhanced multimodal learning, [Paper]
(arXiv 2022.10) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, [Paper]
(arXiv 2022.10) A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective, [Paper]
(arXiv 2022.09) VISION TRANSFORMERS FOR ACTION RECOGNITION: A SURVEY, [Paper]
(arXiv 2022.09) Transformers in Remote Sensing: A Survey, [Paper], [Code]
(arXiv 2022.08) 3D Vision with Transformers: A Survey, [Paper], [Code]
(arXiv 2022.08) A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond, [Paper]
(arXiv 2022.07) Vision Transformers: State of the Art and Research Challenges, [Paper]
(arXiv 2022.07) SELF-SUPERVISED LEARNING FOR VIDEOS: A SURVEY, [Paper]
(arXiv 2022.06) Multimodal Learning with Transformers: A Survey, [Paper]
(arXiv 2022.05) Vision Transformer: Vit and its Derivatives, [Paper]
(arXiv 2022.05) Transformers in 3D Point Clouds: A Survey, [Paper]
(arXiv 2022.04) Visual Attention Methods in Deep Learning: An In-Depth Survey, [Paper]
(arXiv 2022.04) Vision-and-Language Pretrained Models: A Survey, [Paper]
(arXiv 2022.03) A Roadmap for Big Model, [Paper]
(arXiv 2022.03) Transformers Meet Visual Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf）
(arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work, [Paper], [Project]
(arXiv 2022.02) A Survey of Vision-Language Pre-Trained Models, [Paper]
(arXiv 2022.02) VLP: A Survey on Vision-Language Pre-training, [Paper]
(arXiv 2022.02) Transformer for Graphs: An Overview from Architecture Perspective, [Paper]
(arXiv 2022.01) Video Transformers: A Survey, [Paper]
(arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]
(arXiv 2021.11) A Survey of Visual Transformers, [Paper]
(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]
(arXiv 2021.06) A Survey of Transformers, [Paper]
(arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]
(arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]
(arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]
(arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2021.01) A Survey on Visual Transformer, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

2023.8

(arXiv 2023.8) VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control, [Paper], [Project]

2023.5

(arXiv 2023.5) Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields, [Paper]

2023.3

(arXiv 2023.3) Query-Dependent Video Representation for Moment Retrieval and Highlight Detection, [Paper], [Code]

2023.2

(arXiv 2023.2) Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities, [Paper]
(arXiv 2023.2) KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer, [Paper], [Code]
(arXiv 2023.2) HUMAN MOTIONFORMER: TRANSFERRING HUMAN MOTIONS WITH VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2023.2) Aligning Text-to-Image Models using Human Feedback, [Paper]
(arXiv 2023.2) Controlled and Conditional Text to Image Generation with Diffusion Prior, [Paper]
(arXiv 2023.2) Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? [Paper], [Code]
(arXiv 2023.2) OBJECT-CENTRIC VIDEO PREDICTION VIA DECOUPLING OF OBJECT DYNAMICS AND INTERACTIONS, [Paper], [Project]
(arXiv 2023.2) Distribution Normalization: An “Effortless” Test-Time Augmentation for Contrastively Learned Visual-language Models, [Paper], [Code]
(arXiv 2023.2) Teaching CLIP to Count to Ten, [Paper], [Project]
(arXiv 2023.2) Designing an Encoder for Fast Personalization of Text-to-Image Models, [Paper], [Project]
(arXiv 2023.2) Side Adapter Network for Open-Vocabulary Semantic Segmentation, [Paper], [Code]
(arXiv 2023.2) Learning Visual Representations via Language-Guided Sampling, [Paper]
(arXiv 2023.2) VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion, [Paper], [Code]
(arXiv 2023.2) Language-Driven Representation Learning for Robotics, [Paper], [Project]
(arXiv 2023.2) A Convolutional Vision Transformer for Semantic Segmentation of Side-Scan Sonar Data, [Paper], [Code]
(arXiv 2023.2) Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN, [Paper], [Code]
(arXiv 2023.2) VIEWCO: DISCOVERING TEXT-SUPERVISED SEGMENTATION MASKS VIA MULTI-VIEW SEMANTIC CONSISTENCY, [Paper], [Code]
(arXiv 2023.2) CertViT: Certified Robustness of Pre-Trained Vision Transformers, [Paper], [Code]
(arXiv 2023.2) Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions, [Paper]
(arXiv 2023.2) MaskedKD: Efficient Distillation of Vision Transformers with Masked Images, [Paper]
(arXiv 2023.2) A General Visual Representation Guided Framework with Global Affinity for Weakly Supervised Salient Object Detection, [Paper]
(arXiv 2023.2) ViTA: A Vision Transformer Inference Accelerator for Edge Applications, [Paper]
(arXiv 2023.2) Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer, [Paper], [Code]
(arXiv 2023.2) A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning, [Paper]
(arXiv 2023.2) StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization, [Paper]
(arXiv 2023.2) Meta Style Adversarial Training for Cross-Domain Few-Shot Learning, [Paper]
(arXiv 2023.2) HYNETER: HYBRID NETWORK TRANSFORMER FOR OBJECT DETECTION, [Paper]
(arXiv 2023.2) STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training, [Paper]
(arXiv 2023.2) Constraint and Union for Partially-Supervised Temporal Sentence Grounding, [Paper]
(arXiv 2023.2) STB-VMM: Swin Transformer Based Video Motion Magnification, [Paper]
(arXiv 2023.2) Fashion Image Retrieval with Multi-Granular Alignment, [Paper]
(arXiv 2023.2) LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation, [Paper]
(arXiv 2023.2) CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension, [Paper], [Code]
(arXiv 2023.2) MaskSketch: Unpaired Structure-guided Masked Image Generation, [Paper]
(arXiv 2023.2) Single Motion Diffusion, [Paper], [Code]
(arXiv 2023.2) Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction, [Paper], [Code]
(arXiv 2023.2) ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence, [Paper]
(arXiv 2023.2) ForceFormer: Exploring Social Force and Transformer for Pedestrian Trajectory Prediction, [Paper]
(arXiv 2023.2) Video Probabilistic Diffusion Models in Projected Latent Space, [Paper]
(arXiv 2023.2) Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation, [Paper], [Code]
(arXiv 2023.2) Learning to Substitute Ingredients in Recipes, [Paper]
(arXiv 2023.2) Energy Transformer, [Paper]
(arXiv 2023.2) Efficiency 360: Efficient Vision Transformers, [Paper]
(arXiv 2023.2) A-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable ` Prompting, [Paper]
(arXiv 2023.2) Effective Data Augmentation With Diffusion Models, [Paper], [Project]
(arXiv 2023.2) PRedItOR: Text Guided Image Editing with Diffusion Prior, [Paper]
(arXiv 2023.2) TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary One-Shot Image Generation, [Paper]
(arXiv 2023.2) Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection, [Paper]
(arXiv 2023.2) MINOTAUR: Multi-task Video Grounding From Multimodal Queries, [Paper]
(arXiv 2023.2) Towards Efficient Visual Adaption via Structural Re-parameterization, [Paper], [Code]
(arXiv 2023.2) Efficient 3D Object Reconstruction using Visual Transformers, [Paper]
(arXiv 2023.2) Retrieval-augmented Image Captioning, [Paper]
(arXiv 2023.2) Robust Human Motion Forecasting using Transformer-based Model, [Paper]
(arXiv 2023.2) VQ3D: Learning a 3D-Aware Generative Model on ImageNet, [Paper], [Project]
(arXiv 2023.2) UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training, [Paper], [Code]
(arXiv 2023.2) A THEORETICAL UNDERSTANDING OF SHALLOW VISION TRANSFORMERS: LEARNING, GENERALIZATION, AND SAMPLE COMPLEXITY, [Paper]
(arXiv 2023.2) A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models, [Paper]
(arXiv 2023.2) Generalized Few-Shot Continual Learning with Contrastive Mixture of Adapters, [Paper], [Code]
(arXiv 2023.2) Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation, [Paper]
(arXiv 2023.2) Towards Local Visual Modeling for Image Captioning, [Paper], [Code]
(arXiv 2023.2) CLIP-RR: IMPROVED CLIP NETWORK FOR RELATION-FOCUSED CROSS-MODAL INFORMATION RETRIEVAL, [Paper]
(arXiv 2023.2) Anticipating Next Active Objects for Egocentric Videos, [Paper], [Code]
(arXiv 2023.2) UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling, [Paper], [Code]
(arXiv 2023.2) TEAM DETR: GUIDE QUERIES AS A PROFESSIONAL TEAM IN DETECTION TRANSFORMERS, [Paper], [Code]
(arXiv 2023.2) ConceptFusion: Open-set Multimodal 3D Mapping, [Paper], [Project]
(arXiv 2023.2) Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for Multi-Modal Fact Verification, [Paper], [Code]
(arXiv 2023.2) PolyFormer: Referring Image Segmentation as Sequential Polygon Generation, [Paper]
(arXiv 2023.2) Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation, [Paper]
(arXiv 2023.2) TFormer: A Transmission-Friendly ViT Model for IoT Devices, [Paper], [Code]
(arXiv 2023.2) Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction, [Paper], [Code]
(arXiv 2023.2) Adding Conditional Control to Text-to-Image Diffusion Models, [Paper], [Code]
(arXiv 2023.2) Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames, [Paper]
(arXiv 2023.2) IS MULTI-MODAL VISION SUPERVISION BENEFICIAL TO LANGUAGE? [Paper]
(arXiv 2023.2) Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding, [Paper]
(arXiv 2023.2) Scaling Vision Transformers to 22 Billion Parameters, [Paper]
(arXiv 2023.2) Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation, [Paper], [Code]
(arXiv 2023.2) Mitigating Bias in Visual Transformers via Targeted Alignment, [Paper]
(arXiv 2023.2) IH-ViT: Vision Transformer-based Integrated Circuit Appearance Defect Detection, [Paper]
(arXiv 2023.2) Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning, [Paper]
(arXiv 2023.2) Learning by Asking for Embodied Visual Navigation and Task Completion, [Paper]
(arXiv 2023.2) Reversible Vision Transformers, [Paper], [Code1], [Code2]
(arXiv 2023.2) Neural Congealing: Aligning Images to a Joint Semantic Atlas, [Paper], [Project]
(arXiv 2023.2) Adversarial Prompting for Black Box Foundation Models, [Paper]
(arXiv 2023.2) Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective, [Paper], [Code]
(arXiv 2023.2) CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER ATTENTION, [Paper], [Code]
(arXiv 2023.2) Convolutional Neural Networks Trained to Identify Words Provide a Good Account of Visual Form Priming Effects, [Paper]
(arXiv 2023.2) Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models, [Paper]
(arXiv 2023.2) OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer, [Paper]
(arXiv 2023.2) Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval, [Paper], [Code]
(arXiv 2023.2) SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation, [Paper]
(arXiv 2023.2) PhysFormer++: Facial Video-based Physiological Measurement with SlowFast Temporal Difference Transformer, [Paper]
(arXiv 2023.2) Scaling Self-Supervised End-to-End Driving with Multi-View Attention Learning, [Paper]
(arXiv 2023.2) HumanMAC: Masked Motion Completion for Human Motion Prediction, [Paper], [Project]
(arXiv 2023.2) LAMPP: Language Models as Probabilistic Priors for Perception and Action, [Paper]
(arXiv 2023.2) Zero-Shot Robot Manipulation from Passive Human Videos, [Paper], [Project]
(arXiv 2023.2) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]
(arXiv 2023.2) LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval, [Paper]
(arXiv 2023.2) V1T: large-scale mouse V1 response prediction using a Vision Transformer, [Paper]
(arXiv 2023.2) AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION, [Paper], [Project]
(arXiv 2023.2) KDEformer: Accelerating Transformers via Kernel Density Estimation, [Paper], [Code]
(arXiv 2023.2) Semantic-Guided Image Augmentation with Pre-trained Models, [Paper]
(arXiv 2023.2) X-ReID: Cross-Instance Transformer for Identity-Level Person Re-Identification, [Paper]
(arXiv 2023.2) MOMA: Distill from Self-Supervised Teachers, [Paper]
(arXiv 2023.2) Learning to Agree on Vision Attention for Visual Commonsense Reasoning, [Paper]
(arXiv 2023.2) Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer, [Paper], [Code]
(arXiv 2023.2) LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers, [Paper]
(arXiv 2023.2) Oscillation-free Quantization for Low-bit Vision Transformers, [Paper]
(arXiv 2023.2) Design Booster: A Text-Guided Diffusion Model for Image Translation with Spatial Layout Preservation, [Paper]
(arXiv 2023.2) Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining, [Paper], [Code]
(arXiv 2023.2) Leaving Reality to Imagination: Robust Classification via Generated Datasets, [Paper], [Code]
(arXiv 2023.2) CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets, [Paper], [Code]
(arXiv 2023.2) Zero-shot Image-to-Image Translation, [Paper], [Project]
(arXiv 2023.2) Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers, [Paper]
(arXiv 2023.2) EXPLICIT BOX DETECTION UNIFIES END-TO-END MULTI-PERSON POSE ESTIMATION, [Paper], [Code]
(arXiv 2023.2) CFFT-GAN: Cross-domain Feature Fusion Transformer for Exemplar-based Image Translation, [Paper]
(arXiv 2023.2) DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps, [Paper]
(arXiv 2023.2) CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data, [Paper], [Code]
(arXiv 2023.2) DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2023.2) HDFormer: High-order Directed Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2023.2) IC^3: Image Captioning by Committee Consensus, [Paper], [Code]
(arXiv 2023.2) Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt, [Paper]
(arXiv 2023.2) QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time Reasoning, [Paper]
(arXiv 2023.2) Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning, [Paper]
(arXiv 2023.2) Multimodal Chain-of-Thought Reasoning in Language Models, [Paper], [Code]
(arXiv 2023.2) CLIPood: Generalizing CLIP to Out-of-Distributions, [Paper]
(arXiv 2023.2) Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment, [Paper]
(arXiv 2023.2) The geometry of hidden representations of large transformer models, [Paper]
(arXiv 2023.2) Debiasing Vision-Language Models via Biased Prompts, [Paper], [Code]
(arXiv 2023.2) COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION, [Paper], [Code]
(arXiv 2023.2) mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video, [Paper], [Code]
(arXiv 2023.2) Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization, [Paper]
(arXiv 2023.2) ADAPT: Action-aware Driving Caption Transformer, [Paper], [Code]

2023.1

(arXiv 2023.1) AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers, [Paper], [Code]
(arXiv 2023.1) EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata, [Paper], [Project]
(arXiv 2023.1) Head-Free Lightweight Semantic Segmentation with Linear Transformer, [Paper], [Code]
(arXiv 2023.1) Geometry-biased Transformers for Novel View Synthesis, [Paper], [Project]
(arXiv 2023.1) Continual Few-Shot Learning Using HyperTransformers, [Paper]
(arXiv 2023.1) SEMPPL: PREDICTING PSEUDO-LABELS FOR BETTER CONTRASTIVE REPRESENTATIONS, [Paper]
(arXiv 2023.1) Learning to Summarize Videos by Contrasting Clips, [Paper]
(arXiv 2023.1) Guiding Text-to-Image Diffusion Model Towards Grounded Generation, [Paper], [Project]
(arXiv 2023.1) Domain Expansion of Image Generators, [Paper], [Code]
(arXiv 2023.1) Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study, [Paper]
(arXiv 2023.1) Tracr: Compiled Transformers as a Laboratory for Interpretability, [Paper], [Code]
(arXiv 2023.1) CLIP the Gap: A Single Domain Generalization Approach for Object Detection, [Paper]
(arXiv 2023.1) Text to Point Cloud Localization with Relation-Enhanced Transformer, [Paper]
(arXiv 2023.1) GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer, [Paper]
(arXiv 2023.1) Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks, [Paper]
(arXiv 2023.1) ViTs for SITS: Vision Transformers for Satellite Image Time Series, [Paper], [Code]
(arXiv 2023.1) CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP, [Paper]
(arXiv 2023.1) A Large-Scale Outdoor Multi-modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction, [Paper], [Project]
(arXiv 2023.1) USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval, [Paper], [Code]
(arXiv 2023.1) SAT: Size-Aware Transformer for 3D Point Cloud Semantic Segmentation, [Paper]
(arXiv 2023.1) Masked Visual Reconstruction in Language Semantic Space, [Paper], [Code]
(arXiv 2023.1) Vision Learners Meet Web Image-Text Pairs, [Paper], [Code]
(arXiv 2023.1) GLIGEN: Open-Set Grounded Text-to-Image Generation, [Paper], [Project]
(arXiv 2023.1) Learning Customized Visual Models with Retrieval-Augmented Knowledge, [Paper], [Project]
(arXiv 2023.1) UATVR: Uncertainty-Adaptive Text-Video Retrieval, [Paper]
(arXiv 2023.1) Learning Aligned Cross-modal Representations for Referring Image Segmentation, [Paper]
(arXiv 2023.1) T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations, [Paper], [Project]
(arXiv 2023.1) DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets, [Paper], [Code]
(arXiv 2023.1) CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition, [Paper]
(arXiv 2023.1) Generating Templated Caption for Video Grounding, [Paper]
(arXiv 2023.1) Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth Estimation in Dynamic Scenes, [Paper]
(arXiv 2023.1) SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Network, [Paper]
(arXiv 2023.1) CLIPTER: Looking at the Bigger Picture in Scene Text Recognition, [Paper]
(arXiv 2023.1) Temporal Perceiving Video-Language Pre-training, [Paper]
(arXiv 2023.1) Joint Representation Learning for Text and 3D Point Cloud, [Paper], [Code]
(arXiv 2023.1) Effective End-to-End Vision Language Pretraining with Semantic Visual Loss, [Paper]
(arXiv 2023.1) PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection, [Paper]
(arXiv 2023.1) Face Recognition in the age of CLIP & Billion image datasets, [Paper]
(arXiv 2023.1) HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2023.1) Towards Models that Can See and Read, [Paper]
(arXiv 2023.1) Embodied Agents for Efficient Exploration and Smart Scene Description, [Paper]
(arXiv 2023.1) Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, [Paper]
(arXiv 2023.1) Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition, [Paper]
(arXiv 2023.1) Multimodal Video Adapter for Parameter Efficient Video Text Retrieval, [Paper]
(arXiv 2023.1) Self Supervision Does Not Help Natural Language Supervision at Scale, [Paper]
(arXiv 2023.1) MULTI-TARGET MULTI-CAMERA VEHICLE TRACKING USING TRANSFORMER-BASED CAMERA LINK MODEL AND SPATIAL-TEMPORAL INFORMATION, [Paper]
(arXiv 2023.1) ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation, [Paper]
(arXiv 2023.1) DDS: Decoupled Dynamic Scene-Graph Generation Network, [Paper], [Code]
(arXiv 2023.1) Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences, [Paper]
(arXiv 2023.1) Image Memorability Prediction with Vision Transformers, [Paper]
(arXiv 2023.1) HOLISTICALLY EXPLAINABLE VISION TRANSFORMERS, [Paper]
(arXiv 2023.1) FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer, [Paper]
(arXiv 2023.1) LEGO-Net: Learning Regular Rearrangements of Objects in Rooms, [Paper], [Project]
(arXiv 2023.1) Zorro: the masked multimodal transformer, [Paper]
(arXiv 2023.1) Towards Robust Video Instance Segmentation with Temporal-Aware Transformer, [Paper]
(arXiv 2023.1) Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision, [Paper], [Project]
(arXiv 2023.1) Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation, [Paper], [Code]
(arXiv 2023.1) Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer, [Paper]
(arXiv 2023.1) Slice Transformer and Self-supervised Learning for 6DoF Localization in 3D Point Cloud Maps, [Paper]
(arXiv 2023.1) IMPROVING ACCURACY OF ZERO-SHOT ACTION RECOGNITION WITH HANDCRAFTED FEATURES, [Paper]
(arXiv 2023.1) Learning to View: Decision Transformers for Active Object Detection, [Paper]
(arXiv 2023.1) Visual Semantic Relatedness Dataset for Image Captioning, [Paper], [Code]
(arXiv 2023.1) VERSATILE NEURAL PROCESSES FOR LEARNING IMPLICIT NEURAL REPRESENTATIONS, [Paper], [Code]
(arXiv 2023.1) RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving, [Paper], [Code]
(arXiv 2023.1) Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting, [Paper]
(arXiv 2023.1) Image Super-Resolution using Efficient Striped Window Transformer, [Paper], [Code]
(arXiv 2023.1) Out of Distribution Performance of State of Art Vision Model, [Paper], [Code]
(arXiv 2023.1) Compact Transformer Tracker with Correlative Masked Modeling, [Paper], [Code]
(arXiv 2023.1) Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities, [Paper]
(arXiv 2023.1) Cut and Learn for Unsupervised Object Detection and Instance Segmentation, [Paper], [Code]
(arXiv 2023.1) Explaining Visual Biases as Words by Generating Captions, [Paper], [Code]
(arXiv 2023.1) Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring, [Paper], [Code]
(arXiv 2023.1) Multi-video Moment Ranking with Multimodal Clue, [Paper]
(arXiv 2023.1) SDF-FORMER: MONOCULAR SCENE RECONSTRUCTION WITH 3D SDF TRANSFORMERS, [Paper], [Project]
(arXiv 2023.1) Grounding Language Models to Images for Multimodal Generation, [Paper]
(arXiv 2023.1) Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning, [Paper]
(arXiv 2023.1) A Modular Multi-stage Lightweight Graph Transformer Network for Human Pose and Shape Estimation from 2D Human Pose, [Paper]
(arXiv 2023.1) Priors are Powerful: Improving a Transformer for Multi-camera 3D Detection with 2D Priors, [Paper]
(arXiv 2023.1) UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers, [Paper]
(arXiv 2023.1) Fairness-aware Vision Transformer via Debiased Self-Attention, [Paper]
(arXiv 2023.1) Anchor-Based Adversarially Robust Zero-Shot Learning Driven by Language, [Paper]
(arXiv 2023.1) Distilling Internet-Scale Vision-Language Models into Embodied Agents, [Paper]
(arXiv 2023.1) 6-DoF Robotic Grasping with Transformer, [Paper]
(arXiv 2023.1) Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling, [Paper], [Project]
(arXiv 2023.1) GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis, [Paper], [Code]
(arXiv 2023.1) STAIR: Learning Sparse Text and Image Representation in Grounded Tokens, [Paper]
(arXiv 2023.1) Aerial Image Object Detection With Vision Transformer Detector (ViTDet), [Paper]
(arXiv 2023.1) Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on Image Restoration, [Paper]
(arXiv 2023.1) Debiased Fine-Tuning for Vision-language Models by Prompt Regularization, [Paper], [Code]
(arXiv 2023.1) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, [Paper], [Code]
(arXiv 2023.1) Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval, [Paper]
(arXiv 2023.1) SEAFORMER: SQUEEZE-ENHANCED AXIAL TRANSFORMER FOR MOBILE SEMANTIC SEGMENTATION, [Paper], [Code]
(arXiv 2023.1) Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding, [Paper], [Project]
(arXiv 2023.1) Multimodal Event Transformer for Image-guided Story Ending Generation, [Paper]
(arXiv 2023.1) Style-Aware Contrastive Learning for Multi-Style Image Captioning, [Paper]
(arXiv 2023.1) 3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models, [Paper]
(arXiv 2023.1) Semi-Parametric Video-Grounded Text Generation, [Paper]
(arXiv 2023.1) Robust Transformer with Locality Inductive Bias and Feature Normalization, [Paper]
(arXiv 2023.1) LEVERAGING THE THIRD DIMENSION IN CONTRASTIVE LEARNING, [Paper]
(arXiv 2023.1) Understanding Self-Supervised Pretraining with Part-Aware Representation Learning, [Paper]
(arXiv 2023.1) Hypergraph Transformer for Skeleton-based Action Recognition, [Paper]
(arXiv 2023.1) CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, [Paper]
(arXiv 2023.1) InstructPix2Pix: Learning to Follow Image Editing Instructions, [Paper], [Code]
(arXiv 2023.1) OvarNet: Towards Open-vocabulary Object Attribute Recognition, [Paper], [Project]
(arXiv 2023.1) DDS: Decoupled Dynamic Scene-Graph Generation Network, [Paper]
(arXiv 2023.1) Token Transformer: Can class token help window-based transformer build better long-range interactions? [Paper]
(arXiv 2023.1) Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks, [Paper]
(arXiv 2023.1) Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering? [Paper], [Code]
(arXiv 2023.1) FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2023.1) Parallel Reasoning Network for Human-Object Interaction Detection, [Paper]
(arXiv 2023.1) In Defense of Structural Symbolic Representation for Video Event-Relation Prediction, [Paper]
(arXiv 2023.1) Scene Synthesis from Human Motion, [Paper], [Project]

2022.12

(arXiv 2022.12) EVA: Exploring the Limits of Masked Visual Representation Learning at Scale, [Paper], [Code]
(arXiv 2022.12) OneFormer: One Transformer to Rule Universal Image Segmentation, [Paper], [Code]
(arXiv 2022.12) MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation, [Paper], [Project]
(arXiv 2022.12) Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality, [Paper], [Code]
(arXiv 2022.12) Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations, [Paper], [Code]
(arXiv 2022.12) CLIP-FLOW: CONTRASTIVE LEARNING BY SEMISUPERVISED ITERATIVE PSEUDO LABELING FOR OPTICAL FLOW ESTIMATION, [Paper]
(arXiv 2022.12) INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS, [Paper], [Code]
(arXiv 2022.12) MetaFormer Baselines for Vision, [Paper], [Code]
(arXiv 2022.12) ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design, [Paper], [Code]
(arXiv 2022.12) FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA, [Paper], [Project]
(arXiv 2022.12) Optimizing Prompts for Text-to-Image Generation, [Paper], [Code]
(arXiv 2022.12) Attentive Mask CLIP, [Paper]
(arXiv 2022.12) Rethinking Cooking State Recognition with Vision Transformers, [Paper]
(arXiv 2022.12) Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation, [Paper], [Code]
(arXiv 2022.12) MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks, [Paper], [Code]
(arXiv 2022.12) RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers, [Paper]
(arXiv 2022.12) WAVENHANCER: UNIFYING WAVELET AND TRANSFORMER FOR IMAGE ENHANCEMENT, [Paper]
(arXiv 2022.12) AUTOENCODERS AS CROSS-MODAL TEACHERS: CAN PRETRAINED 2D IMAGE TRANSFORMERS HELP 3D REPRESENTATION LEARNING?, [Paper], [Code]
(arXiv 2022.12) SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering, [Paper]
(arXiv 2022.12) Emergent Analogical Reasoning in Large Language Models, [Paper]
(arXiv 2022.12) Unleashing the Power of Visual Prompting At the Pixel Level, [Paper], [Code]
(arXiv 2022.12) Does CLIP Bind Concepts? Probing Compositionality in Large Image Models, [Paper]
(arXiv 2022.12) LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer, [Paper], [Code]
(arXiv 2022.12) Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?, [Paper]
(arXiv 2022.12) Benchmarking Spatial Relationships in Text-to-Image Generation, [Paper], [Project]
(arXiv 2022.12) MetaCLUE: Towards Comprehensive Visual Metaphors Research, [Paper], [Project]
(arXiv 2022.12) Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation, [Paper], [Code]
(arXiv 2022.12) Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment, [Paper]
(arXiv 2022.12) Does unsupervised grammar induction need pixels?, [Paper]
(arXiv 2022.12) Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble, [Paper]
(arXiv 2022.12) MAViC: Multimodal Active Learning for Video Captioning, [Paper]
(arXiv 2022.12) What Makes for Good Tokenizers in Vision Transformer? [Paper]
(arXiv 2022.12) Not Just Pretty Pictures: Text-to-Image Generators Enable Interpretable Interventions for Robust Representations, [Paper], [Code]
(arXiv 2022.12) Generalized Decoding for Pixel, Image, and Language, [Paper], [Project]
(arXiv 2022.12) METEOR Guided Divergence for Video Captioning, [Paper], [Code]
(arXiv 2022.12) SLGTFORMER: AN ATTENTION-BASED APPROACH TO SIGN LANGUAGE RECOGNITION, [Paper], [Code]
(arXiv 2022.12) FROM IMAGES TO TEXTUAL PROMPTS: ZERO-SHOT VQA WITH FROZEN LARGE LANGUAGE MODELS, [Paper], [Code]
(arXiv 2022.12) 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions, [Paper], [Code]
(arXiv 2022.12) Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias, [Paper]
(arXiv 2022.12) Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method, [Paper], [Code]
(arXiv 2022.12) Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation, [Paper], [Project]
(arXiv 2022.12) Beyond SOT: It’s Time to Track Multiple Generic Objects at Once, [Paper]
(arXiv 2022.12) KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL EMBODIED NAVIGATION, [Paper]
(arXiv 2022.12) SegViT: Semantic Segmentation with Plain Vision Transformers, [Paper], [Code]
(arXiv 2022.12) Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features, [Paper]
(arXiv 2022.12) Point·E: A System for Generating 3D Point Clouds from Complex Prompts, [Paper], [Code]
(arXiv 2022.12) Inductive Attention for Video Action Anticipation, [Paper]
(arXiv 2022.12) Image-and-Language Understanding from Pixels Only, [Paper], [Code]
(arXiv 2022.12) FlexiViT: One Model for All Patch Sizes, [Paper], [Code]
(arXiv 2022.12) Unsupervised Object Localization: Observing the Background to Discover Objects, [Paper], [Code]
(arXiv 2022.12) Vision Transformers are Parameter-Efficient Audio-Visual Learners, [Paper], [Project]
(arXiv 2022.12) Full Contextual Attention for Multi-resolution Transformers in Semantic Segmentation, [Paper]
(arXiv 2022.12) DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention, [Paper]
(arXiv 2022.12) Enhanced Training of Query-Based Object Detection via Selective Query Recollection, [Paper], [Code]
(arXiv 2022.12) TEXT-GUIDED MASK-FREE LOCAL IMAGE RETOUCHING, [Paper]
(arXiv 2022.12) Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization, [Paper], [Code]
(arXiv 2022.12) One-Shot Domain Adaptive and Generalizable Semantic Segmentation with Class-Aware Cross-Domain Transformers, [Paper]
(arXiv 2022.12) ConQueR: Query Contrast Voxel-DETR for 3D Object Detection, [Paper]
(arXiv 2022.12) Examining the Difference Among Transformers and CNNs with Explanation Methods, [Paper]
(arXiv 2022.12) Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding, [Paper], [Code]
(arXiv 2022.12) Dual-branch Cross-Patch Attention Learning for Group Affect Recognition, [Paper]
(arXiv 2022.12) Cross-Modal Similarity-Based Curriculum Learning for Image Captioning, [Paper]
(arXiv 2022.12) NLIP: Noise-robust Language-Image Pre-training, [Paper]
(arXiv 2022.12) LidarCLIP or: How I Learned to Talk to Point Clouds, [Paper], [Code]
(arXiv 2022.12) CLIPSEP: LEARNING TEXT-QUERIED SOUND SEPARATION WITH NOISY UNLABELED VIDEOS, [Paper]
(arXiv 2022.12) Reproducible scaling laws for contrastive language-image learning, [Paper], [Code]
(arXiv 2022.12) WHAT DO VISION TRANSFORMERS LEARN? A VISUAL EXPLORATION, [Paper]
(arXiv 2022.12) Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models, [Paper], [Project]
(arXiv 2022.12) GPVIT: A HIGH RESOLUTION NON-HIERARCHICAL VISION TRANSFORMER WITH GROUP PROPAGATION, [Paper], [Code]
(arXiv 2022.12) Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders, [Paper], [Code]
(arXiv 2022.12) Parallel Queries for Human-Object Interaction Detection, [Paper]
(arXiv 2022.12) Structure-Guided Image Completion with Image-level and Object-level Semantic Discriminators, [Paper]
(arXiv 2022.12) Localized Latent Updates for Fine-Tuning Vision-Language Models, [Paper]
(arXiv 2022.12) CamoFormer: Masked Separable Attention for Camouflaged Object Detection, [Paper]
(arXiv 2022.12) FastMIM: Expediting Masked Image Modeling Pre-training for Vision, [Paper], [Code]
(arXiv 2022.12) OAMixer: Object-aware Mixing Layer for Vision Transformers, [Paper], [Code]
(arXiv 2022.12) Doubly Right Object Recognition: A Why Prompt for Visual Rationales, [Paper]
(arXiv 2022.12) RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE, [Paper], [Project]
(arXiv 2022.12) Egocentric Video Task Translation, [Paper]
(arXiv 2022.12) ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes, [Paper], [Project]
(arXiv 2022.12) Curriculum Learning Meets Weakly Supervised Modality Correlation Learning, [Paper]
(arXiv 2022.12) IMoS: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions, [Paper]
(arXiv 2022.12) MultiAct: Long-Term 3D Human Motion Generation from Multiple Action Labels, [Paper]
(arXiv 2022.12) A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning, [Paper]
(arXiv 2022.12) Beyond Object Recognition: A New Benchmark towards Object Concept Learning, [Paper], [Project]
(arXiv 2022.12) ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation, [Paper], [Code]
(arXiv 2022.12) Structured Vision-Language Pretraining for Computational Cooking, [Paper]
(arXiv 2022.12) MIME: Human-Aware 3D Scene Generation, [Paper], [Project]
(arXiv 2022.12) OFASY S: A Multi-Modal Multi-Task Learning System for Building Generalist Models, [Paper], [Code]
(arXiv 2022.12) Task Bias in Vision-Language Models, [Paper]
(arXiv 2022.12) Multi-Concept Customization of Text-to-Image Diffusion, [Paper], [Code]
(arXiv 2022.12) Few-View Object Reconstruction with Unknown Categories and Camera Poses, [Paper], [Project]
(arXiv 2022.12) Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning, [Paper], [Code]
(arXiv 2022.12) Learning Video Representations from Large Language Models, [Paper], [Project]
(arXiv 2022.12) Frozen CLIP Model is Efficient Point Cloud Backbone, [Paper]
(arXiv 2022.12) DialogCC: Large-scale Multi-Modal Dialogue Dataset, [Paper], [Project]
(arXiv 2022.12) Group Generalized Mean Pooling for Vision Transformer, [Paper]
(arXiv 2022.12) LEARNING DOMAIN INVARIANT PROMPT FOR VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.12) LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models, [Paper]
(arXiv 2022.12) Hyperbolic Contrastive Learning for Visual Representations beyond Objects, [Paper], [Code]

2022.11

(arXiv 2022.11) Texts as Images in Prompt Tuning for Multi-Label Image Recognition, [Paper], [Code]
(arXiv 2022.11) Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation, [Paper]
(arXiv 2022.11) InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images, [Paper]
(arXiv 2022.11) VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval, [Paper], [Code]
(arXiv 2022.11) Completing point cloud from few points by Wasserstein GAN and Transformers, [Paper], [Code]
(arXiv 2022.11) Integrally Pre-Trained Transformer Pyramid Networks, [Paper], [Code]
(arXiv 2022.11) Data Augmentation Vision Transformer for Fine-grained Image Classification, [Paper]
(arXiv 2022.11) DETRs with Collaborative Hybrid Assignments Training, [Paper], [Code]
(arXiv 2022.11) Open-vocabulary Attribute Detection, [Paper], [Project]
(arXiv 2022.11) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation, [Paper], [Code]
(arXiv 2022.11) Inversion-Based Creativity Transfer with Diffusion Models, [Paper], [Code]
(arXiv 2022.11) CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning, [Paper]
(arXiv 2022.11) SVFormer: Semi-supervised Video Transformer for Action Recognition, [Paper], [Code]
(arXiv 2022.11) Generalizable Implicit Neural Representations via Instance Pattern Composers, [Paper]
(arXiv 2022.11) Improving Visual-textual Sentiment Analysis by Fusing Expert Features, [Paper]
(arXiv 2022.11) Self-Supervised Learning based on Heat Equation, [Paper]
(arXiv 2022.11) Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors, [Paper]
(arXiv 2022.11) Paint by Example: Exemplar-based Image Editing with Diffusion Models, [Paper], [Code]
(arXiv 2022.11) Human or Machine? Turing Tests for Vision and Language, [Paper], [Code]
(arXiv 2022.11) Teach-DETR: Better Training DETR with Teachers, [Paper], [Code]
(arXiv 2022.11) Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition, [Paper]
(arXiv 2022.11) X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.11) Aligning Source Visual and Target Language Domains for Unpaired Video Captioning, [Paper]
(arXiv 2022.11) On the Transferability of Visual Features in Generalized Zero-Shot Learning, [Paper], [Code]
(arXiv 2022.11) Generalizable Industrial Visual Anomaly Detection with Self-Induction Vision Transformer, [Paper]
(arXiv 2022.11) Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification, [Paper], [Code]
(arXiv 2022.11) Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring, [Paper], [Code]
(arXiv 2022.11) Event Transformer+. A multi-purpose solution for efficient event data processing, [Paper]
(arXiv 2022.11) MagicPony: Learning Articulated 3D Animals in the Wild, [Paper], [Project]
(arXiv 2022.11) Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free Continual Learning of Vision Transformers, [Paper], [Code]
(arXiv 2022.11) Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations, [Paper], [Code]
(arXiv 2022.11) N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution, [Paper]
(arXiv 2022.11) Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models, [Paper], [Code]
(arXiv 2022.11) Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training, [Paper], [Code]
(arXiv 2022.11) Unifying Vision-Language Representation Space with Single-tower Transformer, [Paper]
(arXiv 2022.11) DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting, [Paper]
(arXiv 2022.11) Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference, [Paper]
(arXiv 2022.11) CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering, [Paper]
(arXiv 2022.11) Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics, [Paper]
(arXiv 2022.11) A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset, [Paper]
(arXiv 2022.11) Efficient Video Representation Learning via Masked Video Modeling with Motion-centric Token Selection, [Paper]
(arXiv 2022.11) DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization, [Paper]
(arXiv 2022.11) TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer, [Paper]
(arXiv 2022.11) Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models, [Paper], [Code]
(arXiv 2022.11) Are Out-of-Distribution Detection Methods Reliable?, [Paper]
(arXiv 2022.11) GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds, [Paper], [Code]
(arXiv 2022.11) CROSS-MODAL CONTRASTIVE LEARNING FOR ROBUST REASONING IN VQA, [Paper], [Code]
(arXiv 2022.11) LISA: Localized Image Stylization with Audio via Implicit Neural Representation, [Paper]
(arXiv 2022.11) MagicVideo: Efficient Video Generation With Latent Diffusion Models, [Paper], [Code]
(arXiv 2022.11) DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning, [Paper]
(arXiv 2022.11) Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation, [Paper]
(arXiv 2022.11) Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, [Paper]
(arXiv 2022.11) Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation, [Paper]
(arXiv 2022.11) You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model, [Paper]
(arXiv 2022.11) Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers, [Paper]
(arXiv 2022.11) FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer, [Paper], [Code]
(arXiv 2022.11) PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism, [Paper]
(arXiv 2022.11) On the Robustness, Generalization, and Forgetting of Shape-Texture Debiased Continual Learning, [Paper]
(arXiv 2022.11) Vision Transformer with Super Token Sampling, [Paper], [Code]
(arXiv 2022.11) Detect Only What You Specify : Object Detection with Linguistic Target, [Paper]
(arXiv 2022.11) Visual Programming: Compositional visual reasoning without training, [Paper], [Project]
(arXiv 2022.11) ClipCrop: Conditioned Cropping Driven by Vision-Language Model, [Paper]
(arXiv 2022.11) SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training, [Paper]
(arXiv 2022.11) Blur Interpolation Transformer for Real-World Motion from Blur, [Paper]
(arXiv 2022.11) Mean Shift Mask Transformer for Unseen Object Instance Segmentation, [Paper], [Code]
(arXiv 2022.11) PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning, [Paper], [Code]
(arXiv 2022.11) Exploring Discrete Diffusion Models for Image Captioning, [Paper], [Code]
(arXiv 2022.11) PERCEIVER-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention, [Paper], [Code]
(arXiv 2022.11) Multitask Vision-Language Prompt Tuning, [Paper], [Code]
(arXiv 2022.11) Teaching Structured Vision & Language Concepts to Vision & Language Models, [Paper]
(arXiv 2022.11) WEIGHTED ENSEMBLE SELF-SUPERVISED LEARNING, [Paper]
(arXiv 2022.11) BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision, [Paper]
(arXiv 2022.11) Task Residual for Tuning Vision-Language Models, [Paper], [Code]
(arXiv 2022.11) α DARTS Once More: Enhancing Differentiable Architecture Search by Masked Image Modeling, [Paper]
(arXiv 2022.11) Delving into Transformer for Incremental Semantic Segmentation, [Paper]
(arXiv 2022.11) DETRDistill: A Universal Knowledge Distillation Framework for DETR-families, [Paper]
(arXiv 2022.11) PromptCap: Prompt-Guided Task-Aware Image Captioning, [Paper]
(arXiv 2022.11) UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH VIDEO UNIFORMER, [Paper], [Code]
(arXiv 2022.11) Masked Reconstruction Contrastive Learning with Information Bottleneck Principle, [Paper]
(arXiv 2022.11) Listen, denoise, action! Audio-driven motion synthesis with diffusion models, [Paper], [Project]
(arXiv 2022.11) ConStruct-VL: Data-Free Continual Structured VL Concepts Learning, [Paper]
(arXiv 2022.11) How to Fine-Tune Vision Models with SGD, [Paper]
(arXiv 2022.11) Progressive Tree-Structured Prototype Network for End-to-End Image Captioning, [Paper], [Code]
(arXiv 2022.11) CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge, [Paper], [Code]
(arXiv 2022.11) Visual Commonsense-aware Representation Network for Video Captioning, [Paper], [Code]
(arXiv 2022.11) Language Conditioned Spatial Relation Reasoning for 3D Object Grounding, [Paper], [Code]
(arXiv 2022.11) HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors, [Paper], [Code]
(arXiv 2022.11) Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information, [Paper], [Code]
(arXiv 2022.11) Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.11) D^3ETR: Decoder Distillation for Detection Transformer, [Paper]
(arXiv 2022.11) CAE v2: Context Autoencoder with CLIP Target, [Paper]
(arXiv 2022.11) Cross-Modal Adapter for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.11) TOKEN TURING MACHINES, [Paper]
(arXiv 2022.11) WILL LARGE-SCALE GENERATIVE MODELS CORRUPT FUTURE DATASETS? [Paper], [Code]
(arXiv 2022.11) Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application, [Paper]
(arXiv 2022.11) SATVSR: Scenario Adaptive Transformer for Cross Scenarios Video Super-Resolution, [Paper]
(arXiv 2022.11) TransCC: Transformer-based Multiple Illuminant Color Constancy Using Multitask Learning, [Paper]
(arXiv 2022.11) Stare at What You See: Masked Image Modeling without Reconstruction, [Paper], [Code]
(arXiv 2022.11) HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers, [Paper]
(arXiv 2022.11) Cross-domain Federated Adaptive Prompt Tuning for CLIP, [Paper]
(arXiv 2022.11) YORO - Lightweight End to End Visual Grounding, [Paper]
(arXiv 2022.11) Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling, [Paper]
(arXiv 2022.11) BiViT: Extremely Compressed Binary Vision Transformer, [Paper]
(arXiv 2022.11) ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations, [Paper]
(arXiv 2022.11) Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment, [Paper]
(arXiv 2022.11) Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding, [Paper], [Project]
(arXiv 2022.11) Enhancing Few-Shot Image Classification with Cosine Transformer, [Paper], [Code]
(arXiv 2022.11) SCOTCH and SODA: A Transformer Video Shadow Detection Framework, [Paper]
(arXiv 2022.11) AU-Aware Vision Transformers for Biased Facial Expression Recognition, [Paper]
(arXiv 2022.11) Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces, [Paper], [Code]
(arXiv 2022.11) Large-Scale Bidirectional Training for Zero-Shot Image Captioning, [Paper]
(arXiv 2022.11) Grafting Pre-trained Models for Multimodal Headline Generation, [Paper]
(arXiv 2022.11) CabViT: Cross Attention among Blocks for Vision Transformer, [Paper], [Code]
(arXiv 2022.11) Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization, [Paper]
(arXiv 2022.11) SSGVS: Semantic Scene Graph-to-Video Synthesis, [Paper]
(arXiv 2022.11) One-Time Model Adaptation to Heterogeneous Clients: An Intra-Client and Inter-Image Attention Design, [Paper]
(arXiv 2022.11) An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention, [Paper]
(arXiv 2022.11) Zero-shot Visual Commonsense Immorality Prediction, [Paper], [Code]
(arXiv 2022.11) Hyperbolic Cosine Transformer for LiDAR 3D Object Detection, [Paper]
(arXiv 2022.11) Training a Vision Transformer from scratch in less than 24 hours with 1 GPU, [Paper], [Code]
(arXiv 2022.11) ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention, [Paper]
(arXiv 2022.11) SimOn: A Simple Framework for Online Temporal Action Localization, [Paper], [Code]
(arXiv 2022.11) ERNIE-UNIX^2: A UNIFIED CROSS-LINGUAL CROSS-MODAL FRAMEWORK FOR UNDERSTANDING AND GENERATION, [Paper]
(arXiv 2022.11) SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation, [Paper]
(arXiv 2022.11) Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions, [Paper]
(arXiv 2022.11) VieCap4H - VLSP 2021: ObjectAoA - Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning, [Paper]
(arXiv 2022.11) Watching the News: Towards VideoQA Models that can Read, [Paper], [Project]
(arXiv 2022.11) Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer, [Paper]
(arXiv 2022.11) Demystify Transformers & Convolutions in Modern Image Deep Networks, [Paper], [Code]
(arXiv 2022.11) InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions, [Paper], [Code]
(arXiv 2022.11) DEPTHFORMER: MULTIMODAL POSITIONAL ENCODINGS AND CROSS-INPUT ATTENTION FOR TRANSFORMER-BASED SEGMENTATION NETWORKS, [Paper]
(arXiv 2022.11) Sequential Transformer for End-to-End Person Search, [Paper]
(arXiv 2022.11) Prompting Large Pre-trained Vision-Language Models For Compositional Concept Learning, [Paper]
(arXiv 2022.11) CASA: Category-agnostic Skeletal Animal Reconstruction, [Paper]
(arXiv 2022.11) ViT-CX: Causal Explanation of Vision Transformers, [Paper]
(arXiv 2022.11) Disentangling Content and Motion for Text-Based Neural Video Manipulation, [Paper]
(arXiv 2022.11) Efficient Multi-order Gated Aggregation Network, [Paper]
(arXiv 2022.11) CLOP: Video-and-Language Pre-Training with Knowledge Regularizations, [Paper]
(arXiv 2022.11) MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task Image Manipulation Detection and Localization, [Paper]
(arXiv 2022.11) Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.11) Zero-shot Video Moment Retrieval With Off-the-Shelf Models, [Paper]
(arXiv 2022.11) Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization, [Paper]
(arXiv 2022.11) A Transformer Architecture for Online Gesture Recognition of Mathematical Expressions, [Paper]
(arXiv 2022.11) Evaluating and Improving Factuality in Multimodal Abstractive Summarization, [Paper], [Code]
(arXiv 2022.11) RCDPT: RADAR-CAMERA FUSION DENSE PREDICTION TRANSFORMER, [Paper]
(arXiv 2022.11) Video Event Extraction via Tracking Visual States of Arguments, [Paper]
(arXiv 2022.11) The Lottery Ticket Hypothesis for Vision Transformers, [Paper]
(arXiv 2022.11) TEXTCRAFT: ZERO-SHOT GENERATION OF HIGHFIDELITY AND DIVERSE SHAPES FROM TEXT, [Paper]
(arXiv 2022.11) PolyBuilding: Polygon Transformer for End-to-End Building Extraction, [Paper]
(arXiv 2022.11) RETHINKING HIERARCHIES IN PRE-TRAINED PLAIN VISION TRANSFORMER, [Paper], [Code]
(arXiv 2022.11) SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency, [Paper]
(arXiv 2022.11) Could Giant Pretrained Image Models Extract Universal Representations? [Paper]
(arXiv 2022.11) MAEDAY: MAE for few and zero shot AnomalY-Detection, [Paper], [Code]
(arXiv 2022.11) Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations, [Paper]
(arXiv 2022.11) Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding, [Paper], [Code]
(arXiv 2022.11) SpaText: Spatio-Textual Representation for Controllable Image Generation, [Paper], [Project]
(arXiv 2022.11) Learning 3D Scene Priors with 2D Supervision, [Paper], [Project]
(arXiv 2022.11) PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation, [Paper], [Code]
(arXiv 2022.11) Spatial-Spectral Transformer for Hyperspectral Image Denoising, [Paper], [Code]
(arXiv 2022.11) Training Vision-Language Models with Less Bimodal Supervision, [Paper]
(arXiv 2022.11) Text-Only Training for Image Captioning using Noise-Injected CLIP, [Paper], [Code]
(arXiv 2022.11) Attention-based Neural Cellular Automata, [Paper]
(arXiv 2022.11) eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, [Paper], [Code]
(arXiv 2022.11) Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese, [Paper], [Code]
(arXiv 2022.11) P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection, [Paper]
(arXiv 2022.11) tSF: Transformer-based Semantic Filter for Few-Shot Learning, [Paper]
(arXiv 2022.11) WITT: A WIRELESS IMAGE TRANSMISSION TRANSFORMER FOR SEMANTIC COMMUNICATIONS, [Paper], [Code]
(arXiv 2022.11) Pair DETR: Contrastive Learning Speeds Up DETR Training, [Paper]
(arXiv 2022.11) Interaction Visual Transformer for Egocentric Action Anticipation, [Paper]
(arXiv 2022.11) UDE: A Unified Driving Engine for Human Motion Generation, [Paper], [Code]
(arXiv 2022.11) Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation, [Paper], [Project]
(arXiv 2022.11) Human or Machine? Turing Tests for Vision and Language, [Paper], [Code]
(arXiv 2022.11) Knowledge Prompting for Few-shot Action Recognition, [Paper]
(arXiv 2022.11) UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance, [Paper], [Project]
(arXiv 2022.11) LVP-M^3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation, [Paper]
(arXiv 2022.11) PROCONTEXT: PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING, [Paper], [Code]
(arXiv 2022.11) Video based Object 6D Pose Estimation using Transformers, [Paper], [Code]
(arXiv 2022.11) S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention, [Paper], [Code]
(arXiv 2022.11) Holistic Interaction Transformer Network for Action Detection, [Paper], [Code]
(arXiv 2022.11) Learning and Retrieval from Prior Data for Skill-based Imitation Learning, [Paper], [Code]
(arXiv 2022.11) SimpleClick: Interactive Image Segmentation with Simple Vision Transformers, [Paper], [Code]
(arXiv 2022.11) TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition, [Paper], [Code]
(arXiv 2022.11) CPL: Counterfactual Prompt Learning for Vision and Language Models, [Paper], [Code]
(arXiv 2022.11) Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training, [Paper]
(arXiv 2022.11) Selective Query-guided Debiasing for Video Corpus Moment Retrieval, [Paper]
(arXiv 2022.11) Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning, [Paper], [Code]
(arXiv 2022.11) DENOISING MASKED AUTOENCODERS ARE CERTIFIABLE ROBUST VISION LEARNERS, [Paper], [Code]
(arXiv 2022.11) Token-Label Alignment for Vision Transformers, [Paper], [Code]
(arXiv 2022.11) CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory, [Paper], [Code]
(arXiv 2022.11) Multi-Scale Wavelet Transformer for Face Forgery Detection, [Paper]
(arXiv 2022.11) CLIP-PAE: PROJECTION-AUGMENTATION EMBEDDING TO EXTRACT RELEVANT FEATURES FOR A DISENTANGLED, INTERPRETABLE, AND CONTROLLABLE TEXT-GUIDED IMAGE MANIPULATION, [Paper]
(arXiv 2022.11) VISUAL PROMPT TUNING FOR TEST-TIME DOMAIN ADAPTATION, [Paper]
(arXiv 2022.11) FastCLIPstyler: Optimisation-free Text-based Image Style Transfer Using Style Representations, [Paper]
(arXiv 2022.11) PROGRESSIVE DENOISING MODEL FOR FINEGRAINED TEXT-TO-IMAGE GENERATION, [Paper]
(arXiv 2022.11) DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics, [Paper], [Project]
(arXiv 2022.11) Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning, [Paper], [Code]
(arXiv 2022.11) ACCURATE IMAGE RESTORATION WITH ATTENTION RETRACTABLE TRANSFORMER, [Paper], [Code]
(arXiv 2022.11) Dilated Neighborhood Attention Transformer, [Paper], [Code]
(arXiv 2022.11) Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval, [Paper]
(arXiv 2022.11) TVLT: Textless Vision-Language Transformer, [Paper], [Code]

2022.10

(arXiv 2022.10) DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention, [Paper]
(arXiv 2022.10) TFORMER: 3D TOOTH SEGMENTATION IN MESH SCANS WITH GEOMETRY GUIDED TRANSFORMER, [Paper]
(arXiv 2022.10) ON-THE-FLY OBJECT DETECTION USING STYLEGAN WITH CLIP GUIDANCE, [Paper]
(arXiv 2022.10) Image-free Domain Generalization via CLIP for 3D Hand Pose Estimation, [Paper]
(arXiv 2022.10) Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition, [Paper]
(arXiv 2022.10) A SIMPLE, EFFICIENT AND SCALABLE CONTRASTIVE MASKED AUTOENCODER FOR LEARNING VISUAL REPRESENTATIONS, [Paper]
(arXiv 2022.10) Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection, [Paper]
(arXiv 2022.10) Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer, [Paper]
(arXiv 2022.10) ViT-LSLA: Vision Transformer with Light Self-Limited-Attention, [Paper]
(arXiv 2022.10) Generative Negative Text Replay for Continual Vision-Language Pretraining, [Paper]
(arXiv 2022.10) PatchRot: A Self-Supervised Technique for Training Vision Transformers, [Paper]
(arXiv 2022.10) MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION, [Paper]
(arXiv 2022.10) Multimodal Transformer for Parallel Concatenated Variational Autoencoders, [Paper]
(arXiv 2022.10) Differentially Private CutMix for Split Learning with Vision Transformer, [Paper]
(arXiv 2022.10) OHMG: ZERO-SHOT OPEN-VOCABULARY HUMAN MOTION GENERATION, [Paper]
(arXiv 2022.10) VLT: Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper]
(arXiv 2022.10) PSFORMER: POINT TRANSFORMER FOR 3D SALIENT OBJECT DETECTION, [Paper]
(arXiv 2022.10) GRAFTING VISION TRANSFORMERS, [Paper]
(arXiv 2022.10) Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems, [Paper]
(arXiv 2022.10) FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning, [Paper]
(arXiv 2022.10) Masked Vision-Language Transformer in Fashion, [Paper], [Code]
(arXiv 2022.10) Learning Variational Motion Prior for Video-based Motion Capture, [Paper]
(arXiv 2022.10) Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models, [Paper], [Code]
(arXiv 2022.10) TEXT2MODEL: MODEL INDUCTION FOR ZERO-SHOT GENERALIZATION USING TASK DESCRIPTIONS, [Paper]
(arXiv 2022.10) Learning Joint Representation of Human Motion and Language, [Paper]
(arXiv 2022.10) ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, [Paper]
(arXiv 2022.10) MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving, [Paper]
(arXiv 2022.10) Li3DeTr: A LiDAR based 3D Detection Transformer, [[Paper]](Li3DeTr: A LiDAR based 3D Detection Transformer)
(arXiv 2022.10) Masked Transformer for image Anomaly Localization, [Paper]
(arXiv 2022.10) Discovering Design Concepts for CAD Sketches, [Paper]
(arXiv 2022.10) Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering, [Paper]
(arXiv 2022.10) End-to-End Multimodal Representation Learning for Video Dialog, [Paper]
(arXiv 2022.10) TPFNet: A Novel Text In-painting Transformer for Text Removal, [Paper], [Code]
(arXiv 2022.10) IMU2CLIP: MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS, [Paper]
(arXiv 2022.10) Can Transformer Attention Spread Give Insights Into Uncertainty of Detected and Tracked Objects? [Paper]
(arXiv 2022.10) SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.10) End-to-end Tracking with a Multi-query Transformer, [Paper]
(arXiv 2022.10) Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets, [Paper], [Code]
(arXiv 2022.10) TAMFORMER: MULTI-MODAL TRANSFORMER WITH LEARNED ATTENTION MASK FOR EARLY INTENT PREDICTION, [Paper]
(arXiv 2022.10) VISUAL ANSWER LOCALIZATION WITH CROSS-MODAL MUTUAL KNOWLEDGE TRANSFER, [Paper], [Code]
(arXiv 2022.10) Visual Semantic Parsing: From Images to Abstract Meaning Representation, [Paper]
(arXiv 2022.10) End-to-end Transformer for Compressed Video Quality Enhancement, [Paper]
(arXiv 2022.10) PlanT: Explainable Planning Transformers via Object-Level Representations, [Paper], [Project]
(arXiv 2022.10) Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations, [Paper], [Code]
(arXiv 2022.10) GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction, [Paper]
(arXiv 2022.10) VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge, [Paper], [Code]
(arXiv 2022.10) Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision, [Paper]
(arXiv 2022.10) Learning Explicit Object-Centric Representations with Vision Transformers, [Paper]
(arXiv 2022.10) Abductive Action Inference, [Paper]
(arXiv 2022.10) Minutiae-Guided Fingerprint Embeddings via Vision Transformers, [Paper], [Code]
(arXiv 2022.10) 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows, [Paper]
(arXiv 2022.10) COMPOSING ENSEMBLES OF PRE-TRAINED MODELS VIA ITERATIVE CONSENSUS, [Paper], [Code]
(arXiv 2022.10) Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?, [Paper]
(arXiv 2022.10) Boosting vision transformers for image retrieval, [Paper], [Code]
(arXiv 2022.10) LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling, [Paper]
(arXiv 2022.10) Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding, [Paper]
(arXiv 2022.10) Face Pyramid Vision Transformer, [Paper], [Code]
(arXiv 2022.10) Context-Enhanced Stereo Transformer, [Paper], [Code]
(arXiv 2022.10) CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers, [Paper], [Code]
(arXiv 2022.10) Rethinking Learning Approaches for Long-Term Action Anticipation, [Paper], [Code]
(arXiv 2022.10) Extending Phrase Grounding with Pronouns in Visual Dialogues, [Paper], [Code]
(arXiv 2022.10) Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets, [Paper], [Code]
(arXiv 2022.10) Transformers For Recognition In Overhead Imagery: A Reality Check, [Paper]
(arXiv 2022.10) Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation, [Paper], [Code]
(arXiv 2022.10) UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection, [Paper]
(arXiv 2022.10) LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context Propagation in Transformers, [Paper]
(arXiv 2022.10) Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization, [Paper], [Code]
(arXiv 2022.10) Language-free Training for Zero-shot Video Grounding, [Paper], [Code]
(arXiv 2022.10) Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object Discovery with Transformers, [Paper], [Code]
(arXiv 2022.10) Towards Unifying Reference Expression Generation and Comprehension, [Paper]
(arXiv 2022.10) Robust Self-Supervised Learning with Lie Groups, [Paper]
(arXiv 2022.10) VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors, [Paper], [Code]
(arXiv 2022.10) VTC: Improving Video-Text Retrieval with User Comments, [Paper], [Project]
(arXiv 2022.10) SOLVING REASONING TASKS WITH A SLOT TRANSFORMER, [Paper], [Code]
(arXiv 2022.10) Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models, [Paper]
(arXiv 2022.10) Grounded Video Situation Recognition, [Paper], [Project]
(arXiv 2022.10) Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer, [Paper]
(arXiv 2022.10) Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation, [Paper], [Code]
(arXiv 2022.10) MovieCLIP: Visual Scene Recognition in Movies, [Paper]
(arXiv 2022.10) PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points, [Paper], [Code]
(arXiv 2022.10) TOWARDS SUSTAINABLE SELF-SUPERVISED LEARNING, [Paper]
(arXiv 2022.10) Visual-Semantic Contrastive Alignment for Few-Shot Image Classification, [Paper]
(arXiv 2022.10) i-MAE: ARE LATENT REPRESENTATIONS IN MASKED AUTOENCODERS LINEARLY SEPARABLE? [Paper], [Code]
(arXiv 2022.10) 2nd Place Solution to ECCV 2022 Challenge: Transformer-based Action recognition in hand-object interacting scenarios, [Paper]
(arXiv 2022.10) 1st Place Solution to ECCV 2022 Challenge on HBHA: Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios, [Paper]
(arXiv 2022.10) DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models, [Paper]
(arXiv 2022.10) CLIP-Driven Fine-grained Text-Image Person Re-identification, [Paper]
(arXiv 2022.10) Dense but Efficient VideoQA for Intricate Compositional Reasoning, [Paper]
(arXiv 2022.10) Multi-view Gait Recognition based on SiameseVisionTransformer, [Paper]
(arXiv 2022.10) TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation, [Paper], [Code]
(arXiv 2022.10) CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion, [Paper], [Project]
(arXiv 2022.10) A Unified View of Masked Image Modeling, [Paper], [Code]
(arXiv 2022.10) Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval, [Paper], [Code]
(arXiv 2022.10) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.12) TOKEN MERGING: YOUR VIT BUT FASTER, [Paper], [Code]
(arXiv 2022.10) Using Language to Extend to Unseen Domains, [Paper], [Code]
(arXiv 2022.10) SWINV2-IMAGEN: HIERARCHICAL VISION TRANSFORMER DIFFUSION MODELS FOR TEXT-TO-IMAGE GENERATION, [Paper]
(arXiv 2022.10) HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes, [Paper], [Project]
(arXiv 2022.10) Transfer-learning for video classification: Video Swin Transformer on multiple domains, [Paper]
(arXiv 2022.10) PERCEPTUAL GROUPING IN VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.10) How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders, [Paper], [Code]
(arXiv 2022.10) LINEAR VIDEO TRANSFORMER WITH FEATURE FIXATION, [Paper], [Code]
(arXiv 2022.10) Transformer-based dimensionality reduction, [Paper]
(arXiv 2022.10) Bridging the Domain Gap for Multi-Agent Perception, [Paper]
(arXiv 2022.10) TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos, [Paper], [Code]
(arXiv 2022.10) SCRATCHING VISUAL TRANSFORMER’S BACK WITH UNIFORM ATTENTION, [Paper]
(arXiv 2022.10) Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective, [Paper]
(arXiv 2022.10) TLDW: Extreme Multimodal Summarisation of News Videos, [Paper], [Code]
(arXiv 2022.10) Character-Centric Story Visualization via Visual Planning and Token Alignment, [Paper], [Code]
(arXiv 2022.10) COFAR: Commonsense and Factual Reasoning in Image Search, [Paper], [Code]
(arXiv 2022.10) Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2022.10) Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows, [Paper]
(arXiv 2022.10) Forecasting Human Trajectory from Scene History, [Paper], [Code]
(arXiv 2022.10) SGRAM: Improving Scene Graph Parsing via Abstract Meaning Representation, [Paper]
(arXiv 2022.10) Contrastive Language-Image Pre-Training with Knowledge Graphs, [Paper]
(arXiv 2022.10) A Saccaded Visual Transformer for General Object Spotting, [Paper]
(arXiv 2022.10) Vision Transformers provably learn spatial structure, [Paper]
(arXiv 2022.10) oViT: An Accurate Second-Order Pruning Framework for Vision Transformers, [Paper]
(arXiv 2022.10) Fine-grained Category Discovery under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning, [Paper], [Code]
(arXiv 2022.10) Non-Contrastive Learning Meets Language-Image Pre-Training, [Paper]
(arXiv 2022.10) Frame Mining: a Free Lunch for Learning Robotic Manipulation from 3D Point Clouds, [Paper], [Project]
(arXiv 2022.10) Pretrained Transformers Do not Always Improve Robustness, [Paper]
(arXiv 2022.10) Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training, [Paper]
(arXiv 2022.10) CONTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER, [Paper]
(arXiv 2022.10) SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds, [Paper]
(arXiv 2022.10) Trailers12k: Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification, [Paper]
(arXiv 2022.10) AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments, [Paper]
(arXiv 2022.10) MOVE: Unsupervised Movable Object Segmentation and Detection, [Paper]
(arXiv 2022.10) IS SYNTHETIC DATA FROM GENERATIVE MODELS READY FOR IMAGE RECOGNITION?, [Paper], [Code]
(arXiv 2022.10) Towards Transformer-based Homogenization of Satellite Imagery for Landsat-8 and Sentinel-2, [Paper]
(arXiv 2022.10) MCTNET: A MULTI-SCALE CNN-TRANSFORMER NETWORK FOR CHANGE DETECTION IN OPTICAL REMOTE SENSING IMAGES, [Paper]
(arXiv 2022.10) Vision Transformer Visualization: What Neurons Tell and How Neurons Behave? [Paper], [Code]
(arXiv 2022.10) TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers, [Paper], [Code]
(arXiv 2022.10) SQA3D: SITUATED QUESTION ANSWERING IN 3D SCENES, [Paper]
(arXiv 2022.10) When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture, [Paper], [Code]
(arXiv 2022.10) STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition, [Paper]
(arXiv 2022.10) PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning, [Paper]
(arXiv 2022.10) One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations, [Paper], [Code]
(arXiv 2022.10) IMAGINARYNET: LEARNING OBJECT DETECTORS WITHOUT REAL IMAGES AND ANNOTATIONS, [Paper], [Code]
(arXiv 2022.10) Feature-Proxy Transformer for Few-Shot Segmentation, [Paper], [Code]
(arXiv 2022.10) Scene Text Image Super-Resolution via Content Perceptual Loss and Criss-Cross Transformer Blocks, [Paper]
(arXiv 2022.10) UNIFIED VISION AND LANGUAGE PROMPT LEARNING, [Paper], [Code]
(arXiv 2022.10) Exploring Long-Sequence Masked Autoencoders, [Paper], [Code]
(arXiv 2022.10) MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting, [Paper]
(arXiv 2022.10) Interactive Language: Talking to Robots in Real Time, [Paper], [Project]
(arXiv 2022.10) RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer, [Paper], [Code]
(arXiv 2022.10) How to Train Vision Transformer on Small-scale Datasets?, [Paper], [Code]
(arXiv 2022.10) Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features, [Paper], [Code]
(arXiv 2022.10) Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers, [Paper]
(arXiv 2022.10) CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS, [Paper]
(arXiv 2022.10) Foundation Transformers, [Paper], [Code]
(arXiv 2022.10) Underspecification in Scene Description-to-Depiction Tasks, [Paper]
(arXiv 2022.10) Continuous conditional video synthesis by neural processes, [Paper], [Code]
(arXiv 2022.10) SAIT: SPARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN PRUNING, [Paper]
(arXiv 2022.10) ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors, [Paper]
(arXiv 2022.10) SLOTFORMER: UNSUPERVISED VISUAL DYNAMICS SIMULATION WITH OBJECT-CENTRIC MODELS, [Paper], [Project]
(arXiv 2022.10) Learning by Asking Questions for Knowledge-based Novel Object Recognition, [Paper]
(arXiv 2022.10) Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets, [Paper], [Code]
(arXiv 2022.10) GGViT:Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection, [Paper]
(arXiv 2022.10) Distilling Knowledge from Language Models for Video-based Action Anticipation, [Paper]
(arXiv 2022.10) Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning, [Paper], [Code]
(arXiv 2022.10) M3VIDEO: MASKED MOTION MODELING FOR SELFSUPERVISED VIDEO REPRESENTATION LEARNING, [Paper]
(arXiv 2022.10) Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers, [Paper], [Code]
(arXiv 2022.10) FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers, [Paper]
(arXiv 2022.10) AISFormer: Amodal Instance Segmentation with Transformer, [Paper], [Code]
(arXiv 2022.10) ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view, [Paper]
(arXiv 2022.10) One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks, [Paper]
(arXiv 2022.10) PROMPT GENERATION NETWORKS FOR EFFICIENT ADAPTATION OF FROZEN VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.10) Generating Executable Action Plans with Environmentally-Aware Language Models, [Paper]
(arXiv 2022.10) AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization, [Paper]
(arXiv 2022.10) Improving Dense Contrastive Learning with Dense Negative Pairs, [Paper]
(arXiv 2022.10) Fine-Grained Image Style Transfer with Visual Transformers, [Paper], [Code]
(arXiv 2022.10) IT TAKES TWO: MASKED APPEARANCE-MOTION MODELING FOR SELF-SUPERVISED VIDEO TRANSFORMER PRE-TRAINING, [Paper]
(arXiv 2022.10) Contrastive Video-Language Learning with Fine-grained Frame Sampling, [Paper]
(arXiv 2022.10) Style-Guided Inference of Transformer for High-resolution Image Synthesis, [Paper]
(arXiv 2022.10) MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model, [Paper], [Code]
(arXiv 2022.10) LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION, [Paper], [Code]
(arXiv 2022.10) UNDERSTANDING EMBODIED REFERENCE WITH TOUCH-LINE TRANSFORMER, [Paper]
(arXiv 2022.10) Point Transformer V2: Grouped Vector Attention and Partition-based Pooling, [Paper], [Code]
(arXiv 2022.10) See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction, [Paper]
(arXiv 2022.10) USING BOTH DEMONSTRATIONS AND LANGUAGE INSTRUCTIONS TO EFFICIENTLY LEARN ROBOTIC TASKS, [Paper], [Project]
(arXiv 2022.10) Generating image captions with external encyclopedic knowledge, [Paper]
(arXiv 2022.10) LOCL: Learning Object-Attribute Composition using Localization, [Paper]
(arXiv 2022.10) SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models, [Paper], [Code]
(arXiv 2022.10) ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval, [Paper]
(arXiv 2022.10) Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling, [Paper], [Code]
(arXiv 2022.10) (Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network Based On Transformer for 3D Human Pose Estimation, [Paper]
(arXiv 2022.10) Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs, [Paper], [Code]
(arXiv 2022.10) OGC: Unsupervised 3D Object Segmentation from Rigid Dynamics of Point Clouds, [Paper], [Code]
(arXiv 2022.10) Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing, [Paper], [Code]
(arXiv 2022.10) Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment, [Paper]
(arXiv 2022.10) VOLTA: VISION-LANGUAGE TRANSFORMER WITH WEAKLY-SUPERVISED LOCAL-FEATURE ALIGNMENT, [Paper]
(arXiv 2022.10) OPEN-VOCABULARY SEMANTIC SEGMENTATION WITH MASK-ADAPTED CLIP, [Paper], [Project]
(arXiv 2022.10) MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning, [Paper]
(arXiv 2022.10) SELF-SUPERVISED VIDEO REPRESENTATION LEARNING WITH MOTION-AWARE MASKED AUTOENCODERS, [Paper], [Code]
(arXiv 2022.10) LEARNING TO DECOMPOSE VISUAL FEATURES WITH LATENT TEXTUAL PROMPTS, [Paper]
(arXiv 2022.10) DCVQE: A Hierarchical Transformer for Video Quality Assessment, [Paper]
(arXiv 2022.10) Fine-grained Object Categorization for Service Robots, [Paper]
(arXiv 2022.10) CLIP-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE CAPTIONING, [Paper], [Code]
(arXiv 2022.10) A Memory Transformer Network for Incremental Learning, [Paper]
(arXiv 2022.10) Bridging CLIP and StyleGAN through Latent Alignment for Image Editing, [Paper]
(arXiv 2022.10) LMQFormer: A Laplace-Prior-Guided Mask Query Transformer for Lightweight Snow Removal, [Paper]
(arXiv 2022.10) FS-DETR: FEW-SHOT DETECTION TRANSFORMER WITH PROMPTING AND WITHOUT RE-TRAINING, [Paper]
(arXiv 2022.10) Transformer-based Localization from Embodied Dialog with Large-scale Pre-training, [Paper]
(arXiv 2022.10) Turbo Training with Token Dropout, [Paper]
(arXiv 2022.10) Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks, [Paper]
(arXiv 2022.10) C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval, [Paper]
(arXiv 2022.10) Pose Guided Human Image Synthesis with Partially Decoupled GAN, [Paper]
(arXiv 2022.10) A Simple Plugin for Transforming Images to Arbitrary Scales, [Paper], [Project]
(arXiv 2022.10) Time-Space Transformers for Video Panoptic Segmentation, [Paper]
(arXiv 2022.10) MOAT: ALTERNATING MOBILE CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS, [Paper], [Code]
(arXiv 2022.10) IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS, [Paper], [Project]
(arXiv 2022.10) clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP, [Paper]
(arXiv 2022.10) FQDet: Fast-converging Query-based Detector, [Paper], [Code]
(arXiv 2022.10) VARIATIONAL PROMPT TUNING IMPROVES GENERALIZATION OF VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.10) Grounding Language with Visual Affordances over Unstructured Data, [Paper], [Project]
(arXiv 2022.10) Granularity-aware Adaptation for Image Retrieval over Multiple Tasks, [Paper]
(arXiv 2022.10) WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT? [Paper]
(arXiv 2022.10) Multi-view Human Body Mesh Translator, [Paper]
(arXiv 2022.10) EXPLORING THE ROLE OF MEAN TEACHERS IN SELFSUPERVISED MASKED AUTO-ENCODERS, [Paper]
(arXiv 2022.10) Point Cloud Recognition with Position-to-Structure Attention Transformers, [Paper]
(arXiv 2022.10) TEMPORALLY CONSISTENT VIDEO TRANSFORMER FOR LONG-TERM VIDEO PREDICTION, [Paper], [Code]
(arXiv 2022.10) PHENAKI: VARIABLE LENGTH VIDEO GENERATION FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS, [Paper]
(arXiv 2022.10) MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text, [Paper]
(arXiv 2022.10) Real-World Robot Learning with Masked Visual Pre-training, [Paper], [Project]
(arXiv 2022.10) BaseTransformers: Attention over base data-points for One Shot Learning, [Paper], [Code]
(arXiv 2022.10) Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition, [Paper]
(arXiv 2022.10) Vision Transformer Based Model for Describing a Set of Images as a Story, [Paper]
(arXiv 2022.10) Video Referring Expression Comprehension via Transformer with Content-aware Query, [Paper], [Code]
(arXiv 2022.10) EFFECTIVE SELF-SUPERVISED PRE-TRAINING ON LOW-COMPUTE NETWORKS WITHOUT DISTILLATION, [Paper]
(arXiv 2022.10) CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER, [Paper]
(arXiv 2022.10) Content-Based Search for Deep Generative Models, [Paper]
(arXiv 2022.10) MAPLE: MULTI-MODAL PROMPT LEARNING, [Paper], [Code]
(arXiv 2022.10) SYSTEMATIC GENERALIZATION AND EMERGENT STRUCTURES IN TRANSFORMERS TRAINED ON STRUCTURED TASKS, [Paper]
(arXiv 2022.10) WIDE ATTENTION IS THE WAY FORWARD FOR TRANSFORMERS? [Paper]
(arXiv 2022.10) DARTFORMER: FINDING THE BEST TYPE OF ATTENTION, [Paper]
(arXiv 2022.10) MOBILEVITV3: MOBILE-FRIENDLY VISION TRANSFORMER WITH SIMPLE AND EFFECTIVE FUSION OF LOCAL, GLOBAL AND INPUT FEATURES, [Paper], [Code]
(arXiv 2022.10) Differentiable Parsing and Visual Grounding of Verbal Instructions for Object Placement, [Paper], [Project]
(arXiv 2022.10) EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, [Paper]
(arXiv 2022.10) Motion-inductive Self-supervised Object Discovery in Videos, [Paper]
(arXiv 2022.10) Fully Transformer Network for Change Detection of Remote Sensing Images, [Paper], [Code]
(arXiv 2022.10) TOWARDS A UNIFIED VIEW ON VISUAL PARAMETER-EFFICIENT TRANSFER LEARNING, [Paper]
(arXiv 2022.10) Visual Prompt Tuning for Generative Transfer Learning, [Paper]
(arXiv 2022.10) A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers, [Paper]
(arXiv 2022.10) LPT: LONG-TAILED PROMPT TUNING FOR IMAGE CLASSIFICATION, [Paper]
(arXiv 2022.10) Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning, [Paper]
(arXiv 2022.10) CLIP2POINT: TRANSFER CLIP TO POINT CLOUD CLASSIFICATION WITH IMAGE-DEPTH PRE-TRAINING, [Paper]
(arXiv 2022.10) Dual-former: Hybrid Self-attention Transformer for Efficient Image Restoration, [Paper]
(arXiv 2022.10) LANGUAGE-AWARE SOFT PROMPTING FOR VISION & LANGUAGE FOUNDATION MODELS, [Paper]
(arXiv 2022.10) ASIF: COUPLED DATA TURNS UNIMODAL MODELS TO MULTIMODAL WITHOUT TRAINING, [Paper]
(arXiv 2022.10) ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather Conditions, [Paper]
(arXiv 2022.10) PROMPT LEARNING WITH OPTIMAL TRANSPORT FOR VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.10) Bridged Transformer for Vision and Point Cloud 3D Object Detection, [Paper]
(arXiv 2022.10) Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry, [Paper]
(arXiv 2022.10) HUMAN MOTION DIFFUSION MODEL, [Paper], [Project]
(arXiv 2022.10) TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval, [Paper]
(arXiv 2022.10) UniCLIP: Unified Framework for Contrastive Language–Image Pre-training, [Paper]
(arXiv 2022.10) CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection, [Paper], [Code]
(arXiv 2022.10) Multi-dataset Training of Transformers for Robust Action Recognition, [Paper], [Code]
(arXiv 2022.10) Multi-Scale Human-Object Interaction Detector, [Paper]
(arXiv 2022.10) LGDN: Language-Guided Denoising Network for Video-Language Modeling, [Paper]
(arXiv 2022.10) RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.10) Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation, [Paper], [Code]
(arXiv 2022.10) Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features, [Paper], [Code]
(arXiv 2022.10) Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer, [Paper], [Code]
(arXiv 2022.10) Prepended Domain Transformer: Heterogeneous Face Recognition without Bells and Whistles, [Paper]
(arXiv 2022.10) Visual Knowledge Graph for Human Action Reasoning in Videos, [Paper]
(arXiv 2022.10) Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction, [Paper]
(arXiv 2022.10) VIMA: GENERAL ROBOT MANIPULATION WITH MULTIMODAL PROMPTS, [Paper], [Project]
(arXiv 2022.10) What Should the System Do Next?: Operative Action Captioning for Estimating System Actions, [Paper]
(arXiv 2022.10) DMMGAN: Diverse Multi Motion Prediction of 3D Human Joints using Attention-Based Generative Adversarial Network, [Paper]
(arXiv 2022.10) PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6 DoF Tracking, [Paper], [Code]

2022.09

(arXiv 2022.09) SELF-DISTILLATION FOR FURTHER PRE-TRAINING OF TRANSFORMERS, [Paper]
(arXiv 2022.09) Visuo-Tactile Transformers for Manipulation, [Paper], [Project]
(arXiv 2022.09) UNDERSTANDING PURE CLIP GUIDANCE FOR VOXEL GRID NERF MODELS, [Paper], [Project]
(arXiv 2022.09) Dual Progressive Transformations for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.09) Transformers for Object Detection in Large Point Clouds, [Paper]
(arXiv 2022.09) DIFFUSION-BASED IMAGE TRANSLATION USING DISENTANGLED STYLE AND CONTENT REPRESENTATION, [Paper]
(arXiv 2022.09) ERNIE-VIL 2.0: MULTI-VIEW CONTRASTIVE LEARNING FOR IMAGE-TEXT PRE-TRAINING, [Paper], [Code]
(arXiv 2022.09) LEARNING TRANSFERABLE SPATIOTEMPORAL REPRESENTATIONS FROM NATURAL SCRIPT KNOWLEDGE, [Paper]
(arXiv 2022.09) SMALLCAP: Lightweight Image Captioning Prompted with Retrieval Augmentation, [Paper], [Code]
(arXiv 2022.09) SPIKFORMER: WHEN SPIKING NEURAL NETWORK MEETS TRANSFORMER, [Paper]
(arXiv 2022.09) F-VLM: OPEN-VOCABULARY OBJECT DETECTION UPON FROZEN VISION AND LANGUAGE MODELS, [Paper]
(arXiv 2022.09) CONTRASTIVE CORPUS ATTRIBUTION FOR EXPLAINING REPRESENTATIONS, [Paper]
(arXiv 2022.09) Alignment-guided Temporal Attention for Video Action Recognition, [Paper]
(arXiv 2022.09) EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning, [Paper], [Code]
(arXiv 2022.09) SPOTLIGHT: MOBILE UI UNDERSTANDING USING VISION-LANGUAGE MODELS WITH A FOCUS, [Paper]
(arXiv 2022.09) DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION, [Paper], [Project]
(arXiv 2022.09) REST: RETRIEVE & SELF-TRAIN FOR GENERATIVE ACTION RECOGNITION, [Paper]
(arXiv 2022.09) Effective Vision Transformer Training: A Data-Centric Perspective, [Paper]
(arXiv 2022.09) Human-in-the-loop Robotic Grasping using BERT Scene Representation, [Paper], [Project]
(arXiv 2022.09) Revisiting Few-Shot Learning from a Causal Perspective, [Paper]
(arXiv 2022.09) Attacking Compressed Vision Transformers, [Paper]
(arXiv 2022.09) Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, [Paper]
(arXiv 2022.09) DeViT: Deformed Vision Transformers in Video Inpainting, [Paper]
(arXiv 2022.09) Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks, [Paper], [Code]
(arXiv 2022.09) Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding, [Paper]
(arXiv 2022.09) Motion Transformer for Unsupervised Image Animation, [Paper]
(arXiv 2022.09) Weighted Contrastive Hashing, [Paper], [Code]
(arXiv 2022.09) CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention, [Paper]
(arXiv 2022.09) Dialog Acts for Task-Driven Embodied Agents, [Paper]
(arXiv 2022.09) NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System, [Paper], [Code]
(arXiv 2022.09) Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding, [Paper], [Code]
(arXiv 2022.09) Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval, [Paper]
(arXiv 2022.09) Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding, [Paper]
(arXiv 2022.09) Anomaly Detection in Aerial Videos with Transformers, [Paper], [Code]
(arXiv 2022.09) AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition, [Paper]
(arXiv 2022.09) Motion Transformer with Global Intention Localization and Local Movement Refinement, [Paper], [Code]
(arXiv 2022.09) FREESEG: FREE MASK FROM INTERPRETABLE CONTRASTIVE LANGUAGE-IMAGE PRETRAINING FOR SEMANTIC SEGMENTATION, [Paper]
(arXiv 2022.09) Learning State-Aware Visual Representations from Audible Interactions, [Paper], [Code]
(arXiv 2022.09) Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline, [Paper]
(arXiv 2022.09) Leveraging Self-Supervised Training for Unintentional Action Recognition, [Paper]
(arXiv 2022.09) NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields, [Paper]
(arXiv 2022.09) All are Worth Words: a ViT Backbone for Score-based Diffusion Models, [Paper]
(arXiv 2022.09) Paraphrasing Is All You Need for Novel Object Captioning, [Paper]
(arXiv 2022.09) Collaboration of Pre-trained Models Makes Better Few-shot Learner, [Paper]
(arXiv 2022.09) Multi-modal Video Chapter Generation, [Paper], [Code]
(arXiv 2022.09) Best Prompts for Text-to-Image Models and How to Find Them, [Paper]
(arXiv 2022.09) Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration, [Paper], [Code]
(arXiv 2022.09) 3DPCT: 3D Point Cloud Transformer with Dual Self-attention, [Paper]
(arXiv 2022.09) LIGHTWEIGHT TRANSFORMERS FOR HUMAN ACTIVITY RECOGNITION ON MOBILE DEVICES, [Paper]
(arXiv 2022.09) PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training, [Paper]
(arXiv 2022.09) UniColor: A Unified Framework for Multi-Modal Colorization with Transformer, [Paper], [Code]
(arXiv 2022.09) Traffic Accident Risk Forecasting using Contextual Vision Transformers, [Paper]
(arXiv 2022.09) CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding, [Paper]
(arXiv 2022.09) Recipe Generation from Unsegmented Cooking Videos, [Paper]
(arXiv 2022.09) PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification, [Paper], [Code]
(arXiv 2022.09) Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia, [Paper]
(arXiv 2022.09) RNGDet++: Road Network Graph Detection by Transformer with Instance Segmentation and Multi-scale Features Enhancement, [Paper], [Code]
(arXiv 2022.09) Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering, [Paper]
(arXiv 2022.09) I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification, [Paper]
(arXiv 2022.09) Integer Fine-tuning of Transformer-based Models, [Paper]
(arXiv 2022.09) Open-vocabulary Queryable Scene Representations for Real World Planning, [Paper], [Code]
(arXiv 2022.09) DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection, [Paper]
(arXiv 2022.09) Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos, [Paper]
(arXiv 2022.09) Graph Reasoning Transformer for Image Parsing, [Paper]
(arXiv 2022.09) Quantum Vision Transformers, [Paper]
(arXiv 2022.09) Active Visual Search in the Wild, [Paper]
(arXiv 2022.09) PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation, [Paper], [Code]
(arXiv 2022.09) Learning Distinct and Representative Modes for Image Captioning, [Paper], [Code]
(arXiv 2022.09) TODE-Trans: Transparent Object Depth Estimation with Transformer, [Paper], [Code]
(arXiv 2022.09) Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising, [Paper]
(arXiv 2022.09) Integrative Feature and Cost Aggregation with Transformers for Dense Correspondence, [Paper]
(arXiv 2022.09) Axially Expanded Windows for Local-Global Interaction in Vision Transformers, [Paper]
(arXiv 2022.09) UNCERTAINTY AWARE MULTITASK PYRAMID VISION TRANSFORMER FOR UAV-BASED OBJECT RE-IDENTIFICATION, [Paper]
(arXiv 2022.09) TASKED: Transformer-based Adversarial learning for human activity recognition using wearable sensors via Self-KnowledgE Distillation, [Paper]
(arXiv 2022.09) EcoFormer: Energy-Saving Attention with Linear Complexity, [Paper], [[Code]]](https://github.com/ziplab/EcoFormer)
(arXiv 2022.09) Panoramic Vision Transformer for Saliency Detection in 360◦ Videos, [Paper]
(arXiv 2022.09) THE BIASED ARTIST: EXPLOITING CULTURAL BIASES VIA HOMOGLYPHS IN TEXT-GUIDED IMAGE GENERATION MODELS, [Paper]
(arXiv 2022.09) Scene Graph Modification as Incremental Structure Expanding, [Paper], [Code]
(arXiv 2022.09) Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization, [Paper], [Code]
(arXiv 2022.09) Real-time Online Video Detection with Temporal Smoothing Transformers, [Paper]
(arXiv 2022.09) ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection, [Paper], [Code]
(arXiv 2022.09) Code as Policies: Language Model Programs for Embodied Control, [Paper], [Project]
(arXiv 2022.09) SQ-Swin: a Pretrained Siamese Quadratic Swin Transformer for Lettuce Browning Prediction, [Paper]
(arXiv 2022.09) Self-Attentive Pooling for Efficient Deep Learning, [Paper]
(arXiv 2022.09) Domain-Unified Prompt Representations for Source-Free Domain Generalization, [Paper], [Code]
(arXiv 2022.09) BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING, [Paper]
(arXiv 2022.09) Prompt-guided Scene Generation for 3D Zero-Shot Learning, [Paper]
(arXiv 2022.09) RE-IMAGEN: RETRIEVAL-AUGMENTED TEXT-TO-IMAGE GENERATOR, [Paper]
(arXiv 2022.09) Distribution Aware Metrics for Conditional Natural Language Generation, [Paper]
(arXiv 2022.09) CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models, [Paper]
(arXiv 2022.09) Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering, [Paper]
(arXiv 2022.09) PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer, [Paper], [Code]
(arXiv 2022.09) Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer? [Paper], [Code]
(arXiv 2022.09) EXPLORING VISUAL INTERPRETABILITY FOR CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]
(arXiv 2022.09) OmniVL: One Foundation Model for Image-Language and Video-Language Tasks, [Paper]
(arXiv 2022.09) Test-Time Training with Masked Autoencoders, [Paper], [Code]
(arXiv 2022.09) VISUAL RECOGNITION WITH DEEP NEAREST CENTROIDS, [Paper], [Code]
(arXiv 2022.09) One-Shot Transfer of Affordance Regions? AffCorrs! [Paper], [Code]
(arXiv 2022.09) Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models, [Paper], [Code]
(arXiv 2022.09) A Light Recipe to Train Robust Vision Transformers, [Paper], [Code]
(arXiv 2022.09) On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition, [Paper]
(arXiv 2022.09) Number of Attention Heads vs. Number of Transformer-Encoders in Computer Vision, [Paper]
(arXiv 2022.09) Global Semantic Descriptors for Zero-Shot Action Recognition, [Paper], [Code]
(arXiv 2022.09) Revisiting Neural Scaling Laws in Language and Vision, [Paper]
(arXiv 2022.09) Small Transformers Compute Universal Metric Embeddings, [Paper]
(arXiv 2022.09) CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment, [Paper], [Code]
(arXiv 2022.09) CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer, [Paper]
(arXiv 2022.09) Transformers and CNNs both Beat Humans on SBIR, [Paper]
(arXiv 2022.09) PaLI: A Jointly-Scaled Multilingual Language-Image Model, [Paper]
(arXiv 2022.09) MUST-VQA: MUltilingual Scene-text VQA, [Paper], [Code]
(arXiv 2022.09) Leveraging Large Language Models for Robot 3D Scene Understanding, [Paper], [Code]
(arXiv 2022.09) A lightweight Transformer-based model for fish landmark detection, [Paper]
(arXiv 2022.09) PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers, [Paper], [Code]
(arXiv 2022.09) ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers, [Paper]
(arXiv 2022.09) Semantic2Graph: Graph-based Multi-modal Feature for Action Segmentation in Videos, [Paper]
(arXiv 2022.09) CenterFormer: Center-based Transformer for 3D Object Detection, [Paper], [Code]
(arXiv 2022.09) PreSTU: Pre-Training for Scene-Text Understanding, [Paper]
(arXiv 2022.09) OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training, [Paper]
(arXiv 2022.09) DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer, [Paper]
(arXiv 2022.09) SeRP: Self-Supervised Representation Learning Using Perturbed Point Clouds, [Paper]
(arXiv 2022.09) VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models, [Paper]
(arXiv 2022.09) StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation, [Paper], [Code]
(arXiv 2022.09) ON THE COMPUTATIONAL COMPLEXITY OF SELF-ATTENTION, [Paper]
(arXiv 2022.09) Instruction-driven history-aware policies for robotic manipulations, [Paper], [Code]
(arXiv 2022.09) Towards Multi-Lingual Visual Question Answering, [Paper]
(arXiv 2022.09) PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation, [Paper], [Project]
(arXiv 2022.09) GLOBAL PROTOTYPE ENCODING FOR INCREMENTAL VIDEO HIGHLIGHTS DETECTION, [Paper], [Code]
(arXiv 2022.09) Self-Supervised Multimodal Fusion Transformer for Passive Activity Recognition, [Paper]
(arXiv 2022.09) FETA: Towards Specializing Foundation Models for Expert Task Applications, [Paper]
(arXiv 2022.09) Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers, [Paper]
(arXiv 2022.09) Exploring Target Representations for Masked Autoencoders, [Paper]
(arXiv 2022.09) ISS: IMAGE AS STEPPING STONE FOR TEXT-GUIDED 3D SHAPE GENERATION, [Paper]
(arXiv 2022.09) Towards Confidence-guided Shape Completion for Robotic Applications, [Paper], [Code]
(arXiv 2022.09) Pre-training image-language transformers for open-vocabulary tasks, [Paper]
(arXiv 2022.09) Improved Masked Image Generation with Token-Critic, [Paper]
(arXiv 2022.09) Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, [Paper], [Code]
(arXiv 2022.09) Uformer-ICS: A Specialized U-Shaped Transformer for Image Compressive Sensing, [Paper]
(arXiv 2022.09) An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling, [Paper]
(arXiv 2022.09) Spatial-Temporal Transformer for Video Snapshot Compressive Imaging, [Paper], [Code]
(arXiv 2022.09) MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition, [Paper]
(arXiv 2022.09) SEFormer: Structure Embedding Transformer for 3D Object Detection, [Paper]
(arXiv 2022.09) ADTR: Anomaly Detection Transformer with Feature Reconstruction, [Paper]
(arXiv 2022.09) Learning Canonical Embeddings for Unsupervised Shape Correspondence with Locally Linear Transformations, [Paper]
(arXiv 2022.09) Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students, [Paper]
(arXiv 2022.09) PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection, [Paper], [Code]
(arXiv 2022.09) VITKD: PRACTICAL GUIDELINES FOR VIT FEATURE KNOWLEDGE DISTILLATION, [Paper], [Code]
(arXiv 2022.09) DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation, [Paper]
(arXiv 2022.09) SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised Skeleton Action Recognition, [Paper]
(arXiv 2022.09) What does a platypus look like? Generating customized prompts for zero-shot image classification, [Paper], [Code]
(arXiv 2022.09) AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation, [Paper], [Code]
(arXiv 2022.09) MimCo: Masked Image Modeling Pre-training with Contrastive Teacher, [Paper]
(arXiv 2022.09) Multi-modal Contrastive Representation Learning for Entity Alignment, [Paper]
(arXiv 2022.09) Zero-Shot Multi-Modal Artist-Controlled Retrieval and Exploration of 3D Object Sets, [Paper]
(arXiv 2022.09) Geometry Aligned Variational Transformer for Image-conditioned Layout Generation, [Paper]
(arXiv 2022.09) Real-time 3D Single Object Tracking with Transformer, [Paper], [Code]
(arXiv 2022.09) Video-Guided Curriculum Learning for Spoken Video Grounding, [Paper], [Code]
(arXiv 2022.09) FLAME: Free-form Language-based Motion Synthesis & Editing, [Paper]
(arXiv 2022.09) TOKENCUT: SEGMENTING OBJECTS IN IMAGES AND VIDEOS WITH SELF-SUPERVISED TRANSFORMER AND NORMALIZED CUT, [Paper], [Code]
(arXiv 2022.09) Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation, [Paper]
(arXiv 2022.09) MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition, [Paper], [Project]
(arXiv 2022.09) Visual Prompting via Image Inpainting, [Paper], [Project]
(arXiv 2022.09) RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection, [Paper], [Code]

2022.08

(arXiv 2022.08) On Grounded Planning for Embodied Tasks with Language Models, [Paper], [Project]
(arXiv 2022.08) Group Activity Recognition in Basketball Tracking Data - Neural Embeddings in Team Sports (NETS), [Paper]
(arXiv 2022.08) SWIN-TRANSFORMER-YOLOV5 FOR REAL-TIME WINE GRAPE BUNCH DETECTION, [Paper]
(arXiv 2022.08) SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization, [Paper], [Code]
(arXiv 2022.08) INJECTING IMAGE DETAILS INTO CLIP’S FEATURE SPACE, [Paper]
(arXiv 2022.08) Hierarchical Local-Global Transformer for Temporal Sentence Grounding, [Paper]
(arXiv 2022.08) EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing, [Paper]
(arXiv 2022.08) TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers, [Paper]
(arXiv 2022.08) ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer, [Paper], [Code]
(arXiv 2022.08) SoMoFormer: Multi-Person Pose Forecasting with Transformers, [Paper]
(arXiv 2022.08) A Circular Window-based Cascade Transformer for Online Action Detection, [Paper]
(arXiv 2022.08) ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer, [Paper]
(arXiv 2022.08) Robust Sound-Guided Image Manipulation, [Paper]
(arXiv 2022.08) TrojViT: Trojan Insertion in Vision Transformers, [Paper]
(arXiv 2022.08) User-Controllable Latent Transformer for StyleGAN Image Layout Editing, [Paper]
(arXiv 2022.08) Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification, [Paper]
(arXiv 2022.08) JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents, [Paper]
(arXiv 2022.08) TFusion: Transformer based N-to-One Multimodal Fusion Block, [Paper]
(arXiv 2022.08) VMFormer: End-to-End Video Matting with Transformer, [Paper], [Code]
(arXiv 2022.08) LOGICRANK: Logic Induced Reranking for Generative Text-to-Image Systems, [Paper]
(arXiv 2022.08) CLUSTR: EXPLORING EFFICIENT SELF-ATTENTION VIA CLUSTERING FOR VISION TRANSFORMERS, [Paper]
(arXiv 2022.08) Federated Zero-Shot Learning with Mid-Level Semantic Knowledge Transfer, [Paper]
(arXiv 2022.08) Prompt Tuning with Soft Context Sharing for Vision-Language Models, [Paper]
(arXiv 2022.08) Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment, [Paper], [Code]
(arXiv 2022.08) CounTR: Transformer-based Generalised Visual Counting, [Paper], [Code]
(arXiv 2022.08) Open-Set Semi-Supervised Object Detection, [Paper]
(arXiv 2022.08) gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window, [Paper]
(arXiv 2022.08) Adaptive Perception Transformer for Temporal Action Localization, [Paper], [Code]
(arXiv 2022.08) Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task, [Paper], [Code]
(arXiv 2022.08) Masked Autoencoders Enable Efficient Knowledge Distillers, [Paper], [Code]
(arXiv 2022.08) LaTeRF: Label and Text Driven Object Radiance Fields, [Paper]
(arXiv 2022.08) Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling, [Paper]
(arXiv 2022.08) Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding, [Paper], [Code]
(arXiv 2022.08) MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining, [Paper]
(arXiv 2022.08) Visual Subtitle Feature Enhanced Video Outline Generation, [Paper], [Code]
(arXiv 2022.08) CATS: COMPLEMENTARY CNN AND TRANSFORMER ENCODERS FOR SEGMENTATION, [Paper]
(arXiv 2022.08) Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization, [Paper]
(arXiv 2022.08) FashionVQA: A Domain-Specific Visual Question Answering System, [Paper]
(arXiv 2022.08) K-ORDER GRAPH-ORIENTED TRANSFORMER WITH GRAATTENTION FOR 3D POSE AND SHAPE ESTIMATION, [Paper]
(arXiv 2022.08) Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors, [Paper], [Code]
(arXiv 2022.08) Improving video retrieval using multilingual knowledge transfer, [Paper]
(arXiv 2022.08) EFFICIENT SPARSELY ACTIVATED TRANSFORMERS, [Paper]
(arXiv 2022.08) M2HF: MULTI-LEVEL MULTI-MODAL HYBRID FUSION FOR TEXT-VIDEO RETRIEVAL, [Paper]
(arXiv 2022.08) Accelerating Vision Transformer Training via a Patch Sampling Schedule, [Paper], [Project]
(arXiv 2022.08) A Dual Modality Approach For (Zero-Shot) Multi-Label Classification, [Paper]
(arXiv 2022.08) Offline Handwritten Mathematical Recognition using Adversarial Learning and Transformers, [Paper]
(arXiv 2022.08) Semantic-enhanced Image Clustering, [Paper]
(arXiv 2022.08) DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection, [Paper]
(arXiv 2022.08) ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition, [Paper], [Code]
(arXiv 2022.08) Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks, [Paper], [Project]
(arXiv 2022.08) PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling, [Paper], [Code]
(arXiv 2022.08) EFFICIENT ATTENTION-FREE VIDEO SHIFT TRANSFORMERS, [Paper]
(arXiv 2022.08) Flat Multi-modal Interaction Transformer for Named Entity Recognition, [Paper]
(arXiv 2022.08) Dance Style Transfer with Cross-modal Transformer, [Paper]
(arXiv 2022.08) Improved Image Classification with Token Fusion , [Paper]
(arXiv 2022.08) VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations, [Paper], [Code]
(arXiv 2022.08) TEXT TO IMAGE GENERATION: LEAVING NO LANGUAGE BEHIND, [Paper]
(arXiv 2022.08) Aspect-based Sentiment Classification with Sequential Cross-modal Semantic Graph, [Paper]
(arXiv 2022.08) Diverse Video Captioning by Adaptive Spatio-temporal Attention, [Paper]
(arXiv 2022.08) VLMAE: Vision-Language Masked Autoencoder, [Paper]
(arXiv 2022.08) SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction, [Paper]
(arXiv 2022.08) ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber, [Paper]
(arXiv 2022.08) ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos, [Paper]
(arXiv 2022.08) UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation, [Paper]
(arXiv 2022.08) InterTrack: Interaction Transformer for 3D Multi-Object Tracking, [Paper]
(arXiv 2022.08) Understanding Attention for Vision-and-Language Task, [Paper]
(arXiv 2022.08) Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning, [Paper]
(arXiv 2022.08) Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model, [Paper]
(arXiv 2022.08) Unifying Visual Perception by Dispersible Points Learning, [Paper], [Code]
(arXiv 2022.08) Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork, [Paper]
(arXiv 2022.08) ConMatch: Semi-Supervised Learning with Confidence-Guided Consistency Regularization, [Paper], [Code]
(arXiv 2022.08) The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs, [Paper]
(arXiv 2022.08) Open-Vocabulary Panoptic Segmentation with MaskCLIP, [Paper]
(arXiv 2022.08) Prompt Vision Transformer for Domain Generalization, [Paper]
(arXiv 2022.08) GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement, [Paper]
(arXiv 2022.08) CONVIFORMERS: CONVOLUTIONALLY GUIDED VISION TRANSFORMER, [Paper]
(arXiv 2022.08) Learning Spatial-Frequency Transformer for Visual Object Tracking, [Paper], [Code]
(arXiv 2022.08) Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis, [Paper]
(arXiv 2022.08) Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model, [Paper], [Code]
(arXiv 2022.08) LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, [Paper]
(arXiv 2022.08) ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning, [Paper], [Code]
(arXiv 2022.08) Multi-modal Transformer Path Prediction for Autonomous Vehicle, [Paper]
(arXiv 2022.08) Flow-Guided Transformer for Video Inpainting, [Paper], [Code]
(arXiv 2022.08) TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency, [Paper], [Project]
(arXiv 2022.08) HoW-3D: Holistic 3D Wireframe Perception from a Single Image, [Paper], [Code]
(arXiv 2022.08) BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers, [Paper], [Code]
(arXiv 2022.08) MILAN: Masked Image Pretraining on Language Assisted Representation, [Paper], [Code]
(arXiv 2022.08) Hybrid Transformer Network for Deepfake Detection, [Paper]
(arXiv 2022.08) Semi-supervised Vision Transformers at Scale, [Paper]
(arXiv 2022.08) PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding, [Paper], [Code]
(arXiv 2022.08) Exploring Anchor-based Detection for Ego4D Natural Language Query, [Paper]
(arXiv 2022.08) Language Supervised Training for Skeleton-based Action Recognition, [Paper], [Code]
(arXiv 2022.08) Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2022.08) Ghost-free High Dynamic Range Imaging with Context-aware Transformer, [Paper], [Code]
(arXiv 2022.08) CLIP-based Neural Neighbor Style Transfer for 3D Assets, [Paper]
(arXiv 2022.08) Sports Video Analysis on Large-Scale Data, [Paper], [Code]
(arXiv 2022.08) How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification, [Paper]
(arXiv 2022.08) In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation, [Paper], [Code]
(arXiv 2022.08) DALLE-URBAN: Capturing the urban design expertise of large text to image transformers, [Paper], [Code]
(arXiv 2022.08) PlaneFormers: From Sparse View Planes to 3D Reconstruction, [Paper], [Code]
(arXiv 2022.08) Boosting Video-Text Retrieval with Explicit High-Level Semantics, [Paper]
(arXiv 2022.08) Distinctive Image Captioning via CLIP Guided Group Optimization, [Paper]
(arXiv 2022.08) Understanding Masked Image Modeling via Learning Occlusion Invariant Feature, [Paper]
(arXiv 2022.08) GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training, [Paper], [Code]
(arXiv 2022.08) Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model, [Paper], [Code]
(arXiv 2022.08) Domain Randomization-Enhanced Depth Simulation and Restoration for Perceiving and Grasping Specular and Transparent Objects, [Paper], [Code]
(arXiv 2022.08) Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation, [Paper]
(arXiv 2022.08) Frozen CLIP Models are Efficient Video Learners, [Paper], [Code]
(arXiv 2022.08) MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer, [Paper], [Code]
(arXiv 2022.08) HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization, [Paper], [Code]
(arXiv 2022.08) IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation, [Paper]
(arXiv 2022.08) A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch, [Paper], [Code]
(arXiv 2022.08) PointConvFormer: Revenge of the Point-based Convolution, [Paper]
(arXiv 2022.08) ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding, [Paper]
(arXiv 2022.08) LaTTe: Language Trajectory TransformEr, [Paper], [Code]
(arXiv 2022.08) Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution, [Paper], [Code]
(arXiv 2022.08) TransMatting: Enhancing Transparent Objects Matting with Transformers, [Paper], [Project]
(arXiv 2022.08) Word-Level Fine-Grained Story Visualization, [Paper]
(arXiv 2022.08) Fine-Grained Semantically Aligned Vision-Language Pre-Training, [Paper]
(arXiv 2022.08) Expanding Language-Image Pretrained Models for General Video Recognition, [Paper], [Code]
(arXiv 2022.08) P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting, [Paper], [Code]
(arXiv 2022.08) DropKey, [Paper]
(arXiv 2022.08) MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth, [Paper]
(arXiv 2022.08) Per-Clip Video Object Segmentation, [Paper]
(arXiv 2022.08) XCon: Learning with Experts for Fine-grained Category Discovery, [Paper], [Code]
(arXiv 2022.08) Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition, [Paper]
(arXiv 2022.08) RE-ATTENTION TRANSFORMER FOR WEAKLY SUPERVISED OBJECT LOCALIZATION, [Paper], [Code]
(arXiv 2022.08) TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, [Paper]
(arXiv 2022.08) Two-Stream Transformer Architecture for Long Form Video Understanding, [Paper]
(arXiv 2022.08) A Fast Text-Driven Approach for Generating Artistic Content, [Paper]
(arXiv 2022.08) DAHITRA: DAMAGE ASSESSMENT USING A NOVEL HIERARCHICAL TRANSFORMER ARCHITECTURE, [Paper]
(arXiv 2022.08) MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training, [Paper], [Code]
(arXiv 2022.08) Masked Vision and Language Modeling for Multi-modal Representation Learning, [Paper]
(arXiv 2022.08) SSformer: A Lightweight Transformer for Semantic Segmentation, [Paper], [Code]
(arXiv 2022.08) Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer, [Paper]
(arXiv 2022.08) Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation, [Paper], [Code]
(arXiv 2022.08) Unified Normalization for Accelerating and Stabilizing Transformers, [Paper]
(arXiv 2022.08) An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, [Paper], [Project]
(arXiv 2022.08) Prompt-to-Prompt Image Editing with Cross Attention Control, [Paper]
(arXiv 2022.08) Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization, [Paper]
(arXiv 2022.08) Testing Relational Understanding in Text-Guided Image Generation, [Paper]
(arXiv 2022.08) UAVM: A Unified Model for Audio-Visual Learning, [Paper]
(arXiv 2022.08) Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation, [Paper], [Code]
(arXiv 2022.08) Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding, [Paper]
(arXiv 2022.08) One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning, [Paper]
(arXiv 2022.08) Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition, [Paper], [Code]
(arXiv 2022.08) SdAE: Self-distillated Masked Autoencoder, [Paper], [Code]
(arXiv 2022.08) Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics, [Paper]
(arXiv 2022.08) STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer, [Paper]
(arXiv 2022.08) D^3Former: Debiased Dual Distilled Transformer for Incremental Learning, [Paper], [Code]
(arXiv 2022.08) Local Perception-Aware Transformer for Aerial Tracking, [Paper], [Code]
(arXiv 2022.08) SIAMIXFORMER: A SIAMESE TRANSFORMER NETWORK FOR BUILDING DETECTION AND CHANGE DETECTION FROM BI-TEMPORAL REMOTE SENSING IMAGES, [Paper]
(arXiv 2022.08) Transformers as Meta-Learners for Implicit Neural Representations, [Paper], [Code]
(arXiv 2022.08) Video Question Answering with Iterative Video-Text Co-Tokenization, [Paper], [Code]
(arXiv 2022.08) Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem, [Paper], [Code]

2022.07

(arXiv 2022.07) Pro-tuning: Unified Prompt Tuning for Vision Tasks, [Paper]
(arXiv 2022.07) ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, [Paper], [Code]
(arXiv 2022.07) Curriculum Learning for Data-Efficient Vision-Language Alignme, [Paper]
(arXiv 2022.07) DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer, [Paper]
(arXiv 2022.07) Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers, [Paper], [Code]
(arXiv 2022.07) AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing, [Paper], [Project]
(arXiv 2022.07) Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion, [Paper], [Code]
(arXiv 2022.07) Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer, [Paper], [Code]
(arXiv 2022.07) Video Mask Transfiner for High-Quality Video Instance Segmentation, [Paper], [Project]
(arXiv 2022.07) A Variational AutoEncoder for Transformers with Nonparametric Variational Information Bottleneck, [Paper]
(arXiv 2022.07) Online Continual Learning with Contrastive Vision Transformer, [Paper]
(arXiv 2022.07) Retrieval-Augmented Transformer for Image Captioning, [Paper]
(arXiv 2022.07) Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition, [Paper], [Code]
(arXiv 2022.07) Is Attention All NeRF Needs?, [Paper], [Code]
(arXiv 2022.07) Convolutional Embedding Makes Hierarchical Vision Transformer Stronger, [Paper]
(arXiv 2022.07) SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding, [Paper], [Code]
(arXiv 2022.07) Deep Clustering with Features from Self-Supervised Pretraining, [Paper]
(arXiv 2022.07) Contrastive Masked Autoencoders are Stronger Vision Learners, [Paper]
(arXiv 2022.07) VICTOR: VISUAL INCOMPATIBILITY DETECTION WITH TRANSFORMERS AND FASHION-SPECIFIC CONTRASTIVE PRE-TRAINING, [Paper]
(arXiv 2022.07) Compositional Human-Scene Interaction Synthesis with Semantic Control, [Paper], [Code]
(arXiv 2022.07) Static and Dynamic Concepts for Self-supervised Video Representation Learning, [Paper]
(arXiv 2022.07) Unsupervised Domain Adaptation for Video Transformers in Action Recognition, [Paper], [Code]
(arXiv 2022.07) LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection, [Paper]
(arXiv 2022.07) TransFiner: A Full-Scale Refinement Approach for Multiple Object Tracking, [Paper]
(arXiv 2022.07) S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for Domain Incremental Learning, [Paper]
(arXiv 2022.07) WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models, [Paper], [Project]
(arXiv 2022.07) Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering, [Paper]
(arXiv 2022.07) Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds, [Paper]
(arXiv 2022.07) Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training, [Paper], [Code]
(arXiv 2022.07) V^2L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval, [Paper], [Code]
(arXiv 2022.07) NewsStories: Illustrating articles with visual summaries, [Paper], [Project]
(arXiv 2022.07) DETRs with Hybrid Matching, [Paper], [Code]
(arXiv 2022.07) GROUP DETR: FAST TRAINING CONVERGENCE WITH DECOUPLED ONE-TO-MANY LABEL ASSIGNMENT, [Paper]
(arXiv 2022.07) Improved Super Resolution of MR Images Using CNNs and Vision Transformers, [Paper]
(arXiv 2022.07) Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022, [Paper], [Code]
(arXiv 2022.07) An Impartial Take to the CNN vs Transformer Robustness Contest, [Paper]
(arXiv 2022.07) Generative Artisan: A Semantic-Aware and Controllable CLIPstyler, [Paper]
(arXiv 2022.07) MAR: Masked Autoencoders for Efficient Action Recognition, [Paper], [Code]
(arXiv 2022.07) Object State Change Classification in Egocentric Videos using the Divided Space-Time Attention Mechanism, [Paper], [Cpde]
(arXiv 2022.07) Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation, [Paper], [Code]
(arXiv 2022.07) Reference-based Image Super-Resolution with Deformable Attention Transformer, [Paper], [Code]
(arXiv 2022.07) JIGSAW-VIT: LEARNING JIGSAW PUZZLES IN VISION TRANSFORMER, [Paper], [Code]
(arXiv 2022.07) TransCL: Transformer Makes Strong and Flexible Compressive Learning, [Paper], [Code]
(arXiv 2022.07) 3D Siamese Transformer Network for Single Object Tracking on Point Clouds, [Paper], [Code]
(arXiv 2022.07) Intention-Conditioned Long-Term Human Egocentric Action Forecasting @ EGO4D Challenge 2022, [Paper], [Code]
(arXiv 2022.07) IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition, [Paper]
(arXiv 2022.07) Is GPT-3 all you need for Visual Question Answering in Cultural Heritage? [Paper]
(arXiv 2022.07) Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers, [Paper]
(arXiv 2022.07) Action Quality Assessment using Transformers, [Paper]
(arXiv 2022.07) Self-Distilled Vision Transformer for Domain Generalization, [Paper], [Code]
(arXiv 2022.07) Exploring CLIP for Assessing the Look and Feel of Images, [Paper], [Code]
(arXiv 2022.07) Transformer with Implicit Edges for Particle-based Physics Simulation, [Paper], [Code]
(arXiv 2022.07) Auto-regressive Image Synthesis with Integrated Quantization, [Paper]
(arXiv 2022.07) Efficient Modeling of Future Context for Image Captioning, [Paper], [Code]
(arXiv 2022.07) Zero-Shot Video Captioning with Evolving Pseudo-Tokens, [Paper], [Code]
(arXiv 2022.07) Panoptic Scene Graph Generation, [Paper], [Project], [Code]
(arXiv 2022.07) Facial Expression Recognition using Vanilla ViT backbones with MAE Pretraining, [Paper]
(arXiv 2022.07) Target-Driven Structured Transformer Planner for Vision-Language Navigation, [Paper]
(arXiv 2022.07) Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? [Paper]
(arXiv 2022.07) Hybrid CNN-Transformer Model For Facial Affect Recognition In the ABAW4 Challenge, [Paper]
(arXiv 2022.07) MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis, [Paper]
(arXiv 2022.07) SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer, [Paper], [Code]
(arXiv 2022.07) LocVTP: Video-Text Pre-training for Temporal Localization, [Paper], [Code]
(arXiv 2022.07) Temporal Saliency Query Network for Efficient Video Recognition, [Paper], [Code]
(arXiv 2022.07) Pose for Everything: Towards Category-Agnostic Pose Estimation, [Paper], [Code]
(arXiv 2022.07) Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration, [Paper], [Code]
(arXiv 2022.07) An Efficient Spatio-Temporal Pyramid Transformer for Action Detection, [Paper]
(arXiv 2022.07) Towards Efficient Adversarial Training on Vision Transformers, [Paper]
(arXiv 2022.07) TinyViT: Fast Pretraining Distillation for Small Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning, [Paper], [Code]
(arXiv 2022.07) Explicit Image Caption Editing, [Paper], [Code]
(arXiv 2022.07) AiATrack: Attention in Attention for Transformer Visual Tracking, [Paper], [Code]
(arXiv 2022.07) Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification, [Paper], [Code]
(arXiv 2022.07) Single Frame Atmospheric Turbulence Mitigation: A Benchmark Study and A New Physics-Inspired Transformer Model, [Paper], [Code]
(arXiv 2022.07) HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers, [Paper]
(arXiv 2022.07) GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features, [Paper]
(arXiv 2022.07) OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos, [Paper]
(arXiv 2022.07) FaceFormer: Scale-aware Blind Face Restoration with Transformers, [Paper]
(arXiv 2022.07) Multimodal Transformer for Automatic 3D Annotation and Object Detection, [Paper], [Code]
(arXiv 2022.07) Temporal and cross-modal attention for audio-visual zero-shot learning, [Paper], [Code]
(arXiv 2022.07) Locality Guidance for Improving Vision Transformers on Tiny Datasets, [Paper], [Code]
(arXiv 2022.07) Is an Object-Centric Video Representation Beneficial for Transfer? [Paper]
(arXiv 2022.07) DUQIM-Net: Probabilistic Object Hierarchy Representation for Multi-View Manipulation, [Paper]
(arXiv 2022.07) RELATIONAL FUTURE CAPTIONING MODEL FOR EXPLAINING LIKELY COLLISIONS IN DAILY TASKS, [Paper]
(arXiv 2022.07) Conditional DETR V2: Efficient Detection Transformer with Box Queries, [Paper]
(arXiv 2022.07) Exploiting Unlabeled Data with Vision and Language Models for Object Detection, [Paper], [Code]
(arXiv 2022.07) TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation, [Paper], [Code]
(arXiv 2022.07) Time Is MattEr: Temporal Self-supervision for Video Transformers, [Paper]
(arXiv 2022.07) IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection, [Paper]
(arXiv 2022.07) Don’t Stop Learning: Towards Continual Learning for the CLIP Model, [Paper]
(arXiv 2022.07) Action Quality Assessment with Temporal Parsing Transformer, [Paper]
(arXiv 2022.07) Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective, [Paper], [Code]
(arXiv 2022.07) Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement, [Paper]
(arXiv 2022.07) TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.07) Clover: Towards A Unified Video-Language Alignment and Fusion Model, [Paper], [Code]
(arXiv 2022.07) SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery, [Paper]
(arXiv 2022.07) FashionViL: Fashion-Focused Vision-and-Language Representation Learning, [Paper], [Code]
(arXiv 2022.07) Zero-Shot Temporal Action Detection via Vision-Language Prompting, [Paper], [Code]
(arXiv 2022.07) Rethinking Alignment in Video Super-Resolution Transformers, [Paper], [Code]
(arXiv 2022.07) Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding, [Paper]
(arXiv 2022.07) TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being? [Paper]
(arXiv 2022.07) Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection, [Paper]
(arXiv 2022.07) Semantic Novelty Detection via Relational Reasoning, [Paper]
(arXiv 2022.07) Unifying Event Detection and Captioning as Sequence Generation via Pre-Training, [Paper], [Code]
(arXiv 2022.07) Multi-manifold Attention for Vision Transformers, [Paper]
(arXiv 2022.07) UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird’s-Eye-View, [Paper]
(arXiv 2022.07) Position Prediction as an Effective Pretraining Strategy, [Paper]
(arXiv 2022.07) Lightweight Vision Transformer with Cross Feature Attention, [Paper]
(arXiv 2022.07) Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP, [Paper], [Code]
(arXiv 2022.07) X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval, [Paper]
(arXiv 2022.07) Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal, [Paper]
(arXiv 2022.07) A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion, [Paper]
(arXiv 2022.07) Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning, [Paper]
(arXiv 2022.07) Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models, [Paper]
(arXiv 2022.07) Cross-Attention Transformer for Video Interpolation, [Paper]
(arXiv 2022.07) Towards Multimodal Vision-Language Models Generating Non-Generic Text, [Paper]
(arXiv 2022.07) QKVA grid: Attention in Image Perspective and Stacked DETR, [Paper], [Code]
(arXiv 2022.07) Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet, [Paper], [Code]
(arXiv 2022.07) Horizontal and Vertical Attention in Transformers, [Paper]
(arXiv 2022.07) CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition, [Paper], [Code]
(arXiv 2022.07) DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer, [Paper], [Code]
(arXiv 2022.07) DEPTHFORMER: MULTISCALE VISION TRANSFORMER FOR MONOCULAR DEPTH ESTIMATION WITH GLOBAL LOCAL INFORMATION FUSION, [Paper], [Code]
(arXiv 2022.07) LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval, [Paper]
(arXiv 2022.07) Dual Vision Transformer, [Paper], [Code]
(arXiv 2022.07) Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning, [Paper], [Code]
(arXiv 2022.07) Scaling Novel Object Detection with Weakly Supervised Detection Transformers, [Paper]
(arXiv 2022.07) Hunting Group Clues with Transformers for Social Group Activity Recognition, [Paper]
(arXiv 2022.07) Outpainting by Queries, [Paper], [Code]
(arXiv 2022.07) IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training, [Paper]
(arXiv 2022.07) Video Graph Transformer for Video Question Answering, [Paper], [Code]
(arXiv 2022.07) Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios, [Paper]
(arXiv 2022.07) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [Paper], [Code]
(arXiv 2022.07) Image and Model Transformation with Secret Key for Vision Transformer, [Paper]
(arXiv 2022.07) eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.07) Compound Prototype Matching for Few-shot Action Recognition, [Paper]
(arXiv 2022.07) Long-term Leap Attention, Short-term Periodic Shift for Video Classification, [Paper], [Code]
(arXiv 2022.07) LightViT: Towards Light-Weight Convolution-Free Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Learning from Label Relationships in Human Affect, [Paper]
(arXiv 2022.07) MSP-Former: Multi-Scale Projection Transformer for Single Image Desnowing, [Paper]
(arXiv 2022.07) Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding, [Paper]
(arXiv 2022.07) Vision Transformer for NeRF-Based View Synthesis from a Single Input Image, [Paper], [Code]
(arXiv 2022.07) COSIM: Commonsense Reasoning for Counterfactual Scene Imagination, [Paper], [Code]
(arXiv 2022.07) Beyond Transfer Learning: Co-finetuning for Action Localisation, [Paper]
(arXiv 2022.07) RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection, [Paper]
(arXiv 2022.07) k-means Mask Transformer, [Paper], [Code]
(arXiv 2022.07) Training Transformers Together, [Paper], [Code]
(arXiv 2022.07) Improving Few-Shot Image Classification Using Machine- and User-Generated Natural Language Descriptions, [Paper]
(arXiv 2022.07) MaiT: Leverage Attention Masks for More Efficient Image Transformers, [Paper]
(arXiv 2022.07) Dual-Stream Transformer for Generic Event Boundary Captioning, [Paper], [Code]
(arXiv 2022.07) Softmax-free Linear Transformers, [Paper], [[Code[[(https://github.com/fudan-zvg/SOFT)
(arXiv 2022.07) Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection, [Paper], [Code]
(arXiv 2022.07) Transformers are Adaptable Task Planners, [Paper], [Code]
(arXiv 2022.07) ARRAY CAMERA IMAGE FUSION USING PHYSICS-AWARE TRANSFORMERS, [Paper]
(arXiv 2022.07) OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.07) Weakly Supervised Grounding for VQA in Vision-Language Transformers, [Paper], [Code]
(arXiv 2022.07) PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning, [Paper]
(arXiv 2022.07) STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding, [Paper]
(arXiv 2022.07) Towards Counterfactual Image Manipulation via CLIP, [Paper]
(arXiv 2022.07) MatFormer: A Generative Model for Procedural Materials, [Paper]
(arXiv 2022.07) Multimodal Frame-Scoring Transformer for Video Summarization, [Paper]
(arXiv 2022.07) 3D Part Assembly Generation with Instance Encoded Transformer, [Paper]
(arXiv 2022.07) Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation, [Paper]
(arXiv 2022.07) Efficient Representation Learning via Adaptive Context Pooling, [Paper]
(arXiv 2022.07) Gaze Target Estimation inspired by Interactive Attention, [Paper], [Code]
(arXiv 2022.07) Generalizable Patch-Based Neural Rendering, [Paper], [Project]
(arXiv 2022.07) Interaction Transformer for Human Reaction Generation, [Paper]
(arXiv 2022.07) TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts, [Paper], [Project]
(arXiv 2022.07) FishFormer: Annulus Slicing-based Transformer for Fisheye Rectification with Efficacy Domain Exploration, [Paper]
(arXiv 2022.07) Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer, [Paper], [Code]
(arXiv 2022.07) Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases, [Paper], [Code]
(arXiv 2022.07) Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention, [Paper]
(arXiv 2022.07) MULTI-MODAL ROBUSTNESS ANALYSIS AGAINST LANGUAGE AND VISUAL PERTURBATIONS, [Paper], [Project]
(arXiv 2022.07) CoBEVT: Cooperative Bird’s Eye View Semantic Segmentation with Sparse Transformers, [Paper]
(arXiv 2022.07) Segmenting Moving Objects via an Object-Centric Layered Representation, [Paper]
(arXiv 2022.07) Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models, [Paper]
(arXiv 2022.07) Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval, [Paper]
(arXiv 2022.07) Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification, [Paper], [Code]
(arXiv 2022.07) Memory-Based Label-Text Tuning for Few-Shot Class-Incremental Learning, [Paper]
(arXiv 2022.07) Exploiting Context Information for Generic Event Boundary Captioning, [Paper], [Code]
(arXiv 2022.07) You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Divert More Attention to Vision-Language Tracking, [Paper], [Code]
(arXiv 2022.07) Can Language Understand Depth? [Paper], [Code]
(arXiv 2022.07) TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection, [Paper], [Code]
(arXiv 2022.07) DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning, [Paper]
(arXiv 2022.07) Transferring Textual Knowledge for Visual Recognition, [Paper], [Code]
(arXiv 2022.07) R^2-VOS: Robust Referring Video Object Segmentation via Relational Cycle Consistency, [Paper]
(arXiv 2022.07) CRFormer: A Cross-Region Transformer for Shadow Removal, [Paper]
(arXiv 2022.07) Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks, [Paper], [Code]
(arXiv 2022.07) Back to MLP: A Simple Baseline for Human Motion Prediction, [Paper], [Code]
(arXiv 2022.07) I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference, [Paper]
(arXiv 2022.07) Rethinking Query-Key Pairwise Interactions in Vision Transformers, [Paper]
(arXiv 2022.07) LARGE-SCALE ROBUSTNESS ANALYSIS OF VIDEO ACTION RECOGNITION MODELS, [Paper], [Code]
(arXiv 2022.07) VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations, [Paper], [Code]
(arXiv 2022.07) Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds, [Paper]
(arXiv 2022.07) MotionMixer: MLP-based 3D Human Body Pose Forecasting, [Paper], [Code]
(arXiv 2022.07) DALG: Deep Attentive Local and Global Modeling for Image Retrieval, [Paper]
(arXiv 2022.07) PolarFormer: Multi-camera 3D Object Detection with Polar Transformers, [Paper], [Code]
(arXiv 2022.07) CTrGAN: Cycle Transformers GAN for Gait Transfer, [Paper]
(arXiv 2022.07) LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, [Paper]
(arXiv 2022.07) Bootstrapped Masked Autoencoders for Vision BERT Pretraining, [Paper], [Code]
(arXiv 2022.07) ReAct: Temporal Action Detection with Relational Queries, [Paper], [Code]
(arXiv 2022.07) Benchmarking Omni-Vision Representation through the Lens of Visual Realms, [Paper], [Project]
(arXiv 2022.07) Convolutional Bypasses Are Better Vision Transformer Adapters, [Paper]
(arXiv 2022.07) LANGUAGE MODELLING WITH PIXELS, [Paper]
(arXiv 2022.07) Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection, [Paper]
(arXiv 2022.07) DEEPFAKE VIDEO DETECTION WITH SPATIOTEMPORAL DROPOUT TRANSFORMER, [Paper]
(arXiv 2022.07) iColoriT: Towards Propagating Local Hint to the Right Region in Interactive Colorization by Leveraging Vision Transformer, [Paper]
(arXiv 2022.07) Imaging through the Atmosphere using Turbulence Mitigation Transformer, [Paper]
(arXiv 2022.07) Symmetry-Aware Transformer-based Mirror Detection, [Paper], [Code]
(arXiv 2022.07) Pyramid Transformer for Traffic Sign Detection, [Paper]
(arXiv 2022.07) Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning, [Paper], [Code]
(arXiv 2022.07) DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation, [Paper]
(arXiv 2022.07) Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Entry-Flipped Transformer for Inference and Prediction of Participant Behavior, [Paper]
(arXiv 2022.07) Wayformer: Motion Forecasting via Simple & Efficient Attention Networks, [Paper]
(arXiv 2022.07) Diverse Dance Synthesis via Keyframes with Transformer Controllers, [Paper]
(arXiv 2022.07) Learning to Estimate External Forces of Human Motion in Video, [Paper]
(arXiv 2022.07) Vision Transformer for Contrastive Clustering, [Paper], [Code]
(arXiv 2022.07) Pose2Room: Understanding 3D Scenes from Human Activities, [Paper]
(arXiv 2022.07) Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2022.07) Cross-Architecture Knowledge Distillation, [Paper]
(arXiv 2022.07) Distance Matters in Human-Object Interaction Detection, [Paper]

2022.06

(arXiv 2022.06) TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction, [Paper]
(arXiv 2022.06) GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation, [Paper], [Code]
(arXiv 2022.06) GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language, [Paper]
(arXiv 2022.06) Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition, [Paper]
(arXiv 2022.06) Causality for Inherently Explainable Transformers: CAT-XPLAIN, [Paper], [Code]
(arXiv 2022.06) A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA, [Paper]
(arXiv 2022.06) Continual Learning with Transformers for Image Classification, [Paper]
(arXiv 2022.06) ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition, [Paper]
(arXiv 2022.06) Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment, [Paper], [Code]
(arXiv 2022.06) Leveraging Language for Accelerated Learning of Tool Manipulation, [Paper]
(arXiv 2022.06) RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval, [Paper]
(arXiv 2022.06) VLCAP: VISION-LANGUAGE WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH CAPTIONING, [Paper], [Code]
(arXiv 2022.06) Video2StyleGAN: Encoding Video in Latent Space for Manipulation, [Paper]
(arXiv 2022.06) Text-Driven Stylization of Video Objects, [Paper], [Project]
(arXiv 2022.06) Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization, [Paper], [Code]
(arXiv 2022.06) CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation, [Paper]
(arXiv 2022.06) Towards Adversarial Attack on Vision-Language Pre-training Models, [Paper]
(arXiv 2022.06) CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2022.06) VISUALIZING AND UNDERSTANDING SELF-SUPERVISED VISION LEARNING, [Paper], [Code]
(arXiv 2022.06) VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection, [Paper]
(arXiv 2022.06) Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution, [Paper]
(arXiv 2022.06) DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection, [Paper]
(arXiv 2022.06) REVECA – Rich Encoder-decoder framework for Video Event CAptioner, [Paper], [Code]
(arXiv 2022.06) SAViR-T: Spatially Attentive** Visual Reasoning** with Transformers, [Paper]
(arXiv 2022.06) EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm, [Paper], [Code]
(arXiv 2022.06) DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations, [Paper]
(arXiv 2022.06) Capturing and Inferring Dense Full-Body Human-Scene Contact, [Paper], [Project]
(arXiv 2022.06) M&M Mix: A Multimodal Multiview Transformer Ensemble, [Paper]
(arXiv 2022.06) DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment, [Paper]
(arXiv 2022.06) Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds, [Paper], [Code]
(arXiv 2022.06) Global Context Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation, [Paper]
(arXiv 2022.06) One-stage Action Detection Transformer, [Paper]
(arXiv 2022.06) SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders, [Paper]
(arXiv 2022.06) TRANSFORMER-BASED MULTI-MODAL PROPOSAL AND RE-RANK FOR WIKIPEDIA IMAGE-CAPTION MATCHING, [Paper], [Code]
(arXiv 2022.06) Vicinity Vision Transformer, [Paper], [Code]
(arXiv 2022.06) EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications, [Paper], [Code]
(arXiv 2022.06) Temporally Consistent Semantic Video Editing, [Paper]
(arXiv 2022.06) VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation, [Paper]
(arXiv 2022.06) MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge, [Paper], [Project]
(arXiv 2022.06) IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes, [Paper], [Code]
(arXiv 2022.06) Backdoor Attacks on Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Rectify ViT Shortcut Learning by Visual Saliency, [Paper]
(arXiv 2022.06) Learning Using Privileged Information for Zero-Shot Action Recognition, [Paper]
(arXiv 2022.06) Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning, [Paper], [Code]
(arXiv 2022.06) CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer, [Paper], [Project]
(arXiv 2022.06) SimA: Simple Softmax-free Attention for Vision Transformers, [Paper], [Code]
(arXiv 2022.06) UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS, [Paper], [Project]
(arXiv 2022.06) VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix, [Paper], [Code]
(arXiv 2022.06) ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022, [Paper]
(arXiv 2022.06) Video + CLIP Baseline for Ego4D Long-term Action Anticipation, [Paper], [Code]
(arXiv 2022.06) What makes domain generalization hard?, [Paper]
(arXiv 2022.06) SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos, [Paper], [Code]
(arXiv 2022.06) Disentangling visual and written concepts in CLIP, [Paper], [Project]
(arXiv 2022.06) Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos, [Paper]
(arXiv 2022.06) Patch-level Representation Learning for Self-supervised Vision Transformers, [Paper]
(arXiv 2022.06) Zero-Shot Video Question Answering via Frozen Bidirectional Language Models, [Paper], [Code]
(arXiv 2022.06) OmniMAE: Single Model Masked Pretraining on Images and Videos, [Paper], [Code]
(arXiv 2022.06) Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency, [Paper], [Code]
(arXiv 2022.06) LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling, [Paper], [Code]
(arXiv 2022.06) Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World, [Paper]
(arXiv 2022.06) Rethinking Generalization in Few-Shot Classification, [Paper], [Code]
(arXiv 2022.06) VCT: A Video Compression Transformer, [Paper]
(arXiv 2022.06) Forecasting of depth and ego-motion with transformers and self-supervision, [Paper]
(arXiv 2022.06) Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone, [Paper], [Code]
(arXiv 2022.06) SP-ViT: Learning 2D Spatial Priors for Vision Transformers, [Paper]
(arXiv 2022.06) A Simple Data Mixing Prior for Improving Self-Supervised Learning, [Paper], [Code]
(arXiv 2022.06) Prefix Language Models are Unified Modal Learners, [Paper], [Code]
(arXiv 2022.06) Masked Frequency Modeling for Self-Supervised Visual Pre-Training, [Paper], [Code]](https://www.mmlab-ntu.com/project/mfm/index.html)
(arXiv 2022.06) Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer, [Paper]
(arXiv 2022.06) A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training, [Paper]
(arXiv 2022.06) Learning to Estimate Shapley Values with Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Graph-based Spatial Transformer with Memory Replay for Multi-future Pedestrian Trajectory Prediction, [Paper], [Code]
(arXiv 2022.06) GLIPv2: Unifying Localization and VL Understanding, [Paper], [Code]
(arXiv 2022.06) INDIGO: Intrinsic Multimodality for Domain Generalization, [Paper]
(arXiv 2022.06) TRANSDUCTIVE CLIP WITH CLASS-CONDITIONAL CONTRASTIVE LEARNING, [Paper]
(arXiv 2022.06) SILVER-BULLET-3D AT MANISKILL 2021: LEARNING-FROM-DEMONSTRATIONS AND HEURISTIC RULE-BASED METHODS FOR OBJECT MANIPULATION, [Paper], [Code]
(arXiv 2022.06) MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing, [Paper], [Code]
(arXiv 2022.06) Visual Transformer for Object Detection, [Paper]
(arXiv 2022.06) Bringing **Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens, [Paper], [Code]
(arXiv 2022.06) TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer, [Paper]
(arXiv 2022.06) ReCo: Retrieve and Co-segment for Zero-shot Transfer, [Paper], [Project]
(arXiv 2022.06) MAREO: MEMORY- AND ATTENTION- BASED VISUAL REASONING, [Paper]
(arXiv 2022.06) Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis, [Paper]
(arXiv 2022.06) Object Scene Representation Transformer, [Paper]
(arXiv 2022.06) Comprehending and Ordering Semantics for Image Captioning, [Paper], [Code]
(arXiv 2022.06) Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO, [Paper]
(arXiv 2022.06) Peripheral Vision Transformer, [Paper], [Code]
(arXiv 2022.06) Efficient Decoder-free Object Detection with Transformers, [Paper], [Code]
(arXiv 2022.06) Prototypical Contrastive Language Image Pretraining, [Paper], [Code]
(arXiv 2022.06) SpA-Former:Transformer image** shadow detection and removal** via spatial attention, [Paper], [Code]
(arXiv 2022.06) A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers, [Paper]
(arXiv 2022.06) Can Foundation Models Talk Causality? [Paper]
(arXiv 2022.06) Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space, [Paper], [Code]
(arXiv 2022.06) MaskViT: Masked Visual Pre-Training for Video Prediction, [Paper]
(arXiv 2022.06) PromptPose: Language Prompt Helps Animal Pose Estimation, [Paper]
(arXiv 2022.06) Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos, [Paper]
(arXiv 2022.06) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.06) Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation, [Paper]
(arXiv 2022.06) Position Labels for Self-Supervised Vision Transformer, [Paper]
(arXiv 2022.06) Exploring Feature Self-relation for Self-supervised Transformer, [Paper]
(arXiv 2022.06) Patch-based Object-centric Transformers for Efficient Video Generation, [Paper], [Code]
(arXiv 2022.06) Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners, [Paper], [Code]
(arXiv 2022.06) VN-Transformer: Rotation-Equivariant Attention for Vector Neurons, [Paper]
(arXiv 2022.06) CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes, [Paper], [Code]
(arXiv 2022.06) OOD Augmentation May Be at Odds with Open-Set Recognition, [Paper]
(arXiv 2022.06) Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer, [Paper]
(arXiv 2022.06) cycle text2face: cycle text-to-face gan via transformers, [Paper]
(arXiv 2022.06) Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer, [Paper], [Code]
(arXiv 2022.06) Transformer based Urdu Handwritten Text Optical Character Reader, [Paper]
(arXiv 2022.06) Spatial Entropy Regularization for Vision Transformers, [Paper]
(arXiv 2022.06) On Data Scaling in Masked Image Modeling, [Paper]
(arXiv 2022.06) Extreme Masking for Learning Instance and Distributed Visual Representations, [Paper]
(arXiv 2022.06) GateHUB: Gated History Unit with Background Suppression for Online Action Detection, [Paper]
(arXiv 2022.06) Anomaly detection in surveillance videos using transformer based attention model, [Paper], [Code]
(arXiv 2022.06) ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences, [Paper], [Code]
(arXiv 2022.06) EAANet: Efficient Attention Augmented Convolutional Networks, [Paper]
(arXiv 2022.06) Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning, [Paper]
(arXiv 2022.06) Recurrent Video Restoration Transformer with Guided Deformable Attention, [Paper], [Code]
(arXiv 2022.06) Rethinking the Openness of CLIP, [Paper]
(arXiv 2022.06) OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression, [Paper]
(arXiv 2022.06) Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval, [Paper]
(arXiv 2022.06) CONTRASTIVE GRAPH MULTIMODAL MODEL FOR TEXT CLASSIFICATION IN VIDEOS, [Paper]
(arXiv 2022.06) Separable Self-attention for Mobile Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation, [Paper], [Code]
(arXiv 2022.06) Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts, [Paper]
(arXiv 2022.06) cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation, [Paper]
(arXiv 2022.06) Masked Unsupervised Self-training for Zero-shot Image Classification, [Paper], [Code]
(arXiv 2022.06) DETR++: Taming Your Multi-Scale Detection Transformer, [Paper]
(arXiv 2022.06) Structured Context Transformer for Generic Event Boundary Detection, [Paper]
(arXiv 2022.06) Revealing Single Frame Bias for Video-and-Language Learning, [Paper], [Code]
(arXiv 2022.06) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2022.06) Can CNNs Be More Robust Than Transformers? [Paper], [Code]
(arXiv 2022.06) Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding, [Paper]
(CVPR 2022) Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation, [Paper]
(arXiv 2022.06) A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge, [Paper], [Project]
(arXiv 2022.06) Revisiting the “Video” in Video-Language Understanding, [Paper], [Project]
(arXiv 2022.06) Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction, [Paper]
(arXiv 2022.06) Modeling Image Composition for Complex Scene Generation, [Paper], [Code]
(arXiv 2022.06) Unified Recurrence Modeling for Video Action Anticipation, [Paper]
(arXiv 2022.06) Prefix Conditioning Unifies Language and Label Supervision, [Paper]
(arXiv 2022.06) Optimizing Relevance Maps of Vision Transformers Improves Robustness, [Paper], [Code]
(arXiv 2022.06) VL-BEIT: Generative Vision-Language Pretraining, [Paper], [Code]
(arXiv 2022.06) EfficientFormer: Vision Transformers at MobileNet Speed, [Paper], [Code]
(arXiv 2022.06) REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering, [Paper]
(arXiv 2022.06) Siamese Image Modeling for Self-Supervised Vision Representation Learning, [Paper]
(CVPR 2022) Distillation Using Oracle Queries for Transformer-based Human-Object nteraction Detection, [Paper], [Code]
(CVPR 2022) Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection, [Paper], [Code]
(CVPR 2022) Human Trajectory Prediction with Momentary Observation, [Paper]
(arXiv 2022.06) Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer, [Paper]
(arXiv 2022.06) Unifying Voxel-based Representation with Transformer for 3D Object Detection, [Paper], [Code]
(arXiv 2022.06) Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades, [Paper]
(arXiv 2022.06) Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training, [Paper]
(arXiv 2022.06) VALHALLA: Visual Hallucination for Machine Translation, [Paper], [Code]
(arXiv 2022.06) Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation, [Paper]
(arXiv 2022.06) CLIP4IDC: CLIP for Image Difference Captioning, [Paper], [Code]
(arXiv 2022.06) Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment, [Paper]
(arXiv 2022.06) Vision GNN: An Image is Worth Graph of Nodes, [Paper], [Code]
(arXiv 2022.06) Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction, [Paper], [Code]
(arXiv 2022.06) TubeFormer-DeepLab: Video Mask Transformer, [Paper]
(arXiv 2022.06) Video-based Human-Object Interaction Detection from Tubelet Tokens, [Paper]

2022.05

(arXiv 2022.05) HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER, [Paper]
(arXiv 2022.05) Robotic grasp detection based on Transformer, [Paper]
(arXiv 2022.05) Multimodal Masked Autoencoders Learn Transferable Representations, [Paper]
(arXiv 2022.05) Multimodal Fake News Detection via CLIP-Guided Learning, [Paper]
(arXiv 2022.05) WT-MVSNet: Window-based Transformers for Multi-view Stereo, [Paper]
(arXiv 2022.05) Object-wise Masked Autoencoders for Fast Pre-training, [Paper]
(arXiv 2022.05) A Closer Look at Self-supervised Lightweight Vision Transformers, [Paper]
(arXiv 2022.05) Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning, [Paper]
(arXiv 2022.05) CYCLIP: Cyclic Contrastive Language-Image Pretraining, [Paper], [Code]
(arXiv 2022.05) MDMLP: Image Classification from Scratch on Small Datasets with MLP, [Paper], [Code]
(arXiv 2022.05) SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners, [Paper], [Code]
(arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction, [Paper]
(arXiv 2022.05) Prompt-aligned Gradient for Prompt Tuning, [Paper], [Code]
(arXiv 2022.05) Illumination Adaptive Transformer, [Paper], [Code]
(arXiv 2022.05) HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling, [Paper]
(arXiv 2022.05) GMML is All you Need, [Paper], [Code]
(arXiv 2022.05) COMPLETEDT: POINT CLOUD COMPLETION WITH DENSE AUGMENT INFERENCE TRANSFORMERS, [Paper]
(arXiv 2022.05) Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks, [Paper]
(arXiv 2022.05) VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models, [Paper], [Benchmark], [Code]
(arXiv 2022.05) Architecture-Agnostic Masked Image Modeling – From ViT back to CNN, [Paper]
(arXiv 2022.05) Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, [Paper], [Code]
(arXiv 2022.05) GIT: A Generative Image-to-text Transformer for Vision and Language, [Paper]
(arXiv 2022.05) 3DILG: Irregular Latent Grids for 3D Generative Modeling, [Paper]
(arXiv 2022.05) Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos, [Paper], [Code]
(arXiv 2022.05) Future Transformer for Long-term Action Anticipation, [Paper], [Project]
(arXiv 2022.05) X-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2022.05) Knowledge Distillation via the Target-aware Transformer, [Paper]
(arXiv 2022.05) Dynamic Query Selection for Fast Visual Perceiver, [Paper]
(arXiv 2022.05) MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers, [Paper]
(arXiv 2022.05) PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models, [Paper], [Code]
(arXiv 2022.05) Supporting Vision-Language Model Inference with Causality-pruning Knowledge Prompt, [Paper]
(arXiv 2022.05) Super Vision Transformer, [Paper], [Code]
(arXiv 2022.05) mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections, [Paper]
(arXiv 2022.05) VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering, [Paper]
(arXiv 2022.05) UMSNet: An Universal Multi-sensor Network for Human Activity Recognition, [Paper]
(arXiv 2022.05) Privacy-Preserving Image Classification Using Vision Transformer, [Paper]
(arXiv 2022.05) HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval, [Paper]
(arXiv 2022.05) ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions, [Paper], [Code]
(arXiv 2022.05) HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding, [Paper]
(arXiv 2022.05) Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning, [Paper]
(arXiv 2022.05) Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging, [Paper]
(arXiv 2022.05) Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality, [Paper], [Code]
(arXiv 2022.05) Visual Concepts Tokenization, [Paper]
(arXiv 2022.05) MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion, [Paper]
(arXiv 2022.05) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers., [Paper], [Code]
(arXiv 2022.05) Evidence for Hypodescent in Visual Semantic AI, [Paper]
(arXiv 2022.05) Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer, [Paper], [Code]
(arXiv 2022.05) muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems, [Paper]
(arXiv 2022.05) Large Language Models are Zero-Shot Reasoners, [Paper]
(arXiv 2022.05) AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition, [Paper], [Code]
(arXiv 2022.05) Green Hierarchical Vision Transformer for Masked Image Modeling, [Paper], [Code]
(arXiv 2022.05) Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation, [Paper]
(arXiv 2022.05) Cross-Architecture Self-supervised Video Representation Learning, [Paper], [Code]
(arXiv 2022.05) Prompt-based Learning for Unpaired Image Captioning, [Paper]
(arXiv 2022.05) MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning, [Paper], [Code]
(arXiv 2022.05) Fast Vision Transformers with HiLo Attention, [Paper], [Code]
(arXiv 2022.05) Fine-grained Image Captioning with CLIP Reward, [Paper], [Code]
(arXiv 2022.05) Mutual Information Divergence: A Unified Metric for Multimodal Generative Models, [Paper]
(arXiv 2022.05) MoCoViT: Mobile Convolutional Vision Transformer, [Paper]
(arXiv 2022.05) AO2-DETR: Arbitrary-Oriented Object Detection Transformer, [Paper]
(arXiv 2022.05) Inception Transformer, [Paper], [Code]
(arXiv 2022.05) VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation, [Paper]
(arXiv 2022.05) UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, [Paper]
(arXiv 2022.05) Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners, [Paper], [Code]
(arXiv 2022.05) Training Vision-Language Transformers from Captions Alone, [Paper], [Code]
(arXiv 2022.05) Voxel-informed Language Grounding, [Paper], [Code]
(arXiv 2022.05) Cross-Enhancement Transformer for Action Segmentation, [Paper]
(arXiv 2022.05) TRT-ViT: TensorRT-oriented Vision Transformer, [Paper]
(arXiv 2022.05) Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection, [Paper]
(arXiv 2022.05) A graph-transformer for whole slide image classification, [Paper]
(arXiv 2022.05) VNT-Net: Rotational Invariant Vector Neuron Transformers, [Paper]
(arXiv 2022.05) Masked Image Modeling with Denoising Contrast, [Paper]
(arXiv 2022.05) Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling, [Paper]
(arXiv 2022.05) Masked Autoencoders As Spatiotemporal Learners, [Paper]
(arXiv 2022.05) BodyMap: Learning Full-Body Dense Correspondence Map, [Paper], [Code]
(arXiv 2022.05) Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers, [Paper]
(arXiv 2022.05) AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars, [Paper]
(arXiv 2022.05) Vision Transformer Adapter for Dense Predictions, [Paper], [Code]
(arXiv 2022.05) Demo: Real-Time Semantic Communications with a Vision Transformer, [Paper]
(arXiv 2022.05) MulT: An End-to-End Multitask Learning Transformer, [Paper], [Code]
(arXiv 2022.05) A CLIP-Hitchhiker’s Guide to Long Video Retrieval, [Paper]
(arXiv 2022.05) Video Frame Interpolation with Transformer, [Paper], [Code]
(arXiv 2022.05) Dense residual Transformer for Image Denoising, [Paper]
(arXiv 2022.05) Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT, [Paper]
(arXiv 2022.05) Robot Cooking with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects, [Paper]
(arXiv 2022.05) Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos, [Paper], [Code]
(arXiv 2022.05) Learning to Retrieve Videos by Asking Questions, [Paper]
(arXiv 2022.05) One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, [Paper]
(arXiv 2022.05) Simple Open-Vocabulary Object Detection with Vision Transformers, [Paper], [Code]
(arXiv 2022.05) AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation, [Paper], [Code]
(arXiv 2022.05) An Empirical Study of Self-supervised Learning Approaches for Object Detection with Transformers, [Paper], [Code-DETR], [Code-Deform-DETR]
(arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image Inpainting, [Paper], [Code]
(arXiv 2022.05) Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training, [Paper]
(arXiv 2022.05) Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild, [Paper]
(arXiv 2022.05) Generalizable Task Planning through Representation Pretraining, [Paper], [Project]
(arXiv 2022.05) EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers, [Paper]
(arXiv 2022.05) Activating More Pixels in Image Super-Resolution Transformer, [Paper], [Code]
(arXiv 2022.05) Row-wise Accelerator for Vision Transformer, [Paper]
(arXiv 2022.05) SparseTT: Visual Tracking with Sparse Transformers, [Paper], [Code]
(arXiv 2022.05) RoViST: Learning Robust Metrics for Visual Storytelling, [Paper], [Code]
(arXiv 2022.05) Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection, [Paper]
(arXiv 2022.05) Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering, [Paper]
(arXiv 2022.05) Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning, [Paper]
(arXiv 2022.05) ConvMAE: Masked Convolution Meets Masked Autoencoders, [Paper], [Code]
(arXiv 2022.05) Cross-lingual Adaptation for Recipe Retrieval with Mixup, [Paper]
(arXiv 2022.05) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework, [Paper]
(arXiv 2022.05) Transformer Tracking with Cyclic Shifting Window Attention, [Paper], [Code]
(arXiv 2022.05) Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning, [Paper]
(arXiv 2022.05) Prompt Distribution Learning, [Paper]
(arXiv 2022.05) CLIP-CLOP: CLIP-Guided Collage and Photomontage, [Paper]
(arXiv 2022.05) Dual-Level Decoupled Transformer for Video Captioning, [Paper]
(arXiv 2022.05) Declaration-based Prompt Tuning for Visual Question Answering, [Paper], [Code]
(arXiv 2022.05) P^3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision, [Paper]
(arXiv 2022.05) Language Models Can See: Plugging Visual Controls in Text Generation, [Paper], [Code]
(arXiv 2022.05) YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression, [Paper]
(arXiv 2022.05) Cross-view Transformers for real-time Map-view Semantic Segmentation, [Paper], [Code]
(arXiv 2022.05) i-Code: An Integrative and Composable Multimodal Learning Framework, [Paper]
(arXiv 2022.05) Visual Commonsense in Pretrained Unimodal and Multimodal Models, [Paper], [Project]
(arXiv 2022.05) Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification, [Paper]
(arXiv 2022.05) RecipeSnap - a lightweight image to recipe model, [Paper], [Code]
(arXiv 2022.05) CoCa: Contrastive Captioners are Image-Text Foundation Models, [Paper]
(arXiv 2022.05) Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP), [Paper]
(arXiv 2022.05) Cross-modal Representation Learning for Zero-shot Action Recognition, [Paper], [Code]
(arXiv 2022.05) Cross-Domain Object Detection with Mean-Teacher Transformer, [Paper]
(arXiv 2022.05) Better plain ViT baselines for ImageNet-1k, [Paper], [Code]
(arXiv 2022.05) Reinforced Swin-Convs Transformer for Underwater Image Enhancement, [Paper]
(arXiv 2022.05) UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, [Paper]
(arXiv 2022.05) Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering, [Paper]
(arXiv 2022.05) CenterCLIP: Token Clustering for Efficient Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.05) Arbitrary Shape Text Detection via Boundary Transformer, [Paper], [Code]
(arXiv 2022.05) HULC: 3D Human Motion Capture with Pose Manifold Sampling and Dense Contact Guidance, [Paper], [Project]

2022.04

(arXiv 2022.04) Learn to Understand Negation in Video Retrieval, [Paper]
(arXiv 2022.04) LayoutBERT: Masked Language Layout Model for Object Insertion, [Paper]
(arXiv 2022.04) Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, [Paper], [Code]
(arXiv 2022.04) Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer, [Paper]
(arXiv 2022.04) SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation, [Paper]
(arXiv 2022.04) Where in the World is this Image? Transformer-based Geo-localization in the Wild, [Paper]
(arXiv 2022.04) Depth Estimation with Simplified Transformer, [Paper]
(arXiv 2022.04) A very preliminary analysis of DALL-E 2, [Paper]
(arXiv 2022.04) CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, [Paper], [Code]
(arXiv 2022.04) CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification, [Paper], [Code]
(arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]
(arXiv 2022.04) PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining, [Paper]
(arXiv 2022.04) Symmetric Transformer-based Network for Unsupervised Image Registration, [Paper], [Code]
(arXiv 2022.04) Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos, [Paper], [Code]
(arXiv 2022.04) CapOnImage: Context-driven Dense-Captioning on Image, [Paper]
(arXiv 2022.04) Self-Supervised Learning of Object Parts for Semantic Segmentation, [Paper], [Code]
(arXiv 2022.04) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers, [Paper]
(arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot Segmentation, [Paper]
(arXiv 2022.04) Self-Driving Car Steering Angle Prediction: Let Transformer Be a Car Again, [Paper], [Code]
(arXiv 2022.04) ClothFormer: Taming Video Virtual Try-on in All Module, [Paper]
(arXiv 2022.04) Deeper Insights into ViTs Robustness towards Common Corruptions, [Paper]
(arXiv 2022.04) VITPOSE: SIMPLE VISION TRANSFORMER BASELINES FOR HUMAN POSE ESTIMATION, [Paper], [Code]
(arXiv 2022.04) Understanding The Robustness in Vision Transformers, [Paper], [Code]
(arXiv 2022.04) MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval, [Paper]
(arXiv 2022.04) Contrastive Language-Action Pre-training for Temporal Localization, [Paper]
(arXiv 2022.04) Boosting Adversarial Transferability of MLP-Mixer, [Paper]
(arXiv 2022.04) Adaptive Split-Fusion Transformer, [Paper], [Code]
(arXiv 2022.04) Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? [Paper], [Project]
(arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]
(arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout, [Paper], [Code]
(arXiv 2022.04) CLIP-DISSECT: AUTOMATIC DESCRIPTION OF NEURON REPRESENTATIONS IN DEEP VISION NETWORKS, [Paper]
(arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]
(arXiv 2022.04) Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers, [Paper]
(arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images, [Paper], [Code]
(arXiv 2022.04) OCFormer: One-Class Transformer Network for Image Classification, [Paper]
(arXiv 2022.04) DRT: A Lightweight Single Image Deraining Recursive Transformer, [Paper], [Code]
(arXiv 2022.04) Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering, [Paper], [Code]
(arXiv 2022.04) ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer, [Paper]
(arXiv 2022.04) iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition, [Paper], [Code]
(arXiv 2022.04) DIVERSE INSTANCE DISCOVERY: VISION-TRANSFORMER FOR INSTANCE-AWARE MULTI-LABEL IMAGE RECOGNITION, [Paper]
(arXiv 2022.04) Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds, [Paper], [Code]
(arXiv 2022.04) DFAM-DETR: Deformable feature based attention mechanism DETR on slender object detection, [Paper]
(arXiv 2022.04) NFormer: Robust Person Re-identification with Neighbor Transformer, [Paper], [Code]
(arXiv 2022.04) Video Moment Retrieval from Text Queries via Single Frame Annotation, [Paper]
(arXiv 2022.04) GIMO: Gaze-Informed Human Motion Prediction in Context, [Paper]
(arXiv 2022.04) VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance, [Paper]
(arXiv 2022.04) Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2022.04) Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer, [Paper], [Code]
(arXiv 2022.04) Multimodal Token Fusion for Vision Transformers, [Paper]
(arXiv 2022.04) Self-Calibrated Efficient Transformer for Lightweight Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Searching Intrinsic Dimensions of Vision Transformers, [Paper]
(arXiv 2022.04) Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks, [Paper]
(arXiv 2022.04) Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting, [Paper]
(arXiv 2022.04) Multi-Frame Self-Supervised Depth with Transformers, [Paper], [Code]
(arXiv 2022.04) MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction, [Paper], [Code]
(arXiv 2022.04) Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis, [Paper], [Code]
(arXiv 2022.04) An Extendable, Efficient and Effective Transformer-based Object Detector, [Paper], [Code]
(arXiv 2022.04) VDTR: Video Deblurring with Transformer, [Paper], [Code]
(arXiv 2022.04) BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment, [Paper], [Code]
(arXiv 2022.04) Temporally Efficient Vision Transformer for Video Instance Segmentation, [Paper], [Code]
(arXiv 2022.04) VSA: Learning Varied-Size Window Attention in Vision Transformers, [Paper], [Code]
(arXiv 2022.04) XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding, [Paper]
(arXiv 2022.04) IMPROVING CROSS-MODAL UNDERSTANDING IN VISUAL DIALOG VIA CONTRASTIVE LEARNING, [Paper]
(arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient Multi-View Stereo, [Paper], [Code]
(arXiv 2022.04) UNCONDITIONAL IMAGE-TEXT PAIR GENERATION WITH MULTIMODAL CROSS QUANTIZER, [Paper]
(arXiv 2022.04) Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference, [Paper]
(arXiv 2022.04) COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval, [Paper]
(arXiv 2022.04) Image Captioning In the Transformer Age, [Paper], [Code]
(arXiv 2022.04) ResT V2: Simpler, Faster and Stronger, [Paper], [Code]
(arXiv 2022.04) Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer, [Paper], [Code]
(arXiv 2022.04) Temporal Progressive Attention for Early Action Prediction, [Paper], [Code]
(arXiv 2022.04) Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval, [Paper]
(arXiv 2022.04) Flamingo: a Visual Language Model for Few-Shot Learning, [Paper]
(arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]
(arXiv 2022.04) Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [Paper], [Code]
(arXiv 2022.04) Learning Future Object Prediction with a Spatiotemporal Detection Transformer, [Paper]
(arXiv 2022.04) R^2-Trans: Fine-Grained Visual Categorization with Redundancy Reduction, [Paper], [Code]
(arXiv 2022.04) A New Dataset and Transformer for Stereoscopic Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization, [Paper]
(arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [Paper], [Code]
(arXiv 2022.04) BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training, [Paper]
(arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [Paper]
(arXiv 2022.04) ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, [Paper]
(arXiv 2022.04) Interactiveness Field in Human-Object Interactions, [Paper], [Code]
(arXiv 2022.04) DeiT III: Revenge of the ViT, [Paper]
(arXiv 2022.04) Residual Swin Transformer Channel Attention Network for Image Demosaicing, [Paper]
(arXiv 2022.04) Neighborhood Attention Transformer, [Paper], [Code]
(arXiv 2022.04) MiniViT: Compressing Vision Transformers with Weight Multiplexing, [Paper], [Code]
(arXiv 2022.04) ViTOL: Vision Transformer for Weakly Supervised Object Localization, [Paper], [Code]
(arXiv 2022.04) What Matters in Language Conditioned Robotic Imitation Learning, [Paper], [Code]
(arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes, [Paper]
(arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension, [Paper]
(arXiv 2022.04) Are Multimodal Transformers Robust to Missing Modality? [Paper]
(arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, [Paper], [Code]
(arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks, [Paper]
(arXiv 2022.04) Event Transformer, [Paper]
(arXiv 2022.04) Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels, [Paper]
(arXiv 2022.04) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, [Paper], [Code]
(arXiv 2022.04) Multimodal Transformer for Nursing Activity Recognition, [Paper], [Code]
(arXiv 2022.04) Robust Cross-Modal Representation Learning with Progressive Self-Distillation, [Paper]
(arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [Paper]
(arXiv 2022.04) No Token Left Behind: Explainability-Aided Image Classification and Generation, [Paper]
(arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition, [Paper], [Code]
(arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [Paper], [Code]
(arXiv 2022.04) DILEMMA: Self-Supervised Shape and Texture Learning with Transformers, [Paper]
(arXiv 2022.04) Learning Trajectory-Aware Transformer for Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Learning to Induce Causal Structure, [Paper]
(arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [Paper], [Code]
(arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [Paper]
(arXiv 2022.04) Does Robustness on ImageNet Transfer to Downstream Tasks?, [Paper]
(arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition, [Paper], [Code]
(arXiv 2022.04) Vision Transformers for Single Image Dehazing, [Paper], [Code]
(arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [Paper]
(arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient event data processing, [Paper], [Code]
(arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [Paper], [Code]
(arXiv 2022.04) Adapting CLIP For Phrase Localization Without Further Training, [Paper], [Code]
(arXiv 2022.04) FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment, [Paper], [Project]
(arXiv 2022.04) DaViT: Dual Attention Vision Transformers, [Paper], [Code]
(arXiv 2022.04) Unsupervised Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.04) Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer, [Paper], [Project]
(arXiv 2022.04) Unified Contrastive Learning in Image-Text-Label Space, [Paper], [Code]
(arXiv 2022.04) HunYuan_tvr for Text-Video Retrivial, [Paper]
(arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING, [Paper]
(arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.04) Temporal Alignment Networks for Long-term Video, [Paper], [Code]
(arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection, [Paper], [Code]
(arXiv 2022.04) MixFormer: Mixing Features across Windows and Dimensions, [Paper], [Code]
(arXiv 2022.04) CM3: A CAUSAL MASKED MULTIMODAL MODEL OF THE INTERNET, [Paper]
(arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN ROBOTIC AFFORDANCES, [Paper], [Project]
(arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [Paper], [Code]
(arXiv 2022.04) Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, [Paper], [Project]
(arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition, [Paper]
(arXiv 2022.04) Learning Audio-Video Modalities from Image Captions, [Paper]
(arXiv 2022.04) Improving Vision Transformers by Revisiting High-frequency Components, [Paper]
(arXiv 2022.04) POS-BERT: Point Cloud One-Stage BERT Pre-Training, [Paper], [Code]
(arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation, [Paper], [Code]
(arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning, [Paper]
(arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting, [Paper]
(arXiv 2022.04) Long Movie Clip Classification with State-Space Video Models, [Paper], [Code]
(arXiv 2022.04) TALLFormer: Temporal Action Localization with Long-memory Transformer, [Paper], [Code]
(arXiv 2022.04) MultiMAE: Multi-modal Multi-task Masked Autoencoders, [Paper], [Project]
(arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations, [Paper]
(arXiv 2022.04) SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space, [Paper]
(arXiv 2022.04) Multi-View Transformer for 3D Visual Grounding, [Paper], [Code]
(arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON FACIAL EXPRESSION RECOGNITION TASK, [Paper]
(arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition, [Paper], [Project]
(arXiv 2022.04) Detector-Free Weakly Supervised Group Activity Recognition, [Paper]
(arXiv 2022.04) Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, [Paper], [Project]
(arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [Paper]
(arXiv 2022.04) MaxViT: Multi-Axis Vision Transformer, [Paper]

2022.03

(arXiv 2022.03) A ConvNet for the 2020s, [Paper], [Code]
(arXiv 2022.03) DeepNet: Scaling Transformers to 1,000 Layers, [Paper]
(arXiv 2022.03) Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation, [Paper]
(arXiv 2022.03) ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval, [Paper]
(arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [Paper], [Project]
(arXiv 2022.03) CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation, [Paper]
(arXiv 2022.03) Deformable Video Transformer, [Paper]
(arXiv 2022.03) End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps, [Paper]
(arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [Paper], [Code]
(arXiv 2022.03) VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, [Paper], [App]
(arXiv 2022.03) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing, [Paper], [Code]
(arXiv 2022.03) BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [Paper], [Code]
(arXiv 2022.03) Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models, [Paper], [Code]
(arXiv 2022.03) Bringing Old Films Back to Life, [Paper], [Code]
(arXiv 2022.03) Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model, [Paper], [Code]
(arXiv 2022.03) SeqTR: A Simple yet Universal Network for Visual Grounding, [Paper], [Code]
(arXiv 2022.03) InstaFormer: Instance-Aware Image-to-Image Translation with Transformer, [Paper]
(arXiv 2022.03) Omni-DETR: Omni-Supervised Object Detection with Transformers, [Paper], [Code]
(arXiv 2022.03) Learning Program Representations for Food Images and Cooking Recipes, [Paper], [Project]
(arXiv 2022.03) ITTR: Unpaired Image-to-Image Translation with Transformers, [Paper]
(arXiv 2022.03) VPTR: Efficient Transformers for Video Prediction, [Paper], [Code]
(arXiv 2022.03) Parameter-efficient Fine-tuning for Vision Transformers, [Paper]
(arXiv 2022.03) TubeDETR: Spatio-Temporal Video Grounding with Transformers, [Paper], [Code]
(arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object Detection, [Paper]
(arXiv 2022.03) PROMPTDET: EXPAND YOUR DETECTOR VOCABULARY WITH UNCURATED IMAGES, [Paper], [Code]
(arXiv 2022.03) Few-Shot Object Detection with Fully Cross-Transformer, [Paper]
(arXiv 2022.03) Unified Transformer Tracker for Object Tracking, [Paper]
(arXiv 2022.03) X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.03) Fine-tuning Image Transformers using Learnable Memory, [Paper]
(arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [Paper], [Code]
(arXiv 2022.03) mc-BEiT: Multi-choice Discretization for Image BERT Pre-training, [Paper]
(arXiv 2022.03) End-to-End Transformer Based Model for Image Captioning, [Paper]
(arXiv 2022.03) Hybrid Routing Transformer for Zero-Shot Learning, [Paper]
(arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR NOISY IMAGE CLASSIFICATION, [Paper]
(arXiv 2022.03) Do Vision-Language Pretrained Models Learn Primitive Concepts?, [Paper]
(arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs, [Paper]
(arXiv 2022.03) SepViT: Separable Vision Transformer, [Paper]
(arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [Paper], [Code]
(arXiv 2022.03) Feature Selective Transformer for Semantic Image Segmentation, [Paper]
(arXiv 2022.03) Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos, [Paper], [Code]
(arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution, [Paper], [Code]
(arXiv 2022.03) Single-Stream Multi-Level Alignment for Vision-Language Pretraining, [Paper]
(arXiv 2022.03) Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Collaborative Transformers for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [Paper]
(arXiv 2022.03) Brain-inspired Multilayer Perceptron with Spiking Neurons, [Paper], [Code]
(arXiv 2022.03) HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network, [Paper], [Code]
(arXiv 2022.03) REGTR: End-to-end Point Cloud Correspondences with Transformers, [Paper], [Code]
(arXiv 2022.03) Automated Progressive Learning for Efficient Training of Vision Transformers, [Paper]
(arXiv 2022.03) Stratified Transformer for 3D Point Cloud Segmentation, [Paper], [Code]
(arXiv 2022.03) NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge, [Paper]
(arXiv 2022.03) FACIAL EXPRESSION RECOGNITION WITH SWIN TRANSFORMER, [Paper]
(arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness, [Paper]
(arXiv 2022.03) Efficient Visual Tracking via Hierarchical Cross-Attention Transformer, [Paper], [Code]
(arXiv 2022.03) High-Performance Transformer Tracking, [Paper], [Code]
(arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [Paper]
(arXiv 2022.03) Multi-modal Multi-label Facial Action Unit Detection with Transformer, [Paper]
(arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection, [Paper], [Code]
(arXiv 2022.03) Text to Mesh Without 3D Supervision Using Limit Subdivision, [Paper], [Project]
(arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection, [Paper], [Code]
(arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation, [Paper]
(arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks, [Paper], [Code]
(arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [Paper]
(arXiv 2022.03) Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, [Paper]
(arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [Paper]
(arXiv 2022.03) Learning Patch-to-Cluster Attention in Vision Transformer, [Paper]
(arXiv 2022.03) Visual Prompt Tuning, [Paper]
(arXiv 2022.03) Training-free Transformer Architecture Search, [Paper]
(arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, [Paper], [Code]
(arXiv 2022.03) METAMORPH: LEARNING UNIVERSAL CONTROLLERS WITH TRANSFORMERS, [Paper], [Project]
(arXiv 2022.03) A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning, [Paper]
(arXiv 2022.03) Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers, [Paper], [Project]
(arXiv 2022.03) Associating Objects with Scalable Transformers for Video Object Segmentation, [Paper], [[Project]](https://github.com/z-x-yang/AOT0
(arXiv 2022.03) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation, [Paper], [Code]
(arXiv 2022.03) Learning to generate line drawings that convey geometry and semantics, [Paper], [Project]
(arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection, [Paper], [Code]
(arXiv 2022.03) AIMusicGuru: Music Assisted Human Pose Correction, [Paper]
(arXiv 2022.03) What to Hide from Your Students: Attention-Guided Masked Image Modeling, [Paper]
(arXiv 2022.03) Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer, [Paper]
(arXiv 2022.03) ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator, [Paper]
(arXiv 2022.03) Keypoints Tracking via Transformer Networks, [Paper], [Code]
(arXiv 2022.03) Beyond Fixation: Dynamic Window Visual Transformer, [Paper], [Code]
(arXiv 2022.03) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, [Paper]
(arXiv 2022.03) Self-supervised Video-centralised Transformer for Video Face Clustering, [Paper]
(arXiv 2022.03) Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]
(arXiv 2022.03) Global Tracking Transformers, [Paper], [Code]
(arXiv 2022.03) Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer, [Paper], [Code]
(arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation, [Paper]
(arXiv 2022.03) Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos, [Paper], [Project]
(arXiv 2022.03) GradViT: Gradient Inversion of Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation, [Paper]
(arXiv 2022.03) Under the Hood of Transformer Networks for Trajectory Forecasting, [Paper]
(arXiv 2022.03) Open-Vocabulary DETR with Conditional Matching, [Paper]
(arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [Paper], [Code]
(arXiv 2022.03) CNNs and Transformers Perceive Hybrid Images Similar to Humans, [Paper], [Code]
(arXiv 2022.03) Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory, [Paper], [Code]
(arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal Text and Image Data, [Paper]
(arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [Paper]
(arXiv 2022.03) CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration, [Paper]
(arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds, [Paper], [Code]
(arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image Super Resolution, [Paper]
(arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition, [Paper], [Code]
(arXiv 2022.03) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]
(arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [Paper], [Code]
(arXiv 2022.03) Relationformer: A Unified Framework for Image-to-Graph Generation, [Paper], [Code]
(arXiv 2022.03) CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning, [Paper], [Code]
(arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [Paper], [Code]
(arXiv 2022.03) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer, [Paper], [Code]
(arXiv 2022.03) Transformer-based HTR for Historical Documents, [Paper]
(arXiv 2022.03) simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers, [Paper], [Code]
(arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [Paper]
(arXiv 2022.03) End-to-End Video Text Spotting with Transformer, [Paper], [Code]
(arXiv 2022.03) Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.03) V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer, [Paper]
(arXiv 2022.03) LocATe: End-to-end Localization of Actions in 3D with Transformers, [Paper]
(arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [Paper]
(arXiv 2022.03) ViM: Out-Of-Distribution with Virtual-logit Matching, [Paper], [Code]
(arXiv 2022.03) ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer, [Paper]
(arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [Paper]
(arXiv 2022.03) Vision Transformer with Convolutions Architecture Search, [Paper]
(arXiv 2022.03) Cascade Transformers for End-to-End Person Search, [Paper], [Code]
(arXiv 2022.03) CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance, [Paper]
(arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [Paper], [Code]
(arXiv 2022.03) Local-Global Context Aware Transformer for Language-Guided Video Segmentation, [Paper], [Code]
(arXiv 2022.03) Three things everyone should know about Vision Transformers, [Paper]
(arXiv 2022.03) Are Vision Transformers Robust to Spurious Correlations? [Paper], [Code]
(arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR CROSS-VIEW GEO-LOCALIZATION, [Paper]
(arXiv 2022.03) DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training, [Paper]
(arXiv 2022.03) Semantic-aligned Fusion Transformer for One-shot Object Detection, [Paper]
(arXiv 2022.03) UNIMO-2: End-to-End Unified Vision-Language Grounded Learning, [Paper], [Code]
(arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [Paper], [Code]
(arXiv 2022.03) One-Shot Adaptation of GAN in Just One CLIP, [Paper]
(arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation, [Paper]
(arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer, [Paper]
(arXiv 2022.03) Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image, [Paper], [Code]
(arXiv 2022.03) Transframer: Arbitrary Frame Prediction with Generative Models, [Paper]
(arXiv 2022.03) Towards Data-Efficient Detection Transformers, [Paper], [Code]
(arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for Saliency Ranking, [Paper], [Code]
(arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST ADVERSARIAL PERTURBATIONS? [Paper], [Code]
(arXiv 2022.03) WegFormer: Transformers for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [Paper], [Code]
(arXiv 2022.03) UNIFIED VISUAL TRANSFORMER COMPRESSION, [Paper], [Code]
(arXiv 2022.03) Towards Practical Certifiable Patch Defense with Vision Transformer, [Paper]
(arXiv 2022.03) EDTER: Edge Detection with Transformer, [Paper], [Code]
(arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation, [Paper]
(arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution, [Paper]
(arXiv 2022.03) Revitalize Region Feature for Democratizing Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) Inverted Pyramid Multi-task Transformer for Dense Scene Understanding, [Paper]
(arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Style Transformer for Image Inversion and Editing, [Paper], [Code]
(arXiv 2022.03) MotionCLIP: Exposing Human Motion Generation to CLIP Space, [Paper], [Project]
(arXiv 2022.03) The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy, [Paper], [Code]
(arXiv 2022.03) Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation, [Paper]
(arXiv 2022.03) Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning, [Paper], [Code]
(arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [Paper]
(arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [Paper]
(arXiv 2022.03) DATR: Domain-adaptive transformer for multi-domain landmark detection, [Paper]
(arXiv 2022.03) EventFormer: AU Event Transformer for Facial Action Unit Event Detection, [Paper]
(arXiv 2022.03) Accelerating DETR Convergence via Semantic-Aligned Matching, [Paper], [Code]
(arXiv 2022.03) All in One: Exploring Unified Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment, [Paper]
(arXiv 2022.03) EIT: Efficiently Lead Inductive Biases to ViT, [Paper], [Code]
(arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [Paper], [Code]
(arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization, [Paper]
(arXiv 2022.03) Disentangled Representation Learning for Text-Video Retrieval, [Paper]
(arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding, [Paper], [Dataset]
(arXiv 2022.03) Visualizing and Understanding Patch Interactions in Vision Transformer, [Paper]
(arXiv 2022.03) ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [Paper], [Code]
(arXiv 2022.03) Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision, [Paper], [Code]
(arXiv 2022.03) ActiveMLP: An MLP-like Architecture with Active Token Mixer, [Paper], [Code]
(arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]
(arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]
(arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval, [Paper]
(arXiv 2022.03) MVP: Multimodality-guided Visual Pre-training, [Paper]
(arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting, [Paper]
(arXiv 2022.03) Multi-Modal Mixup for Robust Fine-tuning, [Paper]
(arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant, [Paper], [Project]
(arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper], [Code]
(arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper]
(arXiv 2022.03) WAVEMIX: RESOURCE-EFFICIENT TOKEN MIXING FOR IMAGES, [Paper]
(arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED AUDIO-VISUAL VOICE SEPARATION TRANSFORMER, [Paper], [Code]
(arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]
(arXiv 2022.03) EDGEFORMER: IMPROVING LIGHT-WEIGHT CONVNETS BY LEARNING FROM VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.03) Skating-Mixer: Multimodal MLP for Scoring Figure Skating, [Paper]
(arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]
(arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]
(arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning, [Paper]
(arXiv 2022.03) ChiTransformer: Towards Reliable Stereo from Cues, [Paper]
(arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation,** Co-Saliency Detection** and Video Salient Object Detection, [Paper], [Code]
(arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]
(arXiv 2022.03) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) Multiscale Transformer for Hyperspectral Image Classification, [Paper]
(arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning, [Paper], [Code]
(arXiv 2022.03) Autoregressive Image Generation using Residual Quantization, [Paper]
(arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED IMAGE COMPRESSION, [Paper]
(arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]
(arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]
(arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR DOCUMENT IMAGE TRANSFORMER, [Paper]
(arXiv 2022.03) Towards Efficient and Scalable Sharpness-Aware Minimization, [Paper]
(arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper], [Code]
(arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION, [Paper], [Code]
(arXiv 2022.03) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, [Paper], [Code]
(arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR PAN-SHARPENING, [Paper], [Code]
(arXiv 2022.03) Multi-class Token Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Cross Language Image Matching for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]
(arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]
(arXiv 2022.03) Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language, [Paper]
(arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]
(arXiv 2022.03) Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos, [Paper]
(arXiv 2022.03) Modeling Coreference Relations in Visual Dialog, [Paper], [Code]
(arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR FACE PRESENTATION ATTACK DETECTION, [Paper]
(arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]
(arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology, [Paper]
(arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]
(arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction, [Paper]
(arXiv 2022.03) DCT-Former: Efficient Self-Attention with Discrete Cosine Transform, [Paper], [Code]
(arXiv 2022.03) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, [Paper]
(arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]
(arXiv 2022.03) CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, [Paper]
(arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]
(arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]
(arXiv 2022.03) DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation, [Paper]
(arXiv 2022.03) D_2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]
(arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]
(arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper]
(arXiv 2022.03) Aggregated Pyramid Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [Paper]
(arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]
(arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]
(arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]
(arXiv 2022.03) Masked Visual Pre-training for Motor Control, [Paper], [Project]
(arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.03) Lane Detection with Versatile AtrousFormer and Local Semantic Guidance, [Paper]
(arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.03) Forecasting Characteristic 3D Poses of Human Actions , [Paper], [Code]

2022.02

(arXiv 2022.02) Bayesian Structure Learning with Generative Flow Networks, [Paper]
(arXiv 2022.02) Towards Unsupervised Domain Adaptation via Domain-Transformer, [Paper]
(arXiv 2022.02) An End-to-End Transformer Model for Crowd Localization, [Paper]
(arXiv 2022.02) Instantaneous Physiological Estimation using Video Transformers, [Paper], [Code]
(arXiv 2022.02) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, [Paper], [Code]
(arXiv 2022.02) ATTENTION ENABLES ZERO APPROXIMATION ERROR, [Paper]
(arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]
(arXiv 2022.02) AUTO-SCALING VISION TRANSFORMERS WITHOUT TRAINING, [Paper], [Code]
(arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2022.02) LEARNING TO MERGE TOKENS IN VISION TRANSFORMERS, [Paper]
(arXiv 2022.02) ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers, [Paper], [Code]
(arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR UNSUPERVISED OBJECT DISCOVERY USING NORMALIZED CUT, [Paper], [Project]
(arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [Paper]
(arXiv 2022.02) CaMEL: Mean Teacher Learning for Image Captioning, [Paper]
(arXiv 2022.02) Hierarchical Perceiver, [Paper]
(arXiv 2022.02) Movies2Scenes: Learning Scene Representations Using Movie Similarities, [Paper]
(arXiv 2022.02) GroupViT: Semantic Segmentation Emerges from Text Supervision, [Paper], [[Code
(arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]
(arXiv 2022.02) Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations, [Paper]
(arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]
(arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper], [Code]
(arXiv 2022.02) DataMUX: Data Multiplexing for Neural Networks, [Paper], [Code]
(arXiv 2022.02) On Guiding Visual Attention with Language Specification, [Paper]
(arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR LIGHTING AGGREGATION ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [Paper]
(arXiv 2022.02) MISINFORMATION DETECTION IN SOCIAL MEDIA VIDEO POSTS, [Paper]
(arXiv 2022.02) Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [Paper]
(arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS, [Paper], [Code]
(arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]
(arXiv 2022.02) One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones, [Paper]
(arXiv 2022.02) XAI for Transformers: Better Explanations through Conservative Propagation, [Paper]
(arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human Pose and Mesh Reconstruction for In-the-Wild Scenes, [Paper]
(arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]
(arXiv 2022.02) Hyper-relationship Learning Network for Scene Graph Generation, [Paper]
(arXiv 2022.02) CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval, [Paper]
(arXiv 2022.02) Flowformer: Linearizing Transformers with Conservation Flows, [Paper]
(arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following, [Paper], [Code]
(arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper]
(arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]
(arXiv 2022.02) I-Tuning: Tuning Language Models with Image for Caption Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)
(arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]
(arXiv 2022.02) Visual Acoustic Matching, [Paper]
(arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]
(arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper], [Code]
(arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]
(arXiv 2022.02) Domain Adaptation via Prompt Learning, [Paper]
(arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs, [Paper], [Code]
(arXiv 2022.02) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, [Paper], [Project]
(arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [Paper], [Code]
(arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]
(arXiv 2022.02) CLIPasso: Semantically-Aware Object Sketching, [Paper], [Code]
(arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]
(arXiv 2022.02) DEEP SOCCER CAPTIONING WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [Paper], [Project]
(arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED IMAGE COMPRESSION, [Paper], [Code]
(arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]
(arXiv 2022.02) Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning, [Paper]
(arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]
(arXiv 2022.02) Conditional Motion In-betweening, [Paper], [Code]
(arXiv 2022.02) Memory-based gaze prediction in deep imitation learning for robot manipulation, [Paper]
(arXiv 2022.02) Spherical Transformer, [Paper]
(arXiv 2022.02) OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context, [Paper]
(arXiv 2022.02) The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning, [Paper], [Project]
(arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]
(arXiv 2022.02) The devil is in the labels: Semantic segmentation from sentences, [Paper]
(arXiv 2022.02) Webly Supervised Concept Expansion for General Purpose Vision Models, [Paper], [Project]
(arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG, [Paper]
(arXiv 2022.02) UNIFYING ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [Paper], [Code]
(arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]
(arXiv 2022.02) TRANSDREAMER: REINFORCEMENT LEARNING WITH TRANSFORMER WORLD MODELS, [Paper]
(arXiv 2022.02) Vision-Language Pre-Training with Triple Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) Corrupted Image Modeling for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, [Paper], [Code]
(arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators, [Paper]
(arXiv 2022.02) Interactron: Embodied Adaptive Object Detection, [Paper]
(arXiv 2022.02) Local Feature Matching with Transformers for low-end devices LoFTR method adaptation approach, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) Can Transformers be Strong Treatment Effect Estimators?, [Paper]
(arXiv 2022.02) Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers, [Paper]
(arXiv 2022.02) Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics, [Paper], [Code]

2022.01

(arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]
(arXiv 2022.01) DynaMixer: A Vision MLP Architecture with Dynamic Mixing, [Paper]
(arXiv 2022.01) VRT: A Video Restoration Transformer, [Paper], [Code]
(arXiv 2022.01) DAB-DETR: DYNAMIC ANCHOR BOXES ARE BETTER QUERIES FOR DETR, [Paper], [Code]
(arXiv 2022.01) Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations, [Paper]
(arXiv 2022.01) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, [Paper]
(arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training, [Paper]
(arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING GRAPH REPRESENTATION WITH TRANSFORMER, [Paper]
(arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper], [Code]
(arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]
(arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]
(arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper]
(arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]
(arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]
(arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer, [Paper]
(arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]
(arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]
(arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]
(arXiv 2022.01) CONVOLUTIONAL XFORMERS FOR VISION, [Paper], [Code]
(arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]
(arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]
(arXiv 2022.01) SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [Paper]
(arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR BUILDING DAMAGE ASSESSMENT, [Paper]
(arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]
(arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation, [Paper]
(arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]
(arXiv 2022.01) Learning To Recognize Procedural Activities with Distant Supervision, [Paper]
(arXiv 2022.01) EVALUATING LANGUAGE-BIASED IMAGE CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [Paper]
(arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks, [Paper]
(arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]
(arXiv 2022.01) Patches Are All You Need? [Paper], [Code]
(arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval, [Paper]
(arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE MULTIMODAL NEURAL SLAM, [Paper]
(arXiv 2022.01) Visual Information Guided Zero-Shot Paraphrase Generation, [Paper]
(arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]
(arXiv 2022.01) End-to-end Generative Pretraining for Multimodal Video Captioning, [Paper]
(arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper], [Project]
(arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, [Paper]
(arXiv 2022.01) The CLEAR Benchmark: Continual LEArning on Real-World Imagery, [Paper], [Project]
(arXiv 2022.01) ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues, [Paper]
(arXiv 2022.01) Cross-modal Contrastive Distillation for Instructional Activity Anticipation, [Paper]
(arXiv 2022.01) Transformers in Action: Weakly Supervised Action Segmentation, [Paper]
(arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]
(arXiv 2022.01) CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks, [Paper]
(arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]
(arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]
(arXiv 2022.01) Motion Inbetweening via Deep ∆-Interpolator, [Paper]
(arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper]
(arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events, [Paper]
(arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]
(arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]
(arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Project]
(arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers*, [Paper]
(arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP TRAFFIC PREDICTION USING SHIFTED WINDOW TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION, [Paper]
(arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Project]
(arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]
(arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]
(arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]
(arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]
(arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]
(arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]
(arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]
(arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]
(arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
(arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]
(arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]
(arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]
(arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]
(arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]
(arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]
(arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]
(arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]
(arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
(arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
(arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]
(arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]
(arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]
(arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]
(arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]
(arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]
(arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
(arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]
(arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]
(arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]
(arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
(arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]
(arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
(arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]
(arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

(arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]
(arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]
(arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]
(arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]
(arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]
(arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
(arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]
(arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]
(arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]
(arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
(arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]
(arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
(arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
(arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
(arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]
(arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
(arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]
(arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
(arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
(arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
(arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]
(arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]
(arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]
(arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]
(arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]
(arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
(arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]
(arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
(arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
(arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
(arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]
(arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
(arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]
(arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]
(arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
(arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]
(arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]
(arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
(arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]
(arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]
(arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]
(arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]
(arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]
(arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]
(arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]
(arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]
(arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]
(arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]
(arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]
(arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]
(arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]
(arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]
(arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]
(arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
(arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
(arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]
(arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
(arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]
(arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
(arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
(arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]
(arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
(arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]
(arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]
(arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]
(arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
(arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]
(arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]
(arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]
(arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]
(arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]
(arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]
(arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
(arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
(arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]
(arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
(arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
(arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]
(arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]
(arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
(arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
(arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]
(arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
(arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]
(arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]
(arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
(arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]
(arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]
(arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
(arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]
(arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
(arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]
(arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
(arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
(arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]
(arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]
(arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]
(arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]
(arXiv 2021.12) Transformer based trajectory prediction, [Paper]
(arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]
(arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]
(arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]
(arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
(arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
(arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]
(arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]
(arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]
(arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]
(arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]
(arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]
(arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
(arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]
(arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]
(arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]
(arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]
(arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]
(arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
(arXiv 2021.12) Fast Point Transformer, [Paper]
(arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]
(arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]
(arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]
(arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
(arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]
(arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]
(arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]
(arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]
(arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]
(arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]
(arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]
(arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
(arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]
(arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
(arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
(arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]
(arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]
(arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
(arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]
(arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]
(arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]
(arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
(arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]
(arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]
(arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
(arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]
(arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
(arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
(arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
(arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]
(arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]
(arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]
(arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]
(arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]
(arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

(arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
(arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]
(arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
(arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
(arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]
(arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
(arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]
(arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
(arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
(arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
(arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
(arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
(arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]
(arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]
(arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]
(arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
(arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
(arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]
(arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]
(arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
(arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]
(arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
(arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]
(arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]
(arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]
(arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
(arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]
(arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]
(arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
(arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]
(arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]
(arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]
(arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]
(arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]
(arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]
(arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]
(arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]
(arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]
(arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]
(arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]
(arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]
(arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
(arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]
(arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]
(arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
(arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]
(arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]
(arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
(arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
(arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]
(arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
(arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]
(arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]
(arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]
(arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]
(arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
(arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]
(arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
(arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]
(arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2021.11) , [Paper]
(arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
(arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
(arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]
(arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]
(arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]
(arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]
(arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]
(arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
(arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
(arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]
(arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
(arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]
(arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]
(arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]
(arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
(arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
(arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]
(arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]
(arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]
(arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
(arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]
(arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]
(arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
(arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
(arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
(arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]
(arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]
(arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]
(arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]
(arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]
(arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]
(arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
(arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]
(arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
(arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]
(arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]
(arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]
(arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
(arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
(arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]
(arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
(arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]
(arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
(arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]
(arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]
(arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
(arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]
(arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
(arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
(arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]
(arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

(arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)
(arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]
(arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]
(arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]
(arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
(arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]
(arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]
(arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]
(arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]
(arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]
(arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]
(arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]
(arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]
(arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
(arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]
(arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]
(arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]
(arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]
(arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]
(arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
(arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]
(arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
(arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
(arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
(arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
(arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
(arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]
(arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]
(arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
(arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]
(arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]
(arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]
(arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
(arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]
(arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]
(arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]
(arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
(arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]
(arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
(arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]
(arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
(arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]
(arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]
(arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]
(arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]
(arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
(arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]
(arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]
(arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]
(arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]
(arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
(arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]
(arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code] ，
(arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]
(arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]
(arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]
(arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]
(arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]
(arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]
(arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]
(arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]
(arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]
(arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]
(arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

(arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]
(arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]
(arXiv 2021.09) Visually Grounded Concept Composition, [Paper]
(arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]
(arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
(arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]
(arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
(arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]
(arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
(arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]
(arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]
(arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]
(arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]
(arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]
(arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]
(arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]
(arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]
(arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
(arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]
(arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
(arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]
(arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]
(arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
(arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]
(arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]
(arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]
(arXiv 2021.09) Panoptic Narrative Grounding, [Paper]
(arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]
(arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]
(arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]
(arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]
(arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]
(arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]
(arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
(arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]
(arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]
(arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]
(arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]
(arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]
(arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]
(ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]
(arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]
(arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

(arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]
(arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
(arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
(arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
(arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]
(arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
(arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]
(arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
(arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]
(arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]