1、Vision-Language(视觉语言)
- A Vision Check-up for Language Models
- The Neglected Tails in Vision-Language Models
- Beyond Average: Individualized Visual Scanpath Prediction
- ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models
- Language Models as Black-Box Optimizers for Vision-Language Models
- Distilling Vision-Language Models on Millions of Videos
- SonicVisionLM: Playing Sound with Vision Language Models
- Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model
- Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
- JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
- MMA: Multi-Modal Adapter for Vision-Language Models
- Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
- Building Vision-Language Models on Solid Foundations with Masked Distillation
- TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model
⭐code - On Scaling Up a Multilingual Vision and Language Model
- CogAgent: A Visual Language Model for GUI Agents
⭐code - Towards Better Vision-Inspired Vision-Language Models
- SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training
- MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
- Sequential Modeling Enables Scalable Learning for Large Vision Models
🏠project大型视觉模型 - Seeing the Unseen: Visual Common Sense for Semantic Placement
- Efficient Vision-Language Pre-training by Cluster Masking
⭐code
🏠project - VILA: On Pre-training for Visual Language Models
- EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
⭐code
🏠project - SPIN: Simultaneous Perception Interaction and Navigation
- MAFA: Managing False Negatives for Vision-Language Pre-training
- Visual In-Context Prompting
⭐code - Semantics-aware Motion Retargeting with Vision-Language Models
- DePT: Decoupled Prompt Tuning
⭐code - Osprey: Pixel Understanding with Visual Instruction Tuning
⭐code - FairCLIP: Harnessing Fairness in Vision-Language Learning
🏠project - Efficient Test-Time Adaptation of Vision-Language Models
⭐code - BioCLIP: A Vision Foundation Model for the Tree of Life
⭐code - InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
⭐code - Anchor-based Robust Finetuning of Vision-Language Models
- Multi-Modal Hallucination Control by Visual Information Grounding
- Do Vision and Language Encoders Represent the World Similarly?
- Dual-View Visual Contextualization for Web Navigation
- Any-Shift Prompting for Generalization over Distributions
- Non-autoregressive Sequence-to-Sequence Vision-Language Models
- One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
⭐code - SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
⭐code - RegionGPT: Towards Region Understanding Vision Language Model
- Enhancing Vision-Language Pre-training with Rich Supervisions
- Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
⭐code - Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
- Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
⭐code - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
⭐code视觉语言构图理解 - FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models
- [Enhancing Vision-Language Pretraining with Rich Supervisions]
- Improved Baselines with Visual Instruction Tuning
🏠project - Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
- Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
⭐code - A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
⭐code - Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
⭐code - SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining视觉-语言
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
⭐code - Iterated Learning Improves Compositionality in Large Vision-Language Models
- ViTamin: Designing Scalable Vision Models in the Vision-Language Era
⭐code - Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
⭐code - Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models
🏠project - Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
🏠project - HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
⭐code - Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
- Learning Vision from Models Rivals Learning Vision from Data
⭐code - Probing the 3D Awareness of Visual Foundation Models
⭐code - LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
🏠project - 视觉理解
- LLM
- PixelLM: Pixel Reasoning with Large Multimodal Model
🏠project - OneLLM: One Framework to Align All Modalities with Language
- Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld
- Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
- Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
- Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
- See Say and Segment: Teaching LMMs to Overcome False Premises
- ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
- Driving Everywhere with Large Language Model Policy Adaptation
🏠project - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
- GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
🏠project - Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
🏠project - V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
⭐code - Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- Pixel Aligned Language Models
🏠project - SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
- OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
⭐code多模态大语言模型 - Low-Rank Approximation for Sparse Attention in Multi-Modal LLMsLLMs
- LISA: Reasoning Segmentation via Large Language Model
⭐code - Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model
- Compositional Chain-of-Thought Prompting for Large Multimodal Models
⭐code - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
🏠project - Honeybee: Locality-enhanced Projector for Multimodal LLM
⭐codeLLM - HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
⭐code - SEED-Bench: Benchmarking Multimodal Large Language Models
⭐code - PerceptionGPT: Effectively Fusing Visual Perception into LLM
- UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
- ModaVerse: Efficiently Transforming Modalities with LLMs
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models
⭐code
🏠project - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
- MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
🏠project大语言模型 - RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
⭐code - DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
⭐code
👍摘要 - Prompt Highlighter: Interactive Control for Multi-Modal LLMs
⭐code
🏠project - Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
🏠project - General Object Foundation Model for Images and Videos at Scale
⭐code
🏠project
👍GLEE 华科与字节跳动联手打造全能目标感知基础模型 - Link-Context Learning for Multimodal LLMs
⭐codeLLMs - Cloud-Device Collaborative Learning for Multimodal Large Language Models
- LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model
⭐code
👍成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM - Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
⭐code
👍成果速览 | CVPR2024细粒度视觉感知多模态大模型Pink、LocLLM - LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
⭐code
🏠projectMLLMs - GSVA: Generalized Segmentation via Multimodal Large Language Models
- PixelLM: Pixel Reasoning with Large Multimodal Model
- VLN
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
⭐code
👍VILP - Volumetric Environment Representation for Vision-Language Navigation
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
- Vision-and-Language Navigation via Causal Learning
⭐code视觉和语言导航
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
- 视频语言
- VidLA: Video-Language Alignment at Scale
- SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling
- VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
- VideoLLM-online: Online Video Large Language Model for Streaming Video
🏠project
- Visual Grounding
- Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
- MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
- Viewpoint-Aware Visual Grounding in 3D Scenes
- Improved Visual Grounding through Self-Consistent Explanations
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
🏠project - Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and ConsistencyVisual Grounding
- Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding
- Multi-Attribute Interactions Matter for 3D Visual Grounding
- Investigating Compositional Challenges in Vision-Language Models for Visual Grounding
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
- 多模态模型
- GLaMM: Pixel Grounding Large Multimodal Model
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
⭐code - What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
🏠project - Multi-modal Learning for Geospatial Vegetation Forecasting
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
- MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- TRINS: Towards Multimodal Language Models that Can Read
- Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
- 视觉基础模型
- 多视图理解
- 视觉定位