ECCV 2018 paper

ECCV2018精选论文概览

最新推荐文章于 2022-09-22 09:47:53 发布

转载最新推荐文章于 2022-09-22 09:47:53 发布 · 8.6k 阅读

Paper 专栏收录该内容

7 篇文章

订阅专栏

本文汇总了ECCV2018大会中一系列计算机视觉领域的前沿研究，涵盖图像分割、视频分析、3D重建、深度学习等多个方向。论文涉及的技术包括但不限于半卷积操作符、跨模态嵌入、视频运动放大、注意力感知掩膜传播等，展示了视觉理解、场景解析及目标检测等任务上的最新进展。

参考链接

ECCV 2018 papers

Paperlist

Semi-convolutional Operators for Instance Segmentation
Learnable PINs: Cross-Modal Embeddings for Person Identity
Learning-based Video Motion Magnification
Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation
CBAM: Convolutional Block Attention Module
BodyNet: Volumetric Inference of 3D Human Body Shapes
CNN-PS: CNN-based Photometric Stereo for General Non-Convex Surfaces
Spatio-temporal Transformer Network for Video Restoration
PS-FCN: A Flexible Learning Framework for Photometric Stereo
Dynamic Conditional Networks for Few-Shot Learning
Deep Factorised Inverse-Sketching
Separating Reflection and Transmission Images in the Wild
Ask, Acquire, and Attack: Data-free UAP Generation using Class Impressions
Rendering Portraitures from Monocular Camera and Beyond
Object Level Visual Reasoning in Videos
Dense Pose Transfer
Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning
Learning to Segment via Cut-and-Paste
Deep Boosting for Image Denoising
Fictitious GAN: Training GANs with Historical Models
Self-Supervised Relative Depth Learning for Urban Scene Understanding
Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss
Bi-box Regression for Pedestrian Detection and Occlusion Estimation
C-WSL: Count-guided Weakly Supervised Localization
Convolutional Networks with Adaptive Inference Graphs
Summarizing First-Person Videos from Third Persons' Points of View
Programmable Triangulation Light Curtains
Learning Single-View 3D Reconstruction with Limited Pose Supervision
Maximum Margin Metric Learning Over Discriminative Nullspace for Person Re-identification
Snap Angle Prediction for 360Â° Panoramas
Memory Aware Synapses: Learning what (not) to forget
Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks
Weakly- and Semi-Supervised Panoptic Segmentation
K-convexity shape priors for segmentation
Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images
Boosted Attention: Leveraging Human Attention for Image Captioning
Incremental Multi-graph Matching via Diversity and Randomness based Graph Clustering
Multi-view to Novel view: Synthesizing novel views with Self-Learned Confidence
Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation
Image Inpainting for Irregular Holes Using Partial Convolutions
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Fighting Fake News: Image Splice Detection via Learned Self-Consistency
End-to-End Joint Semantic Segmentation of Actors and Actions in Video
Visual Text Correction
Deep Co-Training for Semi-Supervised Image Recognition
Progressive Neural Architecture Search
Explainable Neural Computation via Stack Neural Module Networks
Attributes as Operators: Factorizing Unseen Attribute-Object Compositions
Scalable Exemplar-based Subspace Clustering on Class-Imbalanced Data
RCAA: Relational Context-Aware Agents for Person Search
Product Quantization Network for Fast Image Retrieval
Hand Pose Estimation via Latent 2.5D Heatmap Regression
Multimodal Unsupervised Image-to-image Translation
Depth-aware CNN for RGB-D Segmentation
Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Learning Blind Video Temporal Consistency
Diverse Image-to-Image Translation via Disentangled Representations
Learning to Blend Photos
Switchable Temporal Propagation Network
Deeply Learned Compositional Models for Human Pose Estimation
Unsupervised Video Object Segmentation with Motion-based Bilateral Networks
CornerNet: Detecting Objects as Paired Keypoints
Unsupervised holistic image generation from key local patches
Group Normalization
Generalizing A Person Retrieval Model Hetero- and Homogeneously
CAR-Net: Clairvoyant Attentive Recurrent Network
Cross-Modal Hamming Hashing
PlaneMatch: Patch Coplanarity Prediction for Robust RGB-D Reconstruction
DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency
Distractor-aware Siamese Networks for Visual Object Tracking
Multiresolution Tree Networks for 3D Point Cloud Processing
Propagating LSTM: 3D Pose Estimation based on Joint Interdependency
Deep Video Quality Assessor: From Spatio-temporal Visual Sensitivity to A Convolutional Neural Aggregation Network
Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground
Face Recognition with Contrastive Convolution
Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement
Domain Adaptation through Synthesis for Unsupervised Person Re-identification
Adding Attentiveness to the Neurons in Recurrent Neural Networks
Neural Stereoscopic Image Style Transfer
Learning Dynamic Memory Networks for Object Tracking
Gray-box Adversarial Training
GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints
Unsupervised Image-to-Image Translation with Stacked Cycle-Consistent Adversarial Networks
Light Structure from Pin Motion: Simple and Accurate Point Light Calibration for Physics-based Modeling
Find and Focus: Retrieve and Localize Video Events with Natural Language Queries
Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane
Super-Identity Convolutional Neural Network for Face Hallucination
SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network
Face Super-resolution Guided by Facial Component Heatmaps
ML-LocNet: Improving Object Localization with Multi-view Learning Network
Facial Expression Recognition with Inconsistently Annotated Datasets
Visual Question Answering as a Meta Learning Task
Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition
Semi-Dense 3D Reconstruction with a Stereo Event Camera
What do I Annotate Next? An Empirical Study of Active Learning for Action Localization
HybridNet: Classification and Reconstruction Cooperation for Semi-Supervised Learning
Self-Calibrating Isometric Non-Rigid Structure-from-Motion
Stroke Controllable Fast Style Transfer with Adaptive Receptive Fields
Reverse Attention for Salient Object Detection
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
Diagnosing Error in Temporal Action Detectors
Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation
Massively Parallel Video Networks
Transductive Centroid Projection for Semi-supervised Large-scale Recognition
PSANet: Point-wise Spatial Attention Network for Scene Parsing
Robust Anchor Embedding for Unsupervised Video Person Re-Identification in the Wild
Semi-Supervised Deep Learning with Memory
Towards End-to-End License Plate Detection and Recognition: A Large Dataset and Baseline
Repeatability Is Not Enough: Learning Affine Regions via Discriminability
Learning Warped Guidance for Blind Face Restoration
Compressing the Input for CNNs with the First-Order Scattering Transform
Face De-Spoofing: Anti-Spoofing via Noise Modeling
Faces as Lighting Probes via Unsupervised Deep Highlight Extraction
Unsupervised Hard Example Mining from Videos for Improved Object Detection
On Offline Evaluation of Vision-based Driving Models
Deep Fundamental Matrix Estimation
ContextVP: Fully Context-Aware Video Prediction
Visual Psychophysics for Making Face Recognition Algorithms More Explainable
TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild
Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image
Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model
Improved Structure from Motion Using Fiducial Marker Matching
Conditional Prior Networks for Optical Flow
Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training
DetNet: Design Backbone for Object Detection
BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation
HairNet: Single-View Hair Reconstruction using Convolutional Neural Networks
Neural Network Encapsulation
StarMap for Category-Agnostic Keypoint and Viewpoint Estimation
Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation
Multi-Fiber Networks for Video Recognition
Towards Human-Level License Plate Recognition
Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition
Generalized Loss-Sensitive Adversarial Learning with Manifold Margins
Pose Proposal Networks
Less is More: Picking Informative Frames for Video Captioning
Robust Optical Flow in Rainy Scenes
Into the Twilight Zone: Depth Estimation using Joint Structure-Stereo Optimization
Structured Siamese Network for Real-Time Visual Tracking
Associating Inter-Image Salient Instances for Weakly Supervised Semantic Segmentation
Learning Deep Representations with Probabilistic Knowledge Transfer
Recycle-GAN: Unsupervised Video Retargeting
Escaping from Collapsing Modes in a Constrained Space
Integrating Egocentric Videos in Top-view Surveillance Videos: Joint Identification and Temporal Alignment
Cross-Modal and Hierarchical Modeling of Video and Text
Tackling 3D ToF Artifacts Through Learning and the FLAT Dataset
Visual-Inertial Object Detection and Mapping
Zero-Shot Object Detection
Tracking Emerges by Colorizing Videos
Actor-centric Relation Network
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
SkipNet: Learning Dynamic Routing in Convolutional Networks
Quantized Densely Connected U-Nets for Efficient Landmark Localization
Person Search in Videos with One Portrait Through Visual and Temporal Links
HybridFusion: Real-Time Performance Capture Using a Single Depth Sensor and Sparse IMUs
Variational Wasserstein Clustering
A Modulation Module for Multi-task Learning with Applications in Image Retrieval
Learning Human-Object Interactions by Graph Parsing Neural Networks
Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data
Decouple Learning for Parameterized Image Operators
Grassmann Pooling as Compact Homogeneous Bilinear Pooling for Fine-Grained Visual Classification
Liquid Pouring Monitoring via Rich Sensory Inputs
Leveraging Motion Priors in Videos for Improving Human Segmentation
Triplet Loss in Siamese Network for Object Tracking
Macro-Micro Adversarial Network for Human Parsing
Contour Knowledge Transfer for Salient Object Detection
Point-to-Point Regression PointNet for 3D Hand Pose Estimation
Fine-grained Video Categorization with Redundancy Reduction Attention
Analyzing Clothing Layer Deformation Statistics of 3D Human Motions
DOCK: Detecting Objects by transferring Common-sense Knowledge
Recurrent Squeeze-and-Excitation Context Aggregation Net for Single Image Deraining
Multi-Scale Spatially-Asymmetric Recalibration for Image Classification
Fast and Accurate Intrinsic Symmetry Detection
Open Set Domain Adaptation by Backpropagation
Choose Your Neuron: Incorporating Domain Knowledge through Neuron-Importance
CGIntrinsics: Better Intrinsic Image Decomposition through Physically-Based Rendering
Stereo Computation for a Single Mixture Image
Objects that Sound
Iterative Crowd Counting
Weakly Supervised Region Proposal Network and Object Detection
Image Super-Resolution Using Very Deep Residual Channel Attention Networks
Dividing and Aggregating Network for Multi-view Action Recognition
Layer-structured 3D Scene Inference via View Synthesis
Deblurring Natural Image Using Super-Gaussian Fields
Learning Category-Specific Mesh Reconstruction from Image Collections
Selective Zero-Shot Classification with Augmented Attributes
Real-time 'Actor-Critic' Tracking
Zero-Annotation Object Detection with Web Knowledge Transfer
Question-Guided Hybrid Convolution for Visual Question Answering
Fully Motion-Aware Network for Video Object Detection
Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
Geometric Constrained Joint Lane Segmentation and Lane Boundary Detection
Deterministic Consensus Maximization with Biconvex Programming
Lifting Layers: Analysis and Applications
Simultaneous Edge Alignment and Learning
Deep Feature Pyramid Reconfiguration for Object Detection
Unpaired Image Captioning by Language Pivoting
Goal-Oriented Visual Question Generation via Intermediate Rewards
Modeling Varying Camera-IMU Time Offset in Optimization-Based Visual-Inertial Odometry
Teaching Machines to Understand Baseball Games: Large-Scale Baseball Video Database for Multiple Video Understanding Tasks
Receptive Field Block Net for Accurate and Fast Object Detection
DeepGUM: Learning Deep Robust Regression with a Gaussian-Uniform Mixture Model
Deep Bilinear Learning for RGB-D Action Recognition
RelocNet: Continuous Metric Learning Relocalisation using Neural Nets
Generative Semantic Manipulation with Mask-Contrasting GAN
Interpolating Convolutional Neural Networks Using Batch Normalization
SketchyScene: Richly-Annotated Scene Sketches
An Adversarial Approach to Hard Triplet Generation
Toward Characteristic-Preserving Image-based Virtual Try-On Network
Estimating the Success of Unsupervised Image to Image Translation
SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images
Efficient Uncertainty Estimation for Semantic Segmentation in Videos
Deep Cross-modality Adaptation via Semantics Preserving Adversarial Learning for Sketch-based 3D Shape Retrieval
Deep Adversarial Attention Alignment for Unsupervised Domain Adaptation: the Benefit of Target Expectation Maximization
ICNet for Real-Time Semantic Segmentation on High-Resolution Images
Parallel Feature Pyramid Network for Object Detection
MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network
Deep Directional Statistics: Pose Estimation with Uncertainty Quantification
Person Search by Multi-Scale Matching
Learn-to-Score: Efficient 3D Scene Exploration by Predicting View Utility
Joint Representation and Truncated Inference Learning for Correlation Filter based Tracking
TS2C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection
Hierarchy of Alternating Specialists for Scene Recognition
Revisiting RCNN: On Awakening the Classification Power of Faster RCNN
A Hybrid Model for Identity Obfuscation by Face Replacement
3D Scene Flow from 4D Light Field Gradients
RIDI: Robust IMU Double Integration
Superpixel Sampling Networks
Towards Robust Neural Networks via Random Self-ensemble
The Sound of Pixels
Adaptive Affinity Fields for Semantic Segmentation
Joint Map and Symmetry Synchronization
EC-Net: an Edge-aware Point set Consolidation Network
ReenactGAN: Learning to Reenact Faces via Boundary Transfer
Semi-Supervised Generative Adversarial Hashing for Image Retrieval
Training Binary Weight Networks via Semi-Binary Decomposition
Part-Activated Deep Reinforcement Learning for Action Prediction
Learning to Anonymize Faces for Privacy Preserving Action Detection
Lifelong Learning via Progressive Distillation and Retrospection
Focus, Segment and Erase: An Efficient Network for Multi-Label Brain Tumor Segmentation
Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition
A Closed-form Solution to Photorealistic Image Stylization
MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics
3D Recurrent Neural Networks with Context Fusion for Point Cloud Semantic Segmentation
Rethinking the Form of Latent States in Image Captioning
Move Forward and Tell: A Progressive Generator of Video Descriptions
Joint Person Segmentation and Identification in Synchronized First- and Third-person Videos
Transductive Semi-Supervised Deep Learning using Min-Max Features
SAN: Learning Relationship between Convolutional Features for Multi-Scale Object Detection
Visual Tracking via Spatially Aligned Correlation Filters Network
Predicting Future Instance Segmentation by Forecasting Convolutional Features
MVSNet: Depth Inference for Unstructured Multi-view Stereo
Learning Monocular Depth by Distilling Cross-domain Stereo Networks
Person Re-identification with Deep Similarity-Guided Graph Neural Network
Learning and Matching Multi-View Descriptors for Registration of Point Clouds
Flow-Grounded Spatial-Temporal Video Prediction from Still Images
The Contextual Loss for Image Transformation with Non-Aligned Data
Online Dictionary Learning for Approximate Archetypal Analysis
Video Object Segmentation by Learning Location-Sensitive Embeddings
Hashing with Binary Matrix Pursuit
Learning to Capture Light Fields through a Coded Aperture Camera
Learning to Reconstruct High-quality 3D Shapes with Cascaded Fully Convolutional Networks
X2Face: A network for controlling face generation using images, audio, and pose codes
End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners
Model Adaptation with Synthetic and Real Data for Semantic Dense Foggy Scene Understanding
DPP-Net: Device-aware Progressive Search for Pareto-optimal Neural Architectures
Revisiting Autofocus for Smartphone Cameras
Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence
A Dataset of Flash and Ambient Illumination Pairs from the Crowd
Deep Burst Denoising
MaskConnect: Connectivity Learning by Gradient Descent
ISNN: Impact Sound Neural Network for Audio-Visual Object Classification
Dependency-aware Attention Control for Unconstrained Face Recognition with Image Sets
StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction
Compositing-aware Image Search
Online Multi-Object Tracking with Dual Matching Attention Networks
Improving Sequential Determinantal Point Processes for Supervised Video Summarization
Online Detection of Action Start in Untrimmed, Streaming Videos
Volumetric performance capture from minimal camera viewpoints
Coreset-Based Neural Network Compression
A Framework for Evaluating 6-DOF Object Trackers
Learning to Separate Object Sounds by Watching Unlabeled Video
Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency
Neural Graph Matching Networks for Fewshot 3D Action Recognition
Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos
Attention-aware Deep Adversarial Hashing for Cross-Modal Retrieval
3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration
Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers
Variable Ring Light Imaging: Capturing Transient Subsurface Scattering with An Ordinary Camera
Graph R-CNN for Scene Graph Generation
Deep Domain Generalization via Conditional Invariant Adversarial Networks
Using LIP to Gloss Over Faces in Single-Stage Face Detection Networks
Pose-Normalized Image Generation for Person Re-identification
Videos as Space-Time Region Graphs
Learning 3D Human Pose from Structure and Motion
Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment
HiDDeN: Hiding Data with Deep Networks
Deep Cross-Modal Projection Learning for Image-Text Matching
Large Scale Urban Scene Modeling from MVS Meshes
Dual-Agent Deep Reinforcement Learning for Deformable Face Tracking
Unified Perceptual Parsing for Scene Understanding
Multimodal Dual Attention Memory for Video Story Question Answering
Deep Reinforcement Learning with Iterative Shift for Visual Tracking
Collaborative Deep Reinforcement Learning for Multi-Object Tracking
Deep Variational Metric Learning
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Deep Pictorial Gaze Estimation
PSDF Fusion: Probabilistic Signed Distance Function for On-the-fly 3D Data Fusion and Scene Reconstruction
Multi-Scale Context Intertwining for Semantic Segmentation
Learning to Fuse Proposals from Multiple Scanline Optimizations in Semi-Global Matching
Saliency Detection in 360Â° Videos
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
AugGAN: Cross Domain Adaptation with GAN-based Data Augmentation
Incremental Non-Rigid Structure-from-Motion with Unknown Focal Length
Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries
Graininess-Aware Deep Feature Learning for Pedestrian Detection
Acquisition of Localization Confidence for Accurate Object Detection
Learning Shape Priors for Single-View 3D Completion and Reconstruction
R2P2: A ReparameteRized Pushforward Policy for Diverse, Precise Generative Path Forecasting
Synthetically Supervised Feature Learning for Scene Text Recognition
Localization Recall Precision (LRP): A New Performance Metric for Object Detection
Second-order Democratic Aggregation
Lip Movements Generation at a Glance
Probabilistic Video Generation using Holistic Attribute Control
AGIL: Learning Attention from Human for Visuomotor Tasks
Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd
Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360Â° Panoramic Imagery
Seeing Tree Structure from Vibration
Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation
HGMR: Hierarchical Gaussian Mixtures for Adaptive 3D Registration
Deep Imbalanced Attribute Classification using Visual Attention Aggregation
Cross-Modal Ranking with Soft Consistency and Noisy Labels for Robust RGB-T Tracking
Shift-Net: Image Inpainting via Deep Feature Rearrangement
Small-scale Pedestrian Detection Based on Topological Line Localization and Temporal Feature Aggregation
Sub-GAN: An Unsupervised Generative Model via Subspaces
VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions
Penalizing Top Performers: Conservative Loss for Semantic Segmentation Adaptation
Interactive Boundary Prediction for Object Selection
Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection
CIRL: Controllable Imitative Reinforcement Learning for Vision-based Self-driving
The Devil of Face Recognition is in the Noise
Where Will They Go? Predicting Fine-Grained Adversarial Multi-Agent Motion using Conditional Variational Autoencoders
Bi-Real Net: Enhancing the Performance of 1-bit CNNs with Improved Representational Capability and Advanced Training Algorithm
X-ray Computed Tomography Through Scatter
Shape Reconstruction Using Volume Sweeping and Learned Photoconsistency
Unsupervised CNN-based Co-Saliency Detection with Graphical Optimization
Unsupervised Person Re-identification by Deep Learning Tracklet Association
Seeing Deeply and Bidirectionally: A Deep Learning Approach for Single Image Reflection Removal
Learning Data Terms for Non-blind Deblurring
Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation
Statistically-motivated Second-order Pooling
Video Re-localization
Orthogonal Deep Features Decomposition for Age-Invariant Face Recognition
Long-term Tracking in the Wild: a Benchmark
Affinity Derivation and Graph Merge for Instance Segmentation
Deep Model-Based 6D Pose Refinement in RGB
Zero-Shot Deep Domain Adaptation
Comparator Networks
Deep Regionlets for Object Detection
DCAN: Dual Channel-wise Alignment Networks for Unsupervised Scene Adaptation
Generating 3D Faces using Convolutional Mesh Autoencoders
ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking
Physical Primitive Decomposition
Inner Space Preserving Generative Pose Machine
Perturbation Robust Representations of Topological Persistence Diagrams
Hierarchical Relational Networks for Group Activity Recognition and Retrieval
Attention-based Ensemble for Deep Metric Learning
Neural Procedural Reconstruction for Residential Buildings
PyramidBox: A Context-assisted Single Shot Face Detector
Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition
Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes
Broadcasting Convolutional Network for Visual Relational Reasoning
Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning
View-graph Selection Framework for SfM
DFT-based Transformation Invariant Pooling Layer for Visual Classification
Learning Compression from Limited Unlabeled Data
Bayesian Semantic Instance Segmentation in Open Set World
BOP: Benchmark for 6D Object Pose Estimation
3D Vehicle Trajectory Reconstruction in Monocular Video Data Using Environment Structure Constraints
Appearance-Based Gaze Estimation via Evaluation-Guided Asymmetric Regression
Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation
SegStereo: Exploiting Semantic Information for Disparity Estimation
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design
Deep Attention Neural Tensor Network for Visual Question Answering
Pairwise Body-Part Attention for Recognizing Human-Object Interactions
Deep Clustering for Unsupervised Learning of Visual Features
Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features
Learning to Look around Objects for Top-View Representations of Outdoor Scenes
Uncertainty Estimates and Multi-Hypotheses Networks for Optical Flow
Normalized Blind Deconvolution
Selfie Video Stabilization
CubeNet: Equivariance to 3D Rotation and Translation
Improving Generalization via Scalable Neighborhood Component Analysis
Combining 3D Model Contour Energy and Keypoints for Object Tracking
Unsupervised Video Object Segmentation using Motion Saliency-Guided Spatio-Temporal Propagation
Pairwise Confusion for Fine-Grained Visual Classification
Modular Generative Adversarial Networks
Simultaneous 3D Reconstruction for Water Surface and Underwater Scene
Temporal Relational Reasoning in Videos
YouTube-VOS: Sequence-to-Sequence Video Object Segmentation
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
Women also Snowboard: Overcoming Bias in Captioning Models
Graph Distillation for Action Detection with Privileged Modalities
Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences
Proximal Dehaze-Net: A Prior Learning-Based Deep Network for Single Image Dehazing
Deep Component Analysis via Alternating Direction Neural Networks
SDC-Net: Video prediction using spatially-displaced convolution
Exploiting temporal information for 3D human pose estimation
Joint Camera Spectral Sensitivity Selection and Hyperspectral Image Recovery
ADVISE: Symbolism and External Knowledge for Decoding Advertisements
Person Search via A Mask-guided Two-stream CNN Model
GridFace: Face Rectification via Learning Local Homography Transformations
Weakly-supervised Video Summarization using Variational Encoder-Decoder and Web Prior
Compound Memory Networks for Few-shot Video Classification
Contextual-based Image Inpainting: Infer, Match, and Translate
Interpretable Intuitive Physics Model
Polarimetric Three-View Geometry
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
Weakly-supervised 3D Hand Pose Estimation from Monocular RGB Images
T2Net: Synthetic-to-Realistic Translation for Solving Single-Image Depth Estimation Tasks
Instance-level Human Parsing via Part Grouping Network
TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
PPF-FoldNet: Unsupervised Learning of Rotation Invariant 3D Local Descriptors
Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Robust fitting in computer vision: easy or hard?
Graph Adaptive Knowledge Transfer for Unsupervised Domain Adaptation
Single Image Intrinsic Decomposition without a Single Intrinsic Image
Disentangling Factors of Variation with Cycle-Consistent Variational Auto-Encoders
Deep Multi-Task Learning to Recognise Subtle Facial Expressions of Mental States
SRDA: Generating Instance Segmentation Annotation via Scanning, Reasoning and Domain Adaptation
DeepWrinkles: Accurate and Realistic Clothing Modeling
Recovering 3D Planes from a Single Image via Convolutional Neural Networks
Learning 3D Shapes as Multi-Layered Height-maps using 2D Convolutional Networks
A Geometric Perspective on Structured Light Coding
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Robust image stitching with multiple registrations
Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network
Object-centered image stitching
Learning to Dodge A Bullet: Concyclic View Morphing via Deep Learning
CTAP: Complementary Temporal Action Proposal Generation
Effective Use of Synthetic Data for Urban Scene Semantic Segmentation
ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems
ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids
Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency
Learning Discriminative Video Representations Using Adversarial Perturbations
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video
Compositional Learning for Human Object Interaction
Open-World Stereo Video Matching with Deep RNN
stagNet: An Attentive Semantic RNN for Group Activity Recognition
Double JPEG Detection in Mixed JPEG Quality Factors using Deep Convolutional Neural Network
Deep High Dynamic Range Imaging with Large Foreground Motions
Learning 3D Keypoint Descriptors for Non-Rigid Shape Matching
Learning Class Prototypes via Structure Alignment for Zero-Shot Recognition
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
A Trilateral Weighted Sparse Coding Scheme for Real-World Image Denoising
Linear Span Network for Object Skeleton Detection
DDRNet: Depth Map Denoising and Refinement for Consumer Depth Cameras Using Cascaded CNNs
ELEGANT: Exchanging Latent Encodings with GAN for Transferring Multiple Face Attributes
Progressive Structure from Motion
GAL: Geometric Adversarial Loss for Single-View 3D-Object Reconstruction
Viewpoint Estimation---Insights & Model
Super-Resolution and Sparse View CT Reconstruction
NNEval: Neural Network based Evaluation Metric for Image Captioning
Monocular Depth Estimation Using Whole Strip Masking and Reliability-Based Refinement
Dynamic Filtering with Large Sampling Field for ConvNets
SaaS: Speed as a Supervisor for Semi-supervised Learning
AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos
Local Spectral Graph Convolution for Point Set Feature Learning
Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
VideoMatch: Matching based Video Object Segmentation
Wasserstein Divergence for GANs
Semi-supervised FusedGAN for Conditional Image Generation
Practical Black-box Attacks on Deep Neural Networks using Efficient Query Mechanisms
PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model
Context Refinement for Object Detection
Attention-GAN for Object Transfiguration in Wild Images
Pose Guided Human Video Generation
Exploring the Limits of Weakly Supervised Pretraining
Exploiting Vector Fields for Geometric Rectification of Distorted Document Images
Task-driven Webpage Saliency
Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation
DYAN: A Dynamical Atoms-Based Network For Video Prediction
SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters
Hard-Aware Point-to-Set Deep Metric for Person Re-identification
Coded Two-Bucket Cameras for Computer Vision
Egocentric Activity Prediction via Event Modulated Attention
Real-Time MDNet
Image Generation from Sketch Constraint Using Contextual GAN
Real-Time Hair Rendering using Sequential Adversarial Networks
Sparsely Aggregated Convolutional Networks
Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors
Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation
Fast, Accurate, and Lightweight Super-Resolution with Cascading Residual Network
Deep Image Demosaicking using a Cascade of Convolutional Residual Denoising Networks
Modality Distillation with Multiple Stream Networks for Action Recognition
Direct Sparse Odometry With Rolling Shutter
Multi-Class Model Fitting by Energy Minimization and Mode-Seeking
Model-free Consensus Maximization for Non-Rigid Shapes
How good is my GAN?
Pose Partition Networks for Multi-Person Pose Estimation
3D-CODED: 3D Correspondences by Deep Deformation
Interpretable Basis Decomposition for Visual Explanation
Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry
HandMap: Robust Hand Pose Estimation via Intermediate Dense Guidance Map Supervision
Partial Adversarial Domain Adaptation
ExFuse: Enhancing Feature Fusion for Semantic Segmentation
Audio-Visual Event Localization in Unconstrained Videos
Understanding Degeneracies and Ambiguities in Attribute Transfer
Relaxation-Free Deep Hashing via Policy Gradient
How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization
Question Type Guided Attention in Visual Question Answering
Saliency Benchmarking Made Easy: Separating Models, Maps and Metrics
A Unified Framework for Multi-View Multi-Class Object Pose Estimation
A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding
Dynamic Task Prioritization for Multitask Learning
Deep Feature Factorization For Concept Discovery
Diverse feature visualizations reveal invariances in early layers of deep neural networks
Reinforced Temporal Attention and Split-Rate Transfer for Depth-Based Person Re-Identification
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications
Estimating Depth from RGB and Sparse Sensing
Grounding Visual Explanations
End-to-End Incremental Learning
Toward Scale-Invariance and Position-Sensitive Region Proposal Networks
Deep Regression Tracking with Shrinkage Loss
A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers
Adversarial Open-World Person Re-Identification
Conditional Image-Text Embedding Networks
DeepIM: Deep Iterative Matching for 6D Pose Estimation
Dist-GAN: An Improved GAN using Distance Constraints
Pivot Correlational Neural Network for Multimodal Video Categorization
Generative Domain-Migration Hashing for Sketch-to-Image Retrieval
TBN: Convolutional Neural Network with Ternary Inputs and Binary Weights
Multi-object Tracking with Neural Gating Using Bilinear LSTM
Highly-Economized Multi-View Binary Compression for Scalable Image Clustering
Part-Aligned Bilinear Representations for Person Re-Identification
End-to-end View Synthesis for Light Field Imaging with Pseudo 4DCNN
Action Anticipation with RBF Kernelized Feature Mapping RNN
Joint Blind Motion Deblurring and Depth Estimation of Light Field
Learning to Navigate for Fine-grained Classification
Specular-to-Diffuse Translation for Multi-View Reconstruction
Clustering Convolutional Kernels to Compress Deep Neural Networks
Scale Aggregation Network for Accurate and Efficient Crowd Counting
Fine-Grained Visual Categorization using Meta-Learning Optimization with Sample Selection of Auxiliary Data
Sampling Algebraic Varieties for Robust Camera Autocalibration
Stacked Cross Attention for Image-Text Matching
Data-Driven Sparse Structure Selection for Deep Neural Networks
DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks
Attribute-Guided Face Generation Using Conditional CycleGAN
On the Solvability of Viewing Graphs
A-Contrario Horizon-First Vanishing Point Detection Using Second-Order Grouping Laws
Deep Volumetric Video From Very Sparse Multi-View Performance Capture
Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes
Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping
RT-GENE: Real-Time Eye Gaze Estimation in Natural Environments
Deep Video Generation, Prediction and Completion of Human Action Sequences
Quantization Mimic: Towards Very Tiny CNN for Object Detection
Deep Structure Inference Network for Facial Action Unit Recognition
Deep Shape Matching
Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses
Efficient Semantic Scene Completion Network with Spatial Group Convolution
Interaction-aware Spatio-temporal Pyramid Attention Networks for Action Classification
Deep Texture and Structure Aware Filtering Network for Image Smoothing
Learning to Solve Nonlinear Least Squares for Monocular Stereo
Unsupervised Class-Specific Deblurring
VSO: Visual Semantic Odometry
Semantic Match Consistency for Long-Term Visual Localization
Learning Priors for Semantic 3D Reconstruction
The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking
Learning with Biased Complementary Labels
NAM: Non-Adversarial Unsupervised Domain Mapping
Motion Feature Network: Fixed Motion Filter for Action Recognition
Transferable Adversarial Perturbations
Semantically Aware Urban 3D Reconstruction with Plane-Based Regularization
Learning Type-Aware Embeddings for Fashion Compatibility
Visual Reasoning with Multi-hop Feature Modulation
Object Detection in Video with Spatiotemporal Sampling Networks
Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes
Extreme Network Compression via Filter Group Approximation
Efficient Sliding Window Computation for NN-Based Template Matching
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models
Single Image Highlight Removal with a Sparse and Low-Rank Reflection Model
ArticulatedFusion: Real-time Reconstruction of Motion, Geometry and Segmentation Using a Single Depth Camera
Museum Exhibit Identification Challenge for the Supervised Domain Adaptation and Beyond
Reconstruction-based Pairwise Depth Dataset for Depth Image Enhancement Using CNN
MRF Optimization with Separable Convex Prior on Partially Ordered Labels
Deep Generative Models for Weakly-Supervised Multi-Label Classification
Attend and Rectify: a gated attention mechanism for fine-grained recovery
ADVIO: An Authentic Dataset for Visual-Inertial Odometry
SRFeat: Single Image Super-Resolution with Feature Discrimination
Efficient 6-DoF Tracking of Handheld Objects from an Egocentric Viewpoint
Learning Visual Question Answering by Bootstrapping Hard Attention
LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks
Spatio-Temporal Channel Correlation Networks for Action Classification
Video Summarization Using Fully Convolutional Sequence Networks
Deep Autoencoder for Combined Human Pose Estimation and Body Model Upscaling
A Style-Aware Content Loss for Real-time HD Style Transfer
A Zero-Shot Framework for Sketch based Image Retrieval
Lambda Twist: An Accurate Fast Robust Perspective Three Point (P3P) Solver
Multi-modal Cycle-consistent Generalized Zero-Shot Learning
Modeling Visual Context is Key to Augmenting Object Detection Datasets
ForestHash: Semantic Hashing With Shallow Random Forests and Tiny Convolutional Networks
Extending Layered Models to 3D Motion
Scale-Awareness of Light Field Camera based Visual Odometry
Joint 3D tracking of a deformable object in interaction with a hand
Local Orthogonal-Group Testing
Occlusion-aware Hand Pose Estimation Using Hierarchical Mixture Density Network
Rolling Shutter Pose and Ego-motion Estimation using Shape-from-Template
Recognition in Terra Incognita
3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation
A Minimal Closed-Form Solution for Multi-Perspective Pose Estimation using Points and Lines
Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks
FishEyeRecNet: A Multi-Context Collaborative Deep Network for Fisheye Image Rectification
Unveiling the Power of Deep Tracking
LSQ++: Lower running time and higher recall in multi-codebook quantization
HBE: Hand Branch Ensemble Network for Real-time 3D Hand Pose Estimation
Retrospective Encoders for Video Summarization
Sequential Clique Optimization for Video Object Segmentation
Constraint-Aware Deep Neural Network Compression
Linear RGB-D SLAM for Planar Environments
Learning Region Features for Object Detection
Video Compression through Image Interpolation
Key-Word-Aware Network for Referring Expression Image Segmentation
LAPRAN: A Scalable Laplacian Pyramid Reconstructive Adversarial Network for Flexible Compressive Sensing Reconstruction
Recurrent Fusion Network for Image captioning
On Regularized Losses for Weakly-supervised CNN Segmentation
Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network
A Segmentation-aware Deep Fusion Network for Compressed Sensing MRI
End-to-End Deep Structured Models for Drawing Crosswalks
Few-Shot Human Motion Prediction via Meta-Learning
Correcting the Triplet Selection Bias for Triplet Loss
3D Face Reconstruction from Light Field Images: A Model-free Approach
Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering
Sidekick Policy Learning for Active Visual Exploration
Good Line Cutting: towards Accurate Pose Tracking of Line-assisted VO/VSLAM
Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds
Attentive Semantic Alignment with Offset-Aware Correlation Kernels
``Factual'' or ``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention
CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
Single Image Water Hazard Detection using FCN with Reflection Attention Units
ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation
Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition
Bidirectional Feature Pyramid Network with Recurrent Attention Residual Modules for Shadow Detection
Where are the blobs: Counting by Localization with Point Supervision
Dense Semantic and Topological Correspondence of 3D Faces without Landmarks
Textual Explanations for Self-Driving Vehicles
Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-identification
Efficient Relative Attribute Learning using Graph Neural Networks
Contemplating Visual Emotions: Understanding and Overcoming Dataset Bias
Joint & Progressive Learning from High-Dimensional Data for Multi-Label Classification
Using Object Information for Spotting Text
MVTec D2S: Densely Segmented Supermarket Dataset
Video Object Detection with an Aligned Spatial-Temporal Memory
Asynchronous, Photometric Feature Tracking using Events and Frames
Deep Recursive HDRI: Inverse Tone Mapping using Generative Adversarial Networks
DeepKSPD: Learning Kernel-matrix-based SPD Representation for Fine-grained Image Recognition
Remote Photoplethysmography Correspondence Feature for 3D Mask Face Presentation Attack Detection
Fast Light Field Reconstruction With Deep Coarse-To-Fine Modeling of Spatial-Angular Clues
Deep Discriminative Model for Video Classification
Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image
Image Reassembly Combining Deep Learning and Shortest Path Problem
Coded Illumination and Imaging for Fluorescence Based Classification
GANimation: Anatomically-aware Facial Animation from a Single Image
Deep Kalman Filtering Network for Video Compression Artifact Reduction
A Deeply-initialized Coarse-to-fine Ensemble of Regression Trees for Face Alignment
Deep Expander Networks: Efficient Deep Networks from Graph Theory
Coloring with Words: Guiding Image Colorization Through Text-based Palette Generation
BusterNet: Detecting Copy-Move Image Forgery with Source/Target Localization
Task-Aware Image Downscaling
Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition
Self-Calibration of Cameras with Euclidean Image Plane in Case of Two Views and Known Relative Rotation Angle
To learn image super-resolution, use a GAN to learn how to do image degradation first
Multi-scale Residual Network for Image Super-Resolution
Efficient Global Point Cloud Registration by Matching Rotation Invariant Features Through Translation Search
FloorNet: A Unified Framework for Floorplan Reconstruction from 3D Scans
Facial Dynamics Interpreter Network: What are the Important Relations between Local Dynamics for Facial Trait Estimation?
Transferring GANs: generating images from limited data
A Dataset for Lane Instance Segmentation in Urban Environments
Visual Question Generation for Class Acquisition of Unknown Objects
DeepVS: A Deep Learning Based Video Saliency Prediction Approach
Saliency Preservation in Low-Resolution Grayscale Images
Pairwise Relational Networks for Face Recognition
Proxy Clouds for Live RGB-D Stream Processing and Consolidation
U-PC: Unsupervised Planogram Compliance
Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World
Deep Metric Learning with Hierarchical Triplet Loss
Efficient Dense Point Cloud Object Reconstruction using Deformation Vector Fields
DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation
Improving DNN Robustness to Adversarial Attacks using Jacobian Regularization
Joint Learning of Intrinsic Images and Semantic Segmentation
Recurrent Tubelet Proposal and Recognition Networks for Action Detection
Domain transfer through deep activation matching
Towards Privacy-Preserving Visual Recognition via Adversarial Training: A Pilot Study
Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera
Beyond local reasoning for stereo confidence estimation with deep learning
Self-supervised Knowledge Distillation Using Singular Value Decomposition
Implicit 3D Orientation Learning for 6D Object Detection from RGB Images
Concept Mask: Large-Scale Segmentation from Semantic Concepts
Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net
Adaptively Transforming Graph Matching
Deep Continuous Fusion for Multi-Sensor 3D Object Detection
PARN: Pyramidal Affine Regression Networks for Dense Semantic Correspondence
Multimodal image alignment through a multiscale chain of neural networks with application to remote sensing
Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)
Out-of-Distribution Detection Using an Ensemble of Self Supervised Leave-out Classifiers
Start, Follow, Read: End-to-End Full-Page Handwriting Recognition
PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities
Adversarial Geometry-Aware Human Motion Prediction
WildDash - Creating Hazard-Aware Benchmarks
RefocusGAN: Scene Refocusing using a Single Image
Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving
Zero-shot keyword spotting for visual speech recognition in-the-wild
Learning Efficient Single-stage Pedestrian Detectors by Asymptotic Localization Fitting
Generative Adversarial Network with Spatial Attention for Face Attribute Editing
Scenes-Objects-Actions: A Multi-Task, Multi-Label Video Dataset
Descending, lifting or smoothing: Secrets of robust cost optimization
Deep Bilevel Learning
Realtime Time Synchronized Event-based Stereo
Understanding Perceptual and Conceptual Fluency at a Large Scale
Structure-from-Motion-Aware PatchMatch for Adaptive Optical Flow Estimation
Unsupervised Learning of Multi-Frame Optical Flow with Occlusions
Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images
Accelerating Dynamic Programs via Nested Benders Decomposition with Application to Multi-Person Pose Estimation
OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas
Joint optimization for compressive video sensing and reconstruction under hardware constraints
A+D Net: Training a Shadow Detector with Adversarial Shadow Attenuation
Simple Baselines for Human Pose Estimation and Tracking
Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance
Geolocation Estimation of Photos using a Hierarchical Model and Scene Classification
Universal Sketch Perceptual Grouping
License Plate Detection and Recognition in Unconstrained Scenarios
Affine Correspondences between Central Cameras for Rapid Relative Pose Estimation
ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases
Human Motion Analysis with Deep Metric Learning
Real-to-Virtual Domain Unification for End-to-End Autonomous Driving
Imagine This! Scripts to Compositions to Videos
Exploring Visual Relationship for Image Captioning