Paper1 CoKe: Contrastive Learning for Robust Keypoint Detection
摘要原文: In this paper, we introduce a contrastive learning framework for keypoint detection (CoKe). Keypoint detection differs from other visual tasks where contrastive learning has been applied because the input is a set of images in which multiple keypoints are annotated. This requires the contrastive learning to be extended such that the keypoints are represented and detected independently, which enables the contrastive loss to make the keypoint features different from each other and from the background. Our approach has two benefits: It enables us to exploit the power of contrastive learning for keypoint detection, and by detecting each keypoint independently the detection becomes more robust to occlusion compared to holistic methods, such as stacked hourglass networks, which attempt to detect all keypoints jointly. Our CoKe framework introduces several technical innovations. In particular, we introduce: (i) A clutter bank to represent non-keypoint features; (ii) a keypoint bank that stores prototypical representations of keypoints to approximate the contrastive loss between keypoints; and (iii) a cumulative moving average update to learn the keypoint prototypes while training the feature extractor. Our experiments on a range of diverse datasets (PASCAL3D+, MPII, ObjectNet3D) show that our approach works as well, or better than, alternative methods for keypoint detection, even for human keypoints, for which the literature is vast. Moreover, we observe that CoKe is exceptionally robust to partial occlusion and previously unseen object poses.
Paper2 Fine-Context Shadow Detection Using Shadow Removal
摘要原文: Current shadow detection methods perform poorly when detecting shadow regions that are small, unclear or have blurry edges. In this work, we attempt to address this problem on two fronts. First, we propose a Fine Context-aware Shadow Detection Network (FCSD-Net), where we constraint the receptive field size and focus on low-level features to learn fine context features better. Second, we propose a new learning strategy, called Restore to Detect (R2D), where we show that when a deep neural network is trained for restoration (shadow removal), it learns meaningful features to delineate the shadow masks as well. To make use of this complementary nature of shadow detection and removal tasks, we train an auxiliary network for shadow removal and propose a complementary feature learning block (CFL) to learn and fuse meaningful features from shadow removal network to the shadow detection network. We train the proposed network, FCSD-Net, using the R2D learning strategy across multiple datasets. Experimental results on three public shadow detection datasets (ISTD, SBU and UCF) show that our method improves the shadow detection performance while being able to detect fine context better compared to the other recent methods. Our proposed learning strategy can also be adopted easily as a useful pipeline in future advances in shadow detection and removal.
中文总结: 这段话主要内容是:当前的阴影检测方法在检测小、不清晰或具有模糊边缘的阴影区域时表现不佳。作者尝试通过两个方面来解决这个问题。首先,他们提出了一种细粒度上下文感知阴影检测网络(FCSD-Net),在该网络中,他们限制了感知域的大小,并专注于学习低级特征以更好地学习细粒度上下文特征。其次,他们提出了一种新的学习策略,称为Restore to Detect(R2D),他们表明,当深度神经网络被训练用于恢复(去除阴影)时,它也学习到了有意义的特征来描绘阴影掩膜。为了利用阴影检测和去除任务的互补性质,他们训练了一个辅助网络用于阴影去除,并提出了一个互补特征学习块(CFL),用于从阴影去除网络中学习和融合有意义的特征到阴影检测网络中。他们使用R2D学习策略跨多个数据集训练了所提出的网络FCSD-Net。在三个公共阴影检测数据集(ISTD、SBU和UCF)上的实验结果表明,他们的方法提高了阴影检测性能,同时能够更好地检测细粒度上下文,相比其他最近的方法。他们提出的学习策略也可以很容易地作为未来阴影检测和去除领域的有用流程采用。
Paper3 Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection
摘要原文: Zero-shot temporal activity detection (ZSTAD) is the problem of simultaneous temporal localization and classification of activity segments that are previously unseen during training. This is achieved by transferring the knowledge learned from semantically-related seen activities. This ability to reason about unseen concepts without supervision makes ZSTAD very promising for applications where the acquisition of annotated training videos is difficult. In this paper, we design a transformer-based framework titled TranZAD, which streamlines the detection of unseen activities by casting ZSTAD as a direct set-prediction problem, removing the need for hand-crafted designs and manual post-processing. We show how a semantic information-guided contrastive learning strategy can effectively train TranZAD for the zero-shot setting, enabling the efficient transfer of knowledge from the seen to the unseen activities. To reduce confusion between unseen activities and unrelated background information in videos, we introduce a more efficient method of computing the background class embedding by dynamically adapting it as part of the end-to-end learning. Additionally, unlike existing work on ZSTAD, we do not assume the knowledge of which classes are unseen during training and use the visual and semantic information of only the seen classes for the knowledge transfer. This makes TranZAD more viable for practical scenarios, which we evaluate by conducting extensive experiments on Thumos’14 and Charades.
Paper4 Is Your Noise Correction Noisy? PLS: Robustness To Label Noise With Two Stage Detection
摘要原文: Designing robust algorithms capable of training accurate neural networks on uncurated datasets from the web has been the subject of much research as it reduces the need for time consuming human labor. The focus of many previous research contributions has been on the detection of different types of label noise; however, this paper proposes to improve the correction accuracy of noisy samples once they have been detected. In many state-of-the-art contributions, a two phase approach is adopted where the noisy samples are detected before guessing a corrected pseudo-label in a semi-supervised fashion. The guessed pseudo-labels are then used in the supervised objective without ensuring that the label guess is likely to be correct. This can lead to confirmation bias, which reduces the noise robustness. Here we propose the pseudo-loss, a simple metric that we find to be strongly correlated with pseudo-label correctness on noisy samples. Using the pseudo-loss, we dynamically down weight under-confident pseudo-labels throughout training to avoid confirmation bias and improve the network accuracy. We additionally propose to use a confidence guided contrastive objective that learns robust representation on an interpolated objective between class bound (supervised) for confidently corrected samples and unsupervised representation for under-confident label corrections. Experiments demonstrate the state-of-the-art performance of our Pseudo-Loss Selection (PLS) algorithm on a variety of benchmark datasets including curated data synthetically corrupted with in-distribution and out-of-distribution noise, and two real world web noise datasets. Our experiments are fully reproducible github.com/PaulAlbert31/PLS.
Paper5 Zero-Shot Versus Many-Shot: Unsupervised Texture Anomaly Detection
摘要原文: Research on unsupervised anomaly detection (AD) has recently progressed, significantly increasing detection accuracy. This paper focuses on texture images and considers how few normal samples are needed for accurate AD. We first highlight the critical nature of the problem that previous studies have overlooked: accurate detection gets harder for anisotropic textures when image orientations are not aligned between inputs and normal samples. We then propose a zero-shot method, which detects anomalies without using a normal sample. The method is free from the issue of unaligned orientation between input and normal images. It assumes the input texture to be homogeneous, detecting image regions that break the homogeneity as anomalies. We present a quantitative criterion to judge whether this assumption holds for an input texture. Experimental results show the broad applicability of the proposed zero-shot method and its good performance comparable to or even higher than the state-of-the-art methods using hundreds of normal samples. The code and data are available from https://drive.google.com/drive/folders/10OyPzvI3H6llCZBxKxFlKWt1Pw1tkMK1.
Paper6 MT-DETR: Robust End-to-End Multimodal Detection With Confidence Fusion
摘要原文: Due to the trending need for autonomous driving, camera-based object detection has recently attracted lots of attention and successful development. However, there are times when unexpected and severe weather occurs in outdoor environments, making the detection tasks less effective and unexpected. In this case, additional sensors like lidar and radar are adopted to help the camera work in bad weather. However, existing multimodal detection methods do not consider the characteristics of different vehicle sensors to complement each other. Therefore, a novel end-to-end multimodal multistage object detection network called MT-DETR is proposed. Unlike the unimodal object detection networks, MT-DETR adds fusion modules and enhancement modules and adopts a hierarchical fusion mechanism. The Residual Fusion Module (RFM) and Confidence Fusion Module (CFM) are designed to fuse camera, lidar, radar, and time features. The Residual Enhancement Module (REM) reinforces each unimodal branch while a multistage loss is introduced to strengthen each branch’s effectiveness. The synthesis algorithm for generating camera-lidar data pairs in foggy conditions further boosts the performance in unseen adverse weather. Extensive experiments on various weather conditions of the STF dataset demonstrate that MT-DETR outperforms state-of-the-art methods. The generality of MT-DETR has also been confirmed by replacing the feature extractor in the experiments. The code and pre-trained models are available on https://github.com/Chushihyun/MT-DETR.
Paper7 Out-of-Distribution Detection via Frequency-Regularized Generative Models
摘要原文: Modern deep generative models can assign high likelihood to inputs drawn from outside the training distribution, posing threats to models in open-world deployments. While much research attention has been placed on defining new test-time measures of OOD uncertainty, these methods do not fundamentally change how deep generative models are regularized and optimized in training. In particular, generative models are shown to overly rely on the background information to estimate the likelihood. To address the issue, we propose a novel frequency-regularized learning (FRL) framework for OOD detection, which incorporates high-frequency information into training and guides the model to focus on semantically relevant features. FRL effectively improves performance on a wide range of generative architectures, including variational auto-encoder, GLOW, and PixelCNN++. On a new large-scale evaluation task, FRL achieves the state-of-the-art performance, outperforming a strong baseline Likelihood Regret by 10.7% (AUROC) while achieving 147x faster inference speed. Extensive ablations show that FRL improves the OOD detection performance while preserving the image generation quality.
Paper8 Mixture Outlier Exposure: Towards Out-of-Distribution Detection in Fine-Grained Environments
摘要原文: Many real-world scenarios in which DNN-based recognition systems are deployed have inherently fine-grained attributes (e.g., bird-species recognition, medical image classification). In addition to achieving reliable accuracy, a critical subtask for these models is to detect Out-of-distribution (OOD) inputs. Given the nature of the deployment environment, one may expect such OOD inputs to also be fine-grained w.r.t. the known classes (e.g., a novel bird species), which are thus extremely difficult to identify. Unfortunately, OOD detection in fine-grained scenarios remains largely underexplored. In this work, we aim to fill this gap by first carefully constructing four large-scale fine-grained test environments, in which existing methods are shown to have difficulties. Particularly, we find that even explicitly incorporating a diverse set of auxiliary outlier data during training does not provide sufficient coverage over the broad region where fine-grained OOD samples locate. We then propose Mixture Outlier Exposure (MixOE), which mixes ID data and training outliers to expand the coverage of different OOD granularities, and trains the model such that the prediction confidence linearly decays as the input transitions from ID to OOD. Extensive experiments and analyses demonstrate the effectiveness of MixOE for building up OOD detector in fine-grained environments. The code is available at https://github.com/zjysteven/MixOE.
Paper9 Spatio-Temporal Action Detection Under Large Motion
摘要原文: Current methods for spatiotemporal action tube detection often extend a bounding box proposal at a given key-frame into a 3D temporal cuboid and pool features from nearby frames. However, such pooling fails to accumulate meaningful spatiotemporal features if the position or shape of the actor shows large 2D motion and variability through the frames, due to large camera motion, large actor shape deformation, fast actor action and so on. In this work, we aim to study the performance of cuboid-aware feature aggregation in action detection under large action. Further, we propose to enhance actor feature representation under large motion by tracking actors and performing temporal feature aggregation along the respective tracks. We define the actor motion with intersection-over-union (IoU) between the boxes of action tubes/tracks at various fixed time scales. The action having a large motion would result in lower IoU over time, and slower actions would maintain higher IoU. We find that track-aware feature aggregation consistently achieves a large improvement in action detection performance, especially for actions under large motion compared to the cuboid-aware baseline. As a result, we also report state-of-the-art on the large-scale MultiSports dataset.
Paper10 Performer: A Novel PPG-to-ECG Reconstruction Transformer for a Digital Biomarker of Cardiovascular Disease Detection
摘要原文: Electrocardiography (ECG), an electrical measurement which captures cardiac activities, is the gold standard for diagnosing cardiovascular disease (CVD). However, ECG is infeasible for continuous cardiac monitoring due to its requirement for user participation. By contrast, photoplethysmography (PPG) provides easy-to-collect data, but its limited accuracy constrains its clinical usage. To combine the advantages of both signals, recent studies incorporate various deep learning techniques for the reconstruction of PPG signals to ECG; however, the lack of contextual information as well as the limited abilities to denoise biomedical signals ultimately constrain model performance. In this research, we propose Performer, a novel Transformer-based architecture that reconstructs ECG from PPG and combines the PPG and reconstructed ECG as multiple modalities for CVD detection. This method is the first time that Transformer sequence-to-sequence translation has been performed on biomedical waveform reconstruction, combining the advantages of both PPG and ECG. We also create Shifted Patch-based Attention (SPA), an effective method to encode/decode the biomedical waveforms. Through fetching the various sequence lengths and capturing cross-patch connections, SPA maximizes the signal processing for both local features and global contextual representations. The proposed architecture generates a state-of-the-art performance of 0.29 RMSE for the reconstruction of PPG to ECG on the BIDMC database, surpassing prior studies. We also evaluated this model on the MIMIC-III dataset, achieving a 95.9% accuracy in CVD detection, and on the PPG-BP dataset, achieving 75.9% accuracy in related CVD diabetes detection, indicating its generalizability. As a proof of concept, an earring wearable named PEARL (prototype), was designed to scale up the point-of-care (POC) healthcare system.
Paper11 LRA&LDRA: Rethinking Residual Predictions for Efficient Shadow Detection and Removal
摘要原文: The majority of the state-of-the-art shadow removal models (SRMs) reconstruct whole input images, where their capacity is needlessly spent on reconstructing non-shadow regions. SRMs that predict residuals remedy this up to a degree, but fall short of providing an accurate and flexible solution. In this paper, we rethink residual predictions and propose Learnable Residual Attention (LRA) and Learnable Dense Reconstruction Attention (LDRA) modules, which operate over the input and the output of SRMs. These modules guide an SRM to concentrate on shadow region reconstruction, and limit reconstruction of non-shadow regions. The modules improve shadow removal (up to 20%) and detection accuracy across various backbones, and even improve the accuracy of other removal methods (up to 10%). In addition, the modules have minimal overhead (+<1MB memory) and are implemented in a few lines of code. Furthermore, to combat the challenge of training SRMs with small datasets, we present a synthetic dataset generation pipeline. Using our pipeline, we create a dataset called PITSA, which has 10 times more unique shadow-free images than the largest benchmark dataset. Pre-training models on the PITSA significantly improves shadow removal (+2 MAE on shadow regions) and detection accuracy of multiple methods. Our results show that LRA&LDRA, when plugged into a lightweight architecture pre-trained on the PITSA, outperform state-of-the-art shadow removal (+0.7 all-region MAE) and detection (+0.1 BER) methods on the benchmark ISTD and SRD datasets, despite running faster (+5%) and consuming less memory (x150).
Paper12 Hyperdimensional Feature Fusion for Out-of-Distribution Detection
摘要原文: We introduce powerful ideas from Hyperdimensional Computing into the challenging field of Out-of-Distribution (OOD) detection. In contrast to many existing works that perform OOD detection based on only a single layer of a neural network, we use similarity-preserving semi-orthogonal projection matrices to project the feature maps from multiple layers into a common vector space. By repeatedly applying the bundling operation , we create expressive class-specific descriptor vectors for all in-distribution classes. At test time, a simple and efficient cosine similarity calculation between descriptor vectors consistently identifies OOD samples with competitive performance to the current state-of-the-art whilst being significantly faster. We show that our method is orthogonal to recent state-of-the-art OOD detectors and can be combined with them to further improve upon the performance.
Paper13 One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text
摘要原文: Active consumption of digital documents has yielded scope for research in various applications, including search. Traditionally, searching within a document has been cast as a text matching problem ignoring the rich layout and visual cues commonly present in structured documents, forms, etc. To that end, we ask a mostly unexplored question: “Can we search for other similar snippets present in a target document page given a single query instance of a document snippet?”. We propose MONOMER to solve this as a one-shot snippet detection task. MONOMER fuses context from visual, textual, and spatial modalities of snippets and documents to find query snippet in target documents. We conduct extensive ablations and experiments showing MONOMER outperforms several baselines from one-shot object detection (BHRL), template matching, and document understanding (LayoutLMv3). Due to the scarcity of relevant data for the task at hand, we train MONOMER on programmatically generated data having many visually similar query snippets and target document pairs from two datasets - Flamingo Forms and PubLayNet. We also do a human study to validate the generated data.
Paper14 Out-of-Distribution Detection With Reconstruction Error and Typicality-Based Penalty
摘要原文: The task of out-of-distribution (OOD) detection is vital to realize safe and reliable operation for real-world applications. After the failure of likelihood-based detection in high dimensions had been revealed, approaches based on the typical set have been attracting attention; however, they still have not achieved satisfactory performance. Beginning by presenting the failure case of the typicality-based approach, we propose a new reconstruction error-based approach that employs normalizing flow (NF). We further introduce a typicality-based penalty, and by incorporating it into the reconstruction error in NF, we propose a new OOD detection method, penalized reconstruction error (PRE). Because the PRE detects test inputs that lie off the in-distribution manifold, it also effectively detects adversarial examples. We show the effectiveness of our method through the evaluation using natural image datasets, CIFAR-10, TinyImageNet, and ILSVRC2012.
Paper15 Image-Consistent Detection of Road Anomalies As Unpredictable Patches
摘要原文: We propose a novel method for anomaly detection primarily aiming at autonomous driving. The design of the method, called DaCUP (Detection of anomalies as Consistent Unpredictable Patches), is based on two general properties of anomalous objects: an anomaly is (i) not from a class that could be modelled and (ii) it is not similar (in appearance) to non-anomalous objects in the image. To this end, we propose a novel embedding bottleneck in an auto-encoder like architecture that enables modelling of a diverse, multi-modal known class appearance (e.g. road). Secondly, we introduce novel image-conditioned distance features that allow known class identification in a nearest-neighbour manner on-the-fly, greatly increasing its ability to distinguish true and false positives. Lastly, an inpainting module is utilized to model the uniqueness of detected anomalies and significantly reduce false positives by filtering regions that are similar, thus reconstructable from their neighbourhood. We demonstrate that filtering of regions based on their similarity to neighbour regions, using e.g. an inpainting module, is general and can be used with other methods for reduction of false positives. The proposed method is evaluated on several publicly available datasets for road anomaly detection and on a maritime benchmark for obstacle avoidance. The method achieves state-of-the-art performance in both tasks with the same hyper-parameters with no domain specific design.
Paper16 VSGD-Net: Virtual Staining Guided Melanocyte Detection on Histopathological Images
摘要原文: Detection of melanocytes serves as a critical prerequisite in assessing melanocytic growth patterns when diagnosing melanoma and its precursor lesions on skin biopsy specimens. However, this detection is challenging due to the visual similarity of melanocytes to other cells in routine Hematoxylin and Eosin (H&E) stained images, leading to the failure of current nuclei detection methods. Stains such as Sox10 can mark melanocytes, but they require an additional step and expense and thus are not regularly used in clinical practice. To address these limitations, we introduce VSGD-Net, a novel detection network that learns melanocyte identification through virtual staining from H&E to Sox10. The method takes only routine H&E images during inference, resulting in a promising approach to support pathologists in the diagnosis of melanoma. To the best of our knowledge, this is the first study that investigates the detection problem using image synthesis features between two distinct pathology stainings. Extensive experimental results show that our proposed model outperforms state-of-the-art nuclei detection methods.
摘要原文: Our work investigates out-of-distribution (OOD) detection as a neural network output explanation problem. We learn a heatmap representation for detecting OOD images while visualizing in- and out-of-distribution image regions at the same time. Given a trained and fixed classifier, we train a decoder neural network to produce heatmaps with zero response for in-distribution samples and high response heatmaps for OOD samples, based on the classifier features and the class prediction. Our main innovation lies in the heatmap definition for an OOD sample, as the normalized difference from the closest in-distribution sample. The heatmap serves as a margin to distinguish between in- and out-of-distribution samples. Our approach generates the heatmaps not only for OOD detection, but also to indicates in- and out-of-distribution regions of the input image. In our evaluations, our approach mostly outperforms the prior work on fixed classifiers, trained on CIFAR-10, CIFAR-100 and Tiny ImageNet. The code is publicly available at: https://github.com/jhornauer/heatmap_ood.
Paper18 Computer Vision for Ocean Eddy Detection in Infrared Imagery
摘要原文: Reliable and precise detection of ocean eddies can significantly improve the monitoring of the ocean surface and subsurface dynamics, besides the characterization of local hydrographical and biological properties, or the concentration pelagic species. Today, most of the eddy detection algorithms operate on satellite altimetry gridded observations, which provide daily maps of sea surface height and surface geostrophic velocity. However, the reliability and the spatial resolution of altimetry products is limited by the strong spatio-temporal averaging of the mapping procedure. Yet, the availability of high-resolution satellite imagery makes real-time object detection possible at a much finer scale, via advanced computer vision methods. We propose a novel eddy detection method via a transfer learning schema, using the ground truth of high-resolution ocean numerical models to link the characteristic streamlines of eddies with their signature (gradients, swirls, and filaments) on Sea Surface Temperature (SST). A trained, multi-task convolutional neural network is then employed to segment infrared satellite imagery of SST in order to retreive the accurate position, size, and form of each detected eddy. The EddyScan-SST is an operational oceanographic module that provides, in real-time, key information on the ocean dynamics to maritime stakeholders.
Paper19 FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection
摘要原文: Due to its importance in facial behaviour analysis, facial action unit (AU) detection has attracted increasing attention from the research community. Leveraging the online knowledge distillation framework, we propose the “FAN-Trans” method for AU detection. Our model consists of a hybrid network of convolution layers and transformer blocks designed to learn per-AU features and to model AU co-occurrences. The model uses a pre-trained face alignment network as the feature extractor. After further transformation by a small learnable add-on convolutional subnet, the per-AU features are fed into transformer blocks to enhance their representation. As multiple AUs often appear together, we propose a learnable attention drop mechanism in the transformer block to learn the correlation between the features for different AUs. We also design a classifier that predicts AU presence by considering all AUs’ features, to explicitly capture label dependencies. Finally, we make the first attempt of adapting online knowledge distillation in the training stage for this task, further improving the model’s performance. Experiments on the BP4D and DISFA datasets show our method has achieved a new state-of-the-art performance on both, demonstrating its effectiveness.
Paper20 Task Agnostic and Post-Hoc Unseen Distribution Detection
摘要原文: Despite the recent advances in out-of-distribution(OOD) detection, anomaly detection, and uncertainty estimation tasks, there do not exist a task-agnostic and post-hoc approach. To address this limitation, we design a novel clustering-based ensembling method, called Task Agnostic and Post-hoc Unseen Distribution Detection (TAPUDD) that utilizes the features extracted from the model trained on a specific task. Explicitly, it comprises of TAP-Mahalanobis, which clusters the training datasets’ features and determines the minimum Mahalanobis distance of the test sample from all clusters. Further, we propose the Ensembling module that aggregates the computation of iterative TAP-Mahalanobis for a different number of clusters to provide reliable and efficient cluster computation. Through extensive experiments on synthetic and real-world datasets, we observe that our task-agnostic approach can detect unseen samples effectively across diverse tasks and performs better or on-par with the existing task-specific baselines. We also demonstrate that our method is more viable even for large-scale classification tasks.
Paper21 Motion Aware Self-Supervision for Generic Event Boundary Detection
摘要原文: The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hence creating a need for more straightforward and simplified approaches. In this work, we address this issue by revisiting a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets to demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. We also show that this simple self-supervised approach learns motion features without any explicit motion-specific pretext task.
Paper22 Two-Level Data Augmentation for Calibrated Multi-View Detection
摘要原文: Data augmentation has proven its usefulness to improve model generalization and performance. While it is commonly applied in computer vision application when it comes to multi-view systems, it is rarely used. Indeed geometric data augmentation can break the alignment among views. This is problematic since multi-view data tend to be scarce and it is expensive to annotate. In this work we propose to solve this issue by introducing a new multi-view data augmentation pipeline that preserves alignment among views. Additionally to traditional augmentation of the input image we also propose a second level of augmentation applied directly at the scene level. When combined with our simple multi-view detection model, our two-level augmentation pipeline outperforms all existing baselines by a significant margin on the two main multi-view multi-person detection datasets WILDTRACK and MultiviewX.