论文摘要阅读练习-每天练习

好歹早点睡

已于 2024-08-28 23:17:30 修改

阅读量851

点赞数 9

分类专栏： English Learning 文章标签：人工智能学习

于 2024-08-27 01:23:41 首次发布

本文链接：https://blog.csdn.net/weixin_65447388/article/details/141576540

版权

English Learning 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文章旨在练习摘要翻译，本人英文是瘸腿项目😭（争取练成“小牛”，好希望哪天完全不依靠翻译工具读一篇文章，希望大四一年能实现！目前“菜鸡”），常见的论文摘要翻译考核方式：

1. 首先给两分钟默读之后逐句翻译

2. 先朗读出来英文，之后逐句翻译

3. 询问特定的单词是什么含义（回答不知道的话有可能会让根据前后句推测）

4. 说摘要的结构是什么样子的（大致就是当前研究存在什么问题，本文章创新点是什么问题）

5. 还有根据文章内容的问题，比如文章介绍的创新点是什么？

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

摘要：

Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of usergenerated videos. The predominant approach, addressing this task, has been to develop sophisticated fusion techniques. However, the heterogeneous nature of the signals creates distributional modality gaps that pose significant challenges. In this paper, we aim to learn effective modality representations to aid the process of fusion. We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modalityinvariant, where the representations across modalities learn their commonalities and reduce the modality gap. The second subspace is modality-specific, which is private to each modality and captures their characteristic features. These representations provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models. We also consider the task of Multimodal Humor Detection and experiment on the recently proposed UR_FUNNY dataset. Here too, our model fares better than strong baselines, establishing MISA as a useful multimodal framework.

翻译（自翻译）：

~~多模态情感分析是一个研究用户生成式视频的重要领域~~多模态情感分析是一个利用多模态信号对用户生成式视频进行情感分析的热点领域。~~先前的研究多使用融合的方式去解决这个任务~~处理这个任务的主要方法是发展先进的融合技术。~~但是不同信号的异质特性导致模态间的鸿沟，从而造成融合挑战~~但是信号的异构型会造成不同模态的间隙，从而造成巨大挑战。在这篇文章中，我们希望去学习各种模态中有效的表征来帮助处理融合。我们提出了一个新颖的框架，MISA，这个框架将每种模态分成投影两种不同~~子部分~~子空间。~~第一个部分是用来学习各种模态之间的共同点并减少鸿沟的表征~~第一个部分是模态共性，学习不同模态的共性表征，减少模态差距。第二个部分是模态~~特殊点~~特异性，~~能够体现每个模态的特殊特征~~它是每个模态特有的并捕获它们的特征。~~这些表征提供了观察多模态数据的特殊视角，可以应用于预测性任务的融合部分~~这些表征提供了多模态数据的整体视图，用于实现任务预测的融合。~~我们的实验标准为当前流行的情感分析标准，MOSI 和MOSEI，与最先进的模型都有一定的优势。~~我们在流行的情感分析基准测试集MOSI和MOSEI上的实验表明，我们的模型比现有的模型有显著的提高。~~我们仍考虑多模态幽默预测和在最新的UR_FUNNY 数据集上的实验。~~我们还考虑了多模态幽默检测任务，并在最近提出的UR _ FUNNY数据集上进行了实验。~~这里我们的模型明显优于当前基准，证明我们的模型MISA 是一个有效的多模态框架。~~在这里，我们的模型也优于强基线，建立了MISA作为一个有用的多模态框架。

leverages multimodal signals：利用多模态信号

active area：热点领域（活跃领域？）

affective understanding：情感理解

usergenerated videos：用户生成式视频

projects each modality to two distinct subspaces：将每种模态映射到两种不同的子空间

modalityinvariant：模态共同点

private to each modality and captures their characteristic features：每种模态特有并捕获它们的特征

holistic view：整体视图

demonstrate significant gains：证明了显著的提高

fares better than：表现优于

establishing MISA as a useful multimodal framework：证明MISA 是一个有效的框架

Vision meets Robotics: The KITTI Dataset

Abstract—We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10-100 Hz using a variety of sensor modalities such as highresolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations and range from freeways over rural areas to innercity scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.

Index Terms—dataset, autonomous driving, mobile robotics, field robotics, computer vision, cameras, laser, GPS, benchmarks, stereo, optical flow, SLAM, object detection, tracking, KITTI.

摘要—我们介绍了一个从大众旅行轿车（Volkswagen station wagon）采集的新型数据集，该数据集可用于移动机器人和自动驾驶研究。我们总共记录了6个小时的交通场景，频率在10-100 Hz之间，使用了多种传感器模式，如高分辨率的彩色和灰度立体摄像机、Velodyne 3D激光扫描仪和高精度GPS/IMU惯性导航系统。这些场景多种多样，捕捉了现实世界中的交通状况，从乡村地区的公路到拥有众多静态和动态物体的城市内部场景都有涵盖。我们的数据已经过校准、同步和时间戳标记，我们提供校正后的和原始图像序列。我们的数据集还包含以3D轨迹形式出现的对象标签，我们为立体视觉、光流、对象检测和其他任务提供在线基准。本文介绍了我们的记录平台、数据格式以及我们提供的实用程序。

索引词——数据集、自动驾驶、移动机器人、现场机器人、计算机视觉、摄像机、激光、GPS、基准、立体视觉、光流、SLAM、对象检测、跟踪、KITTI

present :介绍展示 sth.

traffic scenarios ：交通场景

sensor modalities ：传感器模式

stereo cameras：立体摄影机

high-precision：高精度

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry / SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti

目前，视觉识别系统在机器人应用中的使用仍然较少。这其中一个主要原因可能是缺乏能够模拟此类场景的有挑战性的基准测试。在本文中，我们利用自主驾驶平台为立体视觉、光流、视觉里程计/SLAM（即时定位与地图构建）和3D目标检测等任务开发了新型挑战性基准测试。我们的记录平台配备了四个高分辨率摄像机、一个Velodyne激光雷达和一个最先进的定位系统。我们的基准测试包括389对立体和光流图像、长达39.2公里的立体视觉里程计序列，以及在复杂场景中捕获的超过20万个3D目标注释（每幅图像中最多可见15辆车和30名行人）。采用当前最先进的算法得到的结果显示，那些在诸如Middlebury等既定数据集上排名靠前的方法，在脱离实验室环境应用于现实世界时，表现低于平均水平。我们的目标是通过为计算机视觉社区提供具有新颖挑战性的基准测试，来减少这种偏差。我们的基准测试数据可在以下网址在线获取：http://www.cvlibs.net/datasets/kitti

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

Object detection on drone-captured scenarios is a recent popular task. As drones always navigate in different altitudes, the object scale varies violently, which burdens the optimization of networks. Moreover, high-speed and low-altitude flight bring in the motion blur on the densely packed objects, which leads to great challenge of object distinction. To solve the two issues mentioned above, we propose TPH-YOLOv5. Based on YOLOv5, we add one more prediction head to detect different-scale objects. Then we replace the original prediction heads with Transformer Prediction Heads (TPH) to explore the prediction potential with self-attention mechanism. We also integrate convolutional block attention model (CBAM) to find attention region on scenarios with dense objects. To achieve more improvement of our proposed TPH-YOLOv5, we provide bags of useful strategies such as data augmentation, multiscale testing, multi-model integration and utilizing extra classifier. Extensive experiments on dataset VisDrone2021 show that TPH-YOLOv5 have good performance with impressive interpretability on drone-captured scenarios. On DET-test-challenge dataset, the AP result of TPH-YOLOv5 are 39.18%, which is better than previous SOTA method (DPNetV3) by 1.81%. On VisDrone Challenge 2021, TPHYOLOv5 wins 5th place and achieves well-matched results with 1st place model (AP 39.43%). Compared to baseline model (YOLOv5), TPH-YOLOv5 improves about 7%, which is encouraging and competitive.

无人机捕获场景下的目标检测是近期一项热门任务。由于无人机总是在不同高度飞行，目标尺度变化剧烈，这给网络优化带来了沉重负担。此外，高速和低空飞行会导致密集排列的目标出现运动模糊，从而极大地增加了目标区分的难度。为解决上述两个问题，我们提出了TPH-YOLOv5。在YOLOv5的基础上，我们增加了一个预测头来检测不同尺度的目标。然后，我们用Transformer预测头（TPH）替换了原有的预测头，以利用自注意力机制探索预测的潜力。我们还集成了卷积块注意力模型（CBAM），以在密集目标场景中找到注意力区域。为了进一步改进我们提出的TPH-YOLOv5，我们提供了一系列有用的策略，如数据增强、多尺度测试、多模型集成和使用额外分类器。在VisDrone2021数据集上进行的大量实验表明，TPH-YOLOv5在无人机捕获的场景中表现出色，且具有良好的可解释性。在DET-test-challenge数据集上，TPH-YOLOv5的平均精度（AP）达到了39.18%，比之前的最佳方法（DPNetV3）高出1.81%。在VisDrone Challenge 2021中，TPHYOLOv5获得了第五名，并与第一名模型的结果（AP 39.43%）相媲美。与基线模型（YOLOv5）相比，TPH-YOLOv5的改进幅度约为7%，这是一个令人鼓舞且具有竞争力的成绩。