【定位系列论文阅读】-Can WiFi Estimate Person Pose?

醉酒柴柴

于 2023-07-27 20:39:23 发布

阅读量334

点赞数

分类专栏：杂七杂八文章标签：论文阅读

本文链接：https://blog.csdn.net/weixin_46050242/article/details/131939759

版权

杂七杂八专栏收录该内容

10 篇文章 3 订阅

订阅专栏

文章目录

0.文章速览
1.Abstract
- 1.1 逐句翻译
- 1.2 总结
2.INTRODUCTION
- 2.1 逐句翻译
3 RELATED WORK
- 3.1 逐句翻译
- - 第一段（介绍从相机估计姿态的相关工作）
  - 第二段（传感器方面的姿态估计相关工作）
3.Background
- 逐句翻译
4 Methodology 方法
- 逐句翻译
5 Evaluation
- 逐句翻译
6 Conclusion

0.文章速览

0.1 文章信息

题目：WiFi可以估计人的姿势吗?
来源paperwithcode中的一篇文章
代码地址

0.2 概述

使用wifi进行单人的姿态估计，提出WiSPPN方法，使得wifi也能达到相机的效果

0.2.1 研究什么东西

在这里插入图片描述
本文方法架构

0.2.3 评价

文章给我带来的收获：
1.方法好值得借鉴：创新性强，使用wifi估计人的姿势
2.实验设计比较完整
文章打分：⭐⭐⭐⭐
值得再读：√ （有开源代码）

1.Abstract

1.1 逐句翻译

WiFi human sensing has achieved great progress in indoor localization, activity classification, etc.
WiFi人体传感在室内定位、活动分类等方面取得了很大进展。

Retracing the development of these work, we have a natural question: can WiFi devices work like cameras for vision applications?
回顾这些工作的发展，我们有一个自然的问题:WiFi设备可以像相机一样用于视觉应用吗?

In this paperWe try to answer this question by exploring the ability of WiFi on estimating single person pose。
在本文中，我们试图通过探索WiFi对单人姿势的估计能力来回答这个问题

We use a 3-antenna WiFi sender and a 3-antenna receiver to generate WiFi data.
我们使用一个3天线的WiFi发送器和一个3天线的接收器来生成WiFi数据。

Meanwhile, we use a synchronized camera to capture person videos for corresponding keypoint annotations.
同时，我们使用同步摄像机捕捉人物视频并进行相应的关键点标注。

We further propose a fully convolutional network (FCN), termed WiSPPN, to estimate single person pose from the collected
data and annotations.
我们进一步提出了一种称为WiSPPN的全卷积网络(FCN)，用于从收集的数据和注释中估计单个人的姿势。

Evaluation on over 80k images (16 sites and 8 persons)replies aforesaid question with a positive answer.
对超过80k张图片(16个站点，8个人)的评价以肯定的答案回答了上述问题。

Codes have been made publicly available at https: // github. com/ geekfeiw/ WiSPPN .
代码已在https: // github上公开提供。http://www.geekfe.com/

1.2 总结

为了探究wifi能否估计人的姿势，提出WiSPPN，此方法可以从wifi数据中估计姿势

2.INTRODUCTION

2.1 逐句翻译

第一段（wifi估计人体姿势的可行性）

The key components of ubiquitous WiFi networks, WiFi devices, have been widely explored in many human sensing work such as indoor localization [1–4] and activity classification [5–7].
无处不在的WiFi网络的关键部件——WiFi设备，在室内定位[1-4]和活动分类[5-7]等许多人类传感工作中得到了广泛的探索。

Retracing the development of these work, a natural question arises: whether WiFi devices can work like cameras for fine-grained human sensing task such as the person pose estimation.
回顾这些工作的发展，一个自然的问题出现了:WiFi设备是否可以像相机一样工作，用于精细的人类感知任务，如人体姿势估计。

If the answer is yes, WiFi could be an alternative or supplementary solution for cameras in some situation such as sensing through-wall, under occlusion and in the dark.
如果答案是肯定的，那么WiFi可能会成为摄像头在某些情况下的替代或补充解决方案，比如在墙外、遮挡下和黑暗中进行感应。

Besides the advancement in the physical propertiescomparing to cameras, WiFi devices are prevalent, requiring less cost in deployment, and rise less privacy concerns for the public.
除了与摄像头相比，WiFi设备在物理性能上的进步，WiFi设备的普及，部署成本更低，公众对隐私的担忧也更少。

第二段（学习wifi信号和人姿态的映射需要姿势监督）

Though estimating person pose estimation with WiFi is with high practical impact for above explanations, it is full of challenges.
基于以上解释，利用WiFi估计人姿虽然具有很高的实际影响，但也充满了挑战。

First, WiFi is designed for wireless communication, which carries no direct information on the person keypoint coordinates.
首先，WiFi是为无线通信而设计的，它不携带关于人的关键点坐标的直接信息。

We cannot benefit from the most popular person pose estimation schema in computer vision techniques, inferring the person location from aimage then regressing the keypoint heatmaps [8, 9].
我们无法从计算机视觉技术中最流行的人物姿态估计模式中获益，即从图像推断人物位置，然后回归关键点热图[8,9]。

Thus in order to learn the mapping from WiFi signals to person pose, pose supervision must be prepared and it must be corresponding with WiFi signals.
因此，为了学习WiFi信号到人体姿态的映射，必须准备好姿势监督，并且姿势监督必须与WiFi信号相对应。

To deal with this problem, we combine a camera with WiFi antennas to capture person videos.
为了解决这个问题，我们将摄像头与WiFi天线结合起来，捕捉人物视频。

The camera and WiFi are synchronized with Unix time to guarantee the correspondence.
摄像头和WiFi采用Unix时间同步，保证通信。

The pose supervision is derived from videos through the AlphaPose [8], an accurate yet fast open-source person pose estimation repository
姿势监督是通过AlphaPose[8]从视频中获得的，AlphaPose是一个准确而快速的开源人体姿势估计库

第三段（介绍本文如何设计系统实现wifi信号估计人的姿势+贡献）

Second, it is very pioneering that estimating person pose from WiFi signals, thus we have little work to refer.
其次，从WiFi信号中估计人的姿势是非常开创性的，因此我们没有多少工作可以参考。

Even in computer vision community, it takes decades to reach an acceptable performance for image or video inputs.
即使在计算机视觉社区，也需要几十年的时间才能达到可接受的图像或视频输入性能。

Generally, deep networks that estimates with WiFi should be completely different.
一般来说，使用WiFi进行估计的深度网络应该是完全不同的。

After abundant survey, we propose WiSPPN (abbreviation of WiFi single person pose networks), which is a selective combination of CSI-Net [10], ResNet [11] and FCN [12].
经过大量调查，我们提出了WiSPPN (WiFi single person pose networks的缩写)，它是CSI-Net[10]、ResNet[11]和FCN[12]的选择性组合。

To be exact, we utilize the up-sampling stage of CSI-Net to encoding WiFi signals.
确切地说，我们利用CSI-Net的上采样级对WiFi信号进行编码。

Then we use ResNet to extracting feature.
然后利用ResNet进行特征提取。

Moreover, we propose an innovative pose embedding approach which is inspired by the adjacency matrix in the graph theory.
此外，我们还提出了一种创新的姿势嵌入方法，其灵感来源于图论中的邻接矩阵。

This approach would take the length constraint of pose coordinates and make pose estimation can be done with FCN.
该方法利用姿态坐标的长度约束，使得姿态估计可以用FCN进行。

When solve these two challenges, we achieve single person pose estimation with WiFi.
在解决这两个难题的同时，我们利用WiFi实现了单人姿态估计。

Evaluationover 80k images shows that our approach achieve single person pose well
对80k多张图像的评估表明，我们的方法可以很好地实现单人姿势

The contribution of this paper can be summarized as follows.
本文的贡献可以概括如下。

We put forwards a question that whether WiFi can be used like cameras for vision problem. We positively answered this question by demonstrating that WiFi singles can be used for single person pose estimation.
我们提出了一个问题，WiFi是否可以像相机一样用于视觉问题。我们积极地回答了这个问题，证明WiFi单点可以用于单人姿势估计。
To answer this question, we built a multi-modality system, collected a dataset and propose a novel deep networks to learn the mapping from WiFi signals to person keypoint coordinates.
为了回答这个问题，我们构建了一个多模态系统，收集了一个数据集，并提出了一种新的深度网络来学习从WiFi信号到人关键点坐标的映射。

3 RELATED WORK

3.1 逐句翻译

第一段（介绍从相机估计姿态的相关工作）

Camera. Estimating multi-person pose from RGB images is a widely-studied problem [13–15].
相机。从RGB图像中估计多人姿态是一个被广泛研究的问题[13-15]。

The leading solutions of COCO Challenge, such as AlphaPose [8] and CPN [9], are prone to apply a person detector to crop every person from images, then to do single person pose estimation from cropped feature maps, regressing the heatmap of each body keypoints.
COCO Challenge的领先解决方案，如AlphaPose[8]和CPN[9]，倾向于应用一个人检测器从图像中裁剪每个人，然后从裁剪的特征图中进行单个人的姿势估计，回归每个身体关键点的热图。

Coordinates with highest confidence are the estimation of single person pose.
置信度最高的坐标是对单个人姿态的估计。

第二段（传感器方面的姿态估计相关工作）

Other sensors. Due to the potential usages of pose estimation, researchers have applied many other sensors to estimate person body pose or sketch.
其他传感器。由于姿态估计的潜在用途，研究人员已经应用了许多其他传感器来估计人体姿态或素描。

Wall++ [16] enables a common wall large electrodes with water-based nickel [17] painted.
wall++[16]使涂有水性镍[17]的普通大电极成为可能。

Then the Wall++ can sense airborne electromagnetic variance caused by human body and estimate person pose.
然后，Wall++可以感知人体引起的空气电磁变化，并估计人的姿势。

LiSense [18] and StarLight [19] use ceiling LED and photo-resistor blanket to capture human body shadow on the blanket, then reconstruct body sketch/pose.
LiSense[18]和StarLight[19]使用天花板LED和光敏电阻毯捕捉毯上的人体阴影，然后重建人体素描/姿势。

With promoted technology, person’s hand can also be reconstructed by similar systems [20].
随着技术的进步，人的手也可以通过类似的系统进行重建[20]。

RF-Capture [21] and RF-Pose [22] implements radars with frequency modulated continuous wave (FMCW) equipment to estimate person body sketch/pose.
RF-Capture[21]和RF-Pose[22]采用调频连续波(FMCW)设备的雷达来估计人体草图/姿势。

Even, single-photon sensors can be used to reconstruct person body [23, 24].
甚至，单光子传感器可以用来重建人体[23,24]。

Comparing to these sensors, devices with WiFi chips may be the most pervasive, such as routers, cell phones and blooming Internet-of-Things.
与这些传感器相比，带有WiFi芯片的设备可能是最普遍的，比如路由器、手机和蓬勃发展的物联网。

3.Background

逐句翻译

3.1 WiFi Signals and Channel State Information

第一段（wifi信号携带的信息可以用CSI衡量）

Under IEEE 802.11 n/g/ac protocols, WiFi works around 2.4/5GHz (central frequency) with multiple channels.
在IEEE 802.11 n/g/ac协议下，WiFi工作在2.4/5GHz(中心频率)左右，具有多个通道。

In each channel, the bandwidth is 20/40/80/160MHz.
每个通道的带宽为20/40/80/160MHz。

Within the band, carriers with different frequencies are modulated to carry information for wireless communication in parallel, which is called orthogonal frequency division multiplexing (OFDM) and illustrated in the left of Fig. 2.
在频带内，对不同频率的载波进行调制，以并行传输无线通信的信息，称为正交频分复用(OFDM)，如图2左侧所示。

During propagation, WiFi carriers decay in power and shift in phase.
在传播过程中，WiFi载波功率衰减，相位偏移。

Moreover, their frequencies may also change when encountering a moving object due to the Doppler Effect.
此外，由于多普勒效应，它们的频率也可能在遇到移动物体时发生变化。

Channel State Information (CSI), a physical layer indicator, can be used to represent these variation of carriers.
信道状态信息(CSI)是一种物理层指示器，可以用来表示这些载波的变化。
在这里插入图片描述
图2:左:OFDM。信息由n个载体承载;右:16-QAM。对于每个载波，一个调制信号携带4位数据。如果携带’ 1010 '，它将被调制为x = 3 + 3i并广播。接收到的信号为y，传播过程中的变化为h = y=x，用于人体感知。

第二段（介绍图2，并把wifi信号与CSI等价）

Take the modulation method of 16-quadrature amplitude modulation (16-QAM) for example 2
, as shown in the right of Fig. 2, one modulated carrier contains 4bits information one time.
以16正交调幅(16-QAM)调制方式为例2，如图2右侧所示，一个调制载波一次包含4bit信息。

When the sender sends a ‘1111’ to the receiver, the carrier would be modulated to x = 1 + 1i. If the receiver receives a y = 0:8 + 0:9i.
当发送方向接收方发送“1111”信号时，载波将被调制为x = 1 + 1i。如果接收方收到一个y = 0:8 + 0:9i。

Thus the variation happening during propagation is h = y=x = 0:2 + 3:4i, which is called CSI of this carrier
因此在传播过程中发生的变化为h = y=x = 0:2 + 3:4i，称为该载波的CSI

For the human sensing application, human body as an object, is able to make carrier change.
对于人体传感应用来说，人体作为物体，是能够使载体发生变化的。

In this paper, we aims to learn the mapping rule from the change to single person pose coordinates.
在本文中，我们的目的是学习从变化到单人位姿坐标的映射规则。

We set WiFi working within a 20MHz band, the CSI of 30 carriers can be obtained through a open-source tool [25].
我们将WiFi设置在20MHz频段内工作，30个载波的CSI可以通过开源工具获得[25]。

In the remaining content of this paper, WiFi signals and CSI indicate the same thing if not stated specially.
在本文剩下的内容中，WiFi信号和CSI如果没有特别说明，表示的是同一件事。

3.2 AlphaPose

第一段（介绍AlphaPose，用来获取人的姿态估计）

AlphaPose is an open-source multi-person pose estimation repository 3, which is also applicable
for single person pose estimation.
AlphaPose是一个开源的多人姿态估计库3，它也适用于单人姿态估计。

AlphaPose is a two-step framework, which first detects person bounding boxes by a person detector (YOLOv3 [26]) then estimates pose for each detected box by the pose regressor.
AlphaPose是一个两步框架，首先通过人检测器(YOLOv3[26])检测人的边界框，然后通过姿态回归器估计每个检测到的框的姿态。

With the innovative regional multi-person pose estimation framework (RMPE) [8],AlphaPose gains estimation resilience to the inaccurate person detection, which largely facilitates the pose estimation performance.
通过创新的区域多人姿态估计框架(RMPE) [8]， AlphaPose获得了对不准确的人检测的估计弹性，极大地提高了姿态估计的性能。

Please refer [8] for more details on AlphaPose and RMPE.
有关AlphaPose及RMPE的详情，请参阅[8]。

第二段（AlphaPose估计示例）

When applied to single person pose estimation, AlphaPose generates n three-element predictions in the format of (xi ; yi ; ci), where n is the number of keypoints to be estimated, xi and yi are the coordinates of the i-th keypoint, and ci is the confidence of the above coordinates.
当应用于单人姿势估计时，AlphaPose生成n个三元素预测，格式为(xi;易;Ci)，其中n为待估计关键点的个数，xi和yi为第i个关键点的坐标，Ci为上述坐标的置信度。

In this paper,we use the COCO person keypoint setting and n is 18. Four estimation examples of AlphaPose areshown in the top of Figure. 1.
在本文中，我们使用COCO人关键点设置，n为18。图1的顶部显示了四个AlphaPose的估计示例。
在这里插入图片描述
图1:基于摄像头的方法(AlphaPose[8])和基于wifi的方法(我们的方法)的人体姿态估计示例。第一行中的渲染图像是手动标记的。

4 Methodology 方法

逐句翻译

4.1 System Build 系统构建

第一段（匹配wifi信号与采集的姿态数据）

To do pose estimation from WiFi by learning, we must have pose annotations.
通过学习从WiFi进行姿态估计，我们必须有姿态注释。

However we cannot mark person pose coordinates in the WiFi signals, thus we use a camera aligned with WiFi antennas to capture person videos.
但是我们无法在WiFi信号中标记人的姿态坐标，因此我们使用与WiFi天线对齐的摄像头来捕捉人的视频。

Then the video is processed by AlphaPose [8] for pose annotations in coordinates and confidences.
然后用AlphaPose[8]对视频进行处理，在坐标和置信度上进行姿态标注。

Besides, the camera and WiFi antennas are synchronized by the their recorded time-stamps.
此外，摄像头和WiFi天线通过其记录的时间戳进行同步。

The WiFi CSI recording system is comprised with 2 ends, one 3-antenna sender and one 3-antenna receiver.
WiFi CSI录音系统由两个端组成，一个3天线发送端和一个3天线接收端。

The sender broadcasts WiFi signals, meanwhile the receiver parses CSI through [25] when receiving the broadcasting WiFi.
发送方广播WiFi信号，接收方接收广播WiFi时通过[25]解析CSI。

In our setting, the parsed CSI is a tensor with the size of n × 30 × 3 × 3, where the n is for the number of received WiFi packages; 30 is for the subcarrier number; the last two 3s represent the antenna numbers of sender and receiver, respectively.
在我们的设置中，解析后的CSI是一个张量，大小为n × 30 × 3 × 3，其中n为接收到的WiFi包的数量;30为子载波号;最后两个3s分别表示发送方和接收方的天线号。

The WiFi pose system is shown Fig. 3. In our data acquisition, we set the sampling rate of WiFi devices and the camera as 100Hz and 20Hz, respectively.
WiFi姿态系统如图3所示。在我们的数据采集中，我们将WiFi设备和相机的采样率分别设置为100Hz和20Hz。

Thus we have a paired dataset in which every 5 CSI samples and one image frame are synchronized by their time-stamps.
因此，我们有一个配对的数据集，其中每5个CSI样本和一个图像帧通过它们的时间戳同步。

4.2 Pose Adjacent Matrix 位姿邻接矩阵

第一段（为了不损害人体姿态估计的泛化能力使用PAM）

As Section. 3.2 said, we have 18 person keypoint coordinates, (x; y; c), for each video frame from those with single person.
如3.2节所述，我们有18个人关键点坐标(x;y;C)，每个视频帧，从那些单人。

Note that AlphaPose may predict multiple persons for single-person frame (false-positive), in this situation, we only keep the one with highest confidence.
注意，AlphaPose可能会对一人框架预测多人(假阳性)，在这种情况下，我们只保留置信度最高的人。

Many work have demonstrated that regressing keypoint coordinates harms the generalization ability in person pose estimation [].
许多研究表明，回归关键点坐标会损害人体姿态估计的泛化能力[]。

.Thus in this paper, we learn to regress pose adjacent matrix (PAM), instead of directly regressing person keypoint coordinates.
因此，在本文中，我们学习回归位姿相邻矩阵(PAM)，而不是直接回归人的关键点坐标。

The PAM is a 3 × 18 × 18 matrix, also annotating the pose coordinates and confidences of 18 keypoints.
PAM是一个3 × 18 × 18的矩阵，还标注了18个关键点的位姿坐标和置信度。

The PAM is comprised of 3 submatrixes, x0, y0and c0.The x0and c0are generated by Equ. (1) from the 18 three-element entries: (xi; yi; ci); i 2 [1; 2; :::; 18].The y0is generated similar to generating x0
PAM由3个子矩阵x0, y0和c0组成。x0和c0由方程生成。(1)从18个三元素条目中:(xi;易;ci);I 2 [1];2;:::;18)。生成y 0的方法与生成x 0的方法类似。

在这里插入图片描述

第二段（详细介绍本文中的PAM如何设置）

.To be specific in the Graph Theory view, we take person skeleton as a directed complete graph (DCG) [27], each keypoint as a node of the graph.
具体来说，在图论视图中，我们将人骨架作为一个有向完全图(DCG)[27]，每个关键点作为图的一个节点。

For the x0and y0 of PAM, the diagonal items are the coordinate values in x and y axes of these 18 nodes, respectively.
对于PAM的x 0和y 0，对角线项分别是这18个节点在x和y轴上的坐标值。

Meanwhile, the elements in other indexes are the displacement of two adjacent nodes–hence the name of pose adjacent matrix.
同时，其他指标中的元素为相邻两个节点的位移，因此称为位姿相邻矩阵。

For the c0 of PAM, the diagonal items are the confidence values of corresponding nodes.
对于PAM的c0，对角线项为对应节点的置信度值。

While we think the displacement between two nodes happens independently, thus we computer
c0 i;j = ci × cj for other indexes.
而我们认为两个节点之间的位移是独立发生的，因此我们计算c 0 i;j = ci × cj。

Finally, we innovatively embed person keypoint coordinates as well as the displacements between keypoints into the PAM.
最后，我们创新性地将人的关键点坐标以及关键点之间的位移嵌入到PAM中。

第三段（详细介绍本文中PAM的设置和优点）

The main advancement of PAM is that it provides additional constraint of human skeleton shape
for person pose estimation.
PAM的主要进步在于它为人体姿态估计提供了额外的人体骨架形状约束。

Take the displacement in y axis from the nose to the neck for example,the displacement is a negative value in majority of situation because the neck is below the nose in human skeleton for a standing person.
以y轴上从鼻子到脖子的位移为例，在大多数情况下，位移是负值，因为对于站立的人来说，骨骼中的脖子在鼻子以下。

The negativity take the direction of nose to neck as constraint.
负性以鼻子到脖子的方向为约束。

Besides, the absolute value of the displacement take the length of nose to neck as constraint.
位移的绝对值以鼻颈长度为约束。

When we regress PAM, the additional regression on its displacements work as the regularization item, taking person skeleton shape into consideration and highly increasing the approach generalization ability comparing to regressing keypoint coordinates directly.
在对PAM进行回归时，对其位移的附加回归作为正则化项，考虑了人的骨架形状，与直接回归关键点坐标相比，大大提高了方法的泛化能力。

4.3 Network Framework 网络框架

第一段（介绍系统框架）

We donate the training dataset as D = f(It; Ct); t 2 [1; n]g, where It and Ct are a pair of synchronized image frame and CSI series, respectively; t means the sampling moment; and n is the dataset size.
我们将训练数据集赋值为D = f(It;Ct);T 2 [1;n]g，其中It和Ct分别为一对同步图像帧和CSI系列;T表示采样矩;n是数据集的大小。

We propose a novel deep network to train D for the purpose of learning to a mapping rule from CSI series to person body keypoints.
我们提出了一种新的深度网络来训练D，目的是学习从CSI系列到人体关键点的映射规则。

The network framework is comprised of AlphaPose [8] as a teacher network and WiSPPN as the student network, shown in Fig. 5.
网络框架由教师网络AlphaPose[8]和学生网络WiSPPN组成，如图5所示。
在这里插入图片描述
WiSPPN系统框架

The teacher and student network are termed as T(·) and S(·), respectively.
师生网络分别称为T(·)和S(·)。

For each (It; Ct) pair, T(·) takes It as input, and outputs the corresponding body keypoint coordinates and confidence, (xt; yt; ct), with the person detector and pose regressor.
对于每一个(It;Ct)对，T(·)以其为输入，输出相应的主体关键点坐标和置信度，(xt;欧美;Ct)，与人检测器和姿态回归器。

We then convert the outputs to a body pose adjacent matrix, PAMt, with aforesaid Euqation. 1. We formulize the operation of the teacher network as T(It) ! PAMt, where PAMt is the cross-modality supervision to teach S(·).
然后，我们将输出转换为具有上述公式的体位姿相邻矩阵PAMt。我们将教师网络的运作公式化为T(It) !PAMt，其中PAMt为教学S(·)的跨模态监督。

第二段（介绍系统框架）

We go into details on S(·), i.e., WiSPPN.
我们将详细讨论S(·)，即WiSPPN。

In the training stage, S(·) takes Ct as input, and outputs a corresponding prediction of pose adjacent matrix.
在训练阶段，S(·)以Ct为输入，输出相应的位姿邻接矩阵预测。

Then S(·) is optimized by the supervision of PAMt.
然后在PAMt的监督下对S(·)进行优化。

As shown in Fig. 5, WiSPPN consists of three key modules, i.e., the encoder, feature extractor and the decoder.
如图5所示，WiSPPN由编码器、特征提取器和解码器三个关键模块组成。

A Ct is converted to the PAMt prediction undergoing these three modules successively.
将A Ct依次经过这三个模块转换为PAMt预测。

Next we explain our designing intentions and parameter details on these three modules.
接下来，我们将阐述这三个模块的设计意图和参数细节。

第三段（编码器：双线性插值操作进行特征提取）

Encoder. The encoder is designed to upsample Ct to a proper width and height which are suitable for the mainstream convolutional backbone networks such as VGG [28] and ResNet [11].
编码器。编码器的设计是将Ct上采样到合适的宽度和高度，适合于主流的卷积骨干网络，如VGG[28]和ResNet[11]。

Recall that our WiFi system is comprised of a sender and a receiver both with 3 antennas, which outputs CSI samples with size of 30 × 3 × 3 through a open-source tool [25], where the 30 is the number of OFDM carriers described in Section. 3.1.
回想一下，我们的WiFi系统由带有3个天线的发送端和接收端组成，通过开源工具[25]输出大小为30 × 3 × 3的CSI样本，其中30为3.1节中描述的OFDM载波数。

As said in Section. 4.1, one image matches with 5 continuous CSI samples due to the sampling rate inconformity, leading to Ct 2 R5×30×3×3 , and we reshape it to be 150 × 3 × 3 along the time axis, which makes Ct 2 R150×3×3
如4.1节所述，由于采样率不一致，一幅图像与5个连续的CSI样本匹配，导致Ct 2 R5×30×3×3，我们沿着时间轴将其重塑为150×3×3，使Ct 2 R150×3×3。

.However, a general RGB image is with size like 3 × 224 × 224, where 3 is for the 3 color channels in Red, Green and Blue; and 224s are the height and width of the image.
不过，一般的 RGB 图像大小为 3 × 224 × 224，其中 3 表示红、绿、蓝三个颜色通道，224 表示图像的高度和宽度。

To enlarger the width and height of CSI samples, CSI-Net [10] use 8 stacked transposed convolutional layers to upsample its input from size of 30 × 1 × 1 to 6 × 224 × 224 gradually, which is operation-consuming.
为了扩大CSI样本的宽度和高度，CSI- net[10]使用8个堆叠的转置卷积层将其输入从30 × 1 × 1逐渐上采样到6 × 224 × 224，这是非常耗费运算的。

In WiSPPN, we apply one bilinear interpolation operation to directly convert Ct 2 R150×144×144 for further feature extraction.
在WiSPPN中，我们应用一个双线性插值操作直接转换Ct 2 R150×144×144进行进一步的特征提取。

第四段（使用深度学习进行特征提取ResNets）

Feature extractor. With the upsampled Ct, the feature extractor are used to learn efficient
features for the person pose estimation.
特征提取器。在上采样Ct的基础上，利用特征提取器学习人体姿态估计的有效特征。

Because Ct lacks spatial information of person body keypoints comparing to images, we need a powerful feature extractor to release its spatial information.
由于Ct与图像相比缺乏人体关键点的空间信息，我们需要一个功能强大的特征提取器来释放其空间信息。

Conventionally, a deeper network could have a more powerful feature learning ability. Thus we tend to use a deeper network as the feature extractor of WiSPPN.
一般来说，深度网络可能具有更强大的特征学习能力。因此，我们倾向于使用更深层次的网络作为WiSPPN的特征提取器。

However, deeper networks are prone to be gradient vanishing or gradient exploding because the chain rule in the backpropagation optimization could result in exponential gradients in the very deep convolutional layers.
然而，由于反向传播优化中的链式法则可能导致在非常深的卷积层中出现指数梯度，因此较深的网络容易出现梯度消失或梯度爆炸。

The ResNets [11] are a cluster of the most widely-used backbone networks in deep learning domain, especially in the computer vision research.
ResNets[11]是深度学习领域，特别是计算机视觉研究中使用最广泛的骨干网络集群。

The ResNets alleviate this problem by the shortcut connection and residual blocks.
resnet通过快捷连接和残留块来缓解这个问题。

Considering this advantage, we stack 4 basic blocks of ResNet [11] (16 convolutional layers) as the feature extractor of WiSPPN (shown in Fig 6), which learn features with a size of 300 × 18 × 18, termed as Ft.
考虑到这一优势，我们将ResNet[11]的4个基本块(16个卷积层)堆叠在一起作为WiSPPN的特征提取器(如图6所示)，学习大小为300 × 18 × 18的特征，称为Ft。
在这里插入图片描述

四个堆叠残差块(16个卷积层)作为特征提取器，并将Ct转换为Ft。
The detailed parameters of feature extractor are listed in Table. 1. Note that a batch normalization [29] and a rectified linear unit activation [30] follow every convolutional layer, successively.
特征提取器的详细参数如表1所示。注意，批处理归一化[29]和整流线性单元激活[30]依次跟随每个卷积层。

在这里插入图片描述
特征提取器的参数。3 × 3表示具有3 × 3核的卷积层;C和s代表卷积层的输出通道和步幅。

第五段（译码器：两个卷积层）

Decoder. The decoder is designed to do shape adaption between the learned features, Ft, and the supervision outputted by T(·), PAMt.
译码器。解码器设计用于在学习到的特征Ft和T(·)输出的监督PAMt之间进行形状自适应。

As described in Section. 4.2, the pose adjacent matrix is a novel form for embedding the 18 body keypoint coordinates and the corresponding confidences, and is with the size of 3 × 18 × 18.
如4.2节所述，位姿相邻矩阵是一种新颖的形式，用于嵌入18个身体关键点坐标及其置信度，其大小为3 × 18 × 18。

In the pose estimation task, a body keypoint can be localized as in two coordinates, i.e., the x axis and the y axis.
在姿态估计任务中，身体关键点可以定位为两个坐标，即x轴和y轴。

Thus the decoder is designed to take Ft as input and predict pose adjacent matrix within (x; y) dimensions, leading to a predicted pose adjacent matrix pPAMt 2 R2×18×18.
(x)内预测位姿邻接矩阵;y)维数，得到预测的位姿相邻矩阵pPAMt 2 R2×18×18。

To achiever this purpose, we stack two convolutional layers illustrated in Fig. 7, where Conv1 is mainly to release channel-wise information (from 300 to 36); and Conv2 is mainly to further reorganize spatial information of Ft with the 1 × 1 convolutional kernels.
为了实现这一目的，我们堆叠了如图7所示的两个卷积层，其中卷积1主要用于释放通道信息(从300到36);Conv2主要是利用1 × 1卷积核对Ft的空间信息进行进一步重组。
在这里插入图片描述

两个卷积层作为解码器来预测位姿相邻矩阵。缩略语的含义与表1相同。

第六段 (总结：本文设计的网络)

Summarily, with the encoder, feature extractor, and decoder, the student network, WiSPPN, predicts a pose adjacent matrix on each CSI input, Ct.
总之，通过编码器、特征提取器和解码器，学生网络WiSPPN在每个CSI输入Ct上预测一个姿态邻近矩阵。

We formalize this process as S(Ct) ! pPAMt.
我们将这个过程形式化为S(Ct) !pPAMt。

During training stage, every predicted pPAMt is supervised by the corresponding result of the teacher network, i.e., PAMt.
在训练阶段，每个预测的pPAMt都由教师网络的相应结果监督，即PAMt。

Once the student network learns well, it gains the ability to do single person pose estimation only with CSI input.
一旦学生网络学习良好，它就获得了仅使用CSI输入进行单人姿态估计的能力。

Next we describe processes of training stage, including the loss computation and implementation details.
接下来，我们描述了训练阶段的过程，包括损失计算和实现细节。

4.4 Pose Adjacent Matrix Similarity Loss 位姿相邻矩阵相似度损失

第一段（训练过程：损失计算）

As above description, T(·) outputs PAM 2 R3×18×18 as supervisions, and S(·) outputs pPAM 2
R2×18×18 as predictions.
如上所述，T(·)输出pPAM 2 R3×18×18作为监督，S(·)输出pPAM 2 R2×18×18作为预测。

With the supervisions and the predictions, L2 loss is a basic option to be applied to optimize WiSPPN as follows.
有了监督和预测，L2损失是用于优化WiSPPN的基本选项，如下所示。

在这里插入图片描述
where k·k22 is a operator to compute L2 distance; pPAMx and PAMx are the prediction and supervision of pose adjacent matrix for body keypoint coordinate in the x axis, respectively; pPAMy and PAMy are with similar representation while in the y axis.
其中k·k2 2是计算L2距离的算子;其中，pPAMx和PAMx分别是人体关键点坐标在x轴上的位姿相邻矩阵的预测和监督;在y轴上，pPAMy和PAMy具有相似的表示。

In this paper, we take the prediction confidence of keypoints in to the loss computing as follows.
在本文中，我们将关键点的预测置信度引入到损失计算中。

在这里插入图片描述

4.5 Training Details and Pose Association 训练细节和姿势关联

第一段（训练细节）

We implemented WiSPPN with Pytorch 1.0. The network is trained for 20 epochs with initial learning rate of 0.001, batch size of 32 and Adam optimizer. The learning rate decays by 0.5 at the epoch of 5th, 10th and 15th.
我们使用Pytorch 1.0实现了WiSPPN。该网络训练了20个epoch，初始学习率为0.001，批大小为32，使用Adam优化器。在第5次、第10次和第15次epoch，学习率衰减0.5。

Once WiSPPN trained, we use it to estimate person pose from testing CSI samples.
一旦WiSPPN训练好了，我们就用它来从测试CSI样本中估计人的姿势。

Taking one sample for example, we can get a predicted PAM (pPAM 2 R2×18 times18).
以一个样本为例，我们可以得到一个预测的PAM (pPAM 2 R2×18 times18)。

We take the diagonal elements in pPAM as the body keypoint prediction by following equations.
我们将pPAM中的对角元素作为主体关键点的预测，通过以下公式:

For x axis:
对于x轴:

在这里插入图片描述
For y axis:
对于y轴:

5 Evaluation

逐句翻译

5.1 Data Collection 数据收集

第一段（数据收集）

We collected data under an approval of Carnegie Mellon University IRB 4
我们在卡内基梅隆大学IRB 4的批准下收集数据。

We recruited 8 volunteers, and asked them to do casual daily actions in two rooms of the campus, one laboratory room and one class room.
我们招募了8名志愿者，让他们在校园的两个房间，一个实验室和一个教室做随意的日常活动。

Floor plans and data collection positions are illustrated in Fig. 8.
平面图和数据收集位置如图8所示。

During the actions, we run the system in Fig. 3 to record CSI samples and videos, simultaneously.
在动作过程中，我们运行图3中的系统，同时记录CSI样本和视频。

For each volunteer, data of his first 80% recording is used to train the networks, and data of the last 20% recording is used to test the networks.
对于每个志愿者，他的前80%的记录数据被用来训练网络，最后20%的记录数据被用来测试网络。

The data size of training and testing are 79496 and 19931, respectively.
训练和测试的数据量分别为79496和19931。
在这里插入图片描述
数据收集环境的平面图和现场图像。在2个房间的16个地点收集数据。箭头表示WiFi接收器的位置和方向;圆圈表示WiFi发送者的对应位置。

5.2 Experimental Results 实验结果

第一段（结果+公式）

Percentage of Correct Keypoint (PCK) is widely used to evaluate the performance of proposed
approach [15, 31, 32].
正确关键点百分比(Percentage of Correct Keypoint, PCK)被广泛用于评估所提出方法的性能[15,31,32]。

在这里插入图片描述
where I(·) is a binary indicator that outputs 1 while true and 0 while false.
其中I(·)是一个二进制指示符，当为真时输出1，当为假时输出0。

are the same as Equation.
与式相同。

N is the number test frames. i denotes the index of body joint and i 2 f1; 2; :::; 18g.
N是测试帧的个数。I为人体关节指数，I为2f1;2;:::;18 g。

The rh and lh are for the positions of the right shoulder and the left hip, respectively.
rh和lh分别代表右肩和左髋的位置。

Thus the 2p rh2 + lh2 can be regarded as the length of the upper limb, which is used to normalize the prediction error, kpdi − gtik 2 2, where pd is prediction coordinates and gt is the ground-truth.
因此，2p rh2 + lh2可视为上肢长度，用于对预测误差kpdi−gtik 22进行归一化，其中pd为预测坐标，gt为基真值。

第二段（用图表示结果）

Table. 5.2 shows the estimation performance of 18 body keypoint in PCK@5, PCK@10, PCK@20,
PCK@30, PCK@40, and PCK@50.
表5.2给出了PCK@5、PCK@10、PCK@20、PCK@30、PCK@40、PCK@50中18个主体关键点的估计性能。

在这里插入图片描述
PCK结果。的R。和“L”。'分别代表左和右。

.From the table, we can see WiSPPN do pose estimation well.Figure.
从表中可以看出，WiSPPN很好地完成了姿态估计。

9 illustrates some estimation comparisons between AlphaPose and WiSPPN.
图9显示了AlphaPose和WiSPPN之间的一些估计比较。

在这里插入图片描述
结果样品在2个房间。

The results show that WiSPPN can work single pose estimation with comparable results to cameras.
结果表明，WiSPPN可以进行单姿态估计，其结果与相机相当。

6 Conclusion

In this paper, we build a system and propose a novel network termed WiSPPN for a fine-grained WiFi sensing, i.e., single person pose estimation.
在本文中，我们构建了一个系统，并提出了一种称为WiSPPN的新网络，用于细粒度WiFi传感，即单人姿态估计

The experimental results show that WiFi sensors can achieve a comparable performance in fine-grained human sensing to cameras.
实验结果表明，WiFi传感器在细粒度人体感知方面可以达到与相机相当的性能。