摘要音频音标：MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

最新推荐文章于 2024-07-03 17:30:44 发布

liqiangyong

最新推荐文章于 2024-07-03 17:30:44 发布

阅读量70

点赞数

分类专栏：研0研1读论文文章标签：音视频深度学习

本文链接：https://blog.csdn.net/lqy61/article/details/132217301

版权

研0研1读论文专栏收录该内容

3 篇文章 0 订阅

订阅专栏

音频
Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation.
[ri:'sənt trænz’fɜ:mə beɪst sə’lu:ʃənz hæv bɪn ɪn’trədju:st tu: 'estɪmeɪt θri: 'di: 'hju:mən pəʊz frɑm tu: 'di: 'ki:pɔɪnt 'si:kwəns baɪ kən’sɪdərɪŋ 'bɒdi dʒɔɪnts ə’mʌŋ ɔ:l freɪmz 'gləʊbəli tu: lɜ:n 'speɪʃəʊ 'tempərəl kɒrə’leɪʃən.]

We observe that the motions of different joints differ significantly.
[wi: əb’zɜ:v ðət ðə 'məʊʃnz əv 'dɪfrənt dʒɔɪnts dɪfə 'sɪgnɪfɪkəntli.]

However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation.
['haʊevə, ðə 'pri:viəs 'meθədz 'kɑ:nɒt ɪ’fɪʃəntli 'mɒdl ðə 'sɒlɪd ɪntəfreɪm kɒrɪs’pɒndəns əv i:tʃ dʒɔɪnt, 'li:dɪŋ tu: ɪnsə’fɪʃənt 'lɜ:nɪŋ əv 'speɪʃəl 'tempərəl kɒrə’leɪʃən.]

We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation.
[wi: prə’pəʊz 'mɪksti: ('mɪkst speɪʃəʊ 'tempərəl ɪn’kəʊdə), wɪtʃ həz ə 'tempərəl trænz’fɜ:mə blɒk tu: 'sepərətli 'mɒdl ðə 'tempərəl 'məʊʃn əv i:tʃ dʒɔɪnt ænd ə 'speɪʃəl trænz’fɜ:mə blɒk tu: lɜ:n 'ɪntədʒɔɪnt 'speɪʃəl kɒrə’leɪʃən.]

These two blocks are utilized alternately to obtain better spatio-temporal feature encoding.
[ði:z tu: blɒks ɑ: ju:təlaɪzd 'ɔ:ltənətli tu: əb’teɪn 'betə speɪʃəʊ 'tempərəl 'fi:tʃər ɪn’kəʊdɪŋ.]

In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences.
[ɪn ə’dɪʃən, ðə 'netwɜ:k 'aʊtpʊt ɪz ɪk’stendɪd frɑm ðə 'sentrəl freɪm tu: ɪn’taɪə freɪmz əv ði: 'ɪnpʊt 'vɪdiəʊ, ðer’baɪ ɪm’pru:vɪŋ ðə kəʊ’hiərəns bɪ’twi:n ði: 'ɪnpʊt ænd 'aʊtpʊt 'si:kwənsɪz.]

Extensive experiments are conducted on three benchmarks (i.e. Human3.6M, MPI-INF-3DHP, and HumanEva).
[ɪk’stensɪv ɪk’sperɪmənts ɑ: kən’dʌktɪd ɒn θri: 'benʧmɑ:ks (aɪ.'i: 'hju:mən θri: pɔɪnt sɪks’em, 'em pi: aɪ 'en 'ef θri: 'di: 'eɪtʃ pi:, ænd 'hju:mən i:'vɑ:).]

The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE.
[ðə rɪ’zʌlts ʃəʊ ðət aʊə 'mɒdl aʊt’pɜ:fɔ:mz ðə 'steɪt əv ði: ɑ:t ə’prəʊtʃ baɪ ten pɔɪnt naɪn pə’sent 'pi: em 'pi: dʒi: 'pi: 'i: ænd seven pɔɪnt sɪks pə’sent 'em pi: dʒi: 'pi: 'i:]

The code is available at this URL.
[ðə kəʊd ɪz ə’veɪləbl ət ðɪs 'ju: ɑ:r 'el.]