2021-09-22

This section illustrates the core part of our system, including the data preprocessing module and the transformer-based predictor.

data preprocessing module

Intuitively, Fast Fourier Transform (FFT) is considered to be one of the state-of-the-art tools to analyze time-series signals. However, we want to obtain the relationship between frequency and time in the motion signal to improve the dimensionality of our data to learn more information. The signals obtained by Fourier transform only include the frequency domain, and there is no corresponding time-domain information. Our solution is replacing FFT with Hilbert Transform.
But this again introduces new questions: the prerequisite of the Hilbert transform is a stationary signal, but our motion signal obtained from the IMU sensor does not satisfy this condition, which makes the calculated instantaneous frequency likely have no physical meaning. To obtain an ideal decomposition result, we choose Hilbert-Huang Transform (HHT). Its improved idea is to use EEMD to convert non-stationary signals into stationary signals before using Hilbert Transform. Then we concat the converted data with the original signal into a 12-dimensional signal. Finally, we use a sliding window with a size of 3 seconds and a step size of 1 second to split the signal into the appropriate size for input into our predictor for training and testing.

transformer-based predictor

Although RNN has been viewed as the classic module in a time-series problem, it is a little too simple for our 12-dimensional data. To learn more information from complex signals and achieve high reconstruction precision, we introduce the transformer architecture. Nevertheless, we still face two problems. First of all, when we walk, the movement of the mobile phone is in three-dimensional space. This means that the data contains much richer spatial-temporal information. Our model should be able to reasonably learn useful information from these associations. Secondly, the amount of data of the motion dataset we collect is smaller than that of the conventional dataset, which tests the generalization ability of the model. How to make the model friendly (that is, making effective predictions without collecting data from new users or just collecting less data) to new users has become a challenge.

Solution with Divided Space-Time Attention. To address these problems, we adopt “Divided Space-Time” Attention to replace the standard self-attention. A visualization of the module is given in Fig. X. The TimeSformer Encoder is consists of 3 encoder blocks. Within each block, we first compute temporal attention in each channel. The result of using temporal attention is then fed back for spatial attention computation instead of being passed to the MLP. According to [cite], this space-time factorization is not only more efficient but it also leads to reduce error.
Model generalization. We design our predictor by referring to the state-of-the-art transformer architecture [cite]. To avoid overfitting and reduce the number of parameters, we use a backbone network to extract feature from the acceleration and gyroscope of each axis and their corresponding time-frequency signals. It was followed by the TimeSformer Encoder and an convolutional feature extractor. Finally, the output layer is linear regression.
To make our system perform well for different users, we adopted meta-learning technique to train our model. Model-agnostic meta-learning (MAML)[cite] is one of the state-of-the-art methods of meta-learning, which is a conceptually simple and general algorithm that has been shown to perform well on few-shot learning problems in classification and regression. Given model parameters θ \theta θ, MAML aims to adapt to a new task τ t \tau_{t} τt with SGD:

θ t ′ = θ − α ∇ θ L T train  ( t ) ( f θ ) \theta_{t}^{\prime}=\theta-\alpha \nabla_{\theta} \mathcal{L}_{\mathcal{T}_{\text {train }(t)}}\left(f_{\theta}\right) θt=θαθLTtrain (t)(fθ)

where t t t is the task number and α is the learning rate. T t r a i n ( t ) T_{train(t)} Ttrain(t) and T t e s t ( t ) T_{test(t)} Ttest(t) denote the training and test set within task t. The tasks are sampled from a defined p ( τ t ) p(\tau_{t}) p(τt). The meta-objective is:
在这里插入图片描述

The model aims to optimize the parameters θ such that with just one SGD step it can adapt to the new task. For the optimization in Eq. 2, this looks as follows:

θ t = θ − β ∇ θ L T test  ( t ) ( f θ t ′ ) \theta_{t}=\theta-\beta \nabla_{\theta} \mathcal{L}_{\mathcal{T}_{\text {test }(t)}}\left(f_{\theta_{t}^{\prime}}\right) θt=θβθLTtest (t)(fθt)

where β is the meta step size. This gives an algorithm that learns an initialization of θ \theta θ that is useful for being adapted to new tasks efficiently with a small number of iterations.

The problem MAML is that the initial model can be trained biased towards some tasks, particularly those sampled in meta-training phase. Such a biased initial model may not be well generalizable to an unseen task that has a large deviation from meta-training tasks, especially when very few examples are available on the new task. We introduce an extension to MAML in our solution, called Inequality-Minimization TAML. The algorithm directly minimizes the inequality of losses by the initial model across a variety of tasks to force the meta-learner to learn a unbiased initial model without over-performing on some particular tasks.
The idea is that the loss of an initial model on each task T i T_{i} Ti is viewed as an income for that task. Then for the TAML model, its loss inequality over multiple tasks is minimized to make the meta-learner task-agnostic.Formally, consider a batch of sampled tasks T i {T_{i}} Ti and their losses L τ i f ( θ ) {\mathcal{L} _{\tau_{i}}f(\theta )} Lτif(θ) by the initial model f θ f_{θ} fθ , one can compute the inequality measure by I E ( { L T i ( f θ ) } ) \mathcal{I}_{\mathcal{E}}\left(\left\{\mathcal{L}_{\mathcal{T}_{i}}\left(f_{\theta}\right)\right\}\right) IE({LTi(fθ)}) as discussed later. Then the initial model parameter θ is meta-learned by minimizing the following objective

E T i ∼ p ( T ) [ L T i ( f θ i ) ] + λ I E ( { L T i ( f θ ) } ) \mathbb{E}_{\mathcal{T}_{i} \sim p(\mathcal{T})}\left[\mathcal{L}_{\mathcal{T}_{i}}\left(f_{\theta_{i}}\right)\right]+\lambda \mathcal{I}_{\mathcal{E}}\left(\left\{\mathcal{L}_{\mathcal{T}_{i}}\left(f_{\theta}\right)\right\}\right) ETip(T)[LTi(fθi)]+λIE({LTi(fθ)})

The first term is the expected loss by the model f θ f_{\theta} fθ after the update, while the second is the inequality of losses by the initial model f θ f_{\theta} fθ before the update. Both terms are a function of the initial model parameter θ \theta θ since θ i \theta_{i} θi is updated from θ \theta θ. About the inequality measure I E \mathcal{I}_{\mathcal{E}} IE , we choose the Theil Index, which is derived from redundancy in information theory. Suppose that we have M M M losses { ℓ i ∣ i = 1 , ⋯   , M } \left\{\ell_{i} \mid i=1, \cdots, M\right\} {ii=1,,M}, then Thiel Index is defined as

T T = 1 M ∑ i = 1 M ℓ i ℓ ˉ ln ⁡ ℓ i ℓ ˉ T_{T}=\frac{1}{M} \sum_{i=1}^{M} \frac{\ell_{i}}{\bar{\ell}} \ln \frac{\ell_{i}}{\bar{\ell}} TT=M1i=1Mˉilnˉi

Implementation of our sysytem. The training and implementation process of the model is shown in the Fig. First, we collect the volunteers’ data to make the dataset for mate-learning. Considering to improve the generalization ability of our final model to new users’ data, we regard the data of different volunteers in different months as a separate task. In the second step, we use the TAML algorithm introduced above to train the transformer-based model. Then we will get the initialization parameters that are suitable for each task. Finally, when faced with a new user, we only need a small amount of that user’s data to update our model to adpat the new user and improve prediction accuracy obviously.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值