[ICLR 2025]CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding

最新推荐文章于 2025-05-15 11:04:43 发布

夏莉莉iy

最新推荐文章于 2025-05-15 11:04:43 发布

阅读量789

点赞数 16

分类专栏：论文精读文章标签：人工智能计算机视觉 transformer 神经网络深度学习机器学习 python

本文链接：https://blog.csdn.net/Sherlily/article/details/146259948

版权

论文精读专栏收录该内容

167 篇文章

订阅专栏

论文网址：2412.07236

论文代码：github.com

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.2. Experiment Setup of Downstream BCI Tasks

2.4.3. Results

2.5. Conclusion

1. 心得

（1）沉默，是今晚的康桥。

（2）写英文不能当键盘侠，于是我决定写中文

2. 论文逐段精读

2.1. Abstract

①The spatial and temporal features of EEG signals are heterogeneous, so they need to be modelled independently

②They proposed CBraMod to solve the dependence and different EEG data formats problems

③Datasets: 12 public with 10 downstream tasks

criss adj. 漂亮的，时髦的 n. （Criss）（美）克里斯（人名）

criss-cross adj. 交错纵横的

2.2. Introduction

①Existing EEG processing methods:

②The authors state the correlation between channels and time points are different, thus global attention is not suitable for EEG signals

③CBraMod is pretrained on Temple University Hospital EEG Corpus (TUEG)

2.3. Method

①Overall framework:

（1）Patching & Masking

①Input EEG sample: $S\in\mathbb{R}^{C\times T}$ with $C$ channel and $T$ timestamps

②Patch segmentation: for window length $t$ , they resize $S$ to $X\in{{\mathbb{R}}}^{C\times n\times t}$ with $n={\lfloor\frac{T}{t}\rfloor}$ patches of one channel and $X=\{x_{i,j}|i\in[1,2,...,C],j\in[1,2,...,n]\}$

③A representation of a patch: $x\in\mathbb{R}^t$

④Total number of patches: $|X|=Cn$

⑤Mask: $\mathcal{M}=\{m_{i,j}|i\in[1,2,...,C],j\in[1,2,...,n]\}$ with Bernoulli distribution of $r$ propotion, and $m_{i,j}\in\{0,1\}$ is the mask indicator of $x_{i,j}$

⑥Masked EEG patches:

$\tilde{x}_{i,j} = \begin{cases} x_{i,j},\quad m_{i,j}=0 \\ x_{M},\quad m_{i,j}=1 \end{cases} \\ \tilde{X}=\{\tilde{x}_{i,j}|i\in[1,2,...,C],j\in[1,2,...,n]\}$

where $x_M\in\mathbb{R}^t$ denotes mask token, $\tilde{X}\in\mathbb{R}^{C\times n\times t}$ denotes remaining EEG patches

（2）Time-Frequency Patch Encoding

①Time domian processing: they use one-dimensional convolution layer, a group normalization layer, and a GELU activation function to process input $\tilde{x}_{i,j}$ to obtain time domain embedding $e_{i,j}^t\in\mathbb{R}^d$ with $d$ dimension

②Frequency-domain branch: they use fast Fourier transform (FFT) and a fully-connected layer to get frequency-domain embedding $e_{i,j}^f\in\mathbb{R}^d$

③Embedding fusion:

$\begin{array} {c}e_{i,j}=e_{i,j}^t+e_{i,j}^f \\ E=\{e_{i,j}|i\in[1,2,...,C],j\in[1,2,...,n]\} \end{array}$

where $e_{i,j}\in\mathbb{R}^d$ is patch embedding, $E\in\mathbb{R}^{C\times n\times d}$ is the set of patch embeddings

（3）Asymmetric Conditional Positional Encoding

①ACPE: a convolution layer with kernel $(k_s,k_t)$ and $(\frac{k_{s}-1}{2},\frac{k_{t}-1}{2})$ zero paddings ( $k_s> k_t$ )（作者觉得，因为是长方形的卷积块，就非对称了，还能同时关注到空间和位置信息= =|||。xd的解决方法真是......额，简单易懂呢）

②一个类似残差的结构，把 $E$ 喂给ACPE得到：

$E^{p}=\{e_{i,j}^{p}|i\in[1,2,...,C],j\in[1,2,...,n]\}$

where $E^{p}\in\mathbb{R}^{C\times n\times d}$ and $e_{i,j}^p\in\mathbb{R}^d$ , 然后把 $E$ 和 $E^p$ 加起来：

$E^o=E+E^p=\{e_{i,j}+e_{i,j}^p|i\in[1,2,...,C],j\in[1,2,...,n]\}$

where $E^{o}\in\mathbb{R}^{C\times n\times d}$

（4）Criss-Cross Transformer

①Pipeline of Criss-Cross Transformer Block:

上面的 $E^o$ 经过Layer Norm变成 $\tilde{E}\in\mathbb{R}^{C\times n\times d}$ 。我好心的猜测作者可能是因为篇幅，导致了以下公式和解释不是特别详细。

②首先，作者把 $\tilde{E}$ 的按照前半通道和后半通道分成两组（我的通道指最后一维 $h$ ，不是说电极通道 $C$ ），在上面的路径中把前半通道组的每一列使用注意力：

$F_k^j=\mathrm{Attention}(\tilde{E}^jW_k^Q,\tilde{E}^jW_k^K,\tilde{E}^jW_k^V)$

把一开始被分开的每一列在分别应用注意力之后又合起来：

$\text{S-Attention}_k(\tilde{E})=[F_k^1,F_k^2,...,F_k^n]$

同理下面的路径，只是下面是对后半通道进行注意力。最后把列注意力和行注意力块拼起来：

$\mathrm{Criss-Cross-Attention}(\tilde{E})=\mathrm{Concat}(\mathrm{head}_1,\mathrm{head}_2,...,\mathrm{head}_K)$

$\mathrm{head}_k= \begin{cases} \text{S-Attention}_k(\tilde{E}), & \quad k\in[1,2,...,K/2] \\ \text{T-Attention}_k(\tilde{E}), & \quad k\in[K/2+1,K/2+2,...,K] \end{cases}$

拼起来之后这玩意儿是 $E^{r}\in\mathbb{R}^{C\times n\times d}$ 。（此时此刻我想知道这哪里交错纵横了只是单纯的叠在了一起感觉是俩完全不相关的东西甚至都不是两只手洗牌那种交叉进去的它的意义在哪里）

③为什么不提及一下 $F$ 的形状？它在上下文中为啥不出现一下

（5）Masked EEG Reconstruction and EEG reconstruction

①由全连接层组成的重建头？作者是学文学的吗。我真的会忍不住开炮的。我以后论文都应该叫由全连接层组成的检测器。

② $E^r$ 通过全连接层变成最终预测 $\hat{X}\in\mathbb{R}^{C\times n\times t}$

③作者真的巨爱写那种巨长的表示：

④MSE loss:

$\mathcal{L}=\|\hat{X}^M-X^M\|^2$

2.4. Experiments

2.4.1. Pre-training

（1）Pre-training Dataset

①Dataset: Temple University Hospital EEG corpus (TUEG)

②Data: 69,652 clinical EEG recordings from 14,987 subjects across 26,846 sessions, with a total duration of 27,062 hours

（2）Preprocessing

①Screening: remove records which the total duration are no more than 5 or absolute amplitude exceed 100 µV

②Cropping: the first and the last one minutes

③Electrode choosing: 19, including Fp1, Fp2, F7, F3, Fz, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz, P4, T6, O1, O2

④Band-pass filter: 0.3 Hz–75 Hz

⑤Notch filter: 60Hz

⑥Resampling: 200Hz

⑦Segmentation: 30s

⑧Norm: 100µV

⑨Remaining samples: 1109545

（3）Pre-training Settings

①Duration of patch: 1s with 200 data points

②Layer of Criss-Cross Transform Block: 12 with 200 hidden dimensions, 800 inner dimensions, 8-head

③Batch size: 128

④Optimizer: AdamW

⑤Learning rate: 5e-4

⑥Weight decay: 5e-2

2.4.2. Experiment Setup of Downstream BCI Tasks

①Statistics of datasets:

2.4.3. Results

①Emotion recognition performance:

②Motor Imagery Classification performance:

③Attention block ablation:

④Positional encoding ablation:

⑤Pre-training ablatrion:

where 1) w/o pre-training: directly training CBraMod on downstream datasets; 2) dirty pre-training: pre-training CBraMod on TUEG corpus without bad samples dropping. 3) clean pre-training: pre-training CBraMod on TUEG corpus with bad samples dropping.