kaggle | fMRI数据竞赛Top10方案(1-5)-CSDN博客

本文链接：https://blog.csdn.net/lazysnake666/article/details/122404591

一句话概况一下这个比赛：用Neuroimaging的数据预测包括年龄在内的5个变量。

详情见之前推送~

懒麻蛇，公众号：懒麻蛇FYI | kaggle 基于fMRI的prediction competition（第一名 $12000)

该比赛于6月底结束，忙里偷闲总结一下Top10的solution，受益颇多，希望对各位也有所启发。二维码是kaggle的传送门，感兴趣点开可以看原帖。

Private Leaderboard 排名

1st

没有用Deep learning！所有流程用12核64GB的Ubuntu运行，没有用GPU。取而代之的是对3D fMRI data做了各种summary statstics并做PCA。通过progress可以看到PCA+Dictionary-learning大大提升了模型的表现。
We used Incremental PCA on fMRI data with n_components 200, batch-size 200. Channels were split into groups by 10 and flattened inside them (6 groups in total). As a result, we got 1200 PCA features. Dictionary-learning (DL) params: n-components 100, batch-size 100, and n-iters 10 (the same scheme with channels splitting).
对于test的特征加入了bias使之更像training的特征分布。其实逻辑很简单，如果testing和training的数据分布更接近的话，用training set训练的模型就可能在testing上表现得更好。
We used minimization of Kolmogorov-Smirnov test’s statistic between train[col] and test[col]+ b.we used minimization of Kolmogorov-Smirnov test’s statistic between train[col] and test[col]+ b.

作者已将全部code共享在github

2nd

NEAT!!!!对CSV的数据做了stacking，对3D fMRI的数据使用了Deep learning，用了强大的XGBoost做了ensembling，最后还有对权重的优化。

3D ResNet18 is the best CNN here for me. Comparing to that, ResNet50, ResNext50, ResNext101 doesn’t work well.

Splitting 3D fMRI into 3~6 pieces gave me boost, but 8 pieces are useless.

3rd

这么复杂的方案实在是太烧脑了，完全是一个BlackBox。为了搞定如此复杂的模型，作者使用了pytorch-lightning+hydra+Weights & Biases。这三个工具值得一看。

I used pytorch-lightning to lap training and evaluation code and hydra to manage parameters. I used Weights & Biases to manage experiments and GCS to save results. Weights & Biases is better for me than the others (tensorboard, mlflow, etc …)

Weights & Biases

4th Neuroimager's solution

1. Neuroimager的方案!

2. 自己使用Kmeans和Ward计算了Parcellation

3. 对ICA component map使用自制的parcellation计算了connectivity，即把每一个component看作一个时间点，并使用了不同的计算方法，包括：correlation/covariance/partial correlation/tangent/precision。对生成的特征用SVR+Ridge对SVR，Ridge, Lasso, Elastic Net模型做了stacking。

5th

只给了3D fMRI部分的solution，但是该比赛的关键就在于3D fMRI的使用。使用了2种不同的精度float16和32，然后做了stacking。float16虽然快但是用float32的模型更准确。3D 数据从（50，63，53）resize到（75，94，79）提升了3DCNN的准确度。

btw, 右上的图是日式幽默吗。。。。