螺旋桨RNA结构预测竞赛第10名方案
队伍名:白鹤亮对翅,黑熊飞双桨
成员:刘建建、史靖玮、项建彪 、杨静俐
成绩情况:score
:3.722 rmsd_avg
:0.269 rmsd_std
:0.067
赛题介绍
“RNA碱基不成对概率”衡量了RNA序列在各个点位是否能形成稳定的碱基对(base pair),是RNA结构的重要属性,并可被应用在mRNA疫苗序列设计、药物研发等领域。例如mRNA疫苗序列通常不稳定,而RNA碱基不成对概率较高的点位正是易被降解的位置;又如RNA 碱基不成对概率较高的点位通常更容易与其他RNA序列相互作用,形成RNA-RNA binding等,这一特性也被广泛应用于疾病诊断和RNA药物研发。
本次比赛提供了5000条训练数据,请选手基于训练数据和飞桨平台,开发模型预测RNA碱基不成对概率。
(Tips:机器学习框架方面只允许使用飞桨深度学习框架哦)
比赛地址:https://aistudio.baidu.com/aistudio/competition/detail/61
说明
是B榜第十名不代表最后比赛结果!
项目地址
GitHub:https://github.com/jhcgt4869/Unpaired_Probability_Prediction_The_first_ten
AIStudio:https://aistudio.baidu.com/aistudio/projectdetail/1930064
比赛思路
先对基线进行了解,使用基线进行提交发现效果不错寻求进一步提升。
在实践过程中发现效果呈现过拟合的情况,进行了提前停止。
再寻求其他的优化模式,最后找到了一个成绩的段点(比较好的地方)
通过这个值周边最近的几个model
进行融合得到了最后的结果并进行提交
竞赛数据集
# 检查数据集所在路径
!tree /home/aistudio/data
/home/aistudio/data
├── data67691
│ └── test_log.txt
└── data82504
└── B_board_112_seqs.txt
2 directories, 2 files
基线系统代码结构
本次基线基于飞桨PaddlePaddle2.0版本。
# 检查源代码文件结构
# !cd work; mkdir model
!tree /home/aistudio/work -L 2
/home/aistudio/work
├── data
│ ├── dev.txt
│ ├── test_nolabel.txt
│ └── train.txt
├── model
│ ├── model_dev=0.0673
│ ├── model_dev=0.0674
│ ├── model_dev=0.0678
│ ├── model_dev=0.0749
│ ├── model_dev=0.0752
│ ├── model_dev=0.0756
│ ├── model_dev=0.0762
│ └── placeholder.txt
├── model-0
│ └── model_dev=0.0772
├── README.txt
├── src
│ ├── const.py
│ ├── dataset.py
│ ├── __init__.py
│ ├── main.py
│ ├── network.py
│ ├── __pycache__
│ ├── utils.py
│ └── vocabulary.py
├── test_log.txt
└── train_log.txt
13 directories, 14 files
训练脚本
python src/main.py train --model-path-base [model_directory_name]
本代码会训练一个模型,并且保存到指定位置,训练日志默认保存到文件train_log.txt
注意:由于初始化的不稳定,可能需要多次训练,比较合理的验证集(dev)均方误差损失值(MSE loss)为0.05-0.08
样例
python src/main.py train --model-path-base model
你将会看到类似如下的训练日志
epoch 1 batch 40 processed 640 batch-loss 0.1984 epoch-elapsed 0h00m10s total-elapsed 0h00m11s
epoch 1 batch 41 processed 656 batch-loss 0.2119 epoch-elapsed 0h00m10s total-elapsed 0h00m11s
epoch 1 batch 42 processed 672 batch-loss 0.2205 epoch-elapsed 0h00m11s total-elapsed 0h00m11s
epoch 1 batch 43 processed 688 batch-loss 0.2128 epoch-elapsed 0h00m11s total-elapsed 0h00m11s
# Dev Average Loss: 0.212 (MSE) -> 0.461 (RMSD)
注意事项
请使用GPU版本的配置环境运行本模块
# To train:
# python src/main.py train --model-path-base [model_directory_name]
!cd work; python src/main.py train --model-path-base model
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
2021-05-12 13:40:41.025758
# python3 src/main.py train --model-path-base model
# Training set contains 4750 Sequences.
# Validation set contains 250 Sequences.
# Paddle: Using device: CUDAPlace(0)
# Initializing model...
initializing vacabularies... done.
Sequence(6): ['<START>', '<STOP>', 'A', 'C', 'G', 'U']
Brackets(5): ['<START>', '<STOP>', '(', ')', '.']
# Checking validation 10 times an epoch (every 475 batches)
W0512 13:40:42.600050 511 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0512 13:40:42.604969 511 device_context.cc:372] device: 0, cuDNN Version: 7.6.
# Epoch 1 starting.
epoch 1 batch 1 processed 1 batch-loss 0.2189 epoch-elapsed 0h00m00s total-elapsed 0h00m04s
epoch 1 batch 2 processed 2 batch-loss 0.1878 epoch-elapsed 0h00m00s total-elapsed 0h00m04s
epoch 1 batch 3 processed 3 batch-loss 0.2259 epoch-elapsed 0h00m00s total-elapsed 0h00m04s
epoch 1 batch 4 processed 4 batch-loss 0.2033 epoch-elapsed 0h00m00s total-elapsed 0h00m04s
epoch 1 batch 5 processed 5 batch-loss 0.2201 epoch-elapsed 0h00m01s total-elapsed 0h00m05s
epoch 1 batch 6 processed 6 batch-loss 0.2206 epoch-elapsed 0h00m01s total-elapsed 0h00m05s
epoch 1 batch 7 processed 7 batch-loss 0.2026 epoch-elapsed 0h00m01s total-elapsed 0h00m05s
epoch 1 batch 8 processed 8 batch-loss 0.2187 epoch-elapsed 0h00m01s total-elapsed 0h00m05s
epoch 1 batch 9 processed 9 batch-loss 0.2143 epoch-elapsed 0h00m01s total-elapsed 0h00m05s
epoch 1 batch 10 processed 10 batch-loss 0.2183 epoch-elapsed 0h00m02s total-elapsed 0h00m06s
epoch 1 batch