AI4Bio-首届世界科学智能大赛：生命科学赛道——生物学年龄评价与年龄相关疾病风险预测特征优化-CSDN博客

本文链接：https://blog.csdn.net/qq_44350348/article/details/132502665

文章目录

six库出的一些小问题
赛题相关文献查阅
- 检索词确定
有关axis属性
NaN填充方法
特征选择方法
MLP测试
模型融合
使用MLP做AutoEncoder
参考

six库出的一些小问题

报错OSErroe，找不到six-1.16.0.dist-info/METADATA，去目录下检查了一下发现同时存在six-1.15.0.dist-info和six-1.16.0.dist-info两个版本，推测是发生冲突了。

将这两个都删掉，重新安装six就恢复了。估计是某天不小心多装了一个

赛题相关文献查阅

检索词确定

赛题数据好像是methylation 450k测得的cpg甲基化位点，目前想要了解的是如何对该甲基化位点进行特征工程

则可以确定检索式("methylation" OR CpG) AND ("feature engineer" OR "feature selection")

查找过程中又发现一些拓展词：("methylation" OR CpG OR DNAm) AND ("feature engineer" OR "feature selection" OR "deep learning")

也有些论文用AutoEncoder，有机会也研究一下

有关axis属性

在baseline中有这样一段代码

df['max'] = df[[i for i in range(df.shape[1])]].max(axis = 1)

拆分一下

columns = [i for i in range(df.shape[1])]
df_sub = df[columns]
max_col = df_sub.max(axis = 1)

这里容易误解的是axis = 1，很容易理解成按列找max，事实上这里axis指定的是被压缩的方向；亦即，axis = 0时，进行跨行操作；axis = 1时，进行跨列操作。

我感觉造成混淆的点是drop操作，drop中的axis指定的是按行或者按列drop。

NaN填充方法

尝试使用ElasticNet筛选填充数据的方法，备选方案如下

填充0
填充均值
填充插值。因为插值填充后可能还有NaN，因此会结合前两种使用
填充中位数

考虑到特征数量非常大，利用机器学习方法填充缺失数据在硬件上有些困难，暂不使用

在使用对应的方法处理数据集后，用KFold划分10个Fold，评估结果取10个Fold的平均

def check_dataset(X, y):
    '''
    use ElasticNet test the MAE of given data
    score is caculated by 10 CV subs
    Args:
    X (m, n): X of m examples, n features
    y (m,) : y of m values
    
    Return:
    mean score over 10 CV subs
    '''
    from sklearn.linear_model import ElasticNet
    # from sklearn.model_selection import train_test_split
    from sklearn.model_selection import RepeatedKFold
    from sklearn.metrics import mean_absolute_error
    rkf = RepeatedKFold(n_splits = 5, n_repeats = 2, random_state = 2023)
    total_score = 0
    for train_idx, test_idx in rkf.split(X, y):
        X_train, X_test, y_train, y_test = X.iloc[train_idx], X.iloc[test_idx], y.iloc[train_idx], y.iloc[test_idx]
        model = ElasticNet()
        model.fit(X_train, y_train)
        pred = model.predict(X_test)
        score = mean_absolute_error(y_test, pred)
        total_score += score
    return total_score / 10

在不同大小的数据子集上测试不同填充方法的效果，步长为50

def test_fill_methods(test_range = 10):
    '''
    caculate fill methods score in different feature scale
    Args:
    test_range (scalar): times for test, n_rows will step up by 50
    
    Returns:
    scores (DataFrame): scores of test iters
    '''
    scores = pd.DataFrame()
    for n_rows in range(test_range):
        print(f'Processing rows = {(n_rows+1)*50}')
        df= pd.read_csv('../ai4bio/traindata.csv',
                  sep = ',',
                  index_col = 0,
                  nrows = (n_rows+1)*50
                  )
        age = pd.read_csv('../ai4bio/trainmap.csv',
                  sep=',',
                  index_col = 0)['age']
        scores = pd.concat([scores, pd.DataFrame(check_fill(df, age)).T], sort = False)
    return scores

用400个特征的数据进行评估

在这里插入图片描述

备选的几个方案中，先插值再填充均值的方法可能会更好一些。

特征选择方法

一些常用的特征选择方法（引自T. Doherty等, 《A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator》, BMC Bioinformatics, 卷 24, 期 1, 页 178, 5月 2023, doi: 10.1186/s12859-023-05282-4.）

在这里插入图片描述

低方差筛除

筛除掉train和test中值特征值变化不大的特征列

def var_filter(train_dir, batch_size):
    variances = []
    with h5py.File(train_dir, 'r') as file:
        for index in range(0, file['data'].shape[1], batch_size):
            print(f'Processing {index}', end='\r')
            data = file['data'][:,index:index + batch_size]
            var = np.var(data, axis = 0)
            variances = np.hstack([variances, var])
    return variances

绘制结果直方图观察在这里插入图片描述

可以观察到，很多特征的方差都不是很大，可以考虑适当筛选

Pearson’s correlation coefficient

实现基于pearson相关系数的特征选择，皮尔逊系数用于度量两个变量之间的线性相关关系，值域为[-1, 1]，可以

def pearson_filter(chunk, target, thresh):
    from sklearn.feature_selection import r_regression
    '''
    Args:
    chunk (m,n): DataFrame of m examples, n features
    target (m,): m target value
    thresh (scalar): thresh of pearson filter
    
    Return:
    relation (n,): array of col_names
    '''
    pearson = r_regression(chunk, target)
    relation = chunk.columns[pearson > thresh]
    return relation

绘图观察，pearson系数在0.3（低相关）能够保留适当数量的特征，高线性相关的特征很少

在这里插入图片描述

F-test

def ftest_filter(chunk, target, thresh):
    from sklearn.feature_selection import r_regression
    '''
    Args:
    chunk (m,n): DataFrame of m examples, n features
    target (m,): m target value
    thresh (scalar): thresh of pearson filter
    
    Return:
    relation (n,): array of bool_values
    '''
    _, p_values = f_regression(chunk, target)
    relation = p_values > thresh # 要求chunk与target差异不显著
    return relation

0.05置信下筛选出47388条特征，0.01置信下筛选出61812条特征

绘图分析

在这里插入图片描述

互信息

互信息能一定程度上反映非线性的相关性

def ML_filter(chunk, target):
    from sklearn.feature_selection import mutual_info_regression
    '''
    Args:
    chunk (m,n): DataFrame of m examples, n features
    target (m,): m target value
    thresh (scalar): thresh of pearson filter
    
    Return:
    relation (n,): array of ml
    '''
    ml = mutual_info_regression(chunk, target)
    return ml

绘图

在这里插入图片描述
互信息法对特征的筛选似乎不太明显

PCA降维

def pca_filter(chunk, ratio = 10):
    from sklearn.decomposition import PCA, IncrementalPCA
    '''
    Args:
    chunk (m,n): DataFrame of m examples, n features
    target (m,): m target value
    thresh (scalar): thresh of pearson filter
    
    Return:
    
    '''
    pca = PCA(n_components = chunk.shape[1]//ratio) # 降维到1/ratio
    data = pca.fit_transform(chunk)
    return data

def pca_decomposition(data_dir, batch_size = 5000, ratio = 10):
    X = pd.DataFrame()
    with h5py.File(data_dir,'r') as file:
        for index in range(0,file['data'].shape[1],batch_size): # 分组读取数据
            print(f'Processing {index}', end='\r')
            data=file['data'][:,index:index+batch_size]
            X = pd.concat([X, pd.DataFrame(pca_filter(data, ratio))], axis = 1, sort = False) 
    return X

实测PCA的效果不是很好，可能是特征之间具有高相关的原因（大概是部分位点之间存在相互作用？）

考虑到相关性可能不是线性的，测试一下kernelPCA，改用poly核再测试一下，效果还是不太行（甚至更差了），submit结果也很差。

筛选结果

测试一部分特征选择方法的有效性，由于选择后的特征仍然很多，参照文献，用特征重要性做二次筛选

一次筛选方法	二次筛选方法	一次筛选后数量	二次筛选后数量	10xCV平均后的成绩（使用CatBoostRegressor）
Pearson>0.3	ElasticNet>0.0001	63621	826	local_mae:3.6007 score:3.8895 mae_control:3.7841 mae_case:3.9949
方差+Pearson>0.3	CatBoostRegressor>0.0001	26353	3614	local_mae:3.6779 score:3.5472 mae_control:3.5962 mae_case:3.4982
F-test>0.05	ElasticNet>0.0001	47388	2605	local_mae:5.2942
PCA 4%	CatBoostRegressor>0.0001	19420	5839	local_mae:5.0130 score:15.9944 mae_control:17.3095 mae_case:14.6793
ML > 0.2	ElasticNet > 0.0001	72940	1362	local_mae:4.4434

ElasticNet参数：

alpha = 0.1, max_iter = 2000, warm_start=True, l1_ratio=0.7, random_state = 2023

CatBoost参数：

params = {'learning_rate': 0.1,'depth': 5, 'bootstrap_type':'Bernoulli','random_seed':2023,
		'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': True,
        'task_type':"CPU", # 这一步特征数可能超过20000，用CPU跑防止爆掉
        'devices':'0:1'}

目前来看pearson过滤的表现更好些，后面考虑用这个方法进行分析

画scatter图可视化预测结果

从论文中还看到了一种直观观察模型效果的方法

import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
plt.scatter(y, train_pred1)
plt.xlabel('real age')
plt.ylabel('predict age')
plt.legend(labels = [f'$R^2$: {r2_score(y, train_pred1):.5f}'])

在这里插入图片描述

这样观察结果可以更直观一些。

MLP测试

看到很多论文中都是用MLP做AutoEncoder，先实现个简单的MLP，配合pearson筛过的数据进行尝试

import numpy as np
import pandas as pd
import h5py
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import relu
import matplotlib.pyplot as plt

model = Sequential([
    Dense(500, activation = 'relu', input_shape=[trainset.shape[1],]),
    Dense(100, activation = 'relu'),
    Dense(1, activation = 'linear')
])
model.compile(optimizer = tf.keras.optimizers.legacy.Adam(0.001),
              loss = 'mae', metrics=['mape', 'mae'])

history = model.fit(trainset, y ,batch_size=512,
                    epochs=1000, validation_split = 0.3, verbose=1,
                   use_multiprocessing=False)

绘图查看Loss

def plot_history(history):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.plot(hist['epoch'], hist['loss'],
             label='Train Error')
    plt.plot(hist['epoch'], hist['val_loss'],
             label = 'Val Error')
    plt.legend()
    
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Metrics')
    plt.plot(hist['epoch'], hist['mape'],
             label = 'Train Error')
    plt.plot(hist['epoch'], hist['val_mape'],
             label = 'Val Error')
    plt.legend()
    
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Metrics')
    plt.plot(hist['epoch'], hist['mae'],
             label = 'Train Error')
    plt.plot(hist['epoch'], hist['val_mae'],
             label = 'Val Error')
    plt.legend()
    plt.show()

plot_history(history)

绘制学习曲线，效果还不错
在这里插入图片描述

引入早停

callback_loss = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                             min_delta=5000000,patience=100)
history = model.fit(trainset, y ,batch_size=512,
                    epochs=1000, validation_split = 0.3, verbose=1,
                   use_multiprocessing=False,
                   callbacks=[callback_loss])

加个callback

filepath="weights.best.hdf5"
checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath, monitor='mae', verbose=1, save_best_only=True,
mode='auto')
history = model.fit(trainset, y ,batch_size=512,
                    epochs=10000, validation_split = 0.3, verbose=1,
                   use_multiprocessing=False,
                   callbacks=[checkpoint])

模型融合

考虑作为一个优化方向，常见的模型融合思路有：

对分类数据的Voting，例如对一个二分类问题，有3个模型，则可以进行投票，投票多的确定分数
对回归问题的Averaging，即取加权平均
Bagging，有放回抽样；首先K折，然后将K折后的模型融合，随机森林就是个例子
Boosting，在每次训练的时候都为上一轮分配错误的样例分配更高的抽样权重，常见的AdaBoosting，GBDT等都是；

Boosting和Bagging在实际编写中的一个差异在于，Boosting必须迭代，而Bagging可以并发
Stacking，本质上是一种分层结构，以两层Stacking为例，设第一层有三个模型：M1~3，分别使用三个模型预测train和test，得到P1~3, T1~3，令P1~3为train2，T1~3为test2
$Models:\{ M1, M2, M3\}\\ \begin{array}{c} \downarrow trainset & \downarrow testset \\ \{P1, P2, P3\} & \{T1, T2, T3\}\\ \downarrow & \downarrow\\ train2 & test2 \end{array}$
使用第二层的模型M4在train2上训练，预测test2
$\rightarrow M4 \to test2 \to pred$
需要注意的是，因为第一层的模型本身就是在train上训练的，如果不做交叉，预测出的P肯定会过拟合，因此要做交叉

例如对如下数据
$\left( \begin{array}{} x1 \\ x2 \\ x3 \end{array}\right)$
如果进行三折交叉，则
$\begin{array}{} train1 = (x1, x2) & val1 = (x3)\\ train2 = (x1, x3) & val2 = (x2)\\ train3 = (x2, x3) & val3 = (x1)\\ \end{array}$
预测出的P应该为
$\begin{array}{r} \left. \begin{array}{} train3 \to M1 \to val3 \to p1\\ train2 \to M1 \to val2 \to p2\\ train1 \to M1 \to val1 \to p3\\ \end{array} \right\} \to P1\\ \begin{array}{}\cdots &\to & M2 &\to & \cdots &&\end{array} \}\to P2 \\ \begin{array}{}\cdots &\to & M3 &\to & \cdots &&\end{array} \}\to P3 \end{array}$
在实际实现中，可以通过k折的idx进行赋值
```
train_predict = np.zeros(trainset.shape[0])
kf = KFold()
for i, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    X_train, X_val, y_train, y_val = X[train_idx], X[val_idx], y[train_idx], y[val_idx]
    model.fit(X_train, y_train) # 训练当前的模型
    val_pred = model.predict(X_val) # 预测得p_i
    
    train_predict[val_idx] = val_pred # 将p_i记录入P
```
而对于T，一种求解方法是直接应用模型Mi求解Ti；也可以在交叉训练的时候，每个K折的模型都求解一次Ti，取平均值；
$\to M1 \to T1\\ test \to M2 \to T2\\ test \to M3 \to T3$

这里容易绕，最好多看几次。

使用MLP做AutoEncoder

在这里插入图片描述

AE的图例，引自Rakshit S, Saha I, Chakraborty S S, et al. Deep learning for integrated analysis of breast cancer subtype specific multi-omics data[C]//TENCON 2018-2018 IEEE region 10 conference. IEEE, 2018: 1917-1922.，

首先训练一个MLP模型M1，使用隐层参数构建一个新的模型M2
$\to M1 \to M2$
使用M2分别预测train和test，得到训练集P和测试集T
$\to M2 \to P\\ test \to M2 \to T$
记得保存M1的参数便于后面复用（主要是怕内核崩掉）

# 保存参数
model.save_weights('AE_weights')
# 加载
model2.load_weights('AE_weights')

在P上训练新的模型M3，用来预测P
$\to M3 \to T \to predict$

参考

【机器学习】模型融合方法概述
T. Doherty等, 《A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator》, BMC Bioinformatics, 卷 24, 期 1, 页 178, 5月 2023, doi: 10.1186/s12859-023-05282-4.
《AI夏令营（第三期） - AI for Science 生命科学赛道》学习手册
AI夏令营第三期 - 生物学年龄评价与年龄相关疾病风险预测教程
竞赛实践路线分享
阿里云DSW环境部署教程
阿里云机器学习Pai-DSW服务器部署教程
首届世界科学智能大赛：生命科学赛道——生物学年龄评价与年龄相关疾病风险预测

AI4Bio-首届世界科学智能大赛：生命科学赛道——生物学年龄评价与年龄相关疾病风险预测 特征优化

文章目录

six库出的一些小问题

赛题相关文献查阅

检索词确定

有关axis属性

NaN填充方法

特征选择方法

低方差筛除

Pearson’s correlation coefficient

F-test

互信息

PCA降维

筛选结果

画scatter图可视化预测结果

MLP测试

模型融合

使用MLP做AutoEncoder

参考

AI4Bio-首届世界科学智能大赛：生命科学赛道——生物学年龄评价与年龄相关疾病风险预测特征优化