Datawhale AI 夏令营——分子AI预测笔记

tret_hjm

于 2024-07-21 12:40:50 发布

阅读量579

点赞数 24

文章标签：人工智能笔记 python

本文链接：https://blog.csdn.net/swordwielder/article/details/140586260

版权

一、赛事任务

根据提供的demo数据集，并自行划分数据。运用深度学习、强化学习或更加优秀人工智能的方法预测PROTACs的降解能力，若DC50>100nM且Dmax<80% ，则视为降解能力较差（demo数据集中Label=0）；若DC50<=100nM或Dmax>=80%，则视为降解能力好（demo数据集中Label=1）。

训练集数据如下：

测试集数据如下：

二、跑通baseline

使用BML Codelab的感受：

相较于jupyter和pycharm，使用平台编码不需要单独去敲代码创建虚拟环境以及项目文件，更加方便与简介，并且在编码的过程中会有相应提示，很适合小白。

task1（基础baseline代码理解以及相关补充）：

import pandas as pd
import numpy as np
from lightgbm import LGBMClassifier
train = pd.read_excel('./data/data282671/traindata-new.xlsx')
test = pd.read_excel('./data/data282671/testdata-new.xlsx')
train = train.drop(['DC50 (nM)', 'Dmax (%)'], axis=1) #axis=1表示删除列，axis=0表示删除行
for col in train.columns[2:]:#从train中第2列开始遍历
    if train[col].dtype == object or test[col].dtype == object:#判断是否为字符串
        train[col] = train[col].isnull()#标记train中该列的缺失值
        test[col] = test[col].isnull()#标记text中该列的缺失值
model = LGBMClassifier(verbosity=-1)#模型初始化，verbosity=-1避免训练过程产生大量日志
model.fit(train.iloc[:, 2:].values, train['Label'])#模型训练，train中2列开始所有的行和列为特征，Label为标签
pred = model.predict(test.iloc[:, 1:].values, )
pd.DataFrame(
    {
        'uuid': test['uuid'],
        'Label': pred
    }
).to_csv('submit.csv', index=None)

task3（进阶baseline代码理解以及相关补充）：

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import f1_score
from rdkit import Chem
from rdkit.Chem import Descriptors
from sklearn.feature_extraction.text import TfidfVectorizer
import tqdm, sys, os, gc, re, argparse, warnings
warnings.filterwarnings('ignore')

train = pd.read_excel('./data/data282671/traindata-new.xlsx')
test = pd.read_excel('./data/data282671/testdata-new.xlsx')
train = train.drop(['DC50 (nM)', 'Dmax (%)'], axis=1)
drop_cols = []
for f in test.columns:
    if test[f].notnull().sum() < 10:
        drop_cols.append(f)
train = train.drop(drop_cols, axis=1)
test = test.drop(drop_cols, axis=1)
data = pd.concat([train, test], axis=0, ignore_index=True)#ignore_index=True, pandas忽略原来的索引，并为合并后的DataFrame生成一个新的整数索引，从0开始。
cols = data.columns[2:]

data['smiles_list'] = data['Smiles'].apply(lambda x:[Chem.MolToSmiles(mol, isomericSmiles=True) for mol in [Chem.MolFromSmiles(x)]])#将smiles转化为分子列表
data['smiles_list'] = data['smiles_list'].map(lambda x: ' '.join(x))#将分子列表转换为字符串
tfidf = TfidfVectorizer(max_df = 0.9, min_df = 1, sublinear_tf = True)#忽略频率超过90%、低于1的词
res = tfidf.fit_transform(data['smiles_list'])#将单个字符串转化为TF-IDF特征矩阵
tfidf_df = pd.DataFrame(res.toarray())#将特征矩阵转换为DataFrame
tfidf_df.columns = [f'smiles_tfidf_{i}' for i in range(tfidf_df.shape[1])]#设置列名为smiles_tfidf_{i}，其中i为索引
data = pd.concat([data, tfidf_df], axis=1)
def label_encode(series):
    unique = list(series.unique())
    return series.map(dict(zip(
        unique, range(series.nunique())
    )))#定义了一个函数label_encode，用于将分类特征（对象类型）转换为整数编码。
for col in cols:
    if data[col].dtype == 'object':
        data[col]  = label_encode(data[col])
train = data[data.Label.notnull()].reset_index(drop=True)
test = data[data.Label.isnull()].reset_index(drop=True)#根据Label列是否为空，将data DataFrame拆分为训练集train和测试集test。
features = [f for f in train.columns if f not in ['uuid','Label','smiles_list']]
x_train = train[features]
x_test = test[features]
y_train = train['Label'].astype(int)

def cv_model(clf, train_x, train_y, test_x, clf_name, seed=2022):
    kf = KFold(n_splits=5, shuffle=True, random_state=seed)#对数据切分五次，打散，随即种子
    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])#保留结果
    cv_scores = []#保留分数
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):#每次迭代都会返回一对索引
        print('************************************ {} {}************************************'.format(str(i+1), str(seed)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]   
        params = {'learning_rate': 0.1, 'depth': 6, 'l2_leaf_reg': 10, 'bootstrap_type':'Bernoulli','random_seed':seed,
                  'od_type': 'Iter', 'od_wait': 100, 'allow_writing_files': False, 'task_type':'CPU'}#参数字典
        model = clf(iterations=20000, **params, eval_metric='AUC')#模型实例化
        model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                  metric_period=100,
                  cat_features=[], 
                  use_best_model=True, 
                  verbose=1)
        val_pred  = model.predict_proba(val_x)[:,1]#对验证集进行预测，并输出正类的概率
        test_pred = model.predict_proba(test_x)[:,1]#对测试集进行预测，并输出正类的概率          
        train[valid_index] = val_pred
        test += test_pred / kf.n_splits#计算平均预测
        cv_scores.append(f1_score(val_y, np.where(val_pred>0.5, 1, 0)))       
        print(cv_scores)    
    print("%s_score_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test  
cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat")
pd.DataFrame(
    {
        'uuid': test['uuid'],
        'Label': np.where(cat_test>0.5, 1, 0)#大于0.5输出1，小于0.5输出0
    }
).to_csv('submit.csv', index=None)

三、库的学习以及出现一些相关问题与解决

1.import pandas as pd：用于数据处理和分析，提供DataFrame等数据结构。

Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

这条信息表示程序或某个库缺少一个可选的依赖项xlrd。xlrd是一个用于读取Excel文件的Python库，自版本2.0.0起，它不再支持.xlsx格式的文件，仅支持旧的.xls格式。如果你的程序只需要处理.xls文件，可pip install xlrd>=1.0.0；如果你正在使用pandas来读取Excel文件，并且文件是.xlsx格式的，你实际上不需要单独安装xlrd。pandas将自动使用openpyxl（如果已经pip install openpyxl）来处理.xlsx文件。然后，你可以使用pandas的read_excel函数来读取Excel文件，如下所示：

import pandas as pd

df = pd.read_excel('your_file.xlsx', engine='openpyxl')

这里，engine='openpyxl'参数明确指定了使用openpyxl作为引擎来读取.xlsx文件。如果你没有指定engine参数，并且已经安装了openpyxl，pandas也会默认使用它。

2.import numpy as np：用于科学计算和多维数组操作。

3.from lightgbm import LGBMClassifier：从 lightgbm 模块中导入 LGBMClassifier（决策树分类器）类。

4.from catboost import CatBoostClassifier：catboost是一种基于梯度提升机器学习库，适用于分类和回归任务，CatBoostClassifier是梯度增强算法。

ERROR: Could not install packages due to an OSError: [Errno 13] Permission d

这个错误表明在尝试安装Python包时没有足够的权限，解决方法如下：

使用Linux或Mac：sudo pip install package_name

使用windows：pip install --user package_name

5.from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold：sklearn.model_selection包含模型选择的多种方法，如交叉验证。

6.from sklearn.metrics import f1_score：sklearn.metrics包含评估模型性能的多种指标，f1_score用于计算分类问题。

7.from rdkit import Chem：rdkit是一个化学信息学和机器学习软件，用于处理化学结构。

8.from rdkit.Chem import Descriptors：rdkit的Chem模块中，Descriptors子模块提供大量函数来计算和获取化学分子的描述符。

9.from sklearn.feature_extraction.text import TfidfVectorizer：提供将文本转换为特征向量的Tf-idf向量化器。

10.import tqdm, sys, os, gc, re, argparse, warnings：

tqdm：用于在长循环中添加进度条的库；

sys：与Python解释器密切相关的模块，提供访问由解释器使用或维护的变量和函数；

os：提供与操作系统交互的功能；

gc：垃圾收集器接口，用于手动标记对象为可删除；

re：正则表达式库，用于字符串搜索和替换；

argparse：用于编写用户友好的命令行接口；

warnings:用于发出警告的库，这里用来忽略警告信息。

四、学习记录与思考

task1：

1.提取特征时要进行训练集和测试集的比较，测试集不包含训练集的特征类删掉;

2.对比测试集和输出样本，确定训练集的目标变量，本赛题用label作为目标字段，所以测试集要被预测该字段。

task3：

1.缺少数据百分之九十五左右或者唯一性的状况(数据都相同)可以删除；

2.为了方便数据处理以及特征提取，可以将训练集和测试集进行合并处理(相互影响不大时)；

3.考虑特征重要性，信息增益的相关统计；

4.非int，float等类型存在string类型的可以放在模型里面进行处理；

5.对比特征，判断是否有增益，无增益的特征可删除；

或者使用过滤法，观察特征和目标是否有相关性，无相关的的特征可删除；

或者观察在模型中产生的权重，低权可删除；

6.字符串矩阵高维数据可以采用降维操作；

7.二分类或多分类问题可以通过优化某些性能指标来判断最佳阈值。

tret_hjm

关注

24
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
Datawhale AI 夏令营——分子AI预测笔记

来读取Excel文件，并且文件是.xlsx格式的，你实际上不需要单独安装xlrd。xlrd是一个用于读取Excel文件的Python库，自版本2.0.0起，它不再支持.xlsx格式的文件，仅支持旧的.xls格式。相较于jupyter和pycharm，使用平台编码不需要单独去敲代码创建虚拟环境以及项目文件，更加方便与简介，并且在编码的过程中会有相应提示，很适合小白。2.对比测试集和输出样本，确定训练集的目标变量，本赛题用label作为目标字段，所以测试集要被预测该字段。
复制链接

扫一扫