Datawhale AI 夏令营第三期 AI for Science-CSDN博客

本文链接：https://blog.csdn.net/weixin_74886757/article/details/140896850

task3基于task2的lgm模型，从特征和模型训练两个角度优化，并提供一些其它上分思路。

首先，task3提供了三段资料。这三段资料强调了siRNA序列与靶基因的匹配度、GC含量和化学修饰对siRNA沉默效率的重要性，启示我们构建预测模型时，生物学特征的重要性，帮助我们优化模型，提高预测准确性。

1.对task2特征再刻画

根据task2引入的长度、GC含量等特征细节刻画，对教程提供的代码进行补充

def siRNA_feat_builder3(s: pd.Series, anti: bool = False):
    name = "anti" if anti else "sense"
    df = s.to_frame()

    # 长度分组
    df[f"feat_siRNA_{name}_len21"] = (s.str.len() == 21)

    # 首位碱基
    df[f"feat_siRNA_{name}_first_A"] = s.str.startswith("A")
    df[f"feat_siRNA_{name}_first_U"] = s.str.startswith("U")
    df[f"feat_siRNA_{name}_first_G"] = s.str.startswith("G")
    df[f"feat_siRNA_{name}_first_C"] = s.str.startswith("C")
    df[f"feat_siRNA_{name}_last_A"] = s.str.endswith("A")
    df[f"feat_siRNA_{name}_last_U"] = s.str.endswith("U")
    df[f"feat_siRNA_{name}_last_G"] = s.str.endswith("G")
    df[f"feat_siRNA_{name}_last_C"] = s.str.endswith("C")

    # GC含量
    GC_frac = (s.str.count("G") + s.str.count("C")) / s.str.len()
    df[f"feat_siRNA_{name}_GC_in"] = (GC_frac >= 0.36) & (GC_frac <= 0.52)

    # 局部GC含量
    GC_frac1 = (s.str[1:7].str.count("G") + s.str[1:7].str.count("C")) / s.str[1:7].str.len()
    GC_frac2 = (s.str[7:14].str.count("G") + s.str[7:14].str.count("C")) / s.str[7:14].str.len()
    df[f"feat_siRNA_{name}_GC_in1"] = GC_frac1
    df[f"feat_siRNA_{name}_GC_in2"] = GC_frac2
    df[f"feat_siRNA_{name}_local_GC1"] = (GC_frac1 >= 0.36) & (GC_frac1 <= 0.52)
    df[f"feat_siRNA_{name}_local_GC2"] = (GC_frac2 >= 0.36) & (GC_frac2 <= 0.52)

    # 特定序列模式
    df[f"feat_siRNA_{name}_pattern_AA_UU"] = s.str.startswith("AA") & s.str.endswith("UU")
    df[f"feat_siRNA_{name}_pattern_GA_UU"] = s.str.startswith("GA") & s.str.endswith("UU")
    df[f"feat_siRNA_{name}_pattern_CA_UU"] = s.str.startswith("CA") & s.str.endswith("UU")
    df[f"feat_siRNA_{name}_pattern_UA_UU"] = s.str.startswith("UA") & s.str.endswith("UU")
    df[f"feat_siRNA_{name}_pattern_UU_AA"] = s.str.startswith("UU") & s.str.endswith("AA")
    df[f"feat_siRNA_{name}_pattern_UU_GA"] = s.str.startswith("UU") & s.str.endswith("GA")

    return df.iloc[:, 1:]

根据

df[f"feat_siRNA_{name}_len21"] = (s.str.len() == 21)

GC_frac1 = (s.str[1:7].str.count("G") + s.str[1:7].str.count("C"))/s.str[1:7].str.len()

   df[f"feat_siRNA_{name}_GC_in1"] = GC_frac1

类推补充特征代码

2.修饰siRNA构建特征

同样的，根据教程提供的代码进行学习补充。此处就不复制粘贴task3中的代码了，直接展示我的思路和结果。

def siRNA_feat_builder3_mod(s: pd.Series, anti: bool = False):
    name = "anti" if anti else "sense"
    df = s.to_frame()
    
    voc_ls = list("AUGC")
    
    # 修饰RNA的起始、终止位置单元类别
    for pos in [0, -1]:
        for c in voc_ls:
            df[f"feat_siRNA_{name}_pos{pos+1}_{c}"] = (s.str[pos] == c)

    # 修饰RNA的次起始、次终止位置单元类别
    for pos in [1, -2]:
        for c in voc_ls:
            df[f"feat_siRNA_{name}_pos{pos+1}_{c}"] = (s.str[pos] == c)

    return df.iloc[:, 1:]

siRNA_feat_builder3_mod 函数的目标是提取修饰后的 siRNA 序列的特征。也就是说，这个函数提取了起始和终止位置的、第二个和倒数第二个位置的碱基特征。它的主要思路就是，遍历 siRNA 序列的特定位置，并记录每个位置的碱基类别。

遍历序列的起始位置（第一个碱基）和终止位置（最后一个碱基）。

对每一个碱基类别（A、U、G、C），如果序列的对应位置上是该碱基，则标记为 True，否则为 False。

这些特征列的命名格式为 feat_siRNA_{name}_seq_{c}_{'front' if pos == 0 else 'back'}。

遍历序列的第二个位置和倒数第二个位置。

对每一个碱基类别（A、U、G、C），如果序列的对应位置上是该碱基，则标记为 True，否则为 False。

这些特征列的命名格式为 feat_siRNA_{name}_seq_{c}_{'second' if pos == 1 else 'second_last'}。

# 子串词频统计，修饰序列
cols_mod = ["modified_siRNA_sense_seq", "modified_siRNA_antisense_seq"]
cols_mod_ls = ["modified_siRNA_sense_seq_list", "modified_siRNA_antisense_seq_list"]
all_tokens_mod = []

for col in cols_mod_ls:
    for seq_ls in df[col]:
        if pd.isna(seq_ls):
            continue
      
        all_tokens_mod.extend(seq_ls)

print('#all_tokens_mod: ', len(all_tokens_mod))

vocab_mod = GenomicVocab.create(all_tokens_mod, max_vocab=100000, min_freq=1)
print('#vocab_mod: ', len(vocab_mod.itos))

for col in cols_mod:
    
    df[f'{col}_freq'] = df[col].apply(lambda x: [vocab_mod.stoi[b] for b in x] if pd.notna(x) else [])

对修饰序列进行子串词频统计，这涉及到了对修饰后的 siRNA 序列进行特征提取。首先，教程代码收集了碱基token，把每一个序列拆分成单独的碱基，并加入all_tokens_mod列表。然后，根据 token 列表创建词汇表。最后，计算每一个修饰后的 siRNA 序列子串的词频特征。