Demonstration of two novel methods for predicting functional siRNA efficiency

最新推荐文章于 2025-04-03 15:56:11 发布

_BOTAK_

最新推荐文章于 2025-04-03 15:56:11 发布

阅读量360

点赞数

分类专栏：论文阅读文章标签：机器学习深度学习

本文链接：https://blog.csdn.net/BOTAK_/article/details/104657222

版权

论文阅读专栏收录该内容

3 篇文章

订阅专栏

本文介绍两种预测siRNA效率的方法：基于序列的模型和支持向量机(SVM)，使用Dieter和Satron的数据集进行训练，通过调整不同截止值来评估预测效果。基于序列的模型考虑了核苷酸在不同位置的概率分布，而SVM则利用二进制、核苷酸组成和热力学特征进行预测，结果显示SVM性能更优。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Demonstration of two novel methods for predicting functional siRNA efficiency

Background: siRNAs is very important during the gene silencing process(RNAi),but the siRNA efficiency for targeting different sites is very different , therefore ,there is high demand for reliable siRNAsprediction tools and for the design methods able to pick up high silencing potential siRNAs.

This Paper Use Two Methods : The sequence-based model and Support Vector Machine

Dataset:

Dieter’s Dataset and Satron’s Dataset

Be attention : a homogeneous and sufficiently large dataset is of high importance

Be attention : should be very careful to combine datasets from different resources

Trained these two datasets dependly,Dieter’s Dataset(21nt) and Satron’s Dataset(nt)

Use three cut-off values : 0.5,0.6,0.7(y = the value of siRNA inhibiory activity)

The sequence-based model

$P_i^+(R_i)$ is the probability of nucleotide $R_i$ at the sub-site i(i=1,…,19) for the functional siRNAs

$P_i^-(R_i)$ is the probability of nucleotide $R_i$ at the sub-site i(i=1,…,19) for the non-functional siRNAs

$\Psi^+ , \Psi^-$ indicates the attribute quality of the dataset as positive or negative,respectively

$\Psi_0 0$ is Markov chain theroy

$\Psi_0^+(R_1,...,R_5,R_6,...,R_19) = P_1^+(R_1)...P_5^+(R_5)...P_{19}^+(R_{19})$

$\Psi_0^-(R_1,...,R_5,R_6,...,R_19) = P_1^-(R_1)...P_5^-(R_5)...P_{19}^-(R_{19})$

$\Delta(R_1,...,R_5,...,R_19) = \omega^+ \Psi_0^+(R_1,...,R_5,R_6,...,R_19) - \omega^ - \Psi_0^-(R_1,...,R_5,R_6,...,R_19)$

if $\Delta > 0$ ,then functional,elif $\Delta \leq 0$ ,then non-functional

Support Vector Machine

The advantages of SVM : 1,over fitting avoidance 2,large features space handling 3,extract key informations from the datasets

Features : binary , nucleotide composition , thermodynamic (7-D:A,B,C,AB,AC,BC,ABC)

Result : the attribute of nucleotide composition 6-7% enhancement for the prediction

Balancing the biased dataset in SVM training

The method of improving SVM algorithm about the dataset:

Why should we to improve the algorithm:Based on these situations, we come to the hypothesis that when there are much difference in record numbers between positive and negative datasets, especially when the dataset is not sufficiently large, the SVM learning machine is inclined to make a biased pre- diction toward the class with the larger dataset, which results in high false positive or false negative prediction.

Method:

1. Randomly choose a subset from the larger dataset until the subset has the same number of records as the smaller dataset;

2. Repeat step 1 for ten times to construct ten combina- tions of this "sub-larger dataset + whole smaller dataset". Make sure that these combinations cover at least 99% of the larger dataset.

3. Training the ten combinations by SVM in the seven vec- tor spaces one by one.

4. Take the average result of the ten combinations as the over all result.

The methods are robust for the different cut-off values,and The SVM performed better than the sequence-based statistical model

Useful math :

$\frac{TP+TN}{TP+FP+TN+FN}$