Demonstration of two novel methods for predicting functional siRNA efficiency

Demonstration of two novel methods for predicting functional siRNA efficiency

Background: siRNAs is very important during the gene silencing process(RNAi),but the siRNA efficiency for targeting different sites is very different , therefore ,there is high demand for reliable siRNAsprediction tools and for the design methods able to pick up high silencing potential siRNAs.

This Paper Use Two Methods : The sequence-based model and Support Vector Machine

Dataset:

Dieter’s Dataset and Satron’s Dataset

Be attention : a homogeneous and sufficiently large dataset is of high importance

Be attention : should be very careful to combine datasets from different resources

Trained these two datasets dependly,Dieter’s Dataset(21nt) and Satron’s Dataset(nt)

Use three cut-off values : 0.5,0.6,0.7(y = the value of siRNA inhibiory activity)

The sequence-based model

P i + ( R i ) P_i^+(R_i) Pi+(Ri) is the probability of nucleotide R i R_i Ri at the sub-site i(i=1,…,19) for the functional siRNAs

P i − ( R i ) P_i^-(R_i) Pi(Ri) is the probability of nucleotide R i R_i Ri at the sub-site i(i=1,…,19) for the non-functional siRNAs

Ψ + , Ψ − \Psi^+ , \Psi^- Ψ+,Ψ indicates the attribute quality of the dataset as positive or negative,respectively

Ψ 0 0 \Psi_0 0 Ψ00 is Markov chain theroy

Ψ 0 + ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) = P 1 + ( R 1 ) . . . P 5 + ( R 5 ) . . . P 19 + ( R 19 ) \Psi_0^+(R_1,...,R_5,R_6,...,R_19) = P_1^+(R_1)...P_5^+(R_5)...P_{19}^+(R_{19}) Ψ0+(R1,...,R5,R6,...,R19)=P1+(R1)...P5+(R5)...P19+(R19)

Ψ 0 − ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) = P 1 − ( R 1 ) . . . P 5 − ( R 5 ) . . . P 19 − ( R 19 ) \Psi_0^-(R_1,...,R_5,R_6,...,R_19) = P_1^-(R_1)...P_5^-(R_5)...P_{19}^-(R_{19}) Ψ0(R1,...,R5,R6,...,R19)=P1(R1)...P5(R5)...P19(R19)

Δ ( R 1 , . . . , R 5 , . . . , R 1 9 ) = ω + Ψ 0 + ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) − ω − Ψ 0 − ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) \Delta(R_1,...,R_5,...,R_19) = \omega^+ \Psi_0^+(R_1,...,R_5,R_6,...,R_19) - \omega^ - \Psi_0^-(R_1,...,R_5,R_6,...,R_19) Δ(R1,...,R5,...,R19)=ω+Ψ0+(R1,...,R5,R6,...,R19)ωΨ0(R1,...,R5,R6,...,R19)

if Δ > 0 \Delta > 0 Δ>0,then functional,elif Δ ≤ 0 \Delta \leq 0 Δ0,then non-functional

Support Vector Machine

The advantages of SVM : 1,over fitting avoidance 2,large features space handling 3,extract key informations from the datasets

Features : binary , nucleotide composition , thermodynamic (7-D:A,B,C,AB,AC,BC,ABC)

Result : the attribute of nucleotide composition 6-7% enhancement for the prediction

Balancing the biased dataset in SVM training

The method of improving SVM algorithm about the dataset:

Why should we to improve the algorithm:Based on these situations, we come to the hypothesis that when there are much difference in record numbers between positive and negative datasets, especially when the dataset is not sufficiently large, the SVM learning machine is inclined to make a biased pre- diction toward the class with the larger dataset, which results in high false positive or false negative prediction.

Method:

1. Randomly choose a subset from the larger dataset until the subset has the same number of records as the smaller dataset;

2. Repeat step 1 for ten times to construct ten combina- tions of this "sub-larger dataset + whole smaller dataset". Make sure that these combinations cover at least 99% of the larger dataset.

3. Training the ten combinations by SVM in the seven vec- tor spaces one by one.

4. Take the average result of the ten combinations as the over all result.

The methods are robust for the different cut-off values,and The SVM performed better than the sequence-based statistical model

Useful math :

A c c u r a c y = T P + T N T P + F P + T N + F N Accuracy = \frac{TP+TN}{TP+FP+TN+FN} Accuracy=TP+FP+TN+FNTP+TN

S e n s i t y v i t y = T P T P + F N Sensityvity = \frac{TP}{TP+FN} Sensityvity=TP+FNTP

S p e c i f i c i t y = T N T N + F P Specificity = \frac{TN}{TN+FP} Specificity=TN+FPTN

R O C ROC ROC

p c c = ∑ ( X − X ‾ ) ( Y − Y ‾ ) ∑ ( X − X ‾ ) 2 ( Y − Y ‾ ) 2 pcc = \frac{\sum(X-\overline{X})(Y-\overline{Y})}{\sqrt{\sum{(X-\overline{X})^2}}\sqrt{(Y-\overline{Y})^2}} pcc=(XX)2 (YY)2 (XX)(YY)

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值