Demonstration of two novel methods for predicting functional siRNA efficiency
Background: siRNAs is very important during the gene silencing process(RNAi),but the siRNA efficiency for targeting different sites is very different , therefore ,there is high demand for reliable siRNAsprediction tools and for the design methods able to pick up high silencing potential siRNAs.
This Paper Use Two Methods : The sequence-based model and Support Vector Machine
Dataset:
Dieter’s Dataset and Satron’s Dataset
Be attention : a homogeneous and sufficiently large dataset is of high importance
Be attention : should be very careful to combine datasets from different resources
Trained these two datasets dependly,Dieter’s Dataset(21nt) and Satron’s Dataset(nt)
Use three cut-off values : 0.5,0.6,0.7(y = the value of siRNA inhibiory activity)
The sequence-based model
P i + ( R i ) P_i^+(R_i) Pi+(Ri) is the probability of nucleotide R i R_i Ri at the sub-site i(i=1,…,19) for the functional siRNAs
P i − ( R i ) P_i^-(R_i) Pi−(Ri) is the probability of nucleotide R i R_i Ri at the sub-site i(i=1,…,19) for the non-functional siRNAs
Ψ + , Ψ − \Psi^+ , \Psi^- Ψ+,Ψ− indicates the attribute quality of the dataset as positive or negative,respectively
Ψ 0 0 \Psi_0 0 Ψ00 is Markov chain theroy
Ψ 0 + ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) = P 1 + ( R 1 ) . . . P 5 + ( R 5 ) . . . P 19 + ( R 19 ) \Psi_0^+(R_1,...,R_5,R_6,...,R_19) = P_1^+(R_1)...P_5^+(R_5)...P_{19}^+(R_{19}) Ψ0+(R1,...,R5,R6,...,R19)=P1+(R1)...P5+(R5)...P19+(R19)
Ψ 0 − ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) = P 1 − ( R 1 ) . . . P 5 − ( R 5 ) . . . P 19 − ( R 19 ) \Psi_0^-(R_1,...,R_5,R_6,...,R_19) = P_1^-(R_1)...P_5^-(R_5)...P_{19}^-(R_{19}) Ψ0−(R1,...,R5,R6,...,R19)=P1−(R1)...P5−(R5)...P19−(R19)
Δ ( R 1 , . . . , R 5 , . . . , R 1 9 ) = ω + Ψ 0 + ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) − ω − Ψ 0 − ( R 1 , . . . , R 5 , R 6 , . . . , R 1 9 ) \Delta(R_1,...,R_5,...,R_19) = \omega^+ \Psi_0^+(R_1,...,R_5,R_6,...,R_19) - \omega^ - \Psi_0^-(R_1,...,R_5,R_6,...,R_19) Δ(R1,...,R5,...,R19)=ω+Ψ0+(R1,...,R5,R6,...,R19)−ω−Ψ0−(R1,...,R5,R6,...,R19)
if Δ > 0 \Delta > 0 Δ>0,then functional,elif Δ ≤ 0 \Delta \leq 0 Δ≤0,then non-functional
Support Vector Machine
The advantages of SVM : 1,over fitting avoidance 2,large features space handling 3,extract key informations from the datasets
Features : binary , nucleotide composition , thermodynamic (7-D:A,B,C,AB,AC,BC,ABC)
Result : the attribute of nucleotide composition 6-7% enhancement for the prediction
Balancing the biased dataset in SVM training
The method of improving SVM algorithm about the dataset:
Why should we to improve the algorithm:Based on these situations, we come to the hypothesis that when there are much difference in record numbers between positive and negative datasets, especially when the dataset is not sufficiently large, the SVM learning machine is inclined to make a biased pre- diction toward the class with the larger dataset, which results in high false positive or false negative prediction.
Method:
1. Randomly choose a subset from the larger dataset until the subset has the same number of records as the smaller dataset;
2. Repeat step 1 for ten times to construct ten combina- tions of this "sub-larger dataset + whole smaller dataset". Make sure that these combinations cover at least 99% of the larger dataset.
3. Training the ten combinations by SVM in the seven vec- tor spaces one by one.
4. Take the average result of the ten combinations as the over all result.
The methods are robust for the different cut-off values,and The SVM performed better than the sequence-based statistical model
Useful math :
A c c u r a c y = T P + T N T P + F P + T N + F N Accuracy = \frac{TP+TN}{TP+FP+TN+FN} Accuracy=TP+FP+TN+FNTP+TN
S e n s i t y v i t y = T P T P + F N Sensityvity = \frac{TP}{TP+FN} Sensityvity=TP+FNTP
S p e c i f i c i t y = T N T N + F P Specificity = \frac{TN}{TN+FP} Specificity=TN+FPTN
R O C ROC ROC
p c c = ∑ ( X − X ‾ ) ( Y − Y ‾ ) ∑ ( X − X ‾ ) 2 ( Y − Y ‾ ) 2 pcc = \frac{\sum(X-\overline{X})(Y-\overline{Y})}{\sqrt{\sum{(X-\overline{X})^2}}\sqrt{(Y-\overline{Y})^2}} pcc=∑(X−X)2(Y−Y)2∑(X−X)(Y−Y)