最近有点忙,慢更,见谅
要求如上,实现如下:
(这个练习的代码比较简单,就不细讲了。吴老师在课程中一直强调SVM已经有成熟的算法库,练习的目的是理解并熟练运用SVM,不必太关心SVM库实现。。。)
1.实现gaussianKernel,就是计算相似度的
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return the similarity between x1
% and x2 computed using a Gaussian kernel with bandwidth
% sigma
%
%
sim = exp(-(x1 - x2)'*(x1 - x2)./(2*sigma*sigma));
% =============================================================
看看输出
Evaluating the Gaussian Kernel ...
Gaussian Kernel between x1 = [1; 2; 1], x2 = [0; 4; -1], sigma = 2.000000 :
0.324652
(for sigma = 2, this value should be about 0.324652)
Program paused. Press enter to continue.
2.实现dataset3Params,就是在两个8x1向量中,组合出一对prediction_error最小的配置
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return the optimal C and sigma
% learning parameters found using the cross validation set.
% You can use svmPredict to predict the labels on the cross
% validation set. For example,
% predictions = svmPredict(model, Xval);
% will return the predictions on the cross validation set.
%
% Note: You can compute the prediction error using
% mean(double(predictions ~= yval))
%
Clist = [ 0.01 0.03 0.1 0.3 1 3 10 30];
Sigmalist = [ 0.01 0.03 0.1 0.3 1 3 10 30];
prediction_error = [];
for i=1:8
for j=1:8
model = svmTrain(X, y, Clist(i), @(x1, x2) gaussianKernel(x1, x2, Sigmalist(j)));
predictions = svmPredict(model,Xval);
prediction_error((i-1)*8 + j) = mean(double(predictions ~= yval));
end
end
[~,p] = min(prediction_error,[],2);
C = Clist(floor(p/8) + 1);
sigma = Sigmalist(mod(p,8));
% =========================================================================
看看SVM划分
可以可以,继续继续
3.实现processEmail
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to add the index of str to
% word_indices if it is in the vocabulary. At this point
% of the code, you have a stemmed word from the email in
% the variable str. You should look up str in the
% vocabulary list (vocabList). If a match exists, you
% should add the index of the word to the word_indices
% vector. Concretely, if str = 'action', then you should
% look up the vocabulary list to find where in vocabList
% 'action' appears. For example, if vocabList{18} =
% 'action', then, you should add 18 to the word_indices
% vector (e.g., word_indices = [word_indices ; 18]; ).
%
% Note: vocabList{idx} returns a the word with index idx in the
% vocabulary list.
%
% Note: You can use strcmp(str1, str2) to compare two strings (str1 and
% str2). It will return 1 only if the two strings are equivalent.
%
for idx=1:length(vocabList)
if strcmp( str,vocabList(idx))
word_indices = [word_indices;idx];
end
end
% =============================================================
看看输出
=========================
Word Indices:
86 916 794 1077 883 370 1699 790 1822 1831 883 431 1171 794 1002 1893 1364 592 1676 238 162 89 688 945 1663 1120 1062 1699 375 1162 479 1893 1510 799 1182 1237 810 1895 1440 1547 181 1699 1758 1896 688 1676 992 961 1477 71 530 1699 531
Program paused. Press enter to continue.
4.实现emailFeatures
% ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return a feature vector for the
% given email (word_indices). To help make it easier to
% process the emails, we have have already pre-processed each
% email and converted each word in the email into an index in
% a fixed dictionary (of 1899 words). The variable
% word_indices contains the list of indices of the words
% which occur in one email.
%
% Concretely, if an email has the text:
%
% The quick brown fox jumped over the lazy dog.
%
% Then, the word_indices vector for this text might look
% like:
%
% 60 100 33 44 10 53 60 58 5
%
% where, we have mapped each word onto a number, for example:
%
% the -- 60
% quick -- 100
% ...
%
% (note: the above numbers are just an example and are not the
% actual mappings).
%
% Your task is take one such word_indices vector and construct
% a binary feature vector that indicates whether a particular
% word occurs in the email. That is, x(i) = 1 when word i
% is present in the email. Concretely, if the word 'the' (say,
% index 60) appears in the email, then x(60) = 1. The feature
% vector should look like:
%
% x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];
%
%
for idx=1:length(word_indices)
x(word_indices(idx)) = 1;
end
% =========================================================================
看看输出
==== Processed Email ====
anyon know how much it cost to host a web portal well it depend on how mani
visitor you re expect thi can be anywher from less than number buck a month
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb
if your run someth big to unsubscrib yourself from thi mail list send an
email to emailaddr
=========================
Length of feature vector: 1899
Number of non-zero entries: 45
Program paused. Press enter to continue.
最后,我还真找了个垃圾邮件试试
号码被我用xxxx隐去了,这些人虽然烦,但也要尊重一下他们的隐私,遂隐去
新建个文件spamSample12019.txt测试一下
Dear:
You need to invoice, it is worth paying attention!
Professional agent to open. Each. The place. Zheng. Regulations. Send. Ticket points discount! Manager Zhang 1326516xxxx WeChat / QQ: 180890xxxx
看看识别结果
==== Processed Email ====
dear you need to invoic it is worth pai attent profession agent to open each
the place zheng regul send ticket point discount manag zhang number wechat qq
number
=========================
Processed spamSample12019.txt
Spam Classification: 1
(1 indicates spam, 0 indicates not spam)
作为一只做底层电路的渣渣,看到这波操作,只能说666
好了,谢谢大家