斯坦福ML课程——python转写(Week7—课程作业ex6_2)

利用python完成课程作业ex6的第二部分Spam,Introduction如下:

Many email services today provide spam lters that are able to classify emails into spam and non-spam email with high accuracy. In this part of the exercise, you will use SVMs to build your own spam lter.

代码如下:

ex6_

# -*- coding: utf-8 -*-
"""
Created on Fri Jan  3 10:56:17 2020

@author: Lonely_hanhan
"""

'''
%% Machine Learning Online Class
%  Exercise 6 | Spam Classification with SVMs
%
%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  exercise. You will need to complete the following functions:
%
%     gaussianKernel.m
%     dataset3Params.m
%     processEmail.m
%     emailFeatures.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%

'''
#import matplotlib.pyplot as plt
import scipy.io as sio
import numpy as np
#import scipy.optimize as op
import processEmail as pr
import emailFeatures as ef
from sklearn import svm
import getVocabList as ge

'''
%% ==================== Part 1: Email Preprocessing ====================
%  To use an SVM to classify emails into Spam v.s. Non-Spam, you first need
%  to convert each email into a vector of features. In this part, you will
%  implement the preprocessing steps for each email. You should
%  complete the code in processEmail.m to produce a word indices vector
%  for a given email.
'''
print('\nPreprocessing sample email (emailSample1.txt)\n')
#Extract Features
file_contents = open('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\emailSample1.txt', 'r').read()
word_indices = pr.processEmail(file_contents)
# Print stats
print('Word Indices: ')
print(word_indices)


'''
%% ==================== Part 2: Feature Extraction ====================
%  Now, you will convert each email into a vector of features in R^n. 
%  You should complete the code in emailFeatures.m to produce a feature
%  vector for a given email.
'''
print('\nExtracting features from sample email (emailSample1.txt)\n')
features = ef.emailFeatures(word_indices)
# Print stats
print('Length of feature vector: {}'.format(features.size))
print('Number of non-zero entries: {}'.format(np.flatnonzero(features).size))



'''
%% =========== Part 3: Train Linear SVM for Spam Classification ========
%  In this section, you will train a linear classifier to determine if an
%  email is Spam or Not-Spam.

% Load the Spam Email dataset
% You will have X, y in your environment
'''

Data = sio.loadmat('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\spamTrain.mat')
X = Data['X']
y = Data['y'].flatten()

print('\nTraining Linear SVM (Spam Classification)\n')
print('(this may take 1 to 2 minutes) ...\n')

C = 0.1
clf = svm.SVC(C, kernel='linear', tol=1e-3)
clf.fit(X, y)
p = clf.predict(X)
print('Training Accuracy: {}'.format(np.mean(p == y) * 100))
'''
%% =================== Part 4: Test Spam Classification ================
%  After training the classifier, we can evaluate it on a test set. We have
%  included a test set in spamTest.mat

% Load the test dataset
% You will have Xtest, ytest in your environment
'''
Data = sio.loadmat('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\spamTest.mat')
Xtest = Data['Xtest']
ytest = Data['ytest'].flatten()

p = clf.predict(Xtest)
print('Test Accuracy: {}'.format(np.mean(p == ytest) * 100))



'''
%% ================= Part 5: Top Predictors of Spam ====================
%  Since the model we are training is a linear SVM, we can inspect the
%  weights learned by the model to understand better how it is determining
%  whether an email is spam or not. The following code finds the words with
%  the highest weights in the classifier. Informally, the classifier
%  'thinks' that these words are the most likely indicators of spam.
%
'''
vocabList = ge.getVocabList()
# clf.coef_表示特征权重的大小,np.argsort表示从小到大排序,返回的是索引值
indices = np.argsort(clf.coef_).flatten()[::-1]
print(indices)

# 返回与文档中同样的单词数目15个
for i in range(15):
    print('{} ({:0.6f})'.format(vocabList[indices[i]], clf.coef_.flatten()[indices[i]]))


'''
%% =================== Part 6: Try Your Own Emails =====================
%  Now that you've trained the spam classifier, you can use it on your own
%  emails! In the starter code, we have included spamSample1.txt,
%  spamSample2.txt, emailSample1.txt and emailSample2.txt as examples. 
%  The following code reads in one of these emails and then uses your 
%  learned SVM classifier to determine whether the email is Spam or 
%  Not Spam

% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
'''

print('\nPreprocessing sample email (emailSample2.txt)\n')
#Extract Features
file_contents = open('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\emailSample2.txt', 'r').read()
word_indices = pr.processEmail(file_contents)
# Print stats
print('Word Indices: ')
print(word_indices)

print('\nExtracting features from sample email (emailSample2.txt)\n')
features = ef.emailFeatures(word_indices)
print('Length of feature vector: {}'.format(features.size))
print('Number of non-zero entries: {}'.format(np.flatnonzero(features).size))

Xm = features.reshape((1,1899))
pm = clf.predict(Xm)

print(pm) # 0 is not spam, 1 is spam

 processEmail.m

# -*- coding: utf-8 -*-
"""
Created on Fri Jan  3 11:38:24 2020

@author: Lonely_duoha
"""
import getVocabList as ge
import re
import nltk, nltk.stem.porter
import numpy as np

def processEmail(email_contents):
    '''
    %PROCESSEMAIL preprocesses a the body of an email and
    %returns a list of word_indices 
    %   word_indices = PROCESSEMAIL(email_contents) preprocesses 
    %   the body of an email and returns a list of indices of the 
    %   words contained in the email. 
    %
    '''
    # Load Vocabulary
    vocabList = ge.getVocabList()
    word_indices = np.array([], dtype=np.int64)
    # Lower case
    email_contents = email_contents.lower()
    #Strip all HTML
    #Looks for any expression that starts with < and ends with > and replace
    #and does not have any < or > in the tag it with a space
    email_contents = re.sub(r'<[^<>]+>', ' ', email_contents)
    # Handle Numbers Look for one or more characters between 0-9
    email_contents = re.sub('\d+','number', email_contents)
    # Handle URLS Look for strings starting with http:// or https://
    email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
    # Handle Email Addresses Look for strings with @ in the middle
    email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)
    # Handle $ sign
    email_contents = re.sub('[$]+', 'dollar', email_contents)
    # ===================== Tokenize Email =====================
    # Output the email to screen as well
    print('==== Processed Email ====')
    stemmer = nltk.stem.porter.PorterStemmer()
    
    tokens = re.split('[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]', email_contents)
    # Tokenize and also get rid of any punctuation
    for token in tokens:
        # Remove any non alphanumeric characters
        token = re.sub('[^a-zA-Z0-9]', '', token)
        # Stem the word (获取单词前缀)
        token = stemmer.stem(token)
        
        if len(token) < 1:
            continue
        print(token)
        for k, v in vocabList.items():
            if token == v:
                # 单词在词库中存在就加入
                word_indices = np.append(word_indices, k)
    
    print('==================')
    return word_indices
                        

 getVocabList.m

# -*- coding: utf-8 -*-
"""
Created on Fri Jan  3 11:44:58 2020

@author: Lonely_hanhan
"""

def getVocabList():
    '''
    %GETVOCABLIST reads the fixed vocabulary list in vocab.txt and returns a
    %cell array of the words
    %   vocabList = GETVOCABLIST() reads the fixed vocabulary list in vocab.txt 
    %   and returns a cell array of the words in vocabList.
    '''
    # Read the fixed vocabulary list
    #fid = open('D:\exercise\machine-learning-ex6\machine-learning-ex6\ex6\vocab.txt', 'r').read()
    # Store all dictionary words in cell array vocab{}
    #n = 1899 # Total number of words in the dictionary
    '''
    % For ease of implementation, we use a struct to map the strings => integers
    % In practice, you'll want to use some form of hashmap
    '''
    vocabList = {}
    with open('vocab.txt') as f:
        for line in f:
            (val, key) = line.split()
            vocabList[int(val)] = key
    return vocabList
            
            

 emailFeatures.m

# -*- coding: utf-8 -*-
"""
Created on Sun Jan  5 16:05:09 2020

@author: Lonely_hanhan
"""
import numpy as np
import getVocabList as ge
def emailFeatures(word_indices):
    '''
    %EMAILFEATURES takes in a word_indices vector and produces a feature vector
    %from the word indices
    %   x = EMAILFEATURES(word_indices) takes in a word_indices vector and 
    %   produces a feature vector from the word indices. 

    
    '''
    #  Total number of words in the dictionary
    n = 1899
    
    # You need to return the following variables correctly.
    x = np.zeros((n,1))
    
    '''
    % ====================== YOUR CODE HERE ======================
% Instructions: Fill in this function to return a feature vector for the
%               given email (word_indices). To help make it easier to 
%               process the emails, we have have already pre-processed each
%               email and converted each word in the email into an index in
%               a fixed dictionary (of 1899 words). The variable
%               word_indices contains the list of indices of the words
%               which occur in one email.
% 
%               Concretely, if an email has the text:
%
%                  The quick brown fox jumped over the lazy dog.
%
%               Then, the word_indices vector for this text might look 
%               like:
%               
%                   60  100   33   44   10     53  60  58   5
%
%               where, we have mapped each word onto a number, for example:
%
%                   the   -- 60
%                   quick -- 100
%                   ...
%
%              (note: the above numbers are just an example and are not the
%               actual mappings).
%
%              Your task is take one such word_indices vector and construct
%              a binary feature vector that indicates whether a particular
%              word occurs in the email. That is, x(i) = 1 when word i
%              is present in the email. Concretely, if the word 'the' (say,
%              index 60) appears in the email, then x(60) = 1. The feature
%              vector should look like:
%
%              x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];
    
    
    '''
    vocabList = {}
    vocabList = ge.getVocabList()
    m = len(word_indices)
    for i in range(m):
        if word_indices[i] in vocabList.keys():
            x[word_indices[i]-1] = 1
    return x

 运行结果:

Preprocessing sample email (emailSample1.txt)

==== Processed Email ====
anyon
know
how
much
it
cost
to
host
a
web
portal
well
it
depend
on
how
mani
visitor
you
re
expect
thi
can
be
anywher
from
less
than
number
buck
a
month
to
a
coupl
of
dollarnumb
you
should
checkout
httpaddr
or
perhap
amazon
ecnumb
if
your
run
someth
big
to
unsubscrib
yourself
from
thi
mail
list
send
an
email
to
emailaddr
==================
Word Indices: 
[  86  916  794 1077  883  370 1699  790 1822 1831  883  431 1171  794
 1002 1893 1364  592 1676  238  162   89  688  945 1663 1120 1062 1699
  375 1162  479 1893 1510  799 1182 1237  810 1895 1440 1547  181 1699
 1758 1896  688 1676  992  961 1477   71  530 1699  531]

Extracting features from sample email (emailSample1.txt)

Length of feature vector: 1899
Number of non-zero entries: 45

Training Linear SVM (Spam Classification)

(this may take 1 to 2 minutes) ...

Training Accuracy: 99.825
Test Accuracy: 98.9
[1190  297 1397 ... 1764 1665 1560]
otherwis (0.500614)
clearli (0.465916)
remot (0.422869)
gt (0.383622)
visa (0.367710)
base (0.345064)
doesn (0.323632)
wife (0.269724)
previous (0.267298)
player (0.261169)
mortgag (0.257298)
natur (0.253941)
ll (0.253467)
futur (0.248297)
hot (0.246404)

Preprocessing sample email (emailSample2.txt)

==== Processed Email ====
folk
my
first
time
post
have
a
bit
of
unix
experi
but
am
new
to
linux
just
got
a
new
pc
at
home
dell
box
with
window
xp
ad
a
second
hard
diskfor
linux
partit
the
disk
and
have
instal
suse
number
number
from
cd
which
wentfin
except
it
didn
t
pick
up
my
monitor
i
have
a
dell
brand
enumberfpp
number
lcd
flat
panel
monitor
and
a
nvidia
geforcenumbertinumb
video
card
both
of
which
are
probabl
too
new
to
featur
in
suse
s
defaultset
i
download
a
driver
from
the
nvidia
websit
and
instal
it
use
rpm
then
i
ran
saxnumb
as
wa
recommend
in
some
post
i
found
on
the
net
butit
still
doesn
t
featur
my
video
card
in
the
avail
list
what
next
anoth
problem
i
have
a
dell
brand
keyboard
and
if
i
hit
capslock
twice
the
whole
machin
crash
in
linux
not
window
even
the
on
off
switch
isinact
leav
me
to
reach
for
the
power
cabl
instead
if
anyon
can
help
me
in
ani
way
with
these
prob
i
d
be
realli
grate
i
ve
search
the
net
but
have
run
out
of
idea
or
should
i
be
go
for
a
differ
version
of
linux
such
as
redhat
opinionswelcom
thank
a
lot
peter
irish
linux
user
group
emailaddrhttpaddr
for
un
subscript
inform
list
maintain
emailaddr
==================
Word Indices: 
[ 662 1084  652 1694 1280  756  186 1162 1752  594  225   64 1099 1699
  960  902  726 1099 1228  124  787  427  208 1860 1855 1885   21 1464
  752  960 1217 1666  464   74  756  847 1627 1120 1120  688  259 1840
  583  883  450 1249 1760 1084 1061  756  427  210 1120 1208 1061   74
 1792  246  204 1162 1840 1308 1708 1099 1699  626  825 1627  487  492
  688 1666 1824   74  847  883 1437 1671  116 1803 1376  825 1545 1280
  677 1171 1666 1095 1590  476  626 1084 1792  246  825 1666  139  961
 1835 1101   80 1309  756  427  210  909   74  810  785 1666 1845  988
  380  825  960 1113 1855  571 1666 1171 1163 1630  940 1018 1699 1365
  666 1666 1284  230  850  810   86  238  771 1018  825   75 1860 1675
  162 1371 1785 1462 1666 1095  225  756 1440 1192 1162  805 1182 1510
  162  718  666  452 1790 1162  960 1613  116 1379 1664  980  876  960
 1773  735  666 1744 1610  840  961  995  531]

Extracting features from sample email (emailSample2.txt)

Length of feature vector: 1899
Number of non-zero entries: 119
[0]

 

 为了记录自己的学习进度同时也加深自己对知识的认知,刚刚开始写博客,如有错误或者不妥之处,请大家给予指正。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值