[NLP] 动手实现邮件分类算法

数据集获取:sms.tsv
CSDN:https://download.csdn.net/download/stillxjy/11168492
Github: https://github.com/jalajthanaki/NLPython/blob/master/ch8/Spamflteringapplication/data/sms.tsv
在这里插入图片描述
代码实现细节分析:
(1)导入包

import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

(2)加载数据并显示

#数据路径
path = '/home/stillxjy/NLP/sms.tsv'
#读取表格数据
sms = pd.read_table(path,header=None,names=['label','message'])
#数据的大小,行,列
print(sms.shape)
#前10条数据
print(sms.head(10))
#非垃圾邮件和垃圾邮件的数目
print(sms.label.value_counts())

输出:共有5572份邮件,每个邮件由label和message组成,非垃圾邮件4825份,垃圾邮件747份

(5572, 2)
  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5  spam  FreeMsg Hey there darling it's been 3 week's n...
6   ham  Even my brother is not like to speak with me. ...
7   ham  As per your request 'Melle Melle (Oru Minnamin...
8  spam  WINNER!! As a valued network customer you have...
9  spam  Had your mobile 11 months or more? U R entitle...
ham     4825
spam     747

(3)预处理:将标签转换为数字

#根据类别标签,添加数值类别标签数据,将ham转换为0 spam转换为1
sms['label_num'] = sms.label.map({'ham':0,'spam':1})
#添加数值类别数据后的前10条数据
print(sms.head(10))

输出:

Name: label, dtype: int64
  label                                            message  label_num
0   ham  Go until jurong point, crazy.. Available only ...          0
1   ham                      Ok lar... Joking wif u oni...          0
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1
3   ham  U dun say so early hor... U c already then say...          0
4   ham  Nah I don't think he goes to usf, he lives aro...          0
5  spam  FreeMsg Hey there darling it's been 3 week's n...          1
6   ham  Even my brother is not like to speak with me. ...          0
7   ham  As per your request 'Melle Melle (Oru Minnamin...          0
8  spam  WINNER!! As a valued network customer you have...          1
9  spam  Had your mobile 11 months or more? U R entitle...          1

(4)数据集分割:将数据集分为训练集和测试集

X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

#将整个数据分为训练集和数据集
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
print('训练数据数目:{0}'.format(X_train.shape))
print('测试数据数目:{0}'.format(X_test.shape))
print('训练数据百分比: {0}'.format(X_train.shape[0]*1.0/X.shape[0]))

输出:

(5572,)
(5572,)
训练数据数目:(4179,)
测试数据数目:(1393,)
训练数据百分比: 0.75
输出矩阵中不为0的数据的情况:

(5)将文字数据,转换为频率矩阵数字数据。

#实例化
vect = CountVectorizer()

#获取所有邮件的单词表,并将邮件信息转换为频率矩阵
X_train_dtm = vect.fit_transform(X_train)
print('输出矩阵中不为0的数据的情况:\n {0}'.format(X_train_dtm))
print('单词表:{0}\n 单词的数目:{1}'.format(vect.get_feature_names(),len(vect.get_feature_names())))
print('频率矩阵大小:{0}'.format(X_train_dtm.shape))

#同理,根据已经获取的单词表,将测试数据转换为频率矩阵
X_test_dtm = vect.transform(X_test)

输出:所有4179份训练邮件中共有7456单词,所以最后训练数据X_train_dtm为7179x7456的矩阵
每一行代表一个邮件,第i行中第j列数据代表:在第i份邮件中单词j出现的次数

  (0, 2022)	1
  (0, 4779)	1
  (0, 4662)	1
  (0, 6892)	1
  (0, 6656)	1
  (0, 50)	1
  (0, 4743)	1
  (0, 4375)	1
  (0, 1552)	1
  (0, 264)	1
  (0, 4983)	1
  (0, 7424)	1
  (0, 3170)	1
  (0, 2864)	2
  (0, 4987)	1
  (0, 1572)	1
  (0, 3880)	1
  (0, 5479)	1
  (0, 3971)	1
  (0, 4781)	1
  (0, 5193)	1
  (0, 3181)	1
  (0, 509)	1
  (1, 6758)	1
  (1, 3316)	1
  :	:
  (4177, 3700)	1
  (4177, 837)	1
  (4177, 307)	1
  (4177, 6662)	1
  (4177, 6034)	1
  (4177, 4508)	1
  (4177, 2556)	1
  (4177, 5490)	1
  (4177, 254)	1
  (4177, 2744)	1
  (4177, 4778)	1
  (4177, 4446)	1
  (4177, 4255)	1
  (4177, 3629)	1
  (4177, 7257)	1
  (4177, 1574)	1
  (4177, 6887)	1
  (4177, 3738)	1
  (4177, 5656)	1
  (4177, 6514)	1
  (4177, 6656)	1
  (4178, 5999)	1
  (4178, 7257)	1
  (4178, 4238)	1
  (4178, 1691)	1
单词表:['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705', '08000938767', '08001950382', '08002888812', '08002986030', '08002986906', '08002988890', '08006344447', '0808', '08081263000', '08081560665', '0825', '083', '0844', '08448714184', '0845', '08450542832', '08452810071', '08452810073', '08452810075over18', '0870', '08700435505150p', '08700469649', '08700621170150p', '08701213186', '08701417012', '08701417012150p', '0870141701216', '087016248', '08701752560', '0870241182716', '08702840625', '08704050406', '08704439680', '08706091795', '0870737910216yrs', '08707509020', '08707808226', '08708034412', '08709222922', '08709501522', '0871', '087104711148', '08712101358', '08712103738', '0871212025016', '08712300220', '087123002209am', '08712317606', '08712400602450p', '08712400603', '08712402050', '08712402578', '08712402779', '08712402902', '08712402972', '08712404000', '08712405020', '08712405022', '08712460324', '0871277810710p', '0871277810810', '0871277810910p', '08714342399', '08714712379', '08714712388', '08714712394', '08714712412', '08715203028', '08715203649', '08715203652', '08715203685', '08715203694', '08715500022', '08715705022', '08717168528', '08717205546', '0871750', '08717898035', '08718711108', '08718720201', '08718723815', '08718725756', '08718726270', '087187262701', '08718726970', '08718726971', '08718726978', '08718727868', '08718727870', '08718727870150ppm', '08718730555', '08718730666', '08718738001', '08718738002', '08719180248', '08719181503', '08719181513', '08719899217', '08719899229', '08719899230', '09', '09050000301', '09050000332', '09050000460', '09050000555', '09050000878', '09050000928', '09050001295', '09050001808', '09050002311', '09050003091', '09050090044', '09050280520', '09053750005', '09056242159', '09057039994', '09058091854', '09058091870', '09058094454', '09058094455', '09058094507', '09058094565', '09058094583', '09058094594', '09058094597', '09058094599', '09058098002', '09058099801', '09061104276', '09061104283', '09061209465', '09061213237', '09061221061', '09061221066', '09061701444', '09061701461', '09061701939', '09061702893', '09061743386', '09061743806', '09061743810', '09061743811', '09061744553', '09061749602', '09061790121', '09061790125', '09061790126', '09063440451', '09063458130', '0906346330', '09064011000', '09064012103', '09064012160', '09064015307', '09064017295', '09064018838', '09064019014', '09064019788', '09065069120', '09065069154', '09065171142', '09065174042', '09065394514', '09065989180', '09065989182', '09066350750', '09066358152', '09066358361', '09066361921', '09066362231', '09066364311', '09066364349', '09066364589', '09066368470', '09066380611', '09066382422', '09066612661', '09066649731from', '09066660100', '09071512433', '09077818151', '09090204448', '09094100151', '09094646631', '09095350301', '09096102316', '09099725823', '09099726395', '09099726429', '09099726481', '09099726553', '09111032124', '09701213186', '0a', '10', '100', '1000', '1000s', '100p', '100percent', '1013', '1030', '10am', '10k', '10p', '10ppm', '10th', '11', '1120', '113', '1131', '114', '116', '118p', '11mths', '11pm', '12', '1205', '120p', '121', '1225', '123', '125', '1250', '125gift', '128', '12hrs', '12mths', '13', '130', '1327', '139', '14', '140', '1405', '140ppm', '145', '1450', '146tf150p', '14thmarch', '15', '150', '1500', '150p', '150p16', '150pm', '150ppermesssubscription', '150ppm', '150pw', '151', '153', '15pm', '16', '165', '1680', '169', '177', '18', '180', '1843', '18p', '18yrs', '195', '1956669', '1apple', '1b6a5ecef91ff9', '1cup', '1er', '1hr', '1im', '1lemon', '1mega', '1million', '1pm', '1st', '1st4terms', '1stchoice', '1thing', '1tulsi', '1win150ppmx3', '1winaweek', '1winawk', '1x150p', '1yf', '20', '200', '2000', '2003', '2004', '2005', '2006', '2007', '2025050', '20m12aq', '20p', '21', '21870000', '21st', '22', '220', '220cm2', '2309', '23f', '23g', '24', '24hrs', '24m', '24th', '25', '250', '250k', '255', '25p', '26', '2667', '26th', '27', '28', '2814032', '28days', '28th', '28thfeb', '29', '2b', '2c', '2channel', '2day', '2docd', '2end', '2ez', '2find', '2getha', '2geva', '2go', '2gthr', '2hrs', '2kbsubject', '2lands', '2marrow', '2moro', '2morow', '2morro', '2morrow', '2morrowxxxx', '2mro', '2mrw', '2mwen', '2nd', '2nite', '2optout', '2px', '2rcv', '2stop', '2stoptxt', '2u', '2watershd', '2waxsto', '2wks', '2wt', '2wu', '2yr', '2yrs', '30', '300', '3000', '300603', '300p', '3030', '30apr', '30ish', '30pm', '30pp', '30s', '31', '3100', '310303', '31p', '32', '32000', '326', '33', '330', '350', '3510i', '3650', '36504', '3680', '3750', '37819', '38', '382', '391784', '3aj', '3d', '3days', '3g', '3gbp', '3hrs', '3lions', '3lp', '3miles', '3mins', '3mobile', '3optical', '3pound', '3qxj9', '3rd', '3ss', '3uz', '3wks', '3x', '3xx', '40', '400', '400mins', '400thousad', '402', '40411', '40533', '40gb', '40mph', '41685', '41782', '420', '4217', '42810', '430', '434', '44', '440', '4403ldnw1a7rw18', '44345', '447797706009', '447801259231', '448712404000', '449050000301', '45', '450p', '450ppw', '45239', '45pm', '47', '4719', '4742', '47per', '48', '4882', '48922', '49', '49557', '4a', '4d', '4eva', '4few', '4fil', '4get', '4got', '4info', '4msgs', '4mths', '4qf2', '4t', '4th', '4txt', '4u', '4utxt', '4w', '4wrd', '4xx26', '4years', '50', '500', '5000', '50award', '50ea', '50gbp', '50p', '50perweeksub', '50perwksub', '50pm', '50ppm', '50rcvd', '50s', '515', '5226', '523', '5249', '526', '528', '530', '54', '542', '5digital', '5free', '5ish', '5k', '5min', '5mls', '5p', '5pm', '5th', '5wb', '5we', '5wkg', '5wq', '5years', '60', '600', '6031', '6089', '60p', '61', '61200', '61610', '62468', '630', '63miles', '645', '65', '66', '6669', '674', '67441233', '68866', '69101', '69669', '69696', '69698', '69855', '69866', '69888', '69888nyt', '69911', '69969', '69988', '6hrs', '6ish', '6missed', '6months', '6ph', '6pm', '6th', '6wu', '6zf', '700', '7250', '7250i', '730', '731', '75', '750', '75max', '762', '7634', '7684', '77', '7732584351', '78', '786', '7876150ppm', '7am', '7cfca1a', '7ish', '7oz', '7pm', '7th', '7ws', '800', '8000930705', '80062', '8007', '80082', '80086', '80122300p', '80155', '80182', '8027', '80488', '80608', '8077', '80878', '81010', '81151', '81303', '81618', '820554ad0a1705572711', '82242', '82277', '82324', '82468', '83049', '83110', '83118', '83222', '83332', '83338', '83355', '83600', '83738', '84', '84025', '84122', '84128', '84199', '84484', '85', '85023', '85069', '85222', '85233', '8552', '86021', '861', '864233', '86688', '86888', '87021', '87066', '87070', '87077', '87121', '87131', '872', '87239', '87575', '88039', '88066', '88088', '88222', '88600', '88800', '8883', '88877', '88888', '89034', '89070', '89080', '89105', '89545', '89555', '89693', '89938', '8am', '8ball', '8lb', '8p', '8pm', '8th', '8wp', '900', '9061100010', '910', '9153', '92h', '930', '9307622', '945', '95', '9755', '9758', '99', '9996', '9ae', '9am', '9pm', '9t', '9th', '9yt', '____', 'a21', 'a30', 'aah', 'aaniye', 'aaooooright', 'aathi', 'abbey', 'abdomen', 'abeg', 'abel', 'aberdeen', 'abi', 'ability', 'abiola', 'abj', 'able', 'abnormally', 'about', 'aboutas', 'abroad', 'absence', 'absolutely', 'absolutly', 'abstract', 'abt', 'abta', 'ac', 'academic', 'acc', 'accent', 'accenture', 'accept', 'access', 'accessible', 'accidant', 'accident', 'accidentally', 'accommodation', 'accomodate', 'accomodations', 'accordin', 'accordingly', 'account', 'accounting', 'accounts', 'achan', 'ache', 'achieve', 'acid', 'acl03530150pm', 'acnt', 'aco', 'across', 'act', 'acted', 'actin', 'acting', 'action', 'activ8', 'activate', 'active', 'activities', 'actor', 'actual', 'actually', 'ad', 'adam', 'add', 'addamsfa', 'added', 'addicted', 'addie', 'adding', 'address', 'adewale', 'adjustable', 'admin', 'administrator', 'admirer', 'admission', 'admit', 'adore', 'adoring', 'adp', 'adress', 'adrian', 'adrink', 'ads', 'adsense', 'adult', 'adults', 'advance', 'adventure', 'advice', 'advise', 'advising', 'advisors', 'aeronautics', 'aeroplane', 'affair', 'affairs', 'affection', 'affections', 'affidavit', 'afford', 'afghanistan', 'afraid', 'africa', 'african', 'aft', 'after', 'afternon', 'afternoon', 'afterwards', 'aftr', 'ag', 'again', 'against', 'age', 'age16', 'age23', 'agency', 'agent', 'agents', 'ages', 'agidhane', 'aging', 'ago', 'agree', 'ah', 'aha', 'ahead', 'ahhh', 'ahhhh', 'ahmad', 'aids', 'aig', 'aight', 'ain', 'aint', 'air', 'air1', 'airport', 'airtel', 'aiya', 'aiyah', 'aiyar', 'aiyo', 'ajith', 'ak', 'aka', 'akon', 'al', 'alaipayuthe', 'album', 'alcohol', 'aldrine', 'alert', 'alertfrom', 'alerts', 'alex', 'alfie', 'algarve', 'algebra', 'ali', 'alian', 'alibi', 'alive', 'all', 'allah', 'allday', 'allow', 'allowed', 'allows', 'almost', 'alone', 'along', 'alot', 'already', 'alright', 'alrite', 'also', 'alto18', 'alwa', 'always', 'alwys', 'am', 'amanda', 'amazing', 'ambitious', 'american', 'ami', 'amk', 'amla', 'amma', 'ammae', 'ammo', 'among', 'amongst', 'amore', 'amount', 'amp', 'amplikater', 'amrita', 'ams', 'amt', 'amused', 'an', 'analysis', 'anand', 'and', 'anderson', 'andre', 'andres', 'andrews', 'angry', 'animal', 'animation', 'anna', 'annie', 'anniversary', 'announced', 'announcement', 'annoyin', 'annoying', 'anot', 'another', 'ans', 'ansr', 'answer', 'answered', 'answerin', 'answering', 'answers', 'answr', 'antelope', 'antha', 'anthony', 'anti', 'antibiotic', 'any', 'anybody', 'anymore', 'anyone', 'anyones', 'anyplaces', 'anythiing', 'anythin', 'anything', 'anythingtomorrow', 'anytime', 'anyway', 'anyways', 'anywhere', 'aom', 'apart', 'apartment', 'apes', 'apeshit', 'aphex', 'apnt', 'apo', 'apologetic', 'apologise', 'apologize', 'apology', 'app', 'apparently', 'appeal', 'appear', 'appendix', 'applausestore', 'applebees', 'apples', 'application', 'apply', 'applying', 'appointment', 'appointments', 'appreciate', 'appropriate', 'approve', 'approved', 'approx', 'apps', 'appt', 'appy', 'april', 'aproach', 'aptitude', 'aquarius', 'ar', 'arab', 'arabian', 'arcade', 'archive', 'ard', 'are', 'area', 'aren', 'arent', 'arestaurant', 'aretaking', 'argentina', 'argh', 'argue', 'arguing', 'argument', 'arguments', 'aries', 'arise', 'arises', 'arm', 'armand', 'armenia', 'arms', 'arng', 'around', 'aroundn', 'arr', 'arrange', 'arranging', 'arrested', 'arrival', 'arrive', 'arrow', 'arsenal', 'art', 'arts', 'arty', 'arul', 'arun', 'as', 'asap', 'asda', 'ashes', 'ashley', 'ashwini', 'asia', 'asian', 'asjesus', 'ask', 'askd', 'asked', 'askin', 'asking', 'asks', 'asleep', 'asp', 'ass', 'assessment', 'asshole', 'assistance', 'associate', 'asssssholeeee', 'assume', 'assumed', 'asthere', 'astne', 'astrology', 'astronomer', 'asusual', 'at', 'ate', 'athletic', 'athome', 'atlanta', 'atlast', 'atleast', 'atm', 'atrocious', 'attached', 'attempt', 'attend', 'attending', 'attention', 'attitude', 'attraction', 'attractive', 'attracts', 'attributed', 'auction', 'audition', 'audrey', 'audrie', 'august', 'aunt', 'aunties', 'aunts', 'aunty', 'aust', 'australia', 'auto', 'autocorrect', 'av', 'availa', 'available', 'avalarr', 'avatar', 'avble', 'ave', 'avent', 'avenue', 'avin', 'avo', 'avoid', 'avoids', 'await', 'awaiting', 'awake', 'award', 'awarded', 'away', 'awesome', 'awkward', 'aww', 'awww', 'ax', 'axel', 'axis', 'ay', 'ayn', 'b4', 'b4190604', 'b4280703', 'b4u', 'b4utele', 'ba', 'ba128nnfwfly150ppm', 'babe', 'babes', 'baby', 'babygoodbye', 'babyjontet', 'babysit', 'babysitting', 'back', 'backdoor', 'backwards', 'bad', 'badass', 'badly', 'badrith', 'bag', 'bags', 'bahamas', 'bailiff', 'bak', 'bakra', 'balance', 'ball', 'baller', 'balloon', 'bambling', 'band', 'bandages', 'bani', 'bank', 'banneduk', 'bar', 'barbie', 'bare', 'barely', 'bari', 'barkleys', 'barmed', 'barred', 'barrel', 'barring', 'barry', 'bars', 'base', 'based', 'basic', 'basically', 'basket', 'basketball', 'bat', 'batch', 'batchlor', 'bath', 'bathe', 'bathing', 'bathroom', 'battery', 'bay', 'bb', 'bbc', 'bbd', 'bbdeluxe', 'bbq', 'bc', 'bcaz', 'bck', 'bcm', 'bcm4284', 'bcmsfwc1n3xx', 'bcoz', 'bcum', 'bcums', 'bcz', 'bday', 'be', 'beads', 'bear', 'bears', 'beatings', 'beauties', 'beautiful', 'beauty', 'bec', 'becaus', 'because', 'becausethey', 'become', 'becoz', 'becz', 'bed', 'bedbut', 'bedrm', 'bedroom', 'beehoon', 'been', 'beendropping', 'beer', 'beers', 'befor', 'before', 'beg', 'beggar', 'begging', 'begin', 'begins', 'begun', 'behalf', 'behave', 'behind', 'bein', 'being', 'believe', 'belive', 'bell', 'bellearlier', 'belligerent', 'belly', 'belong', 'belovd', 'beloved', 'belt', 'ben', 'bend', 'beneficiary', 'benefits', 'bennys', 'bergkamp', 'best', 'best1', 'bet', 'beta', 'betta', 'better', 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值