机器学习实战之朴素贝叶斯

最新推荐文章于 2022-09-28 20:10:33 发布

小猫奇点

最新推荐文章于 2022-09-28 20:10:33 发布

阅读量236

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/sophiezjz/article/details/82220652

版权

问题1
来源：使用朴素贝叶斯过滤垃圾邮件
描述：spamTest()和textParse()读文件时编译通不过
报错：UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence
TypeError: cannot use a string pattern on a bytes-like object
解决办法：1、将textParse()函数中listOfTokens = re.split(r'\w*', bigString)改成
   regEx = re.compile('\\w*')
   bigString = bigString.decode('GBK', 'ignore')
   listOfTokens = regEx.split(bigString)
   2、将spamTest()函数中wordList = textParse(open('spam/%d.txt' % i).read())和wordList = textParse(open('ham/%d.txt' % i).read())改成
   wordList = textParse(open('spam/%d.txt' % i, 'rb').read())
   wordList = textParse(open('ham/%d.txt' % i, 'rb').read())
原因：①python2转python3 ②读的文件中出现了无法编码的字符，需使用ignore属性进行忽略

问题2
来源：使用朴素贝叶