python自然语言处理评论_python自然语言处理——学习笔记：Chapter3纠错

weixin_39638012

于 2020-12-10 14:03:14 发布

阅读量70

点赞数

文章标签： python自然语言处理评论

2017-12-06更新：很多代码执行结果与书中不一致，是因为python的版本不一致。如果发现有问题，可以参考英文版：

第三章，P87有一段处理html的代码：

>>>raw =nltk.clean_html(html)>>>tokens =nltk.word_tokenize(raw)>>>tokens

可是我们执行会有如下错误：

>>> raw =nltk.clean_html(html)

Traceback (most recent call last):

File"", line 1, in File"/Library/Python/2.7/site-packages/nltk/util.py", line 356, inclean_htmlraise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

根据官方网站：介绍http://www.nltk.org/_modules/nltk/util.html

def clean_html(html):

raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

[docs]def clean_url(url):

raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

网站：http://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript介绍：

以后的版本，似乎不支持clean_html()和clean_url()这两个函数

Support for clean_html and clean_url will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.

有关处理HTML 的内容，可以使用http://www.crummy.com/software/BeautifulSoup/上的Beautiful Soup 软件包。

安装：sudo pip install beautifulsoup4

之后替换书上的代码：

from __future__ importdivisionimportnltk, re, pprintfrom urllib importurlopenfrom bs4 importBeautifulSoupdefread_html():

url= "http://news.bbc.co.uk/2/hi/health/2284783.stm"html=urlopen(url).read()

soup=BeautifulSoup(html)

text=soup.get_text()printtext

tokens=nltk.word_tokenize(text)printtokensdefmain():

read_html()if __name__ == '__main__':

main()

上述脚本文件可以独立运行，运行结果与书上一致

weixin_39638012

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python自然语言处理评论_python自然语言处理——学习笔记：Chapter3纠错

2017-12-06更新：很多代码执行结果与书中不一致，是因为python的版本不一致。如果发现有问题，可以参考英文版：第三章，P87有一段处理html的代码：>>>raw =nltk.clean_html(html)>>>tokens =nltk.word_tokenize(raw)>>>tokens可是我们执行会有如下错误：>>...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。