使用NLTK进行命名实体识别时出现的编码问题解决方法

最新推荐文章于 2021-06-09 19:05:54 发布

海里的咖啡豆

最新推荐文章于 2021-06-09 19:05:54 发布

阅读量1.2k

点赞数

分类专栏： python NLP 文章标签： NLTK

本文链接：https://blog.csdn.net/weixin_41598638/article/details/79599831

版权

在使用NLTK进行命名实体识别时遇到UnicodeDecodeError，'utf8'编码无法解码字节0xef。解决方案详述。

摘要由CSDN通过智能技术生成

以下代码是从这位博主点击打开链接那借鉴过来的用来实现基本的命名实体识别

    # -*- coding: utf-8 -*-  
    import sys  
    reload(sys)  
    sys.setdefaultencoding('utf8')    #让cmd识别正确的编码  
    import nltk  
    newfile = open('news.txt')  
    text = newfile.read()  #读取文件  
    tokens = nltk.word_tokenize(text)  #分词  
    tagged = nltk.pos_tag(tokens)  #词性标注  
    entities = nltk.chunk.ne_chunk(tagged)  #命名实体识别  
    a1=str(entities) #将文件转换为字符串  
    file_object = open('out.txt', 'w')    
    file_object.write(a1)   #写入到文件中  
    file_object.close( )  
    print entities

但是，在运行时出现了以下错误

UnicodeDecodeError: 'utf8' codec can't decode byte 0xef in position 0: unexpected end of data

现给出以下解决办法

#!

最低0.47元/天解锁文章

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

海里的咖啡豆

关注关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
使用NLTK进行命名实体识别时出现的编码问题解决方法

以下代码是从这位博主点击打开链接那借鉴过来的用来实现基本的命名实体识别 # -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf8') #让cmd识别正确的编码 import nltk newfile = open('news.txt'...
复制链接

扫一扫