自然语言处理爬过的坑：使用python结巴对中文分词并且进行过滤，建立停用词。常见的中文停用词表大全

最新推荐文章于 2024-05-12 16:30:17 发布

腾阳

最新推荐文章于 2024-05-12 16:30:17 发布

阅读量1.3w

点赞数 7

分类专栏：学习python我所遇到的坑以及解决方法自然语言处理学习笔记文章标签：结巴分词自然语言处理 python 停用词

本文链接：https://blog.csdn.net/weixin_41931602/article/details/80430380

版权

原代码：

 def natural_language_processing(self,response):
        #对所抓取的预料进行自然语言处理
        title = response.meta['title']
        #print title
        content = response.meta['content']
        #print content
        raw_documents = []
        raw_documents.append(title)
        raw_documents.append(content)
        #print raw_documents
        print raw_documents[0]
        print raw_documents[1]
        corpora_documents = []
        # 分词处理
        for item_text in raw_documents:
            item_seg = list(jieba.cut(item_text))
            #print item_seg

            '''建立停用词'''
            #stopwords = {}.fromkeys(['。', '：', '，',' ','《','》','、',' ','（','）','“','”','；','\n'])
            buff = []
            with codecs.open('stop.txt') as fp:
                for ln in fp:
                    el = ln[:-2]
                    buff.append(el)
            stopwords = buff
            for word in item_seg:
                if word not in stopwords and len(word)>1:
                    print word
                    corpora_documents.append(word)
            print corpora_documents

去除停用词的常见的两种方法：

一是简单的去除一些不要的标点符号。

优点：简单快捷，代码少

缺点：时间成本高，如果遇到太多不想要的特殊符号就很难受

二是建立停用词表

优点：海纳百川，将所有的特殊字符和一些助词以及其他乱七八糟的词语都过滤了

缺点:代码多，需要注意编码问题。

接下来是中文分词常见的过滤词，可以粘贴保存成文件，要使用时打开文件，将所有的引入就好

!  
"  
#  
$  
%  
&  
'  
(  
)  
*  
+  
,  
-  
--  
.  
..  
...  
......  
...................  
./  
.一  
.数  
.日  
/  
//  
0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
:  
://  
::  
;  
<  
=  
>  
>>  
?  
@  
A  
Lex  
[  
\  
]  
^  
_  
`  
exp  
sub  
sup  
|  
}  
~  
~~~~  
·  
×  
×××  
Δ  
Ψ  
γ  
μ  
φ  
φ．  
В  
—  
——  
———  
‘  
’  
’‘  
“  
”  
”，  
…  
……  
…………………………………………………③  
′∈  
′｜  
℃  
Ⅲ  
↑  
→  
∈［  
∪φ∈  
≈  
①  
②  
②ｃ  
③  
③］  
④  
⑤  
⑥  
⑦  
⑧  
⑨  
⑩  
──  
■  
▲  
　  
、  
。  
〈  
〉  
《  
》  
》），  
」  
『  
』  
【  
】  
〔  
〕  
〕〔  
㈧  
一  
一.  
一一  
一下  
一个  
一些  
一何  
一切  
一则  
一则通过  
一天  
一定

最低0.47元/天解锁文章

腾阳

关注

7
点赞
踩
40

收藏

觉得还不错? 一键收藏
2
评论
自然语言处理爬过的坑：使用python结巴对中文分词并且进行过滤，建立停用词。常见的中文停用词表大全

原代码： def natural_language_processing(self,response): #对所抓取的预料进行自然语言处理 title = response.meta['title'] #print title content = response.meta['content'] #print cont...
复制链接

扫一扫