Jieba是一个中文分词组件,可用于中文句子/词性分割、词性标注、未登录词识别,支持用户词典等功能。该组件的分词精度达到了97%以上。下载介绍在Python里安装Jieba。
特点
- 支持三种分词模式:
- 精确模式,试图将句子最精确地切开,适合文本分析;
- 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
- 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
- 支持繁体分词
- 支持自定义词典
- MIT 授权协议
安装说明
代码对 Python 2/3 均兼容
- 全自动安装: easy_install jieba 或者 pip install jieba / pip3 install jieba
- 半自动安装:先下载 https://pypi.python.org/pypi/jieba/ ,解压后运行 python setup.py install
- 手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
- 通过 import jieba 来引用
官网地址: http://pypi.python.org/pypi/jieba/
个人地址: http://download.csdn.net/detail/sanqima/9470715
2)将其解压到D:\TDDownload,如图(1)所示:
3)点击电脑桌面的左下角的【开始】—》运行 —》输入: cmd —》切换到Jieba所在的目录,比如,D:\TDDownload\Jieba,依次使用如下命令:
<code class="hljs tex has-numbering" style="display: block; padding: 0px; background-color: transparent; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background-position: initial initial; background-repeat: initial initial;">C:<span class="hljs-command" style="box-sizing: border-box; color: rgb(0, 0, 136);">\Users</span><span class="hljs-command" style="box-sizing: border-box; color: rgb(0, 0, 136);">\Administrator</span>>D: D:<span class="hljs-command" style="box-sizing: border-box; color: rgb(0, 0, 136);">\></span>cd D:<span class="hljs-command" style="box-sizing: border-box; color: rgb(0, 0, 136);">\TDDownload</span><span class="hljs-command" style="box-sizing: border-box; color: rgb(0, 0, 136);">\jieba</span>-0.35 D:<span class="hljs-command" style="box-sizing: border-box; color: rgb(0, 0, 136);">\TDDownload</span><span class="hljs-command" style="box-sizing: border-box; color: rgb(0, 0, 136);">\jieba</span>-0.35>python setup.py install</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
如图(2)所示:
3)在 spyder 里写一个中文分词的小程序: fenCi.py
## fenCi.py
<code class="hljs perl has-numbering" style="display: block; padding: 0px; background-color: transparent; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background-position: initial initial; background-repeat: initial initial;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#encoding=utf-8</span> import jieba seg_list = jieba.cut(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"我来到北京清华大学"</span>,cut_all=True) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Full Mode:"</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/ "</span>.<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">join</span>(seg_list) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#全模式</span> seg_list = jieba.cut(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"我来到北京清华大学"</span>,cut_all=False) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Default Mode:"</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/ "</span>.<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">join</span>(seg_list) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#精确模式</span> seg_list = jieba.cut(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"他来到了网易杭研大厦"</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#默认是精确模式</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">", "</span>.<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">join</span>(seg_list) seg_list = jieba.cut_for_search(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"小明硕士毕业于中国科学院计算所,后在日本京都大学深造"</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#搜索引擎模式</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">", "</span>.<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">join</span>(seg_list)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul>
效果如下: