NLTK是一个自然语言处理的切分包,如果使用的是基本的Python,需要安装该包才能使用,笔者使用的是jupyter notebook(anaconda3)不需要自己安装,直接使用如下的代码即可使用
import nltk
本文的目的是nltk.download()的应用,所以不对nltk中的其他函数进行讲解。该函数是用来切分大批量的句子。
使用前需要下载nltk中的相应的语言的 pikle文件(可以在tokenizers/punkt中找到),所以先下载punkt如下,结果会显示其安装在电脑的位置接下来可能会用到。很遗憾的是没有中文的
nltk.download('punkt')
[nltk_data] Downloading package punkt to C:\Users\my
[nltk_data] computer\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
在一开始我输入的代码中并没有‘r’和‘encoding’,也不知道什么原因报错了,路径也总是报错,加上了哪两个就好了,之后又发现其实不加也行,也不知道为什么一开始总错,如果有小伙伴有类似的情况,欢迎留言交流。
ch=nltk.data.load(r'tokenizers\\punkt\\english.pickle',encoding='utf-8')
可以发现其实并没有得到切分,但是有趣的是如果加了逗号的话会有不一样的结果
ch.tokenize('I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people')
['I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people']
以下是nltk.data.load函数的参数等等
help(nltk.data.load)
Help on function load in module nltk.data:
load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)
Load a given resource from the NLTK data package. The following
resource formats are currently supported:
- ``pickle``
- ``json``
- ``yaml``
- ``cfg`` (context free grammars)
- ``pcfg`` (probabilistic CFGs)
- ``fcfg`` (feature-based CFGs)
- ``fol`` (formulas of First Order Logic)
- ``logic`` (Logical formulas to be parsed by the given logic_parser)
- ``val`` (valuation of First Order Logic model)
- ``text`` (the file contents as a unicode string)
- ``raw`` (the raw file contents as a byte string)
If no format is specified, ``load()`` will attempt to determine a
format based on the resource name's file extension. If that
fails, ``load()`` will raise a ``ValueError`` exception.
For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
it tries to decode the raw contents using UTF-8, and if that doesn't
work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
is specified.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
:type cache: bool
:param cache: If true, add this resource to a cache. If load()
finds a resource in its cache, then it will return it from the
cache rather than loading it.
:type verbose: bool
:param verbose: If true, print a message when loading a resource.
Messages are not displayed when a resource is retrieved from
the cache.
:type logic_parser: LogicParser
:param logic_parser: The parser that will be used to parse logical
expressions.
:type fstruct_reader: FeatStructReader
:param fstruct_reader: The parser that will be used to parse the
feature structure of an fcfg.
:type encoding: str
:param encoding: the encoding of the input; only used for text formats.