nltk.data.load()应用及其要注意的事项

最新推荐文章于 2024-05-11 03:05:31 发布

不只是小白

最新推荐文章于 2024-05-11 03:05:31 发布

阅读量2.3k

点赞数

分类专栏：自然语言处理文章标签：机器学习 python

本文链接：https://blog.csdn.net/misshanbao/article/details/103807348

版权

自然语言处理专栏收录该内容

3 篇文章 0 订阅

订阅专栏

NLTK是一个自然语言处理的切分包，如果使用的是基本的Python，需要安装该包才能使用，笔者使用的是jupyter notebook(anaconda3)不需要自己安装，直接使用如下的代码即可使用

import nltk

本文的目的是nltk.download（）的应用，所以不对nltk中的其他函数进行讲解。该函数是用来切分大批量的句子。
使用前需要下载nltk中的相应的语言的 pikle文件（可以在tokenizers/punkt中找到），所以先下载punkt如下，结果会显示其安装在电脑的位置接下来可能会用到。很遗憾的是没有中文的

nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\my
[nltk_data]     computer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!





True

在一开始我输入的代码中并没有‘r’和‘encoding’，也不知道什么原因报错了，路径也总是报错，加上了哪两个就好了，之后又发现其实不加也行，也不知道为什么一开始总错，如果有小伙伴有类似的情况，欢迎留言交流。

ch=nltk.data.load(r'tokenizers\\punkt\\english.pickle',encoding='utf-8')

可以发现其实并没有得到切分，但是有趣的是如果加了逗号的话会有不一样的结果

ch.tokenize('I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people')

['I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people']

以下是nltk.data.load函数的参数等等

help(nltk.data.load)

Help on function load in module nltk.data:

load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)
    Load a given resource from the NLTK data package.  The following
    resource formats are currently supported:
    
      - ``pickle``
      - ``json``
      - ``yaml``
      - ``cfg`` (context free grammars)
      - ``pcfg`` (probabilistic CFGs)
      - ``fcfg`` (feature-based CFGs)
      - ``fol`` (formulas of First Order Logic)
      - ``logic`` (Logical formulas to be parsed by the given logic_parser)
      - ``val`` (valuation of First Order Logic model)
      - ``text`` (the file contents as a unicode string)
      - ``raw`` (the raw file contents as a byte string)
    
    If no format is specified, ``load()`` will attempt to determine a
    format based on the resource name's file extension.  If that
    fails, ``load()`` will raise a ``ValueError`` exception.
    
    For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
    it tries to decode the raw contents using UTF-8, and if that doesn't
    work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
    is specified.
    
    :type resource_url: str
    :param resource_url: A URL specifying where the resource should be
        loaded from.  The default protocol is "nltk:", which searches
        for the file in the the NLTK data package.
    :type cache: bool
    :param cache: If true, add this resource to a cache.  If load()
        finds a resource in its cache, then it will return it from the
        cache rather than loading it.
    :type verbose: bool
    :param verbose: If true, print a message when loading a resource.
        Messages are not displayed when a resource is retrieved from
        the cache.
    :type logic_parser: LogicParser
    :param logic_parser: The parser that will be used to parse logical
        expressions.
    :type fstruct_reader: FeatStructReader
    :param fstruct_reader: The parser that will be used to parse the
        feature structure of an fcfg.
    :type encoding: str
    :param encoding: the encoding of the input; only used for text formats.

不只是小白

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
nltk.data.load()应用及其要注意的事项

NLTK是一个自然语言处理的切分包，如果使用的是基本的Python，需要安装该包才能使用，笔者使用的是jupyter notebook(anaconda3)不需要自己安装，直接使用如下的代码即可使用import nltk本文的目的是nltk.download（）的应用，所以不对nltk中的其他函数进行讲解。该函数是用来切分大批量的句子。使用前需要下载nltk中的相应的语言的 pikle文件（...
复制链接

扫一扫