nltk.data.load()应用及其要注意的事项

NLTK是一个自然语言处理的切分包,如果使用的是基本的Python,需要安装该包才能使用,笔者使用的是jupyter notebook(anaconda3)不需要自己安装,直接使用如下的代码即可使用

import nltk

本文的目的是nltk.download()的应用,所以不对nltk中的其他函数进行讲解。该函数是用来切分大批量的句子。
使用前需要下载nltk中的相应的语言的 pikle文件(可以在tokenizers/punkt中找到),所以先下载punkt如下,结果会显示其安装在电脑的位置接下来可能会用到。很遗憾的是没有中文的

nltk.download('punkt')
[nltk_data] Downloading package punkt to C:\Users\my
[nltk_data]     computer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!





True

在一开始我输入的代码中并没有‘r’和‘encoding’,也不知道什么原因报错了,路径也总是报错,加上了哪两个就好了,之后又发现其实不加也行,也不知道为什么一开始总错,如果有小伙伴有类似的情况,欢迎留言交流。

ch=nltk.data.load(r'tokenizers\\punkt\\english.pickle',encoding='utf-8')

可以发现其实并没有得到切分,但是有趣的是如果加了逗号的话会有不一样的结果

ch.tokenize('I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people')
['I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people']

以下是nltk.data.load函数的参数等等

help(nltk.data.load)
Help on function load in module nltk.data:

load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)
    Load a given resource from the NLTK data package.  The following
    resource formats are currently supported:
    
      - ``pickle``
      - ``json``
      - ``yaml``
      - ``cfg`` (context free grammars)
      - ``pcfg`` (probabilistic CFGs)
      - ``fcfg`` (feature-based CFGs)
      - ``fol`` (formulas of First Order Logic)
      - ``logic`` (Logical formulas to be parsed by the given logic_parser)
      - ``val`` (valuation of First Order Logic model)
      - ``text`` (the file contents as a unicode string)
      - ``raw`` (the raw file contents as a byte string)
    
    If no format is specified, ``load()`` will attempt to determine a
    format based on the resource name's file extension.  If that
    fails, ``load()`` will raise a ``ValueError`` exception.
    
    For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
    it tries to decode the raw contents using UTF-8, and if that doesn't
    work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
    is specified.
    
    :type resource_url: str
    :param resource_url: A URL specifying where the resource should be
        loaded from.  The default protocol is "nltk:", which searches
        for the file in the the NLTK data package.
    :type cache: bool
    :param cache: If true, add this resource to a cache.  If load()
        finds a resource in its cache, then it will return it from the
        cache rather than loading it.
    :type verbose: bool
    :param verbose: If true, print a message when loading a resource.
        Messages are not displayed when a resource is retrieved from
        the cache.
    :type logic_parser: LogicParser
    :param logic_parser: The parser that will be used to parse logical
        expressions.
    :type fstruct_reader: FeatStructReader
    :param fstruct_reader: The parser that will be used to parse the
        feature structure of an fcfg.
    :type encoding: str
    :param encoding: the encoding of the input; only used for text formats.
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值