NLTK word_tokenize 抛出 IndexError: list index out of range

NLTK 3.6.6 这个版本千万不要用!!!!!!!!!!!!!!!!!!!

NLTK word_tokenize throws IndexError: list index out of range | GitAnswer I am working on some NLP experiments, where I want to tokenize some texts from users. For that I am using NLTK right now, but I noticed an unexpected behavior when tokenizing a raw user input string. I am not sure if this is a bug of NLTK or if I should provide some pre-processed string. Before, I had no problems using NLTK for tokenization of pre-processed datasets, but with raw user input I run into problems. Do you have an explanation of the problem or can you give me a hint on how to pre-process my user inputs before applying NLTK word_tokenize ? I have provided a minimal reproducible example. I set up a conda environment for python 3.9 and nltk==3.6.6 . conda create -n "example_nltk" python=3.9 -y conda activate example_nltk pip install nltk==3.6.6 Then I create and run the following python file: import nltk from nltk import word_tokenize nltk.download("punkt") nltk.download("wordnet") nltk.download("omw-1.4") text = '? so ein schwachsinn! rot für: dummes post. salzburg gewinnt öfb-cup gegen rapid' word_tokenize(text, language='german') The script throws an IndexError: list index out of range when running the function word_tokenize(text, language='german') . The error occurs in the punkt.py file in the function _match_potential_end_contexts for line before_words[match] = split[-1] , because the variable split is empty ( [] ). Do you have a suggestion for me how to proceed? Am I doing something wrong? Should I process the raw user input before supplying the input to NLTK's word_tokenize ? Thank you for your support! Here the full traceback for details: Traceback (most recent call last): File "nltk_test.py", line 8, in word_tokenize(text, language='german') File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize sentences = [text] if preserve_line else sent_tokenize(text, language) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize return tokenizer.tokenize(text) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize for sentence in slices: File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries for sentence1, sentence2 in _pair_iter(slices): File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter prev = next(iterator) File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text for match, context in self._match_potential_end_contexts(text): File "/Users/user/anaconda3/envs/example_nltk/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts before_words[match] = split[-1] IndexError: list index out of rangehttps://gitanswer.com/nltk-word-tokenize-throws-indexerror-list-index-out-of-range-python-1088385118

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值