错误代码:
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
def normalize(text):
text = text.lower().strip()
doc = nlp(text)
filtered_sentences = []
for sentence in tqdm(doc.sents):#错误在这
错误:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.
原因:
This is currently a limitation of the sentencizer, because the is_sentenced property is based on whether the Token.is_sent_start properties were changed. However, for the first token in a sentence, this will always default to True. So if the sentence only contains one token, there's no way for spaCy to tell whether the sentence boundaries have been set or not