事件抽取处理中文数据集时出现这个错误,因为默认编码方式是gbk,而中文文本是utf-8
原先:
with open(sgm_file, 'r') as f:
soup = BeautifulSoup(f.read(), features='html.parser')
sgm_text = soup.text
所以当出现这个错误时修改代码:
with open(sgm_file, 'r',encoding='utf-8') as f:
soup = BeautifulSoup(f.read(), features='html.parser')
sgm_text = soup.text