注:langchain版本:0.0.352
使用langchain的UnstructuredCSVLoader读取带中文csv时:
file_path = “chinese.csv”
loader = UnstructuredCSVLoader(file_path=str(file_path))
docs = loader. Load()
因为编码问题,导致报错:
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xxx in position x: illegal multibyte sequence
修改UnstructuredCSVLoader类中的_get_elements函数如下:
def _get_elements(self) -> List:
from unstructured.partition.csv import partition_csv
# #####debug code######
# unstructuredCSVLoader加载中文csv错误修复
try:
elements = partition_csv(filename=self.file_path, **self.unstructured_kwargs)
except:
with open(self.file_path,'rb') as f:
elements = partition_csv(file=f,**self.unstructured_kwargs)
# ########code end###########
return elements
即可。
问题为langchain集成三方库unstructured时编码问题导致。