报错一:UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa0’ in position 41: illegal multibyte sequence
拉勾网数据抓取中,抓取一段数据后出现如下报错:\u200e
UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa0’ in position 41: illegal multibyte sequence
检查原因是在抓取到的数据在写入csv的时候出现的问题,查看源码如下:
def csv_writer(self,position):
title = ['job_name','company','salary','city','education','workyear','job_des']
with open('lagou.csv','a',newline='') as f:
writer = csv.DictWriter(f,title)
writer.writerow(position)
分析:在文件写入的时候报的错误,因万恶的windows打开文件默认是以“gbk“编码的,可能造成不识别unicode字符,于是做了如下的修改:
def csv_writer(self,position):
title = ['job_name','company','salary','city','education','workyear','job_des']
with open('lagou.csv','a',newline='', **encoding = 'utf-8'**) as f:
writer = csv.DictWriter(f,title)
writer.writerow(position)
参考文章:https://blog.csdn.net/github_35160620/article/details/53353672
补充:
针对上述乱码的问题,如果想要强行写入,忽略乱码的问题,可以加errors=“ignore”
思路来源:https://blog.csdn.net/yanjiaxin1996/article/details/80113552
def csv_writer(self,position):
title = ['job_name','company','salary','city','education','workyear','job_des']
with open('lagou.csv','a',newline='', errors="ignore") as f:
writer = csv.DictWriter(f,title)
writer.writerow(position)
报错二:SyntaxError: (unicode error) ‘unicodeescape’ codec can’t decode bytes in position 12-13: truncated \xXX escape
在读取目录文件的时候,出现如标题报错,经过分析后发现,路径中包含转义字符,需要转义后才能识别这些符号。
with open(r'C:\Py\spider\xici_proxy\ip.csv','r') as f:
print(f.read())
报错三:SyntaxError: invalid character in identifier
请仔细检查问题原因就是代码中包含了无效字符。
请仔细认真的检查一下代码中有没有出现中文的“空格”、“等于”等符号。
proxy = {
http: proxies[random.randint(0,len(proxies)-1)]
}