在使用python requests库爬取网页时,源代码中的中文字符在爬取下来后变成了英文字符
例如:
import requests
r = requests.get('http://apps.webofknowledge.com', headers = {'User-Agent': 'Mozilla/5.0'})
print(r.text[:1000])
结果为:
'<!DOCTYPE html> <html> <head><link rel="icon" href="http://images.webofknowledge.com/WOKRS5272R3/images/wok_favicon.ico" type="image/x-icon"/><title>Web of Science [v.5.27.2] - All Databases Home </title><link rel="stylesheet" href="http://images.webofknowledge.com/WOKRS5272R3/css/WoKcommon.css" type="text/css" /><link rel="stylesheet" href="http://images.webofknowledge.com/WOKRS5272R3/css/WoKcomponents.css" type="text/css" /><link rel="stylesheet" h'
而网页源代码确是这样的:
显然,源代码中的中文字符“所有数据库主页”在爬下来后变成了英文“All Databases Home”
解决方法:
在请求头headers中添加‘
Accept-Language':'zh-CN',即请求代码变为:
import requests
r = requests.get('http://apps.webofknowledge.com', headers = {'User-Agent': 'Mozilla/5.0',
'Accept-Language':'zh-CN'
})
print(r.text[:1000])
结果就OK了:
'<!DOCTYPE html> <html> <head><link rel="icon" href="http://images.webofknowledge.com/WOKRS5272R3/images/zh_CN/wok_favicon.ico" type="image/x-icon"/><title>Web of Science [v.5.27.2] -
所有数据库主页 </title><link rel="stylesheet" href="http://images.webofknowledge.com/WOKRS5272R3/css/WoKcommon.css" type="text/css" /><link rel="stylesheet" href="http://images.webofknowledge.com/WOKRS5272R3/css/WoKcomponents.css" type="text/css" /><link rel="stylesheet" href="'