Python爬虫在网上书上的代码很多都是基于Python2.7的,下面是我最近遇到的一个问题:用指定文件头,用urllib库访问一个网站:
import urllib
def get_page_source(url):
headers = {'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.8',
'Cache-Control': 'max-age=0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safar i/537.36',
'Connection': 'keep-alive',
'Referer': 'http://www.baidu.com/'
}
response = urllib.urlopen(url,None,headers)
page_source = response.read()
return page_source
print(get_page_source("http://baidu.com"))
报错如下:
Traceback (most recent call last