最近在使用Python抓取网页内容,但是总是报错,使用Python2最简单的方法是这样的
import urllib2
req = urllib2.Request('http://www.baidu.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
但是总是会报出错误
Traceback (most recent call last):
File "D:\Python\WebSpider\frdc\WebSpiderA.py", line 9, in <module>
response = urllib2.urlopen(req)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
因为对Python不是很熟悉,最初以为是代码写的不正确,后来突然感觉或许是因为公司内网限制导致的,所以,试着加上了代理,发现可以正常获取数据了
import urllib,urllib2
url = 'http://www.baidu.com/'
proxy = '12.122.22.333:8080'
opener = urllib2.build_opener( urllib2.ProxyHandler({'http':proxy}) )
urllib2.install_opener( opener )
content = urllib2.urlopen(url)
print content.read()
获取到的内容与百度的网页源码内容相同的