编程悬崖,回头是岸 ——Python:抓取网页初体验
版本:Python 2.7.3 |EPD_free 7.3-2 (32-bit)|
#coding:utf-8
import urllib2
第一次尝试抓取网页成功,抓取的是百度首页
#first try,it works----------------------------
def getHtml_1(url):
response = urllib.urlopen(url)
html_body = response.read()
return html_body
然后尝试了request方法来抓,也是成功的
#use request,still works------------------------------
def getHtml_2(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
html_body = response.read()
return html_body
当地址换成糗事百科就不行了,以为是请求失败,于是想获取状态码看看,发现也是失败的
#try to get status,failed-------------------------
def getHtml_3(url):
response = None
try:
response = urllib2.urlopen(url,timeout=5)
html_body = response.read()
return html_body
except urllib2.URLError as e:
if hasattr(e,'code'):
print 'Error code:',e.code
elif hasattr(e,'reason'):
print 'Reason:',e.code
finally:
if response:
response.close()
通过Wireshark抓包看,人家根本没有回应
然后看到头信息是python什么的,那肯定是不行的,于是琢磨着加了个头信息,结果成功了
#add header--------------------------
def getHtml_4(url):
response = None
requset = None
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
try:
requset = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(requset)
html_body = response.read()
return html_body
except urllib2.URLError as e:
if hasattr(e,'code'):
print 'Error code:',e.code
elif hasattr(e,'reason'):
print 'Reason:',e.code
finally:
if response:
response.close()
#run -----------------------------------
#html = getHtml_1("https://www.baidu.com/")
#html = getHtml_2("https://www.baidu.com/")
#html = getHtml_3("http://www.qiushibaike.com/")
html = getHtml_4("http://www.qiushibaike.com/")
#show me------------------------------
print html
成功信息如下:
简单抓取就这样