爬糗事百科
网址:http://www.qiushibaike.com/hot/
首先设置headers:
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
用headers获取网页源代码:
request=urllib2.Request(url,headers = headers)
response=urllib2.urlopen(request)
content =response.read().decode('utf-8')
编写正则获取想要的内容:
pattern = re.compile('.*?<h2>\n(.*?)\n</h2>.*?<span>\n\n\n(.*?)\n</span>.*?<i class="number">(.*?)</i>(.*?).*?number">(.*?)</i>(.*?)</a>',re.S)
items=re.findall(pattern,content)
如何实现翻页?
http有两种请求方式 get和post
每一页的网址都不同
第一页:https://www.qiushibaike.com/hot/
第二页:https://www.qiushibaike.com/hot/page/2
第三页:https://www.qiushibaike.com/hot/page/3
。。。
所以想要爬多页加一个for循环即可:
for page in range(1,5):
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
整体代码:
__author__ = 'kkk'
#--*--coding:utf-8--*--
import urllib2
import re
for page in range(1,5):
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
try:
request=urllib2.Request(url,headers = headers)
response=urllib2.urlopen(request)
content =response.read().decode('utf-8')
pattern = re.compile('.*?<h2>\n(.*?)\n</h2>.*?<span>\n\n\n(.*?)\n</span>.*?<i class="number">(.*?)</i>(.*?).*?number">(.*?)</i>(.*?)</a>',re.S)
items=re.findall(pattern,content)
for item in items:
haveImg = re.search("img",item[3])
if not haveImg:
s=re.sub('<br/>','',item[1])
print item[0],s,item[2],item[4]
list=(item[0]+' '+s+item[2]+' '+item[4])
f1 = open('01.txt','a')
f1.write(list.encode('utf-8'))
f1.close()
except urllib2.URLError,e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason