python网络爬虫

获取http://www.qiushibaike.com/textnew/的所有段子,并且按照页码保存到本地一共35页。
二话不说上代码,正则表达式有待研究。
网站源码片段:
<a href="/users/32215536/" target="_blank" title="吃了两碗又盛">
<h2>吃了两碗又盛</h2> </a> <div class="articleGender manIcon">38</div> </div>    <a href="/article/119080581" target="_blank" class='contentHerf' > <div class="content">    <span>一声长叹!<br/>二十多年前,简陋的农村小学的某一个班级里,有两个男生坐在同一个长条凳上,这是我和我的同桌。<br/>同桌是冤家,偶尔他回座位,我悄悄地用力,使同桌那一端微微翘起,然后不动声色地向后旋转,同桌一下坐空,摔在地上,我偷着乐!<br/>同一招不能总用,但是小时候哪懂这个。那一次,我故技重施,怎料到同桌早有防备,腿向后踢,凳子翻起,我顺利地坐到地上。<br/>到现在,就怕阴天!</span>   </div> </a>      <div class="stats"> <span class="stats-vote"><i class="number">62</i> 好笑</span> <span class="stats-comments">     </span> </div> <div id="qiushi_counts_119080581" class="stats-buttons bar clearfix"> <ul class="clearfix"> <li id="vote-up-119080581" class="up"> <a href="javascript:voting(119080581,1)" class="voting" data-article="119080581" id="up-119080581" rel="nofollow"> <i></i> <span class="number hidden">68</span> </a> </li> <li id="vote-dn-119080581" class="down"> <a href="javascript:voting(119080581,-1)" class="voting" data-article="119080581" id="dn-119080581" rel="nofollow"> <i></i> <span class="number hidden">-6</span> </a> </li>  <li class="comments"> <a href="/article/119080581" id="c-119080581" class="qiushi_comments" target="_blank"> <i></i> </a> </li>  </ul> </div> <div class="single-share"> <a class="share-wechat" data-type="wechat" title="分享到微信" rel="nofollow">微信</a> <a class="share-qq" data-type="qq" title="分享到QQ" rel="nofollow">QQ</a> <a class="share-qzone" data-type="qzone" title="分享到QQ空间" rel="nofollow">QQ空间</a> <a class="share-weibo" data-type="weibo" title="分享到微博" rel="nofollow">微博</a> </div> <div class="single-clear"></div>     </div>          <div class="article block untagged mb15" id='qiushi_tag_119080574'>  <div class="author clearfix"> <a href="/users/31546279/" target="_blank" rel="nofollow"> <img src="//pic.qiushibaike.com/system/avtnew/3154/31546279/medium/20160530180038.jpg" alt="?最远的距离"/> </a> <a href="/users/31546279/" target="_blank" title="?最远的距离"> <h2>?最远的距离</h2> </a> <div class="articleGender manIcon">21</div> </div> <a href="/article/119080574" target="_blank" class='contentHerf' > <div class="content"> <span>万能的糗友啊 苦舞一生情字谱<br/>这句话 谁能借下一句。</span> </div> </a> 


python代码


# coding=utf-8
import urllib2
import re
import os

class Spider(object):


def __init__(self):
self.url = 'http://www.qiushibaike.com/textnew/page/%s?s=4832452'
self.user_agent= 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0'
def get_page(self,page_index):
headers= {'User-Agent':self.user_agent}
try:
url = self.url % str(page_index)
request= urllib2.Request(url,headers=headers)
response= urllib2.urlopen(request)
content= response.read()
return content
#print 'content:' + content
except urllib2.HTTPError as e:
print e
exit()
except urllib2.URLError as e:
print 'URLError'+ self.url
exit()
#分析网页源码
def analysis (self,content):
#pattern = re.compile('<div class="content">(.*?)<!--(.*?)-->.*?</div>',re.S)
pattern = re.compile('<div class="content">.*?</div>', re.S)

#res_value = r'<span .*?>(.*?)</span>'


items= re.findall(pattern,content)


return items

#保存抓取内容
def save(self,items,i,path):
if not os.path.exists(path):
os.makedirs(path)
file_path = path + '/'+str(i)+'.txt'
f = open(file_path, 'w')
for item in items:

if __name__ == '__main__':
item_new= item.replace('\n','').replace('<br/>','\n').replace('<div class="content">','').replace('</div>','').replace('</span>','\n').replace('<span>','\n')

f.write(item_new)
f.close()
def run(self):
print '开始抓取页面'
for i in range(1,35):
content= self.get_page(i)
items= self.analysis(content)
self.save(items,i,'/Users/huangxuelian/Downloads/41527218/pythontest')
print '内容抓取完了'
if __name__ =='__main__':
spider = Spider()
spider.run()

转载于:https://www.cnblogs.com/huangxuelian/p/6913875.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值