爬虫实战——爬百思不得姐

最新推荐文章于 2020-05-03 22:55:10 发布

weixin_30559481

最新推荐文章于 2020-05-03 22:55:10 发布

阅读量134

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/qukingblog/p/7475292.html

版权

看完了爬虫的入门之后，想实战一下，于是找了一个段子网站——百思不得姐，爬一下段子：

首先进入到 http://www.budejie.com/text/，里面全部是段子，暂时只把段子爬下来，不爬图片，打开页面查看源代码:

duanzi

发现段子都在类似于这样 <a href="(/detail-3242432.html)">段子</a> 的结构中，
于是我们有办法了，把段子在的地方放入正则表达式reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
点赞的人数也是重复上面的过程：

点赞人数

正则表达式reg = re.compile(r'<i class="icon-up ui-icon-up"></i>  <span>(.*?)</span>

代码如下：

 # encoding: utf-8
import urllib2
import re


def getduan():
    url = 'http://www.budejie.com/text/'
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'#代理
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    res = response.read()
    reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
    return re.findall(reg, res)


def up():
    url = 'http://www.budejie.com/text/'
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    res = response.read()
    reg = re.compile(r'<i class="icon-up ui-icon-up"></i>&nbsp;&nbsp;<span>(.*?)</span>')
    return re.findall(reg, res)


if __name__ == '__main__':
    d = zip(getduan(), up())
    d = dict(d)
    count = 0
    for j, i in d.items():
        print '段子', (count+1),j[1]
        count = count+1
        print 'up人数：',i