跟着鬼哥学爬虫-2-糗事百科

学习完上一篇文章,我们需要了解我们需要爬虫的网站的一些数据信息,也就是一些html标签。

bs4中最重要的就是数据的分类,解析,获取过程。

即:



    response = urllib2.urlopen(res)

    html = response.read()

    soup = BeautifulSoup(html, "lxml")

    someData = soup.select("div.content span")


这里的soup.select括号中的数据,是非常重要的。

如下图,我们如何定位select中的标签:






通过红线标记的地方,我们可以直接看到这里的div 的标签的数据,就是上面每一条段子的数据内容,所以,

我们先写一条


 someData = soup.select("div.content")

进行一下测试:

调试推荐大家使用ipython工具,非常方便我们分析数据:

不截图了,直接复制文字:


suz@suz9527:~/Pytools$ ipython
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
Type "copyright", "credits" or "license" for more information.


IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.


In [1]: from bs4 import BeautifulSoup


In [2]: import urllib2


In [3]: url = 'http://www.qiushibaike.com/text/page/1/'


In [4]: heads = {
   ...: 
   ...:         'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36',
   ...:         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   ...:         'Connection': 'keep-alive',
   ...:         'Upgrade-Insecure-Requests': '1',
   ...:         'Referer': 'http://www.qiushibaike.com/',
   ...:         'Accept-Language': 'zh-CN,zh;q=0.8',
   ...:         'Cookie': '_xsrf=2|db27040e|6b4ed8d9536590d4ec5d2064cc2bef4f|1474364551; _qqq_uuid_="2|1:0|10:1474364551|10:_qqq_uuid_|56:MzBlNWFkOGE3MWEyMzc1MWIxMTE3MDBlZjM2M2RkZWQxYzU5YTg1Yw==|1dd2a4f4ce
   ...: acad26b5da9cc295d2965226ea25ee73289855cf032629c4992698"; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1474364592; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1474364595; _ga=GA1.2.1125329542.1474364596'
   ...: 
   ...:     }


In [5]: res = urllib2.Request(url, headers=heads)


In [6]: response = urllib2.urlopen(res)


In [7]: html = response.read()
   ...: 


In [8]: soup = BeautifulSoup(html, "lxml")


In [9]:
    response = urllib2.urlopen(res)

    html = response.read()

    soup = BeautifulSoup(html, "lxml")

    someData = soup.select("div.content span")


[<div class="content">\n<span>


In [10]: print someData



我们可以看到结果中

[<div class="content">\n<span>


有这么一行span属性,所以我们修改一下上面的select为:


    someData = soup.select("div.content span")

查看刚才的输出结果:


In [11]: someData = soup.select("div.content span")


In [12]: print someData


这里的someData其实是个list,所以我们写个for循环输出一下可以看到:




好了,说明我们的想要得到的结果已经正常显示出来了。

下面是完整的代码,加上一些修饰处理:




# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

import urllib2


def getContent(n):

    url = 'http://www.qiushibaike.com/text/page/' + str(n) + '/'

    #url = 'http://www.qiushibaike.com/8hr/page/'+str(n)+'/'

    print url

    heads = {

        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Referer': 'http://www.qiushibaike.com/',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Cookie': '_xsrf=2|db27040e|6b4ed8d9536590d4ec5d2064cc2bef4f|1474364551; _qqq_uuid_="2|1:0|10:1474364551|10:_qqq_uuid_|56:MzBlNWFkOGE3MWEyMzc1MWIxMTE3MDBlZjM2M2RkZWQxYzU5YTg1Yw==|1dd2a4f4ceacad26b5da9cc295d2965226ea25ee73289855cf032629c4992698"; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1474364592; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1474364595; _ga=GA1.2.1125329542.1474364596'

    }

    res = urllib2.Request(url, headers=heads)

    response = urllib2.urlopen(res)

    html = response.read()

    soup = BeautifulSoup(html, "lxml")

    someData = soup.select("div.content span")

    num = 0

    for some in someData:

        num = num + 1

        print num

        print some.text + '\n'


if __name__ == "__main__":

    for i in range(1, 5):

        getContent(i)


最终效果图:



  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值