迭代分析网页内容

最新推荐文章于 2020-12-04 08:49:04 发布

栉风沐雨1314

最新推荐文章于 2020-12-04 08:49:04 发布

阅读量843

点赞数

分类专栏： python 文章标签：刷评论 beautifulsoup 网页分析迭代豆瓣

本文链接：https://blog.csdn.net/gt11799/article/details/39552299

版权

python 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

最近在抓取豆瓣小组的评论区。我想按照用户名，评论，回应的url作为一条存入数据库。首先想到的是用lxml，但是xpath一抓全部都一起出来。试着用ElementTree，玩了半天，感觉越来越复杂，于是就弃暗投明，回到了梦开始的地方。。。。

BeautifulSoup是处理网页最知名的模块之一，也是我最开始用的，当时感觉太复杂了，就使用了相对简单的lxml。昨天又读了一遍文档，以前觉得很无聊的功能，现在感觉十分好用，比如下一个元素，下一个兄弟，子结点诸如此类。本文主要说明迭代和下一个元素。

一个豆瓣的评论区通常都是这样的：

<li class="clearfix comment-item" id="766574544" data-cid="766574544" >
    <div class="user-face">
        <a href="http://www.douban.com/group/people/57082998/"><img class="pil" src="http://img3.douban.com/icon/u57082998-5.jpg" alt="小怪兽。"/></a>
    </div>
    <div class="reply-doc content" style="padding-left:0px;">
        <div class="bg-img-green">
          <h4>
              <a href="http://www.douban.com/group/people/57082998/" class="">***。</a> (任他们多漂亮，未及你矜贵。)
              <span class="pubtime">2014-09-25 10:55:11</span>
          </h4>
        </div>
        
        <div class="reply-quote">
            <span class="short">前任说他家压力很大，负担重，一定要成功，而我家条件一般，没法帮他啥，然后被分手，我希望他能</span>
            <span class="all">前任说他家压力很大，负担重，一定要成功，而我家条件一般，没法帮他啥，然后被分手，我希望他能如你他所愿，找到能帮到你他的另一半，我现在在大街上看到他，估计都想分分钟砍了他</span>
        <a href="#" class="toggle-reply">
            <span class="expaned">...</span>
        </a><span class="pubdate"><a href="http://www.douban.com/group/people/65152572/">****</a></span></div>
        <p class="">帮他。呵呵。小白脸当我节奏吗</p>

        <div class="operation_div" id="57082998">
            <div class="operation-more">
                <a rel="nofollow" href="javascript:void(0);" data-cid="766574544" class="lnk-delete-comment" title="真的要删除***。的发言?">删除</a>
            </div>
            <a rel="nofollow" href="javascript:void(0);" class="comment-vote lnk-fav">赞</a>
            <a href="http://www.douban.com/group/topic/63187199/?cid=766574544#last" class="lnk-reply">回应</a>
        </div>
    </div>
</li>

其他的评论区与这个评论区的结构相同，于是我们可以先抓下来每一个这样一个评论区，然后再把每一个评论区的内容分别抓下来，这样就很容易的分门别类了。首先就是抓评论区：（需要导入bs4.BeautifulSoup）

response = self.session.get(each_url)
soup = BeautifulSoup(response.text)
regions = soup.find_all('li', class_="clearfix comment-item")

regions是个列表，列表中的每一个对象仍然是一个soup对象。这是beautifulsoup比lxml好太多的地方。

对于每一个region，我们可以继续找出我们需要的内容。还是利用属性和特点：

user_name = region.find('h4').find('a').text
comment = region.find('p').text
reply_url = region.find('a', class_="lnk-reply").attrs['href']

巧的是这三个字段都很有代表性。user_name是h4标签下的a标签下的内容，beautifulsoup又是可以直接叠加，太爽了。

comment很简单，就是p下的内容。

reply_url则需要用属性取筛选，当属性是关键字时，比如class，就需要用class_。而链接又是在属性中的，.attr可以把对象的属性转换成字典，然后根据key取出值即可。

有没有感觉很爽呢？beautifulsoup的子结点，父结点也可以做到这些事情，只不过我习惯用自己的方法。

另外我还玩了一把下一个对象方法。通常多页的网页，我们会从每一页中找到下一页的链接，一直迭代到没有下一页为止。豆瓣的页数显示通常时这样的：

<span class="thispage" data-total-page="15">1</span>
                
            <a href="http://www.douban.com/group/topic/63187199/?start=100" >2</a>
        
                
            <a href="http://www.douban.com/group/topic/63187199/?start=200" >3</a>
        
                
            <a href="http://www.douban.com/group/topic/63187199/?start=300" >4</a>
        
                
            <a href="http://www.douban.com/group/topic/63187199/?start=400" >5</a>
        
                
            <a href="http://www.douban.com/group/topic/63187199/?start=500" >6</a>
        
                
            <a href="http://www.douban.com/group/topic/63187199/?start=600" >7</a>
        
                
            <a href="http://www.douban.com/group/topic/63187199/?start=700" >8</a>
        
                
            <a href="http://www.douban.com/group/topic/63187199/?start=800" >9</a>
        
            <span class="break">...</span>
                
            <a href="http://www.douban.com/group/topic/63187199/?start=1300" >14</a>
        
            <a href="http://www.douban.com/group/topic/63187199/?start=1400" >15</a>
        
        <span class="next">
            <link rel="next" href="http://www.douban.com/group/topic/63187199/?start=100"/>
            <a href="http://www.douban.com/group/topic/63187199/?start=100" >后页></a>
        </span>

对我而言最想抓的就是前面几页，为什么不直接把所有的页数都抓下来呢？只要找到“this page”，然后找下一个元素就可以了。

nextpage = soup.find('span', class_="thispage")
while True:
    nextpage = nextpage.next_sibling.next_sibling
    #there is '\n' between two page object            
    try:
        all_pages_urls.append(nextpage.attrs['href'])
    except(KeyError):
        break
print("total %s pages" %len(all_pages_urls))

当然其实豆瓣的下一页的链接很有规律，只要自加就可以了。不过用了一把这个下一个元素的功能，真是蛮爽的。

最后上一下成果（前天写的四篇文章的图全部挂了，待会去补＝＝）：