抓取豆瓣某本书的评论

最新推荐文章于 2022-09-28 21:40:46 发布

lslsyqyq

最新推荐文章于 2022-09-28 21:40:46 发布

阅读量1k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/lslsyqyq/article/details/75413053

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

跟着视频学的

用Python玩转数据

张莉南京大学

先直接上代码

import re, time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag

def getAuthor(data):
    rate = 0
    soup = BeautifulSoup(data, 'lxml')
    comments = soup.find_all('span', "comment-info")  # name rate
    for comment in comments:
        pattern = re.compile('span class="user-stars allstar(.*?) rating"')
        p = re.findall(pattern, str(comment))
        for star in p:
            rate = int(star)
            break
        soup = BeautifulSoup(str(comment), 'lxml')
        comments = soup.find_all('a')
        for item in comments:
            return item.string, rate
            break
        break

def getContext(data):
    soup = BeautifulSoup(data, 'lxml')
    comments = soup.find_all('p', "comment-content")
    for comment in comments:
        return comment.string
        break

index = 0
count = 0
NUM = 50
while count < NUM:
    r = requests.get('https://book.douban.com/subject/1021056/comments/hot?p='+str(index+1))
    index += 1
    soup = BeautifulSoup(r.text, 'lxml')
    comments = soup.find_all('li', 'comment-item')
    for comment in comments:
        name, rate = getAuthor(str(comment))
        context = getContext(str(comment))
        print(str(count+1))
        print("\twriter:  ", name, " - ", rate)
        print("\tcontext: ", context)
        count+=1
        if count >= NUM:
            break
    time.sleep(2)

地址

https://book.douban.com/subject/bookid/comments/hot

bookid用要抓的书的id替换

第一页最热评论 https://book.douban.com/subject/1021056/comments/hot 或 ?p=2

第2页最热评论 https://book.douban.com/subject/1021056/comments/hot?p=2

查看网页源代码后发现，每个评论的构成如下：

<li class="comment-item" data-cid="274376913">
            <div class="avatar">
                <a title="夕雾" href="https://www.douban.com/people/1299702/">
                    <img src="https://img3.doubanio.com/icon/u1299702-71.jpg">
                </a>
            </div>
        <div class="comment">
            <h3>
                <span class="comment-vote">
                    <span id="c-274376913" class="vote-count">0</span>
                        <a href="javascript:;" id="btn-274376913" class="j a_show_login" data-cid="274376913">有用</a>
                </span>
                <span class="comment-info">
                    <a href="https://www.douban.com/people/1299702/">夕雾</a>
                        <span class="user-stars allstar30 rating" title="还行"></span>
                    <span>2010-07-24</span>
                </span>
            </h3>
            <p class="comment-content">哎哟喂我也看过</p>
        </div>
    </li>

上述可由

soup = BeautifulSoup(r.text, 'lxml')
comments = soup.find_all('li', 'comment-item')  获取

然后

用户姓名 在span comment-info块的 a 标签中

	  soup.find_all('span', "comment-info")	
          comments = soup.find_all('a') 
          comment.string
用户评论 在p comment-content块中
          comments = soup.find_all('p', "comment-content")

comment.string

用户评分在span comment-info块的 span标签中，要正则匹配user-stars allstar[00] rating，两个数字部分要匹配

pattern = re.compile('span class="user-stars allstar(.*?) rating"')

re.findall(pattern, str(comment))

这里正则有点不大懂 'span class="user-stars allstar(.*?) rating"' (.*?) 代表两个数字？

另外 comments = soup.find_all( ... ) 返回的是 bs4.element.ResultSet, 可以直接当成列表使用

而 for item in comments中 item类型是bs4.element.Tag

item.string, 返回标签之间内容如 哎哟喂我也看过

str(item) 是整个内容如 <p class="comment-content">哎哟喂我也看过</p>

lslsyqyq

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
抓取豆瓣某本书的评论

跟着视频学的用Python玩转数据张莉南京大学先直接上代码import re, timeimport requestsfrom bs4 import BeautifulSoupfrom bs4.element import Tagdef getAuthor(data): rate = 0 soup = BeautifulS
复制链接

扫一扫