跟着视频学的
张莉 南京大学
先直接上代码
import re, time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
def getAuthor(data):
rate = 0
soup = BeautifulSoup(data, 'lxml')
comments = soup.find_all('span', "comment-info") # name rate
for comment in comments:
pattern = re.compile('span class="user-stars allstar(.*?) rating"')
p = re.findall(pattern, str(comment))
for star in p:
rate = int(star)
break
soup = BeautifulSoup(str(comment), 'lxml')
comments = soup.find_all('a')
for item in comments:
return item.string, rate
break
break
def getContext(data):
soup = BeautifulSoup(data, 'lxml')
comments = soup.find_all('p', "comment-content")
for comment in comments:
return comment.string
break
index = 0
count = 0
NUM = 50
while count < NUM:
r = requests.get('https://book.douban.com/subject/1021056/comments/hot?p='+str(index+1))
index += 1
soup = BeautifulSoup(r.text, 'lxml')
comments = soup.find_all('li', 'comment-item')
for comment in comments:
name, rate = getAuthor(str(comment))
context = getContext(str(comment))
print(str(count+1))
print("\twriter: ", name, " - ", rate)
print("\tcontext: ", context)
count+=1
if count >= NUM:
break
time.sleep(2)
地址
https://book.douban.com/subject/bookid/comments/hot
bookid用要抓的书的id替换
第一页 最热评论 https://book.douban.com/subject/1021056/comments/hot 或 ?p=2
第2页 最热评论 https://book.douban.com/subject/1021056/comments/hot?p=2
查看网页源代码后发现, 每个评论的构成如下:
<li class="comment-item" data-cid="274376913">
<div class="avatar">
<a title="夕雾" href="https://www.douban.com/people/1299702/">
<img src="https://img3.doubanio.com/icon/u1299702-71.jpg">
</a>
</div>
<div class="comment">
<h3>
<span class="comment-vote">
<span id="c-274376913" class="vote-count">0</span>
<a href="javascript:;" id="btn-274376913" class="j a_show_login" data-cid="274376913">有用</a>
</span>
<span class="comment-info">
<a href="https://www.douban.com/people/1299702/">夕雾</a>
<span class="user-stars allstar30 rating" title="还行"></span>
<span>2010-07-24</span>
</span>
</h3>
<p class="comment-content">哎哟喂我也看过</p>
</div>
</li>
上述可由
soup = BeautifulSoup(r.text, 'lxml') comments = soup.find_all('li', 'comment-item') 获取
然后
用户姓名 在span comment-info块的 a 标签中
soup.find_all('span', "comment-info")comment.stringcomments = soup.find_all('a')
comment.string用户评论 在p comment-content块中comments = soup.find_all('p', "comment-content")
用户评分 在span comment-info块的 span标签中,要正则匹配user-stars allstar[00] rating,两个数字部分要匹配
pattern = re.compile('span class="user-stars allstar(.*?) rating"')
re.findall(pattern, str(comment))
这里正则有点不大懂 'span class="user-stars allstar(.*?) rating"' (.*?) 代表两个数字?
另外 comments = soup.find_all( ... ) 返回的是 bs4.element.ResultSet, 可以直接当成列表使用
而 for item in comments中 item类型是bs4.element.Tag
item.string, 返回标签之间内容 如 哎哟喂我也看过
str(item) 是整个内容 如 <p class="comment-content">哎哟喂我也看过</p>