Python爬虫：豆瓣天龙八部短评数据

最新推荐文章于 2024-05-30 13:04:26 发布

〔晴【天】º〕

最新推荐文章于 2024-05-30 13:04:26 发布

阅读量1.2k

点赞数 3

分类专栏： Python 文章标签： python 爬虫可视化

本文链接：https://blog.csdn.net/qq_43625134/article/details/119242175

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

爬取豆瓣天龙八部的短评数据，网址为

https://book.douban.com/subject/1255625/comments/。要求：

（1）抓取所有的短评，将评论信息存储至文本文件中；

（2）将评论生成词云图片。效果如下：
在这里插入图片描述

文章目录

一、分析

1、豆瓣天龙八部的短评网页：https://book.douban.com/subject/1255625/comments/结果为
在这里插入图片描述
我们需要爬取每一个短评内容，并保存起来

2、通过审查元素，我们得知每一个短评都存放在<li>标签里的<span>标签里面
在这里插入图片描述
3、将评论信息存储至文本文件中
简单的文件读写就可以实现

4、将评论生成词云图片
1）需要将所有评论采用jieba库进行分词
2）Numpy 库处理原图
3）WordCloud词云库，生成对应的图片
注意：原图要选择背景颜色单一的，如：
在这里插入图片描述

二、爬取短评

爬取短评并保存在文本文件中

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://book.douban.com/subject/1255625/comments/?start=%d&limit=20&status=P&sort=new_score'
pl_list = []
for pageNum in range(0, 10):
    pl = pageNum*20
    new_url = format(url % pl)
    page_text = requests.get(url=new_url, headers=headers).text
    soup = BeautifulSoup(page_text, 'html.parser')
    li_list = soup.find('div', attrs={"id": "comments"}).find_all('li')
    for li in li_list:
        p = li.find('p', attrs={"class": "comment-content"}).text
        pl_list = pl_list + [p]
fw = open('天龙八部.txt', 'w+', encoding='utf-8')
count = 1
for i in pl_list:
    fw.write("{}.".format(count)+i+'\n')
    count = count+1
fw.close()

注意：
1）考虑到要翻页，网址的选取有一定的不同，有一定的规律（应该是原网页的问题，只能够爬取前200条短评，但是也足够了）
2）pl_list是一个列表，里面存的是每一条短评

三、词云的制作

def generateWordCloud():
    finalComment=''
    comments=pl_list
    for comment in comments:
        finalComment+=comment

    finalComment=' '.join(jieba.cut(finalComment))
    image=numpy.array(Image.open('1.jpg'))

    word=WordCloud(
        font_path="msyh.ttc",
        background_color='white',
        mask=image
    ).generate(finalComment)
    word.to_file('天龙八部.jpg')

注意：
1)原图的选择，背景颜色单一，才好制作（background_color=‘white’,根据背景颜色变化）
2)1.jpg是原图

四、完整代码

import requests
from bs4 import BeautifulSoup
import jieba
from wordcloud import WordCloud
import numpy
from PIL import Image

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://book.douban.com/subject/1255625/comments/?start=%d&limit=20&status=P&sort=new_score'
pl_list = []
for pageNum in range(0, 10):
    pl = pageNum*20
    new_url = format(url % pl)
    page_text = requests.get(url=new_url, headers=headers).text
    soup = BeautifulSoup(page_text, 'html.parser')
    li_list = soup.find('div', attrs={"id": "comments"}).find_all('li')
    for li in li_list:
        p = li.find('p', attrs={"class": "comment-content"}).text
        pl_list = pl_list + [p]
fw = open('天龙八部.txt', 'w+', encoding='utf-8')
count = 1
for i in pl_list:
    fw.write("{}.".format(count)+i+'\n')
    count = count+1
fw.close()
def generateWordCloud():
    finalComment=''
    comments=pl_list
    for comment in comments:
        finalComment+=comment

    finalComment=' '.join(jieba.cut(finalComment))
    image=numpy.array(Image.open('1.jpg'))

    word=WordCloud(
        font_path="msyh.ttc",
        background_color='white',
        mask=image
    ).generate(finalComment)
    word.to_file('天龙八部.jpg')

generateWordCloud()

〔晴【天】º〕

关注

3
点赞
踩
14

收藏

觉得还不错? 一键收藏
4
评论
Python爬虫：豆瓣天龙八部短评数据

爬取豆瓣天龙八部的短评数据，网址为https://book.douban.com/subject/1255625/comments/。要求：（1）抓取所有的短评，将评论信息存储至文本文件中；（2）将评论生成词云图片。效果如下：文章目录一、分析二、爬取短评三、词云的制作四、完整代码一、分析1、豆瓣天龙八部的短评网页：https://book.douban.com/subject/1255625/comments/结果为我们需要爬取每一个短评内
复制链接

扫一扫

专栏目录