python爬取评论_Python爬取豆瓣《复仇者联盟3》评论并生成乖萌的格鲁特

最新推荐文章于 2020-12-11 23:50:24 发布

weixin_39781945

最新推荐文章于 2020-12-11 23:50:24 发布

阅读量134

点赞数

文章标签： python爬取评论

### 1. 需求说明

本项目基于Python爬虫，爬取豆瓣电影上关于复仇者联盟3的所有影评，并保存至本地文件。然后对影评进行分词分析，使用词云生成树人格鲁特的形象照片。

### 2. 代码实现

此部分主要解释Python爬虫部分及使用词云生成图像的代码

###### Python爬虫

首先获取需要爬取的网页地址，然后通过requests.get()方式去获取网页，代码如下：

```python

# 获取网页

def getHtml(url):

try:

r = requests.get(url, timeout=30)

r.raise_for_status()

return r.text

except:

return ''

```

获取到网页之后，对网页中的元素进行正则匹配，找到评论相关的元素，并获取。

```

# 获取某个网页中的影评

def getComment(html):

soup = BeautifulSoup(html, 'html.parser')

comments_list = []

comment_nodes = soup.select('.comment > p')

for node in comment_nodes:

comments_list.append(node.get_text().strip().replace("\n", "") + u'\n')

return comments_list

```

将爬取到的影评保存至文本文件中，以备后续分析使用。

```

def saveCommentText(fpath):

pre_url = "https://movie.douban.com/subject/24773958/comments?"

depth = 8

with open(fpath, 'a', encoding='utf-8') as f:

for i in range(depth):

url = pre_url + 'start=' + str(20 * i) + '&limit=20&sort=new_score&' + 'status=P'

html = getHtml(url)

f.writelines(getComment(html))

time.sleep(1 + float(random.randint(1, 20)) / 20)

```

###### 基于词云生成图像

> 注释比较详细，可以看注释说明

```python

def drawWordcloud():

with codecs.open('text.txt', encoding='utf-8') as f:

comment_text = f.read()

# 设置背景图片,可替换为img目录下的任何一张图片

color_mask = imread("img\Groot4.jpeg")

# 停用词设置

Stopwords = [u'就是', u'电影', u'你们', u'这么', u'不过', u'但是',

u'除了', u'时候', u'已经', u'可以', u'只是', u'还是', u'只有', u'不要', u'觉得', u'，'u'。']

# 设置词云属性

cloud = WordCloud(font_path="simhei.ttf",

background_color='white',

max_words=260,

max_font_size=150,

min_font_size=4,

mask=color_mask,

stopwords=Stopwords)

# 生成词云, 可以用generate输入全部文本,也可以我们计算好词频后使用generate_from_frequencies函数

word_cloud = cloud.generate(comment_text)

# 从背景图片生成颜色值(注意图片的大小)

image_colors = ImageColorGenerator(color_mask)

# 显示图片

plt.imshow(cloud)

plt.axis("off")

# 绘制词云

plt.figure()

plt.imshow(cloud.recolor(color_func=image_colors))

plt.axis("off")

plt.figure()

plt.imshow(color_mask, cmap=plt.cm.gray)

plt.axis("off")

plt.show()

# 保存图片

word_cloud.to_file("img\comment_cloud.jpg")

```

###### 为了方便阅读，这里贴出整体过程编码：

```

def getHtml(url):

try:

r = requests.get(url, timeout=30)

r.raise_for_status()

return r.text

except:

return ''

def getComment(html):

soup = BeautifulSoup(html, 'html.parser')

comments_list = []

comment_nodes = soup.select('.comment > p')

for node in comment_nodes:

comments_list.append(node.get_text().strip().replace("\n", "") + u'\n')

return comments_list

def saveCommentText(fpath):

pre_url = "https://movie.douban.com/subject/24773958/comments?"

depth = 8

with open(fpath, 'a', encoding='utf-8') as f:

for i in range(depth):

url = pre_url + 'start=' + str(20 * i) + '&limit=20&sort=new_score&' + 'status=P'

html = getHtml(url)

f.writelines(getComment(html))

time.sleep(1 + float(random.randint(1, 20)) / 20)

def cutWords(fpath):

text = ''

with open(fpath, 'r', encoding='utf-8') as fin:

for line in fin.readlines():

line = line.strip('\n')

text += ' '.join(jieba.cut(line))

text += ' '

with codecs.open('text.txt', 'a', encoding='utf-8') as f:

f.write(text)

def drawWordcloud():

with codecs.open('text.txt', encoding='utf-8') as f:

comment_text = f.read()

# 设置背景图片

color_mask = imread("img\Groot4.jpeg")

# 停用词设置

Stopwords = [u'就是', u'电影', u'你们', u'这么', u'不过', u'但是',

u'除了', u'时候', u'已经', u'可以', u'只是', u'还是', u'只有', u'不要', u'觉得', u'，'u'。']

# 设置词云属性

cloud = WordCloud(font_path="simhei.ttf",

background_color='white',

max_words=260,

max_font_size=150,

min_font_size=4,

mask=color_mask,

stopwords=Stopwords)

# 生成词云, 可以用generate输入全部文本,也可以我们计算好词频后使用generate_from_frequencies函数

word_cloud = cloud.generate(comment_text)

# 从背景图片生成颜色值(注意图片的大小)

image_colors = ImageColorGenerator(color_mask)

# 显示图片

plt.imshow(cloud)

plt.axis("off")

# 绘制词云

plt.figure()

plt.imshow(cloud.recolor(color_func=image_colors))

plt.axis("off")

plt.figure()

plt.imshow(color_mask, cmap=plt.cm.gray)

plt.axis("off")

plt.show()

# 保存图片

word_cloud.to_file("img\comment_cloud.jpg")

```

###三、项目结构

> 项目结构

![](/contentImages/image/20180521/ecX8WtDp4hbtxshyr5S.jpg)

> 注意整个项目只有一个源码文件，其他的为图片文件

###四、运行效果图

一大波格鲁特来袭

> 格鲁特1号

![](http://p6v1c8fgh.bkt.clouddn.com/groot1.jpg)

> 格鲁特2号

![](http://p6v1c8fgh.bkt.clouddn.com/groot3.jpg)

> 格鲁特3号

![](http://p6v1c8fgh.bkt.clouddn.com/groot4.jpg)

> 格鲁特4号

![](http://p6v1c8fgh.bkt.clouddn.com/groot5.jpg)

weixin_39781945

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫