爬取知乎话题下回答，并制作关键字词云

最新推荐文章于 2024-06-28 13:06:16 发布

天下第一小白

最新推荐文章于 2024-06-28 13:06:16 发布

阅读量3.6k

点赞数 3

分类专栏： Python开发日记文章标签： python 爬虫动态加载页面 selenium

本文链接：https://blog.csdn.net/sinat_36899414/article/details/78117286

版权

Python开发日记专栏收录该内容

24 篇文章 2 订阅

订阅专栏

一开始学习爬虫就有爬知乎的想法，但是直到现在才实现这个小目标，说来确实惭愧；

本项目是用scrapy+python2.7下实现的

本来目标是tor+scrapy来搭建代理池，后来发现还要翻墙，太麻烦了，于是直接更换useragent的方法，发现知乎没有封ip，可以放心大胆的爬。还打算爬取作者，以及评论下的评论等，后面觉得没必要，都是重复的过程。

今天要爬的知乎话题是 #如何评价王尼玛？

目标地址：https://www.zhihu.com/question/26049726

这是一个动态加载网站，我们需要解析js以及ajax。每次往下翻就会有查看更多回答，需要不断点击才能显示完整的回答。

我用的selenium+PhantomJS模拟不断的点击操作，实现全部的页面加载。
上面代码注释中提到了

结果展示：

直接先上代码：
ZhiHu.py

# encoding:utf-8

import sys
from scrapy.spiders import CrawlSpider
import scrapy
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from SpderSpy.items import SpderspyItem

reload(sys)
sys.setdefaultencoding('utf8')

class ZH(CrawlSpider):
    name = 'zhihu'
    allowed_domains = ['https://www.zhihu.com/']
    def start_requests(self):
        coo = {}
        cookie = 'your cookie'  #自己在浏览器下查看自己的cookie
        for seg in cookie.split(';'):  #把cookie写成字典形式的
            key,value = seg.split('=',1)
            coo[key] = value
        return [scrapy.FormRequest('https://www.zhihu.com/question/26049726',cookies=coo,callback=self.parse)]  #模拟登录，其实不用登录也可以爬页面，当时担心需要就顺便写了下
    def parse(self, response):
        print "spider beginning"
        words = open('clould.txt','w+')  #评论写进的文件
        url = response.url
        driver = webdriver.PhantomJS()
        driver.get(url)
        clk = driver.find_elements_by_xpath('//button[text()="查看更多回答"]')  #找到按钮
        count = 0
        while clk:   #直到页面没有“查看更多回答”这个按钮，就算是爬完了评论
            for i in clk:
                i.click()  #模拟点击
                time.sleep(3) #这里有必要等3s，因为点击后有些内容没有加载出来
            clk = driver.find_elements_by_xpath('//button[text()="查看更多回答"]')
        item = SpderspyItem()
        soup = BeautifulSoup(driver.page_source,'html.parser',from_encoding='utf-8')
        item['content'] = soup.find_all('span',class_="RichText ztext CopyrightRichText-richText")
        for i in item['content']:
            print unicode.encode(i.get_text(),'utf-8') #将unicode字符转成utf-8编码
            print '\n\n'
            words.write(unicode.encode(i.get_text(),'utf-8'))
            count = count+1
        print '总共有：%d 条评论'%count
        driver.quit()

settings.py 需要取消注释的：

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = True

middlewares.py

import random
class SpderspySpiderMiddleware(object):
 
    def process_request(self, request, spider):
         user_agent_random = random.choice(self.useragent)
         request.headers.setdefault('User-Agent', user_agent_random) #这样就是实现了User-Agent的随即变换
    useragent = ['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36',
'Mozilla/5.0 (Linux; U; Android 1.5; de-de; Galaxy Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 ',
'Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
'Opera/9.80 (Android 2.3.3; Linux; Opera Mobi/ADR-1111101157; U; es-ES) Presto/2.9.201 Version/11.50 ',
'Opera/9.80 (Android 3.2.1; Linux; Opera Tablet/ADR-1109081720; U; ja) Presto/2.8.149 Version/11.10',
'BlackBerry7100i/4.1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/103 ',
' Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; FujitsuToshibaMobileCommun; IS12T; KDDI)',
'Opera/9.80 (J2ME/MIDP; Opera Mini/5.1.22296; BlackBerry9800; U; AppleWebKit/23.370; U; en) Presto/2.5.25 Version/10.54 '
]

爬取之后生成clould.txt文件，接下来我们需要对这个文件的内容操作，做成我们需要的词云。
我们知道英文字母做词云相对来说比较简单，他们每个单词都是用空格分开，但是中文的格式就不一样，每个词都是连在一起，需要人为分开。这里我们用到了一个模块结巴 pip install jieba，它能自动帮我们把句子分成词。还要用到一个中文字体库simsun.ttf 点击下载，这个必须要和前面的那个clould.txt文件放在同一目录下，否则会报错。
接下来，是配置生成词云的工具，jupyter notebook 用命令pip install jupyter notebook即可安装。安装完成后，在终端下 jupyter notebook直接就可以打开，效果如下：

这里写图片描述

选择右上角新建文件：

%pylab inline
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud
filename = 'clould.txt'
with open(filename) as f:
	mytext = f.read()
mytext = " ".join(jieba.cut(mytext))
wordcloud = WordCloud(font_path="simsun.ttf").generate(mytext)
plt.imshow(wordcloud, interpolation='bilinear') 
plt.axis("off")