爬虫练习-爬取简书网热评文章

最新推荐文章于 2021-04-05 11:11:28 发布

莫莫先生

最新推荐文章于 2021-04-05 11:11:28 发布

阅读量1.1k

点赞数 3

分类专栏： # Python爬虫学习文章标签： python mongodb 爬虫多进程

本文链接：https://blog.csdn.net/weixin_44835732/article/details/104016373

版权

Python爬虫学习专栏收录该内容

27 篇文章 26 订阅

订阅专栏

前言：

使用多进程爬虫方法爬取简书网热评文章，并将爬取的数据存储于MongoDB数据库中

本文为整理代码，梳理思路，验证代码有效性——2020.1.17

环境：
Python3（Anaconda3）
PyCharm
Chrome浏览器

主要模块： 后跟括号内的为在cmd窗口安装的指令
requests（pip install requests）
lxml（pip install lxml）
re
pymongo（pip install pymongo ）
multiprocessing
proxiesIp（自定义的IP代理池库，可以不要）

1

爬取简述网热门文章信息（用户ID、标题、文章内容、评论数、点赞数、打赏数）
在这里插入图片描述

2

2.1.通过观察可以发现该网页没有具体的分页，一直滑动页面可以一直浏览，由此可以判断该页面为异步加载（即不刷新整个网页，只进行局部的更新加载）。

2.2.打开开发者工具 F12 按Network选项，并滚动页面，我们发现请求返回了一些数据，而这些数据里正有我们需要访问的真实URL，由此我可以构造相应的url列表解析式。
在这里插入图片描述

3

我们需要以下文章信息：用户ID、标题、文章内容、评论数、点赞数、打赏数，使用XPath我们对该网页进行解析。

在这里插入图片描述

# 用户ID
author = info.xpath('div/div/a[1]/text()')[0]
# 文章标题
title = info.xpath('div/a/text()')[0]
# 文章内容
content = info.xpath('div/p/text()')[0].strip()
# 评论数
comment = info.xpath('div/div/a[2]/text()')[1].strip()
# 点赞数
like = info.xpath('div/div/span[2]/text()')[0].strip()
# 打赏数
rewards = info.xpath('div/div/span[3]/text()')
if len(rewards) == 0:
    reward = '无'
else:
    reward = rewards[0].strip()

因为并非所有文章都有打赏，所以，我们对其进行判断

4

使用多进程，首先导入使用多进程所需的库

from multiprocessing import Pool

其次，创建进程池，这里的processes是指同步发生的进程数量

测试发现，在爬取简书网热评文章时，processes越大，返回的HTTP状态为429（请求次数过多）的情况越多，这里为2时最佳。

pool = Pool(processes=2)

最后，调用进程爬虫

调用进程爬虫，这里使用的是一个map函数，第一个参数是一个函数，第二个参数是一个列表，这里的意思是将列表中的每个元素作用到前面这个函数中。

pool.map(get_jianshu_info, urls)

5

将数据插入到MongoDB数据库中
5.1.首先，导入必要的pymongo库

import pymongo

5.2.其次，连接并创建数据库即数据集合

# 连接数据库
client = pymongo.MongoClient('localhost', 27017)

# 创建数据库和数据集合
mydb = client['mydb']
jianshu_shouye = mydb['jianshu_shouye']

5.3.最后插入数据到数据库’jianshu_shouye‘中

data = { 'author':author,
         'title':title,
         'content':content,
         'comment':comment,
         'like':like,
         'reward':reward
         }
# 插入数据库
jianshu_shouye.insert_one(data)

完整代码

说明：该代码仅为了学习一下爬虫技术，在url构造时只爬取两页，可自行修改。另，代理IP池（from proxiesIP_1_1 import proxiesIp）可以不要，测试时为了防止被封IP加入的。还有有什么问题可以在评论里说明一下。

# 导入库
import requests
from lxml import etree
import pymongo
from multiprocessing import Pool

from proxiesIP_1_1 import proxiesIp

# 连接数据库
client = pymongo.MongoClient('localhost', 27017)

# 创建数据库和数据集合
mydb = client['mydb']
jianshu_shouye = mydb['jianshu_shouye']

headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}


# 定义获取信息的函数
def get_jianshu_info(url):
    ip = proxiesIp.selectIp()
    html = requests.get(url=url, headers=headers, proxies=ip)
    print(url, html.status_code)

    selector = etree.HTML(html.text)
    # 获取大标签，以此循环
    infos = selector.xpath('//ul[@class="note-list"]/li')
    for info in infos:
        try:
            # 用户ID
            author = info.xpath('div/div/a[1]/text()')[0]
            # 文章标题
            title = info.xpath('div/a/text()')[0]
            # 文章内容
            content = info.xpath('div/p/text()')[0].strip()
            # 评论数
            comment = info.xpath('div/div/a[2]/text()')[1].strip()
            # 点赞数
            like = info.xpath('div/div/span[2]/text()')[0].strip()
            # 打赏数
            rewards = info.xpath('div/div/span[3]/text()')
            if len(rewards) == 0:
                reward = '无'
            else:
                reward = rewards[0].strip()

            data = { 'author':author,
                     'title':title,
                     'content':content,
                     'comment':comment,
                     'like':like,
                     'reward':reward
                     }
            print(data)
            # 插入数据库
            jianshu_shouye.insert_one(data)

        except IndexError:
            # pass掉IndexError错误
            pass


# 程序主入口
if __name__ == '__main__':
    urls = ['https://www.jianshu.com/c/bDHhpK?order_by=top&page={}'.format(str(i)) for i in range(1, 3)]
    # 创建进程池
    pool = Pool(processes=2)
    # 调用进程爬虫
    pool.map(get_jianshu_info, urls)