爬虫日记（4）—定向爬虫与mongdb和redis的集合

最新推荐文章于 2022-06-09 13:52:36 发布

WhareSong

最新推荐文章于 2022-06-09 13:52:36 发布

阅读量183

点赞数

分类专栏：定向爬虫文章标签： redis xpath mongodb 数据库数据挖掘

本文链接：https://blog.csdn.net/qq_44037783/article/details/108082351

版权

定向爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

定向爬虫最重要的我个人认为是xpath语句的书写，其他的都比较容易理解
关于mongodb和redis的基本语法，这里就不在赘述，有很多关于这方面的博客，可以看一下。
直接上代码吧，注释写的也比较清晰容易理解

import requests
import lxml.etree
import redis
import pymongo


# 初始化redis数据库
client = redis.StrictRedis()
# 初始化mongodb数据库
db = pymongo.MongoClient()
database = db.chapter_5
handle = database.book


def get_url():
# 这个函数主要是请求书本主页，为了获取每一章节的详细地址
    url = 'http://dongyeguiwu.zuopinj.com/5525/'
    html = requests.get(url).content.decode('utf-8')
    # HTML方法构造xpath对象
    source = lxml.etree.HTML(html)
    url_list = source.xpath('/html/body/div[2]/div[2]/div/ul/li/a/@href')
    print(url_list)
    # 将获取到的网址存入redis数据库
    for url in url_list:
        client.lpush('url_queue',url)


def req():
    content_list = []
    # 从redis数据库中获取url标识码，用request库中的函数循环请求获取网页源码
    while client.llen('url_queue')>0:
        url = client.lpop('url_queue').decode()
        source = requests.get(url).content
        selector = lxml.etree.HTML(source)
        chapter_name = selector.xpath('/html/body/div[2]/div/div[2]/div[1]/h1/text()')[0]
        content = selector.xpath('/html/body/div[2]/div/div[2]/div[3]/p/text()')
        content_list.append({'title':chapter_name,'content':'\n'.join(content)})
    # 将爬取到的数据写入mongodb数据库
    handle.insert(content_list)

if __name__ == '__main__':
    get_url()
    req()