2.1 案例3：爬取某读书网站

最新推荐文章于 2022-05-27 23:36:43 发布

YiHong_Li

最新推荐文章于 2022-05-27 23:36:43 发布

阅读量402

点赞数

分类专栏：一、爬虫基础框架urllib python爬虫从0到精通

本文链接：https://blog.csdn.net/YiHong_Li/article/details/86359600

版权

一、爬虫基础框架urllib 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

python爬虫从0到精通

13 篇文章 0 订阅

订阅专栏

课前说明：本章节请求的 url 部分用 ** 代替

本章节需要掌握的知识点：

1、无界面浏览器的使用：

driver = r"/home/**/Downloads/chromedriver"
opt = webdriver.ChromeOptions()
opt.add_argument('--headless')
opt.add_argument('--disable-gpu')
browser = webdriver.Chrome(executable_path=driver, options=opt)

2、如何使用数据库进行数据存储

爬虫分三个模块：

1、请求模块：用于构造请求体，并将请求到的网页（数据）返回给解析模块；

2、解析模块：用于提取数据（本章节用xpath提取网页中的数据），并返回数据给存储模块；

3、存储模块：用数据库存储数据。

案例简介：

用于抓取网页 https://read.doub**.com/category/?page=%d&kind=105 中书本的名字、作者、价格、类型等信息。

简单解析一下爬虫的思路：

1、访问链接：https://read.doub**.com/category/?page=%d&kind=105 检查该网站是否动态加载网站（本网站为动态网站）

爬取动态网页需要借助 selenium 工具；

2、观察网页url的变化，经过观察，页面的变化规律为：

https://read.doub**.com/category/?page=1&kind=105

https://read.doub**.com/category/?page=2&kind=105

https://read.doub**.com/category/?page=3&kind=105
3、用xpath 解析网页内容。

4、撰写爬虫代码，具体代码如下：

from time import sleep

import pymysql
from lxml import etree
from selenium import webdriver


# 请求模块
def request_handle(url, start, end):
    '''
    请求页面
    :param url: 请求页面url
    :param start: 起始页码
    :param end: 结束页码
    :return: 返回网页列表
    '''
    driver = r"/home/**/Downloads/chromedriver"
    # 把浏览器改为无界面浏览器
    opt = webdriver.ChromeOptions()
    opt.add_argument('--headless')
    opt.add_argument('--disable-gpu')
    browser = webdriver.Chrome(executable_path=driver, options=opt)
    for page in range(start, end + 1):
        new_url = url % page
        browser.get(new_url)
        # 记得给延时
        sleep(2)
        yield browser.page_source


# 解析模块
def analysis_html(html_list):
    '''
    解析数据
    :param html_list: 网页列表
    :return: 返回解析数据
    '''
    for html in html_list:
        html_tree = etree.HTML(html)
        book_list = html_tree.xpath('//div[@id="react-root"]/div/section[2]/div[1]/ul/li')
        for book in book_list:
            item = {}
            item['name'] = book.xpath(".//div/h4//text()")[0]
            item['author'] = ' '.join(book.xpath(".//div/div[@class='author']//text()"))
            item['comment'] = book.xpath(".//div/div[@class='extra-info']//div[@class='rating']/span/text()")[0]
            item['number'] = book.xpath(".//div/div[@class='extra-info']//span[2]/text()")[0]
            item['type'] = book.xpath(".//div/div[@class='extra-info']//a[@class='kind-link']/text()")[0]
            try:
                item['price'] = book.xpath(".//div/span[@class='sale']//span[2]/text()")[0]
            except Exception as e:
                item['price'] = book.xpath(".//div/span[@class='sale']//span[1]/text()")[1]
            yield item


# 存储模块
def save_mysql(data):
    # 连接数据库
    connect = pymysql.connect(host='localhost', port=3306, password='muzili', user='root', database='douban',
                              charset='utf8')
    # 创建游标
    cursor = connect.cursor()
    for item in data:
        # 写数据库插入语句
        sql = 'INSERT INTO computer(name,author,comment,number,type,price) VALUES ("%s","%s","%s","%s","%s","%s")' % \
              (item['name'], item['author'], item['comment'], item['number'], item['type'], item['price'])
        # 执行数据库语句
        cursor.execute(sql)
        # 提交到数据库
        connect.commit()
    # 关闭游标
    cursor.close()
    # 断开数据库连接
    connect.close()


def main():
    url = 'https://read.doub**.com/category/?page=%d&kind=105'
    start = int(input('请输入起始页：'))
    end = int(input('请输入结束页：'))
    # 请求
    html_list = request_handle(url, start, end)
    # 解析
    data = analysis_html(html_list)
    # 存储
    save_mysql(data)


if __name__ == '__main__':
    main()

读书不仅能修身养性，还能使人快乐哦n(*≧▽≦*)n！！！

温馨提示：好了，本案例就到此结束，记得动手敲一敲哦n(*≧▽≦*)n！不记得 xpath 的同学可以复习一下 1.7 认识网页解析工具哦，还有selenium 的使用哈，嘻嘻~

YiHong_Li

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
3
评论
2.1 案例3：爬取某读书网站

课前说明：本章节请求的 url 部分用 ** 代替本章节需要掌握的知识点：1、无界面浏览器的使用： driver = r"/home/**/Downloads/chromedriver" opt = webdriver.ChromeOptions() opt.add_argument('--headless') opt.add_argument('--di...
复制链接

扫一扫