豆瓣新书速递数据爬取与简单数据处理

本文链接：https://blog.csdn.net/OnedayIlove/article/details/106899141

概要

在这里插入图片描述

数据爬取

爬取豆瓣平台提供的数据，存储到本地 json 文件。

数据说明	URL
豆瓣新书速推 HTML	https://book.douban.com/latest?icn=index-latestbook-all
豆瓣单条图书查询 RESTful API	https://api.douban.com/v2/book/:id?apikey=0df993c66c0c636e29ecbb5344252a4a

使用 urllib3，获取豆瓣新书速推网页的 HTML 相应数据，在使用 bs4 库，从HTML DOM 中解析出所有新书的 id，保存到一个 list 对象中
遍历list：通过新书的 id，调用豆瓣单条图书查询 RESTful API，获取对应图书id的详细数据（一个 json 字符串数据，转换为 dict 对象），添加到 list 中
爬取完所有图书数据后，调用 json 库，把保存了图书数据的list 对象持久化到文件系统中。

数据预处理

对爬取到的数据简单预处理，进行数据清洗、数据变换、数据规约和数据导出过程。

数据预处理过程在Jupyter Notebook 上运行，主要使用了 Python 和 Pandas 做数据处理，内容包裹处理空值、数据格式等，最后进行数据规约、排序、分组等操作，把处理后的数据持久化到本地，供后续使用。

源码

源码包含了类型注解和注释

数据爬取

"""
采集豆瓣网站的新书速递图书数据

- 新书速推HTML: https://book.douban.com/latest?icn=index-latestbook-all
- 单条图书查询 RESTfulAPI: https://api.douban.com/v2/book/:id?apikey=0df993c66c0c636e29ecbb5344252a4a

"""

# from typing import Dict,List
import urllib3
from urllib3.response import HTTPResponse
from bs4 import BeautifulSoup, PageElement, ResultSet
from typing import Dict, List
import json
import datetime

__author__ = "onemsg ([email protected])"

DEFAULT_HEADERS = {
   
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
                Chrome/83.0.4103.97 Safari/537.36 Edg/83.0.478.45"
}

def get_new_books() -> Dict[str,List[str]]:
    """
    爬取页面 https://book.douban.com/latest?icn=index-latestbook-all 的图书 id

    return 包含一个包含了 图书id 的 dict
    """
    url = "https://book.douban.com/latest?icn=index-latestbook-all"

    r: HTTPResponse = urllib3.PoolManager().request('GET', url, headers=DEFAULT_HEADERS)
    bs = BeautifulSoup(r.data, features="lxml")
    fiction: ResultSet[PageElement] = bs.select("#content > div > div.article > ul > li")
    non_fiction: ResultSet[PageElement] = bs.select("#content > div > div.aside > ul > li")

    books = {
   
        "fiction": [],
        "non_fiction": []
    }

    for li in fiction:
        li:PageElement
        id: str = li.find_next("a")["href"].split("/")[-2]
        books['fiction'].append(id)

    for li in non_fiction:
        li: PageElement
        id: str = li.find_next("a")["href"].split("/")[-2]
        books['non_fiction'].append(id)

    return books

def get_book_info(book_ids: List['str']) -> List[Dict]:
    """
    获取一个图书id list中所有图书信息

    调用API: https://api.douban.com/v2/book/:id?apikey=0df993c66c0c636e29ecbb5344252a4a

    return: 一个 list[dict] 包含所有图书信息
    """

    http = urllib3.PoolManager()

    url_template = "https://api.douban.com/v2/book/{}?apikey=0df993c66c0c636e29ecbb5344252a4a"

    books = []
    for id in book_ids:
        r: HTTPResponse = http.request("GET", url_template.format(id))
        book = json.loads(r.data.decode("utf-8"))
        books.append(book)
        print("id {} book 采集完成 -".format(id))
    return books

if __name__ == "__main__":
    
    print(" === 豆瓣新书速递 数据采集开始 ===")

    book_ids = get_new_books()
    books = []
    for ids in book_ids.values():
        _books = get_book_info(ids)
        books.extend(_books)
    
    now = datetime.datetime.now()
    date = now.strftime('%Y-%m-%d')

    outpath = "data/books-{}.json".format(date)

    with open(outpath, "w", encoding="utf-8") as f:
        json.dump(books, f