当当网图书数据爬取，分页爬取，多个分类爬取_爬取当当网站图书数据分析-CSDN博客

本文链接：https://blog.csdn.net/2401_86953479/article/details/142995395

请求设置与目标URL：使用requests发送带有模拟浏览器请求头的GET请求，爬取多个书籍分类的畅销书页面（如小说、文学、童书等）。
数据爬取与解析：利用BeautifulSoup解析网页，提取书籍的链接、图片、名称、评论、推荐语、作者、出版时间、出版社、价格等信息。每个分类爬取20到26页的内容。
书籍详情页信息获取：访问每本书的详情页，爬取更多详细信息如书籍描述。
数据保存与异常处理：将爬取的数据格式化并保存到book.txt文件中，遇到解析失败时跳过异常，继续处理下一条数据。
代码如下

import random

import requests
from bs4 import BeautifulSoup

headers = {
    "Cookie": "",
    "Referer": "",
    "User-Agent": "",
}

# 爬取列表
urls = [
    ("http://bang.dangdang.com/books/bestsellers/01.41.00.00.00.00-recent30-0-0-1-{}", "童书"),
    ("http://bang.dangdang.com/books/bestsellers/01.03.00.00.00.00-recent30-0-0-1-{}", "小说"),
    ("http://bang.dangdang.com/books/bestsellers/01.05.00.00.00.00-recent30-0-0-1-{}", "文学"),
    ("http://bang.dangdang.com/books/bestsellers/01.45.00.00.00.00-recent30-0-0-1-{}", "外语"),
    ("http://bang.dangdang.com/books/bestsellers/01.21.00.00.00.00-recent30-0-0-1-{}", "励志"),
    ("http://bang.dangdang.com/books/bestsellers/01.36.00.00.00.00-recent30-0-0-1-{}", "历史"),
    ("http://bang.dangdang.com/books/bestsellers/01.28.00.00.00.00-recent30-0-0-1-{}", "宗教"),
    ("http://bang.dangdang.com/books/bestsellers/01.15.00.00.00.00-recent30-0-0-1-{}", "亲子"),
    ("http://bang.dangdang.com/books/bestsellers/01.31.00.00.00.00-recent30-0-0-1-{}", "心理"),
    ("http://bang.dangdang.com/books/bestsellers/01.07.00.00.00.00-recent30-0-0-1-{}", "艺术"),
    ("http://bang.dangdang.com/books/bestsellers/01.54.00.00.00.00-recent30-0-0-1-{}", "计算机")
]

for url, category in urls:
    top = random.randint(20, 26)
    for page in range(1, top):
        new_url = url.format(page)
        print(new_url)
        res = requests.get(new_url, headers=headers)
        soup = BeautifulSoup(res.text, "html.parser")
        lists = soup.select('.bang_list li')
        for list in lists:
            try:
                home_url = list.select_one('.pic a')['href']
                img_url = list.select_one('.pic img')['src']
                name = list.select_one('.name a').text
                comment = list.select_one('.star a').text
                recommend = list.select_one('.tuijian').text
                author = list.select('.publisher_info')[0].select('a')[0]['title']
                publication_time = list.select('.publisher_info')[1].select_one('span').text
                publishing_house = list.select('.publisher_info')[1].select_one('a').text
                price = list.select_one('.price_n').text
                ori_price = list.select_one('.price_r').text
                discount = list.select_one('.price_s').text
                child_res = requests.get(home_url, headers=headers)
                child_soup = BeautifulSoup(child_res.text, "html.parser")
                descr = child_soup.select_one('.head_title_name').text.strip()

                line = (category + '#' + home_url + '#' + img_url + '#' + name + '#'
                        + comment + '#' + recommend + '#' + author + '#' + publication_time + '#' + publishing_house
                        + '#' + price + '#' + ori_price + '#' + discount + '#' + descr)

                open('book.txt', 'a+', encoding='utf-8').write(line + '\n')
            except Exception as e:
                print("有一条数据被解析失败，丢失")
                pass