爬取当当网python类书籍的信息(存储在mysql中)_爬取当当网上python相关的图书信息,包括:书名、作者、出版时间、出版社、价格、简-CSDN博客

本文链接：https://blog.csdn.net/SGTXS/article/details/88072692

我爬取到的数据是存储在mysql数据库中的，没有需要安装。

本文设计到的python库有：bs4、scrapy、pymysql、

首先在cmd命令行输入下图(1)的命令,显示下图(2)中的内容时，项目创建成功，然后用pycharm打开这个项目。

没有安装scrapy框架的建议用anaconda安装，然后在anacomda prompt输入(1)的命令

使用pycharm打开项目后目录结构如下(在spiders中新建mySpider.py、book下新建run.py)：

然后分析当当网的图书数据进行分析，打开当当网，在搜索中输入python，看到如下页面：

此时的地址为：http://search.dangdang.com/?key=python&act=input

分析html代码的结构，我们看到每本书都是一个<li>的项目，而它们的结构是完全一样的，这些

<li>包含在一个<ul>中，我们只关心图书的名称title、作者author、出版时间date、出版社publisher、

价格price和内容简介detail。下面对提取这些信息的代码做一个分析：

1、title = li.xpath("./a[position()=1]/@title").extract_first()

<li>中有多个<a>标签，从html代码中可以看到书名信息包含在一个<a>的title属性中，因此通过position()=1

找出第一个<a>,然后取出title属性值就是书名title。

2、price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()

价格包含在<li>中的class='price'的<p>元素下面的class='search_now_price'的span元素的文本中

3、author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()

作者包含在<li>下面的class='search_book_author'的<p>元素下面的第一个<span>元素的<a>标签的

title属性中，其中span[position()=1]就是限定第一个<span>元素。

4、date = li.xpath("./p[@class='search_book_author']/span[position()=last()-1]/text()").extract_first()

出版日期包含在<li>下面的class='search_book_author'<p>元素下面的倒数第二个<span>元素的文本中

5、publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title").extract_first()

出版社包含在<li>下面的class='search_book_author'的<p>元素下面的最后一个span元素的<a>标签中的title属性中。

6、detail = li.xpath("./p[@class='detail']/text()").extract_first()

在<li>下面的class='detail'的<p>的文本就是书的简介。

下面来看具体的代码：

1、编写items.py的数据项目类：

import scrapy
class BookItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    date = scrapy.Field()
    publisher = scrapy.Field()
    detail = scrapy.Field()
    price = scrapy.Field()

2、编写pipelines.py的数据处理类：

import pymysql

class BookPipeline(object):
    # 爬虫开始是执行的函数
    def open_spider(self, spider):
        print("opened")
        try:
            # 连接数据库，密码写自己的密码
            self.con = pymysql.connect(host="127.0.0.1",port=3306, user="root",passwd="123456", charset="utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            try:
                self.cursor.execute("create database mydb")
            except:
                pass
            self.con.select_db("mydb")
            try:
                self.cursor.execute("drop table books")
            except:
                pass
            try:
                sql = """
                        create table books(
                            bID varchar(8) primary key,
                            bTitle varchar(512),
                            bAuthor varchar(256),
                            bPublisher varchar(256),
                            bDate varchar(32),
                            bPrice varchar(16),
                            bDetail text
                        )ENGINE=InnoDB DEFAULT CHARSET=utf8;    # 保证建立的表与打开的数据库编码一致
                    """
                self.cursor.execute(sql)
            except:
                self.cursor.execute("delete from books")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("closed")
        print("总共爬取", self.count, "本书籍")

    def process_item(self, item, spider):
        try:
            print(item["title"])
            print(item["author"])
            print(item["publisher"])
            print(item["date"])
            print(item["price"])
            print(item["detail"])
            print()
            if self.opened:
                self.count += 1     # 用来构造bID
                ID = str(self.count)
                while len(ID) < 8:
                    ID = "0" + ID
                # 插入数据到表中
                self.cursor.execute("insert into books(bID,bTitle,bAuthor,bPublisher,bDate,bPrice,bDetail) values(%s,%s,%s,%s,%s,%s,%s)",(ID, item["title"], item["author"], item["publisher"], item["date"], item["price"],item["detail"]))

        except Exception as err:
            print("wrong error:"+str(err))
        return item

  在scrapy的过程中一旦打开一个spider爬虫就会执行这个类的open_spider(self,spider)函数，一旦这个spider
关闭就执行这个类的close_spider(self,spider)函数，因此程序在open_spider函数中连接MySQL数据库，并创建操作
游标self.cursor,在close_spider中提交数据并关闭数据库，程序中使用count变量统计爬取的书籍数量
  在数据处理函数中每次有数据到达，就显示数据内容，并使用insert的SQL语句把数据插入到数据库中。

3、编写scrapy的配置文件setting.py的数据处理类：

找到以下两个地方，第一处打开，第二处修改

DEFAULT_REQUEST_HEADERS = {     #模拟浏览器
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

ITEM_PIPELINES = {
   'book.pipelines.BookPipeline': 300,
}

这样就可以把爬取的数据推送到pipelin的BookPipeline类中去了。

4、编写scrapy的爬虫程序mySpider.py：

import scrapy
from book.items import BookItem #如果你的目录结构和我不一样注意一下
from bs4 import UnicodeDammit
class MySpider(scrapy.Spider):
    name = "mySpider"
    key = 'python'
    source_url = 'http://search.dangdang.com/'
    def start_requests(self):   #程序开始时会调用
        url = MySpider.source_url+"?key="+MySpider.key+"&act=input"
        yield scrapy.Request(url=url, callback=self.parse)  # 调用parse函数
    def parse(self, response):
        try:
            dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            selector = scrapy.Selector(text=data)
            lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")    # 取到当前页面中所有带有属性ddt-pit的<li>,即每一条书籍
            for li in lis:
                title = li.xpath("./a[position()=1]/@title").extract_first()
                price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
                author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()
                date = li.xpath("./p[@class='search_book_author']/span[position()=last()-1]/text()").extract_first()
                publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title").extract_first()
                detail = li.xpath("./p[@class='detail']/text()").extract_first()
                # detail 有时没有，结果为None
                item = BookItem()
                item["title"] = title.strip() if title else ""
                item["author"] = author.strip() if title else ""
                item["date"] = date.strip()[1:] if title else ""  #注意到日期前有一个符号/,所以从1开始取值
                item["publisher"] = publisher.strip() if publisher else ""
                item["price"] = price.strip() if price else ""
                item["detail"] = detail.strip() if detail else ""
                yield item  # 将爬取到的一条记录推送到pipelines.py由process_item函数处理
            #最后一页是link为None
            link = selector.xpath("//div[@class='paging']/ul[@name='Fy']/li[@class='next']/a/@href").extract_first()
            if link:
                url = response.urljoin(link)
                yield scrapy.Request(url=url, callback=self.parse)
        except Exception as err:
            print(err)

仔细分析网站的HTML代码发现在一个<div class='paging'>的元素中包含了翻页的信息，<div>下面的<ul name='Fy'>下面的

<li class='next>下面的<a>链接就是下一页的链接，取出这个链接地址，通过response.urljoin函数整理成绝对地址，再次产生一个

scrapy.Request对象请求，回调函数仍为这个parse函数，这样就可以递归调用parser函数，实现下一个网页的数据爬取。

5、编写run.py：

from scrapy import cmdline
cmdline.execute("scrapy crawl mySpider -s LOG_ENABLED=False".split())

运行run.py,查看mysql数据库得到如下图所示结果：