python爬取豆瓣图书top250榜单并存为CSV文件

1、页面分析

豆瓣图书TOP250

以下是前四页的网址:

https://book.douban.com/top250
https://book.douban.com/top250?start=25
https://book.douban.com/top250?start=50
https://book.douban.com/top250?start=75

发现把第一页网址改为 https://book.douban.com/top250?start=0 也能访问

只需修改 start= 后面的数字即可构造出10页的网址

需要爬取的信息有:书名,书本的URL地址,作者,出版社和出版日期,书本价格,评分和一句话评价。



2、分析网页源代码,获取网页结构



3、需要用到的库

request用于请求网页获取网页数据,lxml解析提取数据,csv存储数据

import requests
from lxml import etree
import csv


4、源代码

# 导入相应库文件
import requests
from lxml import etree
import csv

# 创建CSV文件,并写入表头信息
fp = open('D:\Code\doubanbook2.csv','wt',newline='',encoding='utf-8')
writer = csv.writer(fp)
writer.writerow(('书名','地址','作者','出版社','出版日期','价格','评分','评价'))

# 构造所有的URL链接
urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,251,25)]

# 添加请求头
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}


# 循环URL
for url in urls:
    html = requests.get(url,headers=headers)
    selector = etree.HTML(html.text)
    # 取大标签,以此循环
    infos = selector.xpath('//tr[@class="item"]')

    for info in infos:
        name = info.xpath('td/div/a/@title')[0]
        url = info.xpath('td/div/a/@href')[0]
        book_infos = info.xpath('td/p/text()')[0]
        author = book_infos.split('/')[0]
        publisher = book_infos.split('/')[-3]
        date = book_infos.split('/')[-2]
        price = book_infos.split('/')[-1]
        rate = info.xpath('td/div/span[2]/text()')[0]
        comments = info.xpath('td/p/span/text()')
        comment = comments[0] if len(comments) != 0 else "空"

        # 写入数据
        writer.writerow((name,url,author,publisher,date,price,rate,comment))

# 关闭文件
fp.close()


5、结果


阅读更多

没有更多推荐了,返回首页