利用python爬虫获取豆瓣读书数据建立书单

最新推荐文章于 2024-07-27 12:20:46 发布

doUle_sUn

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量7.7k

点赞数 8

文章标签： python

本文链接：https://blog.csdn.net/doUle_sUn/article/details/79727168

版权

0. 写在前面

网络爬虫：

A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering)

本文参考了网上教程、资料、代码，写了一个小爬虫，爬取豆瓣读书上的书籍信息，最终目的是建立一个基于标签信息的书单

1. 前期准备

1.1 依赖工具

python 3.6
以及python第三方库：
    requests
    urllib
    re

1.2 网页分析

1.2.1 抓取站点

目标网站：豆瓣读书：
豆瓣读书首页

书籍标签信息：豆瓣图书标签：

数据页面示例：豆瓣图书标签：小说：

由豆瓣图书标签：小说第二页的url：

https://book.douban.com/tag/小说?start=20&type=S

可知，tag代表图书标签类型，Type=S代表按照评分对图书排序，而start=20代表该页面第一本书的序号
则村上春树标签下图书第三页的url为：

https://book.douban.com/tag/村上春树?start=40&type=S

1.2.2 Robots协议

开始写爬虫前，先确定目标网页是否允许爬取相关页面
调用urllib库的robotparser模块，访问https://book.douban.com/robots.txt获取豆瓣读书的robots协议

from urllib.robotparser import RobotFileParser

UrlRobots = 'https://book.douban.com/robots.txt'

def GetRobotsTxt(url) :
    rp = RobotFileParser()
    rp.set_url(url)
    rp.read()
    print(rp.can_fetch('*', 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'))
    print(rp.can_fetch('*', 'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4'))
    print(rp.can_fetch('*', 'https://book.douban.com/'))

GetRobotsTxt(UrlRobots)

Robots协议为：

User-agent: *
Disallow: /subject_search
Disallow: /search
Disallow: /new_subject
Disallow: /service/iframe
Disallow: /j/
Sitemap: http://www.douban.com/sitemap_index.xml
Sitemap: http://www.douban.com/sitemap_updated_index.xml

User-agent: Wandoujia Spider
Disallow: /

程序返回结果为：

ture
ture
ture

则上述网站皆在可爬范围内

1.2.3 分析代码

先从要抓取数据的网站代码里提取需要的代码块如下所示：

<li class="subject-item">
    <div class="pic">
      <a class="nbg" href="https://book.douban.com/subject/1057244/" 
  onclick="moreurl(this,{i:'0',query:'',subject_id:'1057244',from:'book_subject_search'})">
        <img class="" src="https://img1.doubanio.com/mpic/s1595557.jpg"
          width="90">
      </a>
    </div>
    <div class="info">
      <h2 class="">

  <a href="https://book.douban.com/subject/1057244/" title="边城" 
  onclick="moreurl(this,{i:'0',query:'',subject_id:'1057244',from:'book_subject_search'})">
    边城
  </a>

      </h2>
      <div class="pub">

  沈从文、黄永玉 卓雅 插图. / 北岳文艺出版社 / 2002-4 / 12.00元

      </div>

  <div class="star clearfix">
        <span class="allstar45"></span>
        <span class="rating_nums">8.6</span>

    <span class="pl">
        (73914人评价)
    </span>
  </div>

    <p>《边城》是沈从文的代表作，写于一九三三年至一九三四年初。这篇作品如沈从文的其他湘西作品，着眼于普通人、善良人的命运变迁，描摹了湘女翠翠阴差阳错的生活悲剧，诚... </p>

      <div class="ft">
  <div class="collect-info">
  </div>
        <div class="cart-actions">
    <span class="buy-info">
      <a href="https://book.douban.com/subject/1057244/buylinks">
        纸质版 7.80 元起
      </a>
    </span>
          </div>
      </div>
    </div>
  </li>

其中我们可以提取到的信息包含：

代码开始位置：<li class="subject-item">
代码结束位置：</li>
书籍相关信息：图书豆瓣网页地址，图书封面地址，书名，作者、译者信息，出版信息，价格，评分，评价数目，内容简介

由此可以写出对应的正则表达式，从而提取出需要数据：

class="nbg" href="(.*?)".*?src="(.*?)".*?title="(.*?)".*?<div class="pub">\s*(.*?)\/.*?nums">(.*?)</span>.*?<p>(.*?)</p>

2. 代码

2.1 获取网页

在这个项目中，我人为地设定了每个标签抓取5页图书内容，即100本书，并将每页代码保存至HtmlCode.txt中

def GetOneType(UrlLabel,Headers,Num):
    for i in range(5):
        print('正在抓取' + labels[Num] +'类的第' + str(i+1) + '页')
        url = UrlLabel + '?start=' + str(i*20) + '&type=S'

        rp = requests.get(url, headers = Headers)

        with open("HtmlCode.txt", 'w', encoding = 'utf-8') as f:
            f.write(rp.text)
        ReEx(Num)
        time.sleep(3 + random.random())

2.2 正则表达式

获取网页代码后，进行正则表达式匹配分析，提取出有效数据，并保存至对应txt文档中

def ReEx(Num):
    FileName = 'result' + str(Num) + '.txt'
    with open('HtmlCode.txt', 'r', encoding = 'utf-8') as file_re:
        content = file_re.read()
        STR = r'class="nbg" href="(.*?)".*?src="(.*?)".*?title="(.*?)".*?<div class="pub">\s*(.*?)\/.*?nums">(.*?)</span>.*?<p>(.*?)</p>'

        result = re.findall(STR, content, re.S|re.M)
        #print(result)

        with open(FileName, 'a', encoding = 'utf-8') as file_result:
            file_result.write(str(result))

2.3 main

分标签爬取所有数据

#标签内容可根据豆瓣标签页更改数据
labels = ['小说', '外国文学', '文学', '随笔', '中国文学', '经典', '日本文学', '散文', '村上春树']

def GetAllPages():
    for i in range(len(labels)):
        UrlLabel = 'https://book.douban.com/tag/' + labels[i]
        GetOneType(UrlLabel,Headers,i)
    print('抓取完成')

完整代码见GitHub：
https://github.com/Doublesdb/WebSpiderDouBanBook

2.4 爬取结果

小说.txt文件里部分数据：

    *   ('https://book.douban.com/subject/1770782/', 'https://img3.doubanio.com/mpic/s1727290.jpg', '追风筝的人', '[美] 卡勒德·胡赛尼 ', '8.9', '12岁的阿富汗富家少爷阿米尔与仆人哈桑情同手足。然而，在一场风筝比赛后，发生了一件悲惨不堪的事，阿米尔为自己的懦弱感到自责和痛苦，逼走了哈桑，不久，自己也跟... '),
    *   ('https://book.douban.com/subject/1008145/', 'https://img3.doubanio.com/mpic/s1070222.jpg', '围城', '钱锺书 ', '8.9', '《围城》是钱钟书所著的长篇小说。第一版于1947年由上海晨光出版公司出版。1949年之后，由于政治等方面的原因，本书长期无法在中国大陆和台湾重印，仅在香港出... ')
    *   ('https://book.douban.com/subject/1082154/', 'https://img3.doubanio.com/mpic/s23836852.jpg', '活着', '余华 ', '9.1', '地主少爷福贵嗜赌成性，终于赌光了家业一贫如洗，穷困之中的福贵因为母亲生病前去求医，没想到半路上被国民党部队抓了壮丁，后被解放军所俘虏，回到家乡他才知道母亲已... ')
    *   ('https://book.douban.com/subject/1200840/', 'https://img3.doubanio.com/mpic/s2335693.jpg', '平凡的世界（全三部）', '路遥 ', '9.0', '《平凡的世界》是一部现实主义小说，也是一部小说形式的家族史。作者浓缩了中国西北农村的历史变迁过程，在小说中全景式地表现了中国当代城乡的社会生活。在近十年的广... ')
    *   ('https://book.douban.com/subject/25862578/', 'https://img3.doubanio.com/mpic/s27264181.jpg', '解忧杂货店', '[日] 东野圭吾 ', '8.6', '现代人内心流失的东西，这家杂货店能帮你找回——\n僻静的街道旁有一家杂货店，只要写下烦恼投进卷帘门的投信口，第二天就会在店后的牛奶箱里得到回答。\n因男友身患绝... ')

3. 总结

本文只是使用了爬虫技术中的一些最基本操作，完成提取并存储豆瓣读书的数据
但是不足的地方在于未使用IP代理、cookie等反“反爬虫”措施，仅使用延时手段，避免ip地址被封
且只提取出了数据，未进行数据可视化。这些都是亟待完善的地方

doUle_sUn

关注

8
点赞
踩
37

收藏

觉得还不错? 一键收藏
0
评论
利用python爬虫获取豆瓣读书数据建立书单

0. 写在前面网络爬虫： A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering)本文参考了网上教程、...
复制链接

扫一扫