爬取allitebooks网站的电子书下载链接

最新推荐文章于 2024-03-29 09:47:06 发布

redwingz

最新推荐文章于 2024-03-29 09:47:06 发布

阅读量3.6k

点赞数

分类专栏：网络应用文章标签： python spider

本文链接：https://blog.csdn.net/sinat_20184565/article/details/82765578

版权

网络应用专栏收录该内容

11 篇文章

订阅专栏

allitebooks网址是目前为止还在更新的不多几个电子书免费下载网站。之前一直访问的http://it-ebooks.info由于版权原因已经去掉了全部的下载链接，转型为电子书导购网站。趁着还能访问，先把allitebooks的电子书都下载下来吧。搜索了一下已经有人写了python的爬虫抓取allitebooks的下载链接，太好了拿过来直接用。

网站的结构非常简单，共两个级别：

1）第一级为电子书展示页面，每页10本，URL地址格式为http://www.allitebooks.com/page/x，其中x代表第几页，此页面包含每本书的详细页面链接地址URL，如下展示页面代码，href=标签之后为详细页面的URL，此处是电子书的图片所属的超链接，点击电子书图片可转到详细页面；

2）第二级就是每本书的详情页面，此页面包含下载链接。

<a href="http://file.allitebooks.com/20180913/Monetizing Machine Learning.pdf" target="_blank"><i class="fa fa-download" aria-hidden="true"></i> Download PDF <span class="download-size">(22.3 MB)</span></a>
</span>

以下的两个正则表达式BOOK_LINK_PATTERN和DOWNLOAD_LINK_PATTERN分别用于从电子书展示页面获取详情页面URL和从详细页面获取下载链接。

BOOK_LINK_PATTERN = 'href="(.*)" rel="bookmark">'
DOWNLOAD_LINK_PATTERN = '<a href="(http:\/\/file.*)" target="_blank">'

网上的爬虫程序在运行几分钟后出错退出，查了一下代码，由于有一些电子书详情页面中没有下载链接导致程序异常。动手修改增加异常处理。完整的代码程序参见https://github.com/zhangkaiheb/allitebooksSpider。

运行爬虫程序:

$ python3 spider.py

page 1:
http://file.allitebooks.com/20180916/Troubleshooting and Maintaining Your PC All-in-One For Dummies, 3rd Edition.pdf
http://file.allitebooks.com/20180912/Applied Natural Language Processing with Python.pdf
http://file.allitebooks.com/20180911/Beginning Reactive Programming with Swift.pdf
http://file.allitebooks.com/20180915/Website Scraping with Python.pdf
http://file.allitebooks.com/20180916/Hacking For Dummies, 6th Edition.epub
http://file.allitebooks.com/20180917/Introducing Microsoft Flow.pdf
http://file.allitebooks.com/20180913/Monetizing Machine Learning.pdf
http://file.allitebooks.com/20180912/Pro Vuejs 2.pdf
http://file.allitebooks.com/20180913/iPhone For Dummies, 11th Edition.pdf
http://file.allitebooks.com/20180914/Designing Web APIs.pdf

page 2:
http://file.allitebooks.com/20180910/Minecraft Recipes For Dummies.pdf
http://file.allitebooks.com/20180904/Pro Android with Kotlin.pdf
http://file.allitebooks.com/20180909/QuickBooks 2018 For Dummies.pdf
http://file.allitebooks.com/20180906/Backup - Recovery.pdf
http://file.allitebooks.com/20180908/Applied Deep Learning.pdf
http://file.allitebooks.com/20180907/Beginning SVG.pdf
http://file.allitebooks.com/20180904/SQL Server 2017 Query Performance Tuning, 5th Edition.pdf
http://file.allitebooks.com/20180908/Introducing InnoDB Cluster.pdf
http://file.allitebooks.com/20180909/Visual Design of GraphQL Data.pdf
http://file.allitebooks.com/20180906/iPad All-in-One For Dummies, 7th Edition.pdf
......

运行完之后，所有的下载链接都保存到了result.txt文件中。可导入到迅雷中批量下载，注意迅雷最多可添加5000个下载链接。没有下载链接的电子书都将其详情页面的地址保存到了error.txt文件中。