爬豆瓣读书Top250

最新推荐文章于 2024-06-18 11:18:45 发布

少壮Strive

最新推荐文章于 2024-06-18 11:18:45 发布

阅读量1.4k

点赞数 1

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/qq_37520561/article/details/106899221

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

摘要

本课题的主要目的是设计面向指定网站的网络爬虫程序，同时需要满足不同的性能要求。

搜索引擎作为一个辅助人们检索信息的工具。但是，这些通用性搜索引擎也存在着一定的局限性。不同领域、不同背景的用户往往具有不同的检索目的和需求，通过搜索引擎所返回的结果包含大量用户不关心的网页。为了解决这个问题，一个灵活的爬虫有着无可替代的重要意义。

Python爬取豆瓣读书Top250，可以自动分析构造的URL，实现对图书前250名信息的爬取。

关键词：python；爬虫；

Abstract

The main purpose of this topic is to design a web crawler program for a designated website, and it needs to meet different performance requirements.

The search engine serves as a tool to assist people in retrieving information. However, these universal search engines also have certain limitations. Users in different fields and different backgrounds often have different retrieval purposes and needs. The results returned by the search engine contain a large number of web pages that users do not care about. In order to solve this problem, a flexible crawler has irreplaceable importance.

Python crawls Douban Reading Top250, which can automatically analyze the constructed URL to achieve crawling of the top 250 information in the book.

Keywords: python，spider；

模拟浏览器头部信息，向豆瓣服务器发送消息 4

3.2爬取网页 5

3.3.保存数据 7

把爬虫获取到的信息放入excel表格中 7

4. 实验 7

5. 总结和展望 8

1．引言

互联网由庞大的数据信息组成，将数据有效的检索并组织呈现出来有着巨大的应用前景。搜索引擎作为一个辅助人们检索信息的工具成为用户访问万维网的入口和指南。但是，这些通用性搜索引擎也存在着一定的局限性。不同领域、不同背景的用户往往具有不同的检索目的和需求。比如你直接在百度上搜索最受欢迎的前250本书籍，可能你并不能找到符合自己要求的，因为通过搜索引擎所返回的结果包含大量用户不关心的网页。为了解决这个问题，一个灵活的爬虫有着无可替代的重要意义。

2．系统结构

网络爬虫（又称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

Python爬取豆瓣前250的图书系统采用python爬虫技术，用urllib.request模块打开和读取指定的URL，urllib.error模块抛出异常；用re模块匹配正则表达式，获取我们想要爬取的网页内容；用bs4中的BeautifulSoup模块解析网页，获取数据；用xlwt模块把我们爬取到的数据写入excel表格中。

3．实现代码

import urllib.request, urllib.error # 指定URL，获取网页数据
import re # 正则表达式，进行文字匹配
import xlwt # 进行excel操作
from bs4 import BeautifulSoup # 网页解析，获取数据

def main():
 baseurl = "https://book.douban.com/top250?start="
 # 1.爬取网页
 datalist = getData(baseurl)
 savepath = ".\\豆瓣读书Top250.xls"
 # 3.保存数据
 saveData(datalist, savepath)

# 图书详情链接的规则
findLink = re.compile(r'<a class="nbg" href="(.*?)"') # 创建正则表达式对象，表示规则(字符串模式)
# 图书图片
findImgSrc = re.compile(r'<img src="(.*?)"')
# 图书的中文名
findTitle = re.compile(r'<a.*title="(.*)">')
# 图书的外国名
findForeignName = re.compile(r'(.*)')
# 图书评分
findRating = re.compile(r'(.*)')
# 评价人数
findJudge = re.compile(r'[\s\D]*(\d*)人评价.*?', re.S)#\s匹配任意空白字符，等价于 [\t\n\r\f].,\D匹配任意非数字# re.S使 . 匹配包括换行在内的所有字符
# 概况
findInq = re.compile(r'(.*)')
# 图书的相关内容
findBd = re.compile(r'(.*?)')

3.1得到指定一个URL的网页内容

模拟浏览器头部信息，向豆瓣服务器发送消息

def askURL(url):
        head = {
        "User-Agent": "Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 73.0.3683.86 Safari / 537.36"
    }
    # 用户代理，表示告诉豆瓣服务器，我们是什么类型的机器，浏览器（本质上是告诉浏览器，我们可以接受什么水平的）
    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
        # print(html)
    except urllib.error.URLError as e:
        if hasattr(e, "code"): # hasattr(object, name)判断一个对象里面是否有name属性或者name方法
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html

3.2爬取网页

首先我们访问“https://book.douban.com/top250”豆瓣图书链接，可以知道每一页有25条记录，第一页的链接“https://book.douban.com/top250?start=0”开始，后面每点击下一页start的值就+25，最后一页的链接为“https://book.douban.com/top250?start=225”。因此我们可以用BeautifulSoup得到每一页图书的信息，再用正则表达式re匹配到我们需要的内容。

def getData(baseurl):
 datalist = []
 for i in range(0, 10): # 调用获取页面信息的函数，10次
 url = baseurl + str(i * 25)
 html = askURL(url) # 保存获取到的网页源码

 # .逐一解析数据
 soup = BeautifulSoup(html, "html.parser")
 for item in soup.find_all('table'):
 #print(item)
 data = [] # 保留一部电影的所有信息
 item = str(item)

 # 图书详情的链接
 link = re.findall(findLink, item)[0] # re库用来通过正则表达式查找指定的字符串
 data.append(link) # 添加链接

 imgSrc = re.findall(findImgSrc, item)[0]
 data.append(imgSrc) # 添加图片

 titles = re.findall(findTitle, item)[0]
 data.append(titles)

 foreignName = re.findall(findForeignName, item)
 if (len(foreignName) == 1):
 name = re.sub(':', "", foreignName[0]).strip()
 data.append(name)# 添加外国名
 else:
 data.append(' ') # 外国名留空

 rating = re.findall(findRating, item)[0]
 data.append(rating) # 评分

 judgeNum = re.findall(findJudge, item)[0]
 #print("人数：%s结束" % judgeNum)
 data.append(judgeNum) # 添加评价人数

 inq = re.findall(findInq, item)
 if len(inq) != 0:
 inq = inq[0].replace("。 ", "") # 去掉句号
 data.append(inq) # 添加概述
 else:
 data.append(" ") # 留空

 bd = re.findall(findBd, item)[0]
 # bd = re.sub('<br(\s+)?/>(\s+)?', " ", bd) # 去掉 
 bd = re.sub('/', " ", bd) # 替换/
 bd = " ".join(bd.split())#默认为所有的空字符，包括空格、换行(\n)、制表符(\t)
 data.append(bd.strip()) # 去掉前后的空格strip() 方法用于移除字符串头尾指定的字符(默认为空格或换行符)

 datalist.append(data) # 把处理好的一部电影信息存到datalist
 #print(datalist)
 return datalist

3.3.保存数据

把爬虫获取到的信息放入excel表格中

def saveData(datalist, savepath):
    book = xlwt.Workbook(encoding="utf-8", style_compression=0) # style_compression:表示是否压缩
    sheet = book.add_sheet('豆瓣图书Top250', cell_overwrite_ok=True) # 创建workbook对象，cell_overwrite_ok每次覆盖
    col = ('图书详情链接', "图书链接", "图书中文名", "图书外国名", "评分", "评价数", "概况", "相关信息")
    for i in range(0, 8):
        sheet.write(0, i, col[i]) # 列名
    for i in range(0, 250):
        print("第%d条" % (i + 1))
        data = datalist[i]
        for j in range(0, 8):
            sheet.write(i + 1, j, data[j])
    book.save(savepath)

if __name__ == "__main__": # 当程序执行时
    # 调用函数
    main()
    print("爬取完毕")