1.问题来源
这学期有一门软件开发实战的课程,我们组在做一个书籍交易平台。既然是书籍交易平台重要的图书的获取肯定必不可少,而豆瓣上的丰富的图书信息,自然就是我们需要获取资料的重要来源,由于后期项目的完成需要大量的图书信息来帮助我们完成关于知识图谱和推荐算法的应用,所以图书信息的获取就成了一件重要的事,由于没有学习过python,也是从网上学习了很多有关的知识,中途也遇到了很多次的错误,终于找到了适合我这个爬虫新手适用的方法,刚开始是爬top250图书的网页练手,后来觉得获取的信息不全面,有关于图书的ISBN等信息还是只有在书籍详情页才能有,于是便搜集了一些资料以后,自己尝试写了一下代码。个人觉得以下代码比较适合新手来了解爬虫的相关知识,也把一些重要的点注释清楚了。
2.爬虫代码的组成
因为豆瓣上是有反爬的,我搜了很久也没有彻底解决,但是现在基本上一天搜集近万条图书信息是没有问题的。代码主要分为以下三部分:
-
user.py
这部分是用来写UserAgent的,这个UserAgent就是一个用户代理,每个浏览器都会有的,爬虫如果不加这个UserAgent,会被网页默认为不通过浏览器来访问,认为是机器,就会限制访问,这里请求头部分最好加上信息完整一点比较好,我这里是只加了UserAgent的 -
proxy.py
这部分用来存动态代理的ip信息,因为有的网站当你用同一个ip获取大量数据以后且速度很快,也会认为这是机器在进行操作,而不属于用户的正常访问,此时ip被禁,就连访问网页也不能了,所以得使用动态代理,让网站认为是不同的ip访问,免费的代理网站有好几个,在这里推荐一下我用的两个 -
searchBook.py
这部分是用来写爬虫的详细代码了,就不细说了 ,在代码里都给了注释
3.代码实现
user.py
import random
def getuser():
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400"
# Opera
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
"Opera/8.0 (Windows NT 5.1; U; en)",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
# Firefox
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
# Safari
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
# chrome
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
# 360
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
# 淘宝浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
# 猎豹浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
# QQ浏览器
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
# sogou浏览器
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
# maxthon浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",
# UC浏览器
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
]
user_agent = random.choice(USER_AGENTS)
return user_agent
proxy.py(代理二选一即可)
#快代理
import requests
from lxml.html import etree
url = 'http://www.kuaidaili.com/free/inha/6'#快代理
data =requests.get(url)
html = etree.HTML(data.text)
#找xpath
ip_xpath = '//*[@id="list"]/table/tbody/tr/td[1]/text()'
port_xpath = '//*[@id="list"]/table/tbody/tr/td[2]/text()'
http_or_https_xpath ='//*[@id="list"]/table/tbody/tr/td[4]/text()'
#匹配内容
ip_list = rp_html.xpath(ip_xpath)
port_list = rp_html.xpath(port_xpath)
http_or_https_list = rp_html.xpath(http_or_https_xpath)
#进行组合
list_zip = zip(ip_list,port_list,http_or_https_list)
proxy_dict= {}
proxy_list = []
for ip,port,http_or_https in list_zip:
proxy_dict[http_or_https] = f'{ip}:{port}'
proxy_list.append(proxy_dict)
proxy_dict = {}
#西刺代理
# import re
#
# import requests
# from bs4 import BeautifulSoup
#
# import user
#
# import random
#
#
# def getListProxies():
# session = requests.session()
# headers = {'User-Agent': user.getuser()}
# proxies = random.choice(proxy_list)
# page = session.get("http://www.xicidaili.com/nn/2", headers = headers,proxies = proxies)#西刺代理
# soup = BeautifulSoup(page.text, 'lxml')
#
# proxyList = []
# taglist = soup.find_all('tr', attrs={'class': re.compile("(odd)|()")})
# for trtag in taglist:
# tdlist = trtag.find_all('td')
# proxy = {'http': tdlist[1].string + ':' + tdlist[2].string}
#
# proxyList.append(proxy)
# # 设定代理ip个数
# if len(proxyList) >= 20:
# break
#
# return proxyList
searchBook.py
from lxml import etree
import json
import random
import requests
import time
import proxy
import user
def getResqutes():
# 获取诗词类图书的请求
urls = ["https://book.douban.com/tag/%E8%AF%97%E8%AF%8D?start={}".format(str(i)) for i in
range(0, 1000, 20)]#豆瓣分类图书每页20本,搜索一千本,每次搜索完一页,数字加20表示跳转到下一页继续搜索
for url in urls:
# 每搜索1页20本书更换一次请求头信息和代理ip
# 动态设置请求头信息
headers = {'User-Agent': user.getuser()}
# 动态设置代理ip信息
List = proxy.proxy_list
proxies = random.choice(List)
# 打印搜索时代理ip信息
print(proxies)
data = requests.get(url, headers=headers, proxies=proxies) # 此处是请求
html = etree.HTML(data.text) # 网页的解析
count = html.xpath("//li[@class='subject-item']")
for info in count:
# 把页面获取的详情页面的信息转化成字符串link作为下面请求的url,有些网页比如京东在转化成字符串的同时需要在前面拼接"https://"
link = ''.join(info.xpath("div[2]/h2/a/@href"))
# 每爬取一本书线程休息随机时间,模拟人类行为
time.sleep(random.random())
# 控制台输出书籍详情页地址,便于观察爬取过程中的bug
print(link)
# author_name在类别页获取,因为详情页每个页面的作者对应的块位置不同,存在获取不到作者情况,导致书籍信息获取失败
# author_name =''.join(info.xpath("div[2]/div[1]/text()")[0].split('/')[0]).replace(" ","")
# print(author_name)
# author_name = author_name.split()
link_data = requests.get(link, headers=headers, proxies=proxies)
html = etree.HTML(link_data.text)
# 书名
book_name = html.xpath("//*[@id='mainpic']/a/@title")
# 图片url
book_img = html.xpath("//*[@id='mainpic']/a/img/@src")
# 作者信息,因为不同页面位置不同做判断
author_name = html.xpath("//*[@id='info']/span[1]/a/text()")
temp = ''.join(html.xpath("//*[@id='info']/span[1]/a/text()"))
if temp is None or len(temp) == 0:
author_name = html.xpath("//*[@id='info']/a[1]/text()")
# 作者人数大于1时候用/分隔,并去除多余空格和换行符
sum = ""
if len(author_name) > 1:
for item in author_name:
sum += (str(item) + "/")
author_name = sum
else:
author_name = author_name
author_name = "".join(author_name)
author_name = author_name.replace(" ", "")
author_name = author_name.replace("\n", "")
author_name = author_name.split()
# 出版社
press = html.xpath(u'//span[./text()="出版社:"]/following::text()[1]')
# 出版年
press_year = html.xpath(u'//span[./text()="出版年:"]/following::text()[1]')
# 页数
pages = html.xpath(u'//span[./text()="页数:"]/following::text()[1]')
# 价格
price = html.xpath(u'//span[./text()="定价:"]/following::text()[1]')
# 图书ISBN
ISBN = html.xpath(u'//span[./text()="ISBN:"]/following::text()[1]')
# 评分
score = html.xpath("//*[@id='interest_sectl']/div/div[2]/strong/text()")
# 评价人数
number_reviewers = html.xpath("//*[@id='interest_sectl']/div/div[2]/div/div[2]/span/a/span/text()")
# 图书简介
introduction = html.xpath("//*[@class='intro']/p/text()")
for book_name, book_img, author_name, press, press_year, pages, price, ISBN, score, number_reviewers, introduction in zip(
book_name, book_img, author_name, press, press_year, pages, price, ISBN, score, number_reviewers,
introduction):
result = {
"book_name": book_name,
"book_img": book_img,
"author_name": author_name,
"press": press,
"press_year": press_year,
"pages": pages,
"price": price,
"ISBN": ISBN,
"score": score,
"number_reviewers": number_reviewers,
"introduction": introduction
}
print(result)
# 以json形式保存输出结果
with open('诗词.json', 'a', encoding='utf-8') as file:
file.write(json.dumps(result, ensure_ascii=False) + '\n')
if __name__ == '__main__':
getResqutes()
4.运行结果
控制台打印输出可以方便观察爬虫是否意外停止和起到检测程序运行作用,代码最后将信息以json格式保存。当然如果需要保存图片和和将信息写进数据库也是可以实现的。这是第一次使用python写个爬虫,写个博客记录一下。