#对豆瓣读书中的管理标签下的内容进行输出 #使用面向过程的方式进行爬取 import requests import time from lxml import html headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11' } for i in range(0,990,20): v = 0 #在原地址中只有一个%号,由于我们的占位符中也有%,导致程序以为它是转移符,所以我们要使用两个%%解决这个问题 url = 'https://book.douban.com/tag/%%E7%%AE%%A1%%E7%%90%%86?start=%s&type=T'%i # print(url) res = requests.get(url=url,headers=headers) etree = html.etree cont = etree.HTML(res.text) s1 = cont.xpath("//div[@class='info']/h2/a/text()") s1_con = [i.strip() for i in s1 if i.strip() != ''] s2 = cont.xpath("//div[@class='info']/div[@class='pub']/text()") s2_con = [j.strip() for j in s2 if j.strip() != ''] s3 = cont.xpath("//div[@class='star clearfix']/span[@class='rating_nums']/text()") s4 = cont.xpath("//div[@class='star clearfix']/span[@class='pl']/text()") s4_con = [z.strip() for z in s4 if z.strip() != ''] s5 = cont.xpath("//p/text()") del s5[:2] del s5[-2:] for i1, i2, i3, i4, i5 in zip(s1_con, s2_con, s3, s4_con, s5): content = '书名:%s\n作者及出版社:%s\n豆瓣评分:%s\n评价数:%s\n作品简介:%s\n\n' % (i1, i2, i3, i4, i5) #print('书名:%s\n作者及出版社:%s\n豆瓣评分:%s\n评价数:%s\n作品简介:%s\n\n' % (i1, i2, i3, i4, i5)) files = open('doubantotal_codes.txt', 'a', encoding='utf8') files.write(content) files.close() # print('打印中') v += 1 print(v) time.sleep(0.1)
爬虫小项目
最新推荐文章于 2022-11-19 11:15:22 发布