爬虫小项目

最新推荐文章于 2022-11-19 11:15:22 发布

PT、小小马

最新推荐文章于 2022-11-19 11:15:22 发布

阅读量286

点赞数 1

本文链接：https://blog.csdn.net/qq_44862918/article/details/90084450

版权

#对豆瓣读书中的管理标签下的内容进行输出
#使用面向过程的方式进行爬取
import requests
import time
from lxml import html
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'
}
for i in range(0,990,20):
    v = 0
    #在原地址中只有一个%号，由于我们的占位符中也有%，导致程序以为它是转移符，所以我们要使用两个%%解决这个问题
    url = 'https://book.douban.com/tag/%%E7%%AE%%A1%%E7%%90%%86?start=%s&type=T'%i
    # print(url)
    res = requests.get(url=url,headers=headers)
    etree = html.etree
    cont = etree.HTML(res.text)
    s1 = cont.xpath("//div[@class='info']/h2/a/text()")
    s1_con = [i.strip() for i in s1 if i.strip() != '']
    s2 = cont.xpath("//div[@class='info']/div[@class='pub']/text()")
    s2_con = [j.strip() for j in s2 if j.strip() != '']
    s3 = cont.xpath("//div[@class='star clearfix']/span[@class='rating_nums']/text()")
    s4 = cont.xpath("//div[@class='star clearfix']/span[@class='pl']/text()")
    s4_con = [z.strip() for z in s4 if z.strip() != '']
    s5 = cont.xpath("//p/text()")
    del s5[:2]
    del s5[-2:]
    for i1, i2, i3, i4, i5 in zip(s1_con, s2_con, s3, s4_con, s5):
        content = '书名：%s\n作者及出版社：%s\n豆瓣评分：%s\n评价数：%s\n作品简介：%s\n\n' % (i1, i2, i3, i4, i5)
        #print('书名：%s\n作者及出版社：%s\n豆瓣评分：%s\n评价数：%s\n作品简介：%s\n\n' % (i1, i2, i3, i4, i5))
        files = open('doubantotal_codes.txt', 'a', encoding='utf8')
        files.write(content)
        files.close()
        # print('打印中')
        v += 1
        print(v)
        time.sleep(0.1)

PT、小小马

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫小项目

#对豆瓣读书中的管理标签下的内容进行输出#使用面向过程的方式进行爬取import requestsimport timefrom lxml import htmlheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 ...
复制链接

扫一扫