[Python] 图集谷爬虫，采用Xpath的方法

最新推荐文章于 2022-06-01 16:21:42 发布

5G云网络

最新推荐文章于 2022-06-01 16:21:42 发布

阅读量10w+

点赞数 5

分类专栏：技术教程

本文链接：https://blog.csdn.net/qq_30801093/article/details/108074859

版权

技术教程专栏收录该内容

2 篇文章 0 订阅

订阅专栏

import requestsfrom lxml import etree
import re
import os
# 这个主要是为了学习Xpath的用法，嗯嗯呢呢~纯用正则表达式传送门[url=https://www.52pojie.cn/thread-1244292-1-1.html]https://www.52pojie.cn/thread-1244292-1-1.html[/url]
headers = {
    'User-Agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
search = input('需要一个女朋友啊~~~ :  ')
searchpage = requests.get('https://www.tujigu.com/search/' + search, headers=headers).text
searchpage = etree.HTML(searchpage) # 觉得要用Xpath，需要先进行格式化操作
pageurls = searchpage.xpath('//div[@class="hezi"]//li/a/@href') # 使用Xpath获取搜索页的每个图集的URL地址
 
for totle_url in pageurls:
    totle = requests.get(totle_url, headers=headers).content.decode('utf-8')
    picnum = int(re.findall("<p>图片数量： (.*?)P</p>", totle)[0]) # 需要用正则找出图集中图片的总数量,Xpath准确率感觉不高
    ID = totle_url.split("/")[4] # 按照URL的格式，以斜线分割，取出图集的ID
    totle = etree.HTML(totle) # 因为这次主要是练习Xpath，所以接着转换，感觉我的代码有些凌乱呢？？
    title = totle.xpath('//div[@class="weizhi"]/h1/text()')[0] # 取出图集的标题
 
# 创建图集目录
    path = '图集谷\\' + title # 设置输出文件夹
    if not os.path.exists(path): # 判断文件夹不存在
        os.makedirs(path) # 不存在则建立文件夹
        print('目录创建完成(*^v^*)，记得设置为隐私文件哦^_^!')
    else :
        print('目录已创建(－o⌒)=3！！，一看就是老绅士了 ╭∩╮ ')
 
# 开始下载图片
    print(title)
    for i in range(1, picnum+1):
        picurl = "https://lns.hywly.com/a/1/" + ID + '/' + str(i) + '.jpg' # 单个图片URL地址的拼接
        print('(≧^.^≦)喵~~~正在下载：' + picurl)
        pic = requests.get(picurl).content
        with open('%s\%s.jpg'%(path, i), 'wb') as f:
            f.write(pic)
    print(title + '\n下载完成！\n\n')

5G云网络

关注

5
点赞
踩
21

收藏

觉得还不错? 一键收藏
1
评论
[Python] 图集谷爬虫，采用Xpath的方法

import requestsfrom lxml import etreeimport reimport os# 这个主要是为了学习Xpath的用法，嗯嗯呢呢~纯用正则表达式传送门[url=https://www.52pojie.cn/thread-1244292-1-1.html]https://www.52pojie.cn/thread-1244292-1-1.html[/url]headers = { 'User-Agent': 'Mozilla/5.0(Windows NT 10..
复制链接

扫一扫

专栏目录