python爬虫我要个性网，获取头像

最新推荐文章于 2024-07-03 09:37:22 发布

遗憾专家

最新推荐文章于 2024-07-03 09:37:22 发布

阅读量686

点赞数 2

分类专栏： python爬虫文章标签： python 正则表达式

本文链接：https://blog.csdn.net/nyy66/article/details/106150856

版权

python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

python爬虫学习
提前声明：请勿他用，仅限个人学习
运用模块有

import requests
import re
import os

较为常规，适合网络小白。lxml和bs4也是基础。长话短说。

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 Edg/81.0.416.72'}
link="https://www.woyaogexing.com/touxiang/qinglv/"#编写请求头信息
r=requests.get(link,headers=headers)
r.encoding=r.apparent_encoding
html=r.text
# print(html)

在这里插入图片描述
编写请求头，和要获取的网址，link，一般常用url，只是一个简称。个人习惯吧。
然后开始分析这个网站，这次用到的是re

运用正则表达式找到那段文字，

title=re.findall('<div class="h1-title z"><h1>(.*?)</h1><i></i><span>></span></div>',html)
divs=re.compile('<a href="(.*?)" class="img" target="_blank" title=".*?">')
divs=re.findall(divs,html)
# print(divs)

测试一下，开始使用迭代语句，进入我们真正想要爬取的图片地址

for div in divs:
    links='https://www.woyaogexing.com'+div
    resp=requests.get(links,headers=headers)
    resp.encoding=resp.apparent_encoding
    htmls=resp.text
    # print(htmls)

到我们找到之后，links就是我们要找的网址，完善这个网址，然后开始第二次请求
首先用到正则表达式，获取我们的第二次想要爬取的网址

hrefs=re.compile('<a href="(.*?)" class="swipebox">')
    hrefs=re.findall(hrefs,htmls)
    ids=re.findall('<h1>(.*?)</h1>',htmls)

同时编辑好，存储的路径，用到os模块，字符里面有‘

 base_path = 'F://我要个性网/%s'%title
    for id in ids:
        id=re.sub('[/]+','--',id)#字符里面有/影响我们存储，去掉
        path = os.path.join(base_path, id)  # 创建路径
        if not os.path.exists(path):
            os.makedirs(path)

最后一步，获取href，以content形式下载保存

    for href in hrefs:
        href='https:'+href
        # print(href)
        tupian=requests.get(href,headers=headers)
        with open(str(path)+'/'+href.split('/')[-1]+'.jpeg','wb')as f:
            f.write(tupian.content)
            print('正在下载中{}'.format(href.split('/')[-1]))

完美收工。
全代码

import requests
import re
import os
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 Edg/81.0.416.72'}
link="https://www.woyaogexing.com/touxiang/qinglv/"#编写请求头信息
r=requests.get(link,headers=headers)
r.encoding=r.apparent_encoding
html=r.text
# print(html)
title=re.findall('<div class="h1-title z"><h1>(.*?)</h1><i></i><span>></span></div>',html)
divs=re.compile('<a href="(.*?)" class="img" target="_blank" title=".*?">')
divs=re.findall(divs,html)
# print(divs)
for div in divs:
    links='https://www.woyaogexing.com'+div
    resp=requests.get(links,headers=headers)
    resp.encoding=resp.apparent_encoding
    htmls=resp.text
    # print(htmls)
    hrefs=re.compile('<a href="(.*?)" class="swipebox">')
    hrefs=re.findall(hrefs,htmls)
    ids=re.findall('<h1>(.*?)</h1>',htmls)
    base_path = 'F://我要个性网/%s'%title
    for id in ids:
        id=re.sub('[/]+','--',id)
        path = os.path.join(base_path, id)  # 创建路径
        if not os.path.exists(path):
            os.makedirs(path)
    for href in hrefs:
        href='https:'+href
        # print(href)
        tupian=requests.get(href,headers=headers)
        with open(str(path)+'/'+href.split('/')[-1]+'.jpeg','wb')as f:
            f.write(tupian.content)
            print('正在下载中{}'.format(href.split('/')[-1]))

在这里插入图片描述
nice！
安排一波

遗憾专家

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
python爬虫我要个性网，获取头像

python爬虫学习提前声明：请勿他用，仅限个人学习运用模块有import requestsimport reimport os较为常规，适合网络小白。lxml和bs4也是基础。长话短说。headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 Edg/81.0.416.72
复制链接

扫一扫

专栏目录