爬虫爬取壁纸清晰大图实战

最新推荐文章于 2022-10-25 21:47:39 发布

四叶草茶艺师

最新推荐文章于 2022-10-25 21:47:39 发布

阅读量364

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/weixin_46011275/article/details/107891477

版权

爬虫专栏收录该内容

8 篇文章 1 订阅

订阅专栏

用到的库函数

**import os
import requests
import re
import time
from bs4 import BeautifulSoup
#os库用于文件创建
#request BeautifulSoup用于网页获取
#re用于获取连接

分析目标网址
第一页：
http://www.netbian.com/
第二页
http://www.netbian.com/index_2.htm
第三页
http://www.netbian.com/index_3.htm
…
因此要爬取多页内容可以用for循环创建url链接进行爬取
再对网页进行分析
在这里插入图片描述
可以用 BeautifulSoup的select进行抓取

    lis=soup.select('div .list >ul >li >a')

通过for循环遍历lis列表
为了获取href 可以直接通过[‘href’]来获得内容
http://www.netbian.com/desk/22801.htm
的链接再对其分析可以得到清晰大图的url在如下在这里插入图片描述
可以再用select语句进行获取也可以使用正则获取url对象和alt标签

        s=re.compile('alt="(.*?)".*?src="(.*?)"')

对获取到的
http://img.netbian.com/file/2020/0807/1871233011c1435935405e4f52f53ba1.jpg
发送一次request请求
就可以通过读写模块进行下载
这就是大致思路
以下是我的代码还有很多地方可以优化：
采用多线程加快速率
对request方面的请求可以定义一个函数
生成url标签的方式也可以换一种防止占用内存

import os
import requests
import re
import time
from bs4 import BeautifulSoup
tj=0
head=       {
"User-Agent": 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36(KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0'
 }
title='http://www.netbian.com/'
dic={}
os.chdir(r'C:\Users\cys\Desktop\bz')
for i in range(1,10):
    if i ==1:
        page='index.htm'
    else:
        page=  'index_'+str(i)+'.htm'
    url1='http://www.netbian.com/dongman/'
    url=url1+page
    req=requests.get(url,headers=head)
    req.encoding='GBK'
    soup=BeautifulSoup(req.text,'lxml')
    lis=soup.select('div .list >ul >li >a')
    for i in lis:
    	 z=i['href']
        urll=title+z
        req=requests.get(urll,headers=head)
        req.encoding='GBK'
        soup=BeautifulSoup(req.text,'lxml')
        lis=soup.select('div .pic > p > a')
        s=re.compile('alt="(.*?)".*?src="(.*?)"')
        ss=re.findall(s,str(lis))
        for a,b in ss:
            lisss=a.split(' ')
            filename=lisss[0]
            name=b.split('/')[-1]
            req=requests.get(b,headers=head)
             html=req.content
            try:
                if os.path.exists(filename):
                    pass
                else:
                    os.mkdir(filename)
                file=open(os.path.join(r'下载地址',filename,name),'wb')
                file.write(html)
                print(name+'写入成功')
                tj=tj+1    
            except IOError as e:
                print(e)
                sys.exit(0)
            finally:
                if 'file' in locals():
                    file.close()
    
   
print('一共写入了'+str(tj)+'张图片')

四叶草茶艺师

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
爬虫爬取壁纸清晰大图实战

爬虫爬取壁纸清晰大图实战**import osimport requestsimport reimport timefrom bs4 import BeautifulSoup#os库用于文件创建#request BeautifulSoup用于网页获取#re用于获取连接分析目标网址第一页：http://www.netbian.com/第二页http://www.netbian.com/index_2.htm第三页http://www.netbian.com/index_3.ht
复制链接

扫一扫