新手爬取项目实战一之爬取美女图片

最新推荐文章于 2024-05-10 12:27:58 发布

stupid_andwinner

最新推荐文章于 2024-05-10 12:27:58 发布

阅读量443

点赞数

文章标签：多线程 python 爬虫 request

本文链接：https://blog.csdn.net/stupid_andwinner/article/details/114680009

版权

新手爬虫项目实战一之爬取美女图片

##准备

运用所需要的库
requests库
threading库
Beautiful库
若没有下载可直接pip下载，不会可自行网上搜索
所爬取的网站为https://tu.cnzol.com/

代码思路

运用requests库获取网页文档
分析所需要的元素，图片的url的存放地址
运用Beautifulsoup库清洗数据并找到数据
运用分布式爬虫threading，提高效率
最后存放本地

提示：以下是本篇文章正文内容，下面案例可供参考

1.引入库

import requests
import threading
from bs4 import BeautifulSoup
header={'User-Agent':''}#加上自己的user—agent
threads=[]#为后面的多线程
count=0#记忆下载的图片数

2.分析网站

分析网站第二页可得url https://tu.cnzol.com/meinv/index_2.html
规律为网站第三页https://tu.cnzol.com/meinv/index_3.html
后续代码可写循环

`for i in range(2,100):
    url='https://tu.cnzol.com/meinv/index_'+str(i)+'.html'
    imagespider(url)

在这里插入图片描述
图片信息均存于

元素中
一个图片中的url地址为https://tu.cnzol.com/d/file/2020/1223/856bcbe8d8d102e36040f62ea8da3442_250_350.jpg
https://tu.cnzol.com/d/是固定的，后面的存于图片
中
名称存于<img class=“lazy” alt=“性感美女前凸后翘妩媚撩人诱惑写真” ">，所以存文件可用该名称

3.获取网站信息

代码如下（示例）：

def imagespider(url):
    global threads#设置多线程信息
    try:
        urls=[]
        r=requests.get(url,headers=header)
        r.encoding='utf-8'#必须设置'utf-8',否则名称会乱码
        soup=BeautifulSoup(r.text,'lxml')
        lis=soup.select("li div a img")#Beautiful css语法
        for li in lis:
            try:
                img=li['src']#已获取后面部分
                name=li['alt']#图片名称
                url1='https://tu.cnzol.com/'+img#图片的url地址
                if url1 not in urls:
                        T=threading.Thread(target=download(url1,name))
                        T.setDaemon(False)#设置为后台
                        T.start()
                        threads.append(T)
            except Exception as err:
                print(err)
    except:
        return''

4.存于本地

def download(url1,name):#该函数功能为存于本地
    global count#设置全局变量
    count=count+1
    r=requests.get(url1,timeout=100)#设置超时时间
    data=r.content
    with open("D:\图片"+str(name)+str(count)+'.jpg','wb') as f:
        f.write(data)
    print('已打印{}张图片'.format(count))

5.代码的实现

for i in range(2,100):
    url='https://tu.cnzol.com/meinv/index_'+str(i)+'.html'
    imagespider(url)

6.效果图

在这里插入图片描述
已下载几百份，不过不要太过于贪多，否则过多人访问会导致网站的崩溃，适可而止。

7.完整代码`

import threading
from bs4 import BeautifulSoup
header={'User-Agent':'自己的UA'}
threads=[]
count=0
def imagespider(url):
    global threads
    try:
        urls=[]
        r=requests.get(url,headers=header)
        r.encoding='utf-8'
        soup=BeautifulSoup(r.text,'lxml')
        lis=soup.select("li div a img")
        for li in lis:
            try:
                img=li['src']
                name=li['alt']
                url1='https://tu.cnzol.com/'+img
                if url1 not in urls:
                        T=threading.Thread(target=download(url1,name))
                        T.setDaemon(False)
                        T.start()
                        threads.append(T)
            except Exception as err:
                print(err)
    except:
        return''
def download(url1,name):
    global count
    count=count+1
    r=requests.get(url1,timeout=100)
    data=r.content
    with open("D:\图片"+str(name)+str(count)+'.jpg','wb') as f:
        f.write(data)
    print('已打印{}张图片'.format(count))
for i in range(2,100):
    url='https://tu.cnzol.com/meinv/index_'+str(i)+'.html'
    imagespider(url)

总结与优化设想

1.本次爬取图片仅为图集中的封面，技术上本可实现爬取每一页图集中背后的url，直到每一本图集下载完成。但因时间紧迫，没有完成。看后续有无时间

2.一个爬虫的核心部分则为分析网页

## 结语爬虫的使用应节制有度，适当使用可方便许多，若毫无顾忌，则变成面向监狱编程了。本博客仅为学习交流使用。

stupid_andwinner

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
新手爬取项目实战一之爬取美女图片

新手爬虫项目实战一之爬取美女图片##准备运用所需要的库requests库threading库Beautiful库若没有下载可直接pip下载，不会可自行网上搜索所爬取的网站为https://tu.cnzol.com/文章目录新手爬虫项目实战一之爬取美女图片二、使用步骤1.引入库2.读入数据总结代码思路1.引入库2.分析网站3.获取网站信息4.存于本地5.代码的实现6.效果图7.完整代码`总结与优化设想二、使用步骤1.引入库代码如下（示例）：import numpy as np
复制链接

扫一扫