我的第一个Python爬虫

最新推荐文章于 2022-08-11 09:21:17 发布

肖哥威武

最新推荐文章于 2022-08-11 09:21:17 发布

阅读量121

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_35862309/article/details/109692593

版权

我的第一个Python爬虫

作为一个优(cai)秀(ji)的Python程序员，怎么能不懂爬虫呢？我决定从今天开始学习爬虫。虽然爬虫和我目前工作关系不大，但是他可以用来从网上下载大量学(mei)习(zi)资(tu)料(pian)，总之就是很有用。好了，接下来我们来快乐的学习Python爬虫吧。

明确目标

我要从网站中爬取精美的壁纸（我才不会去爬奇奇怪怪的图片呢），下载下来存放到指定文件夹中。

用到的工具：
python3.6

依赖包：
requests
bs4

开始干活

首先选一个公开的、免费的壁纸网站。我选择的是：
彼岸桌面：http://www.netbian.com/weimei/index.htm

彼岸图网提供精美好看的4K高清壁纸免费下载，4K,5K,6K,7K,8K壁纸图片素材，禁止商用。壁纸来源网络和网友分享，图片版权归原作者所有，受网站限制登录用户每天只能下载一张高清图片，我爬虫爬取的是网页中显示的图片，并非原图。

# 导入必要的包
import requests
from bs4 import BeautifulSoup
import os

# 根据URL请求页面
def request_page(url):
    try:
        response = requests.get(url)
        response.encoding = 'gbk'
        #print(response.status_code)
        if response.status_code == 200:
            return response.text
    except requests.RequestException:
        return None

在这里插入图片描述
根据页面中请求头的格式构造自己的请求头,让服务器认为我们是正常的访客：

def header(referer):
    headers = {
        'Host': 'www.netbian.com',
        'Pragma': 'no-cache',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
        'Accept': 'image/avif,image/webp,image/apng,image/*,*/*;q=0.8',
        'Referer': 'http://www.netbian.com/weimei/index.htm',
    }
    return headers

在这里插入图片描述

我们发现图片都在<li>标签下，我们解析出图片的名称和url，从该地址下载图片并保存

def download_pic(path,name,url):
# 这里下载图片的Host的地址和之前的网页不一样，需要修改一下head
    head = {
        'Host': 'img.netbian.com',
        'Pragma': 'no-cache',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
        'Accept': 'image/avif,image/webp,image/apng,image/*,*/*;q=0.8',
        'Referer': 'http://www.netbian.com/weimei/index.htm',
    }

    print("download "+name)
    filename = path+"/"+url.split('/')[-1]
    img = requests.get(url, headers=head)
    with open(filename, 'wb') as f:
        f.write(img.content)
    print("Finish")
        
def download(url,path):
    html = request_page(url)
    soup = BeautifulSoup(html, 'lxml')
    total = soup.find(class_='list').find_all('li')
    for item in total:
        img_src = item.find('a').find('img').get('src')
        img_name = item.find('a').get('title')
        download_pic(path,img_name,img_src)

在这里插入图片描述

然后观察页面数量，点进第二页发现URL从原来的http://www.netbian.com/weimei/index.htm变为http://www.netbian.com/weimei/index_2.htm

根据这个规律，最后写个主函数来下载1-5页的图片：

def main(num,path):
    if num == 1:  
        URL = "http://www.netbian.com/weimei/index.htm"
    else:
        URL = "http://www.netbian.com/weimei/index_"+str(num)+".htm"
    #print(URL)
    download(URL,path)
    
for i in range(1,5):
    main(i,"G:\pic")

下载完成后打开设定的G盘pic文件夹，可以看到图片已经被下载下来了。
在这里插入图片描述

肖哥威武

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
我的第一个Python爬虫

我的第一个Python爬虫作为一个优(cai)秀(ji)的Python程序员，怎么能不懂爬虫呢？我决定从今天开始学习爬虫。虽然爬虫和我目前工作关系不大，但是他可以用来从网上下载大量学(mei)习(nv)资(tu)料(pian)，总之就是很有用。好了，接下来我们来快乐的学习Python爬虫吧。...
复制链接

扫一扫