寒假里的python简单爬虫实践

最新推荐文章于 2024-09-14 11:30:21 发布

Aldriich

最新推荐文章于 2024-09-14 11:30:21 发布

阅读量228

点赞数

分类专栏： python 文章标签：爬虫 python 微博

本文链接：https://blog.csdn.net/weixin_42829741/article/details/88078349

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

寒假在家无事可干，翻微博看到了EASY前辈说的一句话

程序员应该在过年期间写一个只为自己的程序

深以为然，于是开始思索我目前的技术力能够为自己写啥东西，未果。

又是在刷微博的时候，发现自己收藏里全是一堆小偶像的图片，为什么在收藏里呢，因为平时实在是懒的一张一张去保存，只能放在收藏中。然而一直放在收藏里显然也不是个事，这时想到，我可以用python写个爬虫，把微博上的图片爬取下来。

需求：通过python实现爬虫，用户输入内容，爬虫爬取在微博搜索此内容后搜索页的图片。

开始。

一个小爬虫的话肯定得有访问、处理、获取这几步，引入request库用作访问，bs4用作内容处理

考虑到要用正则表达式，引入re.compile

考虑到要往本地保存图片，引入os和shutil

# -*- coding:utf-8 -*-
# powered by Midorii
# release date:19.2.4
import requests
from bs4 import BeautifulSoup
import re
import os
import shutil
import time
import datetime

下面开始研究微博搜索界面的代码。通过查阅相关博客得知weibo.com因为JS应用较多，难以简单爬取，那该怎么办呢。
继续查找资料，得知了微博的低流量入口weibo.cn仍在运作，即该页面。

查看源代码后发现该网站源代码完全裸露，适合爬取。

如图所示，图片链接为裸露的.jpg格式，因此决定weibo.cn为突破口。
观察搜索界面的url格式

可以发现，“keyword=”后面即为搜索内容，“page=”后为页码。注意此处的汉字是chrome优化后结果，实际应为

https://weibo.cn/search/mblog?hideSearchFrame=&keyword=%E5%A0%80%E6%9C%AA%E5%A4%AE%E5%A5%88&page=2

汉字被处理为编码。为了对汉字进行处理，引入urillib

import urllib.parse

编写输入搜索内容和页码范围的函数

# 获取输入信息
def get_search_value():
    searchVal = input("搜索条目:")
    return urllib.parse.quote(searchVal)


# 获取页码，返回处理后的列表
def get_page_number():
    startPage = int(input("起始页码:"))
    endPage = int(input("结束页码:"))
    return range(startPage,endPage+1)

考虑到微博需要使用cookies，编写相关函数。

# 读入本地cookie
def read_cookie():
    localCookiePath = os.getcwd()+"\cookie.txt"
    print(f"尝试自{localCookiePath}读入cookie")
    if os.path.exists(localCookiePath) is False:
        print("未检测到本地cookie")
        tempCookies = input("请输入微博登录时的cookies:")
    else:
        print("本地cookie读入成功")
        with open(localCookiePath, 'r') as cookiesFile:
            tempCookies = cookiesFile.read()
    dictCookie={"Cookie": tempCookies}
    print(f"使用cookie:{tempCookies}")
    return dictCookie


# 保存有效cookie
def save_cookie(tempCookies):
    with open(os.getcwd()+"\cookie.txt",'w') as cookiesFile:
        cookiesFile.write(tempCookies)

在实际使用中，发现cookie失效时会自动跳转至登录页面，登录页面中存在名为“loginAcion”的元素，以此特性实现cookie的检验。

# 检验cookie是否有效
def check_cookie(soup,cookies):
    check = []
    check = soup.find_all('a', id=re.compile("loginAction"))
    if check:
        print("cookies无效，请检查")
        os.remove(os.getcwd() + "\cookie.txt")
        os.system("pause")
        quit()
    else:
        save_cookie(cookies['Cookie'])

下面开始正式的爬取

通过观察微博界面，可知weibo.cn的图片当超过一张时，会形成“组图”超链接，“组图”的超链接含有picAll字段，以此为突破口获取组图url

# 获取网页源代码,返回图片地址列表
def get_anl_site(pages, searchVal, cookies):
    imgUrlList=[]
    #生成路径
    for page in pages:
        print(f"正在获取第{page}页到第{pages[-1]}页的内容")
        url = "https://weibo.cn/search/mblog?hideSearchFrame=&keyword=" + searchVal + "&page=" + str(
            page)
        print(f"生成路径{url}")
        #防止访问过频繁
        time.sleep(5)
        #生成BS对象并处理源代码
        r = requests.get(url, cookies=cookies)
        soup = BeautifulSoup(r.text, "lxml")
        #检验cookie
        check_cookie(soup,cookies)
        #保存该页面全部组图url
        imgList = soup.find_all('a', href=re.compile("picAll"))
        #通过访问保存的组图url，获取组图中全部图片的url，观察可知组图缩略图均含有thumb字段，以此为突破口保存图片url
        for result in imgList:
            albumLink = result.get("href")
            imgr = requests.get(albumLink, cookies=cookies)
            imgSoup = BeautifulSoup(imgr.text, "lxml")
            linkList = imgSoup.find_all('img', src=re.compile("thumb"))
            #print(linkList)
            #通过观察缩略图和大图url差别可发现二者的不同之处是缩略图的thumb180字段在大图时为large，通过替换实现保存大图
            for link in linkList:
                link = link.get("src")
                link = link.replace("thumb180", "large")
                #将处理好的url保存至列表imgUrlList
                imgUrlList.append(link)
            #print(imgUrlList)
        #保存单图，wap180为单图特有字段
        imgSingle = soup.find_all('img', src=re.compile("wap180"))
        for link in imgSingle:
            link = link.get("src")
            #同样进行更换处理
            link = link.replace("wap180", "large")
            imgUrlList.append(link)
        #print(imgUrlList)
    print("有效cookie已储存至本地")
    print(f"图片爬取完成，共获取图片地址{len(imgUrlList)}个")
    #对图片Url进行去重
    imgUrlListRefine = list(set(imgUrlList))
    imgUrlListRefine.sort(key=imgUrlList.index)
    print(f"简单去重后共获取图片地址{len(imgUrlListRefine)}个")
    return imgUrlListRefine

保存图片

# 保存图片
def img_download(imgUrlList):
    tempPath = input("输入存储图片文件夹名，留空则默认以当前时间为文件夹名：")
    if tempPath.strip() == '':
        tempPath = str(datetime.datetime.now())
        tempPath = tempPath.replace(':', '')
        tempPath = tempPath.replace(" ", '')

    imgPath = os.getcwd() + "/" + tempPath
    if os.path.exists(imgPath) is False:
        os.mkdir(imgPath)
    x = 1
    for imgUrl in imgUrlList:
        temp = imgPath + '/' + str(x) + imgUrl[-4:]
        print(f"正在下载第{x}张图片")
        print(f"图片url为{imgUrl}")
        try:
            r = requests.get(imgUrl, stream=True)
            if r.status_code == 200:
                with open(temp, 'wb') as f:
                    r.raw.decode_content = True
                    shutil.copyfileobj(r.raw, f)
                    x += 1
        except:
            print(f"该图片下载失败{imgUrl}")

    print(f"图片下载完成，一共下载了{x - 1}张图片")

编写入口

if __name__ == '__main__':
    cookie=read_cookie()
    searchVal = get_search_value()
    pages = get_page_number()
    imgUrlList = get_anl_site(pages, searchVal,cookie)
    img_download(imgUrlList)
    os.system("pause")

以上，一个简单的爬虫就编写好了

进行测试

基本满足要求。

全部源代码如下

# -*- coding:utf-8 -*-
# powered by Midorii
# release date:19.2.4
import requests
from bs4 import BeautifulSoup
import re
import os
import shutil
import urllib.parse
import time
import datetime


# 读入本地cookie
def read_cookie():
    localCookiePath = os.getcwd()+"\cookie.txt"
    print(f"尝试自{localCookiePath}读入cookie")
    if os.path.exists(localCookiePath) is False:
        print("未检测到本地cookie")
        tempCookies = input("请输入微博登录时的cookies:")
    else:
        print("本地cookie读入成功")
        with open(localCookiePath, 'r') as cookiesFile:
            tempCookies = cookiesFile.read()
    dictCookie={"Cookie": tempCookies}
    print(f"使用cookie:{tempCookies}")
    return dictCookie


# 保存有效cookie
def save_cookie(tempCookies):
    with open(os.getcwd()+"\cookie.txt",'w') as cookiesFile:
        cookiesFile.write(tempCookies)


# 检验cookie是否有效
def check_cookie(soup,cookies):
    check = []
    check = soup.find_all('a', id=re.compile("loginAction"))
    if check:
        print("cookies无效，请检查")
        os.remove(os.getcwd() + "\cookie.txt")
        os.system("pause")
        quit()
    else:
        save_cookie(cookies['Cookie'])


# 获取输入信息
def get_search_value():
    searchVal = input("搜索条目:")
    return urllib.parse.quote(searchVal)


# 获取页码，返回处理后的列表
def get_page_number():
    startPage = int(input("起始页码:"))
    endPage = int(input("结束页码:"))
    return range(startPage,endPage+1)


# 获取网页源代码,返回图片地址列表
def get_anl_site(pages, searchVal, cookies):
    imgUrlList=[]
    #生成路径
    for page in pages:
        print(f"正在获取第{page}页到第{pages[-1]}页的内容")
        url = "https://weibo.cn/search/mblog?hideSearchFrame=&keyword=" + searchVal + "&page=" + str(
            page)
        print(f"生成路径{url}")
        #防止访问过频繁
        time.sleep(5)
        #生成BS对象并处理源代码
        r = requests.get(url, cookies=cookies)
        soup = BeautifulSoup(r.text, "lxml")
        #检验cookie
        check_cookie(soup,cookies)
        #保存该页面全部组图url
        imgList = soup.find_all('a', href=re.compile("picAll"))
        #通过访问保存的组图url，获取组图中全部图片的url，观察可知组图缩略图均含有thumb字段，以此为突破口保存图片url
        for result in imgList:
            albumLink = result.get("href")
            imgr = requests.get(albumLink, cookies=cookies)
            imgSoup = BeautifulSoup(imgr.text, "lxml")
            linkList = imgSoup.find_all('img', src=re.compile("thumb"))
            #print(linkList)
            #通过观察缩略图和大图url差别可发现二者的不同之处是缩略图的thumb180字段在大图时为large，通过替换实现保存大图
            for link in linkList:
                link = link.get("src")
                link = link.replace("thumb180", "large")
                #将处理好的url保存至列表imgUrlList
                imgUrlList.append(link)
            #print(imgUrlList)
        #保存单图，wap180为单图特有字段
        imgSingle = soup.find_all('img', src=re.compile("wap180"))
        for link in imgSingle:
            link = link.get("src")
            #同样进行更换处理
            link = link.replace("wap180", "large")
            imgUrlList.append(link)
        #print(imgUrlList)
    print("有效cookie已储存至本地")
    print(f"图片爬取完成，共获取图片地址{len(imgUrlList)}个")
    #对图片Url进行去重
    imgUrlListRefine = list(set(imgUrlList))
    imgUrlListRefine.sort(key=imgUrlList.index)
    print(f"简单去重后共获取图片地址{len(imgUrlListRefine)}个")
    return imgUrlListRefine


# 保存图片
def img_download(imgUrlList):
    tempPath = input("输入存储图片文件夹名，留空则默认以当前时间为文件夹名：")
    if tempPath.strip() == '':
        tempPath = str(datetime.datetime.now())
        tempPath = tempPath.replace(':', '')
        tempPath = tempPath.replace(" ", '')

    imgPath = os.getcwd() + "/" + tempPath
    if os.path.exists(imgPath) is False:
        os.mkdir(imgPath)
    x = 1
    for imgUrl in imgUrlList:
        temp = imgPath + '/' + str(x) + imgUrl[-4:]
        print(f"正在下载第{x}张图片")
        print(f"图片url为{imgUrl}")
        try:
            r = requests.get(imgUrl, stream=True)
            if r.status_code == 200:
                with open(temp, 'wb') as f:
                    r.raw.decode_content = True
                    shutil.copyfileobj(r.raw, f)
                    x += 1
        except:
            print(f"该图片下载失败{imgUrl}")

    print(f"图片下载完成，一共下载了{x - 1}张图片")


if __name__ == '__main__':
    cookie=read_cookie()
    searchVal = get_search_value()
    pages = get_page_number()
    imgUrlList = get_anl_site(pages, searchVal,cookie)
    img_download(imgUrlList)
    os.system("pause")

目前的不足之处：

没找到爬取超级话题的方法。

cookie存在失效问题