突破次元壁障，Python爬虫获取二次元女友_色姑娘6

APP源码解析

于 2024-05-02 09:19:18 发布

阅读量692

点赞数 30

分类专栏：程序员文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/m0_61331367/article/details/138387476

版权

程序员专栏收录该内容

253 篇文章 0 订阅

订阅专栏


### 页面解析


使用`beautifulsoup`解析页面，获取`JS`中所需数据：

results = soup.find_all(‘script’)[1]


为了能够使用`re`解析获取内容，需要将内容转换为字符串：

image_dirty = str(results)


接下来构造正则表达式获取图片地址：

pattern = re.compile(item, re.I|re.M)


然后查找所有的图片地址：

result_list = pattern.findall(image_dirty)


为了方便获取所需字段，构造解析函数

def analysis(item,results):
pattern = re.compile(item, re.I|re.M)
result_list = pattern.findall(results)
return result_list


打印获取的图片地址：

urls = analysis(r’“path”:“(.*?)”', image_dirty)
urls[0:1]


发现一堆奇怪的字符：

‘images\u002Fresource\u002F2021\u002F06\u002F20\u002F906h89635p0.jpg’,


这是由于网页编码的原因造成的，由于一开始使用`utf-8`方式解码网页，并不能解码`Unicode`：

response.encoding = ‘utf-8’
response.raise_for_status()
soup = BeautifulSoup(response.text, ‘html.parser’)


因此虽然可以通过以下方式获取原始地址：

url = ‘images\u002Fresource\u002F2021\u002F05\u002F22\u002F90h013034p0.jpg’
decodeunichars = url.encode(‘utf-8’).decode(‘unicode-escape’)


但是我们可以通过`response.encoding = 'unicode-escape'`进行更简单的解码，缺点是网页的许多中文字符会变成乱码，但是字不重要不是么？看图！


![图片](https://img-blog.csdnimg.cn/20210707104529848.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0xPVkVteTEzNDYxMQ==,size_16,color_FFFFFF,t_70#pic_center)


#### 创建图片保存路径


为了下载图片，首先创建图片保存路径：

创建图片保存路径

if not os.path.exists(webp_file):
os.makedirs(webp_file, exist_ok=True)
if not os.path.exists(png_file):
os.makedirs(png_file, exist_ok=True)


#### 图片下载


当我们使用`另存为`选项时，发现格式为`webp`，但是上述获取的图片地址为`jpg`或`png`，如果直接存储为`jpg`或`png`格式，会导致格式错误。  
 ![图片格式](https://img-blog.csdnimg.cn/20210707104731803.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0xPVkVteTEzNDYxMQ==,size_16,color_FFFFFF,t_70#pic_center)因此需要重新构建`webp`格式的文件名：

name = img.split(‘/’)[-1]
name = name.split(‘.’)[0]
name_webp = name + ‘.webp’


由于获取的图片地址并不完整，需要添加网站主页来构建图片地址：

from urllib.request import urljoin
domain = ‘https://img2.huashi6.com’
img_url = urljoin(domain,img)


接下来就是下载图片了：

r = requests.get(img_url,headers=headers)
if r.status_code == 200:
with open(name_webp, ‘wb’) as f:
f.write(r.content)


#### 格式转换


最后，由于得到的图片是`webp`格式的，如果希望得到更加常见的`png`格式，需要使用`PIL`库进行转换：

image_wepb = Image.open(name_webp)
image_wepb.save(name_png)


#### 爬取结果展示


![爬取结果](https://img-blog.csdnimg.cn/20210707105342875.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0xPVkVteTEzNDYxMQ==,size_16,color_FFFFFF,t_70#pic_center)


### 完整程序

import time
import requests
from bs4 import BeautifulSoup
import os
import re
from urllib.request import urljoin
from PIL import Image

webp_file = ‘girlfriends_webp’
png_file = ‘girlfriends_png’

print(os.getcwd())

创建图片保存路径

if not os.path.exists(webp_file):
os.makedirs(webp_file, exist_ok=True)
if not os.path.exists(png_file):
os.makedirs(png_file, exist_ok=True)

headers = {
‘User-Agent’:‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0’,
#‘Cookie’:‘’
‘Connection’: ‘keep-alive’
}
url_pattern = “https://www.huashi6.com/tags/161?p={}”

domain = ‘https://img2.huashi6.com’

图片地址获取函数

def analysis(item,results):
pattern = re.compile(item, re.I|re.M)
result_list = pattern.findall(results)
return result_list

图片格式转换函数

def change_webp2png(name_webp, name_png, img_url):
try:
image_wepb = Image.open(name_webp)
image_wepb.save(name_png)
except:
download_image(name_webp, name_png, img_url)