爬取某女鞋网全站商品,上传到oss中


前言

这次的项目就是http://www.go2.cn卖女鞋的一个商品网站,要爬取里边商品的图片包并上传到oss中,在做的过程中也是遇到了很多有意思的东西,在这里写出来记录一下,还有好多实用的小工具,感兴趣的可以拿去用哦!


一、网站分析

先分析网站的结构,要爬全站商品嘛,得有个思路,应该是先找出所有商家,然后去商家找所有商品

1.找所有的商家


主页的右边有个源头厂商,点进去是这样的
在这里插入图片描述
找到商家了,要获取全部的,往下拉的时候有加载商家,看网络请求
在这里插入图片描述
ajax post请求,参数就page,测试发现那个url没用,for 循环构造请求

def business():
    business_list = []
    business_url = 'http://www.go2.cn/ajax/welcome/getSupplier'
    for page in range(1, 100):
        data = {
            'page': page
        }
        resp = httpx.post(business_url, headers=headers, data=data).text
        if len(resp) < 10:
            break
        else:
            business_url_list = re.findall(r'left: 0;\\" href=\\"(http.*?cn)', resp)
            business_url_list = [x.replace('\\', '') for x in business_url_list]
            for i in business_url_list:
                business_list.append([i])
    # 保存csv文件
    with open('business.csv', 'w', newline='') as f:
        w = csv.writer(f)
        w.writerows(business_list)

因为不知道有多少页嘛,就请求100页,如果响应的文本里没东西,就退出循环,请求内容长这样,要去提取商家的url,这个适合用正则,这样提取会简单一些
在这里插入图片描述
细心的小伙伴可以看到是两个选项,不过我试过了,这两个都一样,所以一个请求就搞定了,找到url以后保存到csv文件中
在这里插入图片描述

2.找每个商家的所有商品

进入商家以后就可以看到全部商品
在这里插入图片描述
看网络响应,不出意外ajax请求
在这里插入图片描述
这个的响应值很有意思,是部分h5代码,不过还有个坑,就是这个网站的商家是两个模板,还有一个模板,都是ajax请求,不过响应数据h5代码不一样

先看另一种模板:
在这里插入图片描述

再看两个的响应值:
模板二
模板一
所以呢,要构建两个模板的获取方法(因为不知道有多少页就写大点,如果返回值的长度太小就说明没有了,就退出循环):

def commodity_one(url1):
    commodity_url_list = []
    for page in range(1, 500):
        url2 = url1 + f'/welcome/product_list?pn={page}&channel=index&customCid=&state=&filter=&mcount=&hide=&q='
        resp = httpx.get(url2, headers=headers)
        time.sleep(0.5)
        if len(resp.text) < 10:
            commodity_url_list.append('结束了')
            break
        else:
            soup = etree.HTML(resp.text)
            try:
                a_list = soup.xpath('//a[@class="list-item-big-img"]/@href')
                for a in a_list:
                    commodity_url_list.append(a)
            except:
                print(url2 + '有错')
    return commodity_url_list
def commodity_two(url1):
    commodity_url_list = []
    for page in range(1, 500):
        url2 = url1 + f'/welcome/product_list?pn={page}'
        resp = httpx.get(url2, headers=headers)
        time.sleep(0.5)
        if len(resp.text) < 10:
            commodity_url_list.append('结束了')
            break
        else:
            soup = etree.HTML(resp.text)
            try:
                div = soup.xpath('/html/body/div')
                for a in div:
                    url = a.xpath('./a/@href')[0]
                    commodity_url_list.append(url)
            except:
                print(url2 + '有错')
    return commodity_url_list

def run():
    business_url_list = business()
    with open('business.csv') as f:
        r = csv.reader(f)
        for business_url in tqdm(r):
            business_url = business_url[0]
            resp = httpx.get(business_url, headers=headers).text
            time.sleep(0.5)
            if 'main-list-item find-similar-item' or 'top-wrap' in resp:
                commodity_url_list = commodity_one(business_url)
            elif '<div class="item-shop">' in resp:
                commodity_url_list = commodity_two(business_url)
            else:
                print('新的页面--' + business_url)
            if len(commodity_url_list) == 0:
                print(business_url + '--一个商品没有,你有鬼')
                print(commodity_url_list)

用每个模板的class属性值进行判读,使用那个解析方法,至此所有的商品链接就都有了,并保存到数据库中,方便后边的读取。
(后边放完整代码,有专门处理数据库的代码,各位官人不要急)

3.找数据包在的位置

在这里插入图片描述
这个网页F12打不开,咱换一个方法,这样就可以了
在这里插入图片描述
链接为:http://www.go2.cn/main/product/download/qsggoms?t=1641181482,观察可以看到http://www.go2.cn/main/product/download/这个是固定的,qsggoms这个是商品,咱们获取的商品的url中有这个:
在这里插入图片描述
把这个用正则提取出来,然后构建下载页的url,至于t是个时间戳,实测没作用,加不加都可。

点进去以后是图片验证码,这时候可以用超级鹰打码平台,就是有点贵,反正我没用,我用volo训练了一套图片识别,成功率90%,volo的代码就不放,这个要自己去根据场景训练,不过有调用的,后边有哦!

在这里插入图片描述
点击图片后有个post请求:
在这里插入图片描述
url:http://www.go2.cn/ajax/product_download/verify_img/3823230195.0613/qsggoms/2df025428f7b4a7a11679642b47fa0a3?t=0.848512361092481

这里有几个值要注意一下,3823230195.0613和2df025428f7b4a7a11679642b47fa0a3,不过好在这个页面就有:
就是下载页的网页代码中就有
post的参数为:
在这里插入图片描述
index是上边那6个图的选择图,从0开始,t就是个时间戳,加不加都行,
响应值为:在这里插入图片描述
把下载链接提取出来就可以了,
我曾想过破解这个响应值,努力了几天就放弃了。忒难

至此,通过图像识别把坐标找出来,然后作为参数传入,就可以找到下载链接了,不过有坑,正常逻辑是访问图片链接,然后保存到本地,进行图像识别,但是这个图片是随机的,就是你在访问这个图片链接,图片就变了,就不是当前页面的图片了,所以要用,selenium登录,访问,截图
在这里插入图片描述

二、代码思路

1、先爬取所有商家
2、爬取每个商家的所有商品并保存到数据库
3、访问商品页,把商品名称提取出来
4、selenium模拟登录,然后请求商品url
5、截图到本地,volo识别,返回坐标
6、构建post请求,请求得到zip_url
7、对下载的zip进行处理
8、分片上传到oss中

三、代码

1、工具类代码 main.py

# main.py
import shutil
import threading
import time
import csv
import httpx
from tqdm import tqdm
import re
from lxml import etree
from retrying import retry
import queue


q = queue.Queue()

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.57'
}


def business():
    business_list = []
    business_url = 'http://www.go2.cn/ajax/welcome/getSupplier'
    for page in range(1, 100):
        data = {
            'page': page
        }
        resp = httpx.post(business_url, headers=headers, data=data).text
        if len(resp) < 10:
            break
        else:
            business_url_list = re.findall(r'left: 0;\\" href=\\"(http.*?cn)', resp)
            business_url_list = [x.replace('\\', '') for x in business_url_list]
            for i in business_url_list:
                business_list.append([i])
    with open('business.csv', 'w', newline='') as f:
        w = csv.writer(f)
        w.writerows(business_list)


def commodity_one(url1):
    commodity_url_list = []
    for page in range(1, 500):
        url2 = url1 + f'/welcome/product_list?pn={page}&channel=index&customCid=&state=&filter=&mcount=&hide=&q='
        resp = httpx.get(url2, headers=headers)
        time.sleep(0.5)
        if len(resp.text) < 10:
            commodity_url_list.append('结束了')
            break
        else:
            soup = etree.HTML(resp.text)
            try:
                a_list = soup.xpath('//a[@class="list-item-big-img"]/@href')
                for a in a_list:
                    commodity_url_list.append(a)
            except:
                print(url2 + '有错')
    return commodity_url_list


def commodity_two(url1):
    commodity_url_list = []
    for page in range(1, 500):
        url2 = url1 + f'/welcome/product_list?pn={page}'
        resp = httpx.get(url2, headers=headers)
        time.sleep(0.5)
        if len(resp.text) < 10:
            commodity_url_list.append('结束了')
            break
        else:
            soup = etree.HTML(resp.text)
            try:
                div = soup.xpath('/html/body/div')
                for a in div:
                    url = a.xpath('./a/@href')[0]
                    commodity_url_list.append(url)
            except:
                print(url2 + '有错')
    return commodity_url_list


def run():
    business_url_list = business()
    with open('business.csv') as f:
        r = csv.reader(f)
        for business_url in tqdm(r):
            business_url = business_url[0]
            resp = httpx.get(business_url, headers=headers).text
            time.sleep(0.5)
            if 'main-list-item find-similar-item' or 'top-wrap' in resp:
                commodity_url_list = commodity_one(business_url)
            elif '<div class="item-shop">' in resp:
                commodity_url_list = commodity_two(business_url)
            else:
                print('新的页面--' + business_url)
            if len(commodity_url_list) == 0:
                print(business_url + '--一个商品没有,你有鬼')
                print(commodity_url_list)


# 找商品名称的函数
@retry(stop_max_attempt_number=3, wait_fixed=1000)
def commodity(url):
    resp = httpx.get(url, headers=headers).text
    soup = etree.HTML(resp)
    title = soup.xpath('//title/text()')[0]
    title = title.replace('&', '-').replace(re.findall('(批发.*?网)', title)[0], '')
    return title


import zipfile
from up import upload
import requests


# zip下载器
# 这是重试的装饰器
@retry(stop_max_attempt_number=3, wait_fixed=1000)
def download_source(url, output_path, index, num, chunk_size=5120):
	# 分流下载,因为文件太大了
    response = requests.get(url, stream=True)
    with open(output_path, mode='wb') as f:
        for chunk in response.iter_content(chunk_size):
            f.write(chunk)

    print(f'{num}--第{index}下载完成')
    time.sleep(0.25)
    # 处理压缩包
    try:
        with zipfile.ZipFile(output_path) as z:
            r = z.namelist()
            for i in r:
                if '商家图片说明' in i:
                    shangjia = re.findall('/(GO2-商家.*?txt)', i)[0]
                elif 'go2.cn' in i:
                    cn = re.findall(r'/(.*?)$', i)[0]
        try:
        	# 在不解压压缩包的情况下,删除压缩包文件
            a = popen(fr'D:\7-Zip\7z.exe d -tzip -y {output_path} {shangjia} {cn} -r').read()
            if 'Error' in a:
                print(a)
            # 处理好以后放到上传队列中
            q.put(output_path)
        except Exception as e:
            print(e)
    except Exception as e:
    	# 如果压缩包下载失败就把包移动到这个文件夹下
        shutil.move(output_path, r'D:\桌面\女鞋网\失败的压缩包')


from os import popen
import os


# 调用volo的函数,因为返回的是坐标,而参数是索引值,所以要进行坐标判断
def shibie(name, num):
    os.chdir('E:\yolov5-master')
    text = popen(
        f"python detect.py --weights runs/train/exp/weights/best.pt --source datasets/images/test/save{num}.png").read()
    address = re.findall(f'{name}\nbounding box is (.*?)\n', text)[0].split()
    x = int(address[0])
    y = int(address[1])
    w = 0
    if 30 > x > 0 and 80 > y > 30:
        w = 0
    elif 210 > x > 165 and 80 > y > 30:
        w = 1
    elif 390 > x > 350 and 80 > y > 30:
        w = 2
    elif 30 > x > 0 and 265 > y > 225:
        w = 3
    elif 210 > x > 165 and 265 > y > 225:
        w = 4
    elif 390 > x > 350 and 265 > y > 225:
        w = 5

    os.chdir('D:\桌面\女鞋网')
    return w


# 上传oss的函数,当处理好10以后一块上传
def upp():
    while True:
        if q.qsize() >= 10:
            ths = [threading.Thread(target=upload, args=(q.get(),)) for _ in range(10)]
            for t in ths:
                t.start()
            for t in ths:
                t.join()

2、数据库代码 mysql.py

import pymysql
import threading


class DataManager:
    # 单例模式,确保每次实例化都调用一个对象。
    _instance_lock = threading.Lock()

    def __new__(cls, *args, **kwargs):
        if not hasattr(DataManager, "_instance"):
            with DataManager._instance_lock:
                DataManager._instance = object.__new__(cls)
                return DataManager._instance

        return DataManager._instance

    def __init__(self):
        # 建立连接
        self.conn = pymysql.connect(
            host='127.0.0.1',
            database='data',
            user='root',
            password='123456'
        )

        # 建立游标
        self.cursor = self.conn.cursor()

    def save_data(self, data):
        # 数据库操作
        # (1)定义一个格式化的sql语句
        # 插入数据的sql语句
        sql = 'insert into urls(url) values(%s) '
        # (2)准备数据
        # data = ('nancy','30','100','太好笑了')
        # (3)操作
        try:
            self.cursor.executemany(sql, data)
            self.conn.commit()
        except Exception as e:
            print('插入数据失败', e)
            print(data)
            self.conn.rollback()  # 回滚

    def select_data(self, start: int, end: int):
    	# 查询语句
        select_sql = f'SELECT url FROM urls LIMIT {start - 1}, {end - start}'
        # 删除语句
        delete_sql = 'DELETE FROM urls LIMIT 1000'
        try:
            self.cursor.execute(select_sql)
            all_date = self.cursor.fetchall()  # 所有数据
            return all_date
        except Exception as e:
            print("读取失败", e)
            self.conn.rollback()

    def __del__(self):
        # 关闭游标
        self.cursor.close()
        # 关闭连接
        self.conn.close()

3、oss上传代码 up.py

# -*- coding: utf-8 -*-
import os
import re
from oss2 import SizedFileAdapter, determine_part_size
from oss2.models import PartInfo
import oss2

# 阿里云主账号AccessKey拥有所有API的访问权限,风险很高。强烈建议您创建并使用RAM账号进行API访问或日常运维,请登录RAM控制台创建RAM账号。
auth = oss2.Auth('****', '*****')
# Endpoint以杭州为例,其它Region请按实际情况填写。
bucket = oss2.Bucket(auth, '****', '*****')


def percentage(consumed_bytes, total_bytes):
        """进度条回调函数,计算当前完成的百分比

        :param consumed_bytes: 已经上传/下载的数据量
        :param total_bytes: 总数据量
        """
        if total_bytes:
            rate = int(100 * (float(consumed_bytes) / float(total_bytes)))
            print('\r{0}% '.format(rate), end='')


def upload(filename):
    key = re.findall(r'\\(.*?).zip', filename)[0]
    key = r'go2/' + key
    total_size = os.path.getsize(filename)
    # determine_part_size方法用于确定分片大小。
    part_size = determine_part_size(total_size, preferred_size=1024 * 1024)

    # 初始化分片。
    # 如需在初始化分片时设置文件存储类型,请在init_multipart_upload中设置相关headers,参考如下。
    # headers = dict()
    # headers["x-oss-storage-class"] = "Standard"
    # upload_id = bucket.init_multipart_upload(key, headers=headers).upload_id
    upload_id = bucket.init_multipart_upload(key).upload_id
    parts = []


    # 逐个上传分片。
    with open(filename, 'rb') as fileobj:
        part_number = 1
        offset = 0
        while offset < total_size:
            num_to_upload = min(part_size, total_size - offset)
            # 调用SizedFileAdapter(fileobj, size)方法会生成一个新的文件对象,重新计算起始追加位置。
            result = bucket.upload_part(key, upload_id, part_number,
                                        SizedFileAdapter(fileobj, num_to_upload))
            parts.append(PartInfo(part_number, result.etag))

            offset += num_to_upload
            part_number += 1

    # 完成分片上传。
    # 如需在完成分片上传时设置文件访问权限ACL,请在complete_multipart_upload函数中设置相关headers,参考如下。
    # headers = dict()
    # headers["x-oss-object-acl"] = oss2.OBJECT_ACL_PRIVATE
    # bucket.complete_multipart_upload(key, upload_id, parts, headers=headers)
    bucket.complete_multipart_upload(key, upload_id, parts)

    # 验证分片上传。
    with open(filename, 'rb') as fileobj:
        assert bucket.get_object(key).read() == fileobj.read()
	
	# 上传完成后删除本地文件
    os.remove(filename)



4、selenium主代码

import csv
import json
import time
import httpx
import shutil
from selenium import webdriver
from mysql import DataManager
import re
from PIL import Image
import pytesseract
from main import commodity, download_source, shibie, upp
import threading


def work(url_list, num, count, c):
    opts = webdriver.ChromeOptions()
    opts.headless = True
    driver = webdriver.Chrome(options=opts)
    driver.get('http://www.go2.cn')
	
	# 获取cookies
    # input(':')
    # cookies = driver.get_cookies()
    # json.dump(cookies, open(f'{num}.pkl', 'w'))
    # print(f'{num}--保存成功')
	
	# 用cookies登录
    cookies = json.load(open(f"./cookies/{num}pkl.", "r"))
    for cookies in cookies:
        driver.add_cookie(cookies)
    driver.refresh()

    time.sleep(1)
    shibai_list = []

    for index, url in enumerate(url_list):
        try:
            title = commodity(url[0])
            # 下载zip的地址
            path = fr'J:\{title}.zip'        # 改成自己的路径
            id = re.findall(r'product/(.*?)\.', url[0])[0]
            date = int(time.time())
            url = f'http://www.go2.cn/main/product/download/{id}?t={date}'
            headers = {
                "Accept": "application/json, text/javascript, */*; q=0.01",
                "Accept-Encoding": "gzip, deflate",
                "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,de;q=0.5,ja;q=0.4",
                "Cache-Control": "no-cache",
                "Connection": "close",
                "Content-Length": "7",
                "Content-Type": "application/x-www-form-urlencoded",
                "Cookie": c + str(date),
                "Host": "www.go2.cn",
                "Origin": "http://www.go2.cn",
                "Pragma": "no-cache",
                "Referer": url,
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62",
                "X-Requested-With": "XMLHttpRequest"
            }
            driver.get(url)
            # 只能连续访问20次,然后会让你等待1分钟
            if driver.page_source.find('频繁') != -1:
                print('开始等待')
                time.sleep(65)
                driver.get(url)
            cuo = 0
            while True:
                driver.save_screenshot(f'./图片/bg{num}.png')
                img_rangle = (135, 50, 660, 450)  # 写成我们需要截取的位置坐标
                i = Image.open(f"./图片/bg{num}.png")  # 打开截图
                img = i.crop(img_rangle)  # 使用Image的crop函数,从截图中再次截取我们需要的区域
                img = img.convert('RGB')
                img.save(fr'E:\yolov5-master\datasets\images\test\save{num}.png')  # 保存我们接下来的验证码图片 进行打码

                wenzi_rangle = (345, 50, 400, 80)
                i = Image.open(f'./图片/bg{num}.png')
                wenzi_img = i.crop(wenzi_rangle)
                wenzi_img = wenzi_img.convert('RGB')
                wenzi_img.save(f'./图片/wenzi{num}.jpg')
				
				# 这是识别文字的,我训练了一个库,但是不很精准,凑合能用
                wenzi = pytesseract.image_to_string(Image.open(f'./图片/wenzi{num}.jpg'), lang='normal')
                time.sleep(0.5)
                if '服服服' in wenzi:
                    w = shibie('shoes', num)
                elif '服' in wenzi:
                    w = shibie('bag', num)
                else:
                    w = shibie('clothes', num)

                y = driver.page_source
                s = re.findall('var s = "(.*?)"', y)[0]
                p = re.findall('var p = "(.*?)"', y)[0]
                h = re.findall('var h = "(.*?)"', y)[0]

                ajax_url = f'http://www.go2.cn/ajax/product_download/verify_img/{s}/{p}/{h}'
                data = {
                    'index': w
                }
                resp = httpx.post(ajax_url, data=data, headers=headers).json()
                if resp['status'] == 1:
                    zip_url = resp['data']['source_file']['file_link_source']
                    # 开新线程去下载,不堵塞主线程
                    threading.Thread(target=download_source, args=(zip_url, path, index, num)).start()
                    count += 1
                    break
                else:
                    if cuo == 3:
                        shibai_list.append(url)
                        with open(f'失败的商品{num}.txt', 'a') as f:
                            f.write(url + '\n')
                        break
                    cuo += 1
                    print(f'{num}--第{cuo}次选错')
            time.sleep(0.5)
        except Exception as e:
            print(f'{num}--{e}')
            shibai_list.append(url)
            with open(f'失败的商品{num}.txt', 'a') as f:
                f.write(f'{url}\n')
	
    print(f'下载了{count}个')
    with open(f'失败的商品{num}.csv', 'a', newline='') as f:
        w = csv.writer(f)
        w.writerows(shibai_list)

    driver.close()
    driver.quit()


if __name__ == '__main__':
    db = DataManager()
	
	# 读取第几个到第几个数据
    url_list1 = db.select_data(7171, 10000)
    

    c1 = '写你的cookies'
    

    threading.Thread(target=work, args=(url_list1, 1, 100, c1)).start()

	# 把上传的函数先开起来
    threading.Thread(target=upp).start()

总结

这算是做的第二个完整的项目了,第一个没有记录,哎,在做的过程中也是遇到了许多的问题,网上查资料,问老师解决,做项目是最好的成长方式,再接再厉,以后我会出一个工具专栏,放一些好用的工具代码,有任何疑问和建议可以在评论区评论,必定回复,谢谢大家支持,共同进步。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值