数据抓取 --Beautiful Soup库的使用问题（2）使用 find_all，Tag 和 find 基本解决爬虫中的95%以上有难度的需求

最新推荐文章于 2023-03-17 22:32:41 发布

evan心诺在

最新推荐文章于 2023-03-17 22:32:41 发布

阅读量5.1k

点赞数 2

分类专栏：爬虫 Python 文章标签： PYTHON FIND_ALL BEAUTIFULSOUP FIND TAG

本文链接：https://blog.csdn.net/weixin_42555401/article/details/95040786

版权

爬虫同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

Python

10 篇文章 1 订阅

订阅专栏

介绍

学习完爬虫后很多小朋友会发现，大部分单一的方法是爬不到想要的数据。必须要混合使用才能解决问题。就如同高考的难题如果只用一种公式和知识点的话，大部分是做不出来的。本人基本使用 find_all，Tag 和 find 基本解决爬虫中的95%以上的需求。剩下的需求，基本用正则也就能解决了。

解释

下面是爬取某个网站商品信息和库存的完整代码。（顺便说一下，这个网站可以给公司带来300多万的营收，有兴趣知道背后的逻辑可以留言给我）基础知识点没有过关的，先查资料，这里不多解释。

代码案例一：

            type_item = soup.find_all('li', class_='breadcrumb-item')
            for i in range(2):
                type1_item = type_item[0].a.get_text()  # 3.一级品类
                type2_item = type_item[1].a.get_text()  # 4.一级品类

逻辑： find_all + tag

解释：通过 find_all 到所有包含 class='breadcrumb-item' 的<li> 下的文本，然后通过 TAG 找到对应的文本。

（要使用 for 循环。)

代码案例二：

        chicun = soup.table.tbody.find_all('tr')  # 16-17.不同尺码的库存
        for i in range(len(chicun)):
            size = chicun[i].th.get_text()
            size_number = chicun[i].td.get_text()

逻辑： find_all + tag

解释：通过 find_all 到所有 <tr> 下的文本，然后通过 TAG 找到对应的文本。

（要使用 for 循环)

代码案例三：

  ue_price = int(soup.find('div', class_='tprice').span.get_text().replace('€', ''))  # 10.欧洲价格€

逻辑： find + tag

解释：通过 find 到包含 class='tprice' 的<div> 下的文本，然后通过 TAG 找到对应的文本。

总结

基本上难定位到的元素，使用find_all，Tag 和 find的组合大部分都能解决的。（这里就不多举案例了，有需要更多案例的或者不清楚的就留言给我）

完整代码如下：

import random
import requests
from bs4 import BeautifulSoup
import chardet
import re
import pymysql
from datetime import *


USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
    'Opera/8.0 (Windows NT 5.1; U; en)',
    'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]
url_list = []
# Section1
# 爬取013的url_链接,先将爬取URL放入URL列表内
# begin------------------------------------------------------------------------------------------------------------------
for page in range(1,80):
    url = 'http://www.yvogue.com/product/search?pageNo={}'.format(page)

    print('正在爬取url：', url)
    try:
        headers = {
            'user-agent': random.choice(USER_AGENTS)}  # 随机选取代理的AGENT
        response = requests.get(url=url, headers=headers, timeout=30)
        response.encoding = chardet.detect(response.content)['encoding']
        text = response.text
        soup = BeautifulSoup(text, 'lxml')

        url_group = soup.find_all('h6')
        for i in range(len(url_group)):
            url = url_group[i].a.get('href').replace('detail','http://www.yvogue.com/product/detail')
            url_list.append(url)
    except:
        print('提取失败,失败页面为：', url)
url_list = list(set(url_list))
print(url_list)
print('总共爬取的SKU数量是：',len(url_list))
#爬取SKU的链接结束
#--------------------------------------------------------------------------------------------------------------------end

# Section2
# 爬取品牌列表----------------------------------------------------------------------------------------------------------begin
url_brand ='http://www.yvogue.com/product/search'
brand_list_result = []
headers = {
    'user-agent': random.choice(USER_AGENTS)}  # 随机选取代理的AGENT
response = requests.get(url=url_brand, headers=headers, timeout=30)
response.encoding = chardet.detect(response.content)['encoding']
text = response.text
soup = BeautifulSoup(text, 'lxml')
brand_list = soup.find_all('a',attrvaltarget='brandId')
for i in range(len(brand_list)):
    mm = brand_list[i].span.get_text()
    nn = brand_list[i].span.get_text().lower()
    kk = brand_list[i].span.get_text().lower()
    brand_list_result.append(nn)
    brand_list_result.append(mm)
    brand_list_result.append(kk)
brand_list_result = list(set(brand_list_result))
print(brand_list_result)
print(len(brand_list_result))

# 品牌列表爬取完毕-----------------------------------------------------------------------------------------------------end


#Section3 : 根据已经有的URL和品牌列表，爬取和清洗数据，并写入数据库内。
url_fail=[]

for url in url_list:
    try:
        headers = {
            'user-agent': random.choice(USER_AGENTS)}  # 随机选取代理的AGENT
        response = requests.get(url=url, headers=headers, timeout=60)
        response.encoding = chardet.detect(response.content)['encoding']
        text = response.text
        soup = BeautifulSoup(text, 'lxml')

        url = url  # 1.url 链接

        type_of_size = soup.table.th.get_text() # 2.尺寸标准，均码的话就是TU
        try:
            type_item = soup.find_all('li', class_='breadcrumb-item')
            for i in range(2):
                type1_item = type_item[0].a.get_text()  # 3.一级品类
                type2_item = type_item[1].a.get_text()  # 4.一级品类
        except:
            type1_item = '系统无数据'
            type2_item = '系统无数据'

        product_name = soup.find('div', class_='mainname heading-block topmargin-sm').h4.get_text()  # 5.品名：包含品牌和货号

        for brand_excample in brand_list_result:
            if brand_excample in product_name:
                brand = brand_excample  # 6.品牌
                sku_code = product_name.replace(brand_excample, '')  # 7.商品货号 可能部分没法完整的更新。

        retail_price = int(
            soup.find('div', class_='mprice').span.get_text().replace('$', '').replace(' ', ''))  # 8.013的零售价
        supply_price = int(retail_price * 0.7)  # 9.013的大概供货价

        ue_price = int(soup.find('div', class_='tprice').span.get_text().replace('€', ''))  # 10.欧洲价格€

        supplier = '013'  # 11.供应商013

        repl = [' ', '.', '_', '£', '#', '%', '&', '!', '/', '@', '$', '^', '`', '~', '+', '=', '(', ')', ')', '?','-']
        filter_sku_code = sku_code.upper().replace('O', '0').replace('I', '1').replace('Z', '2')
        for f in range(len(repl)):
            filter_sku_code = filter_sku_code.replace(repl[f], '')  # 12.主键货号（过滤后的货号）

        picture_org = soup.find('div', class_='selectors').find_all('a')
        # 提取图片-----------------------------------------------------------------------------------begin        #13.图片 1-9
        picture = ['pic1', 'pic2', 'pic3', 'pic4', 'pic5', 'pic6', 'pic7', 'pic8', 'pic9']

        for i in range(9):
            try:
                picture[i] = picture_org[i].get('href')
            except:
                picture[i] = ''
        # 提取图片结束--------------------------------------------------------------------------------end

        crawl_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')  # 14.爬取的时间
        task_id = datetime.now().strftime('%Y-%m-%d')  # 15.task_id

        chicun = soup.table.tbody.find_all('tr')  # 16-17.不同尺码的库存
        for i in range(len(chicun)):
            size = chicun[i].th.get_text()
            size_number = chicun[i].td.get_text()

            # 写入数据库
            # begin----------------------------------------------------------------------------------------------------------
            conn = pymysql.connect(host='rm-bp196rexhw26efmo.mysql.rds.aliyuncs.com', user='evan_junhui',
                                   password='mima',
                                   db='supplier', port=3306, charset='utf8')
            cursor = conn.cursor()

            sql = '''insert into supplier_013 (url,type_of_size,type1_item,type2_item,product_name,brand,sku_code,retail_price,supply_price,ue_price,supplier,filter_sku_code,pic1,pic2,pic3,pic4,pic5,pic6,pic7,pic8,pic9,crawl_time,task_id,size,size_number) 
                                                                    values ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')''' % (
                url, type_of_size, type1_item, type2_item, product_name, brand, sku_code, retail_price, supply_price,
                ue_price, supplier, filter_sku_code, picture[0], picture[1],
                picture[2], picture[3], picture[4], picture[5], picture[6], picture[7], picture[8], crawl_time, task_id,
                size, size_number)                                                                                       # 一定要使用，不然会出现解析失败
            try:
                cursor.execute(sql)
                conn.commit()
                print('数据写入成功')
            except:
                cursor.rollback()
                print('写入失败')
            cursor.close()
            conn.close()
            # -----------------------------------------------------------------------------------------------------------end
    except:
        print('爬取:', url, '失败,')
        url_fail.append(url)

print('爬取失败的SKU数量是',len(url_fail))
print(url_fail)

evan心诺在

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
数据抓取 --Beautiful Soup库的使用问题（2）使用 find_all，Tag 和 find 基本解决爬虫中的95%以上有难度的需求

介绍学习完爬虫后很多小朋友会发现，大部分单一的方法是爬不到想要的数据。必须要混合使用才能解决问题。就如同高考的难题如果只用一种公式和知识点的话，大部分是做不出来的。本人基本使用 find_all，Tag 和 find 基本解决爬虫中的95%以上的需求。剩下的需求，基本用正则也就能解决了。解释下面是爬取某个网站商品信息和库存的完整代码。（顺便说一下，这个网站可以给公司带来300多万的营...
复制链接

扫一扫