介绍
学习完爬虫后很多小朋友会发现,大部分单一的方法是爬不到想要的数据。必须要混合使用才能解决问题。就如同高考的难题如果只用一种公式和知识点的话,大部分是做不出来的。本人基本使用 find_all,Tag 和 find 基本解决爬虫中的95%以上的需求。剩下的需求,基本用正则也就能解决了。
解释
下面是爬取某个网站商品信息和库存的完整代码。(顺便说一下,这个网站可以给公司带来300多万的营收,有兴趣知道背后的逻辑可以留言给我) 基础知识点没有过关的,先查资料,这里不多解释。
代码案例一:
type_item = soup.find_all('li', class_='breadcrumb-item')
for i in range(2):
type1_item = type_item[0].a.get_text() # 3.一级品类
type2_item = type_item[1].a.get_text() # 4.一级品类
逻辑: find_all + tag
解释: 通过 find_all 到所有 包含 class='breadcrumb-item' 的<li> 下的文本,然后通过 TAG 找到 对应的文本。
(要使用 for 循环。)
代码案例二:
chicun = soup.table.tbody.find_all('tr') # 16-17.不同尺码的库存
for i in range(len(chicun)):
size = chicun[i].th.get_text()
size_number = chicun[i].td.get_text()
逻辑: find_all + tag
解释: 通过 find_all 到所有 <tr> 下的文本,然后通过 TAG 找到 对应的文本。
(要使用 for 循环)
代码案例三:
ue_price = int(soup.find('div', class_='tprice').span.get_text().replace('€', '')) # 10.欧洲价格€
逻辑: find + tag
解释: 通过 find 到包含 class='tprice' 的<div> 下的文本,然后通过 TAG 找到 对应的文本。
总结
基本上难定位到的元素,使用find_all,Tag 和 find的组合大部分都能解决的。 (这里就不多举案例了,有需要更多案例的或者不清楚的就留言给我)
完整代码如下:
import random
import requests
from bs4 import BeautifulSoup
import chardet
import re
import pymysql
from datetime import *
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]
url_list = []
# Section1
# 爬取013的url_链接,先将爬取URL放入URL列表内
# begin------------------------------------------------------------------------------------------------------------------
for page in range(1,80):
url = 'http://www.yvogue.com/product/search?pageNo={}'.format(page)
print('正在爬取url:', url)
try:
headers = {
'user-agent': random.choice(USER_AGENTS)} # 随机选取代理的AGENT
response = requests.get(url=url, headers=headers, timeout=30)
response.encoding = chardet.detect(response.content)['encoding']
text = response.text
soup = BeautifulSoup(text, 'lxml')
url_group = soup.find_all('h6')
for i in range(len(url_group)):
url = url_group[i].a.get('href').replace('detail','http://www.yvogue.com/product/detail')
url_list.append(url)
except:
print('提取失败,失败页面为:', url)
url_list = list(set(url_list))
print(url_list)
print('总共爬取的SKU数量是:',len(url_list))
#爬取SKU的链接结束
#--------------------------------------------------------------------------------------------------------------------end
# Section2
# 爬取品牌列表----------------------------------------------------------------------------------------------------------begin
url_brand ='http://www.yvogue.com/product/search'
brand_list_result = []
headers = {
'user-agent': random.choice(USER_AGENTS)} # 随机选取代理的AGENT
response = requests.get(url=url_brand, headers=headers, timeout=30)
response.encoding = chardet.detect(response.content)['encoding']
text = response.text
soup = BeautifulSoup(text, 'lxml')
brand_list = soup.find_all('a',attrvaltarget='brandId')
for i in range(len(brand_list)):
mm = brand_list[i].span.get_text()
nn = brand_list[i].span.get_text().lower()
kk = brand_list[i].span.get_text().lower()
brand_list_result.append(nn)
brand_list_result.append(mm)
brand_list_result.append(kk)
brand_list_result = list(set(brand_list_result))
print(brand_list_result)
print(len(brand_list_result))
# 品牌列表爬取完毕-----------------------------------------------------------------------------------------------------end
#Section3 : 根据已经有的URL和品牌列表,爬取和清洗数据,并写入数据库内。
url_fail=[]
for url in url_list:
try:
headers = {
'user-agent': random.choice(USER_AGENTS)} # 随机选取代理的AGENT
response = requests.get(url=url, headers=headers, timeout=60)
response.encoding = chardet.detect(response.content)['encoding']
text = response.text
soup = BeautifulSoup(text, 'lxml')
url = url # 1.url 链接
type_of_size = soup.table.th.get_text() # 2.尺寸标准,均码的话就是TU
try:
type_item = soup.find_all('li', class_='breadcrumb-item')
for i in range(2):
type1_item = type_item[0].a.get_text() # 3.一级品类
type2_item = type_item[1].a.get_text() # 4.一级品类
except:
type1_item = '系统无数据'
type2_item = '系统无数据'
product_name = soup.find('div', class_='mainname heading-block topmargin-sm').h4.get_text() # 5.品名:包含品牌和货号
for brand_excample in brand_list_result:
if brand_excample in product_name:
brand = brand_excample # 6.品牌
sku_code = product_name.replace(brand_excample, '') # 7.商品货号 可能部分没法完整的更新。
retail_price = int(
soup.find('div', class_='mprice').span.get_text().replace('$', '').replace(' ', '')) # 8.013的零售价
supply_price = int(retail_price * 0.7) # 9.013的大概供货价
ue_price = int(soup.find('div', class_='tprice').span.get_text().replace('€', '')) # 10.欧洲价格€
supplier = '013' # 11.供应商013
repl = [' ', '.', '_', '£', '#', '%', '&', '!', '/', '@', '$', '^', '`', '~', '+', '=', '(', ')', ')', '?','-']
filter_sku_code = sku_code.upper().replace('O', '0').replace('I', '1').replace('Z', '2')
for f in range(len(repl)):
filter_sku_code = filter_sku_code.replace(repl[f], '') # 12.主键货号(过滤后的货号)
picture_org = soup.find('div', class_='selectors').find_all('a')
# 提取图片-----------------------------------------------------------------------------------begin #13.图片 1-9
picture = ['pic1', 'pic2', 'pic3', 'pic4', 'pic5', 'pic6', 'pic7', 'pic8', 'pic9']
for i in range(9):
try:
picture[i] = picture_org[i].get('href')
except:
picture[i] = ''
# 提取图片结束--------------------------------------------------------------------------------end
crawl_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S') # 14.爬取的时间
task_id = datetime.now().strftime('%Y-%m-%d') # 15.task_id
chicun = soup.table.tbody.find_all('tr') # 16-17.不同尺码的库存
for i in range(len(chicun)):
size = chicun[i].th.get_text()
size_number = chicun[i].td.get_text()
# 写入数据库
# begin----------------------------------------------------------------------------------------------------------
conn = pymysql.connect(host='rm-bp196rexhw26efmo.mysql.rds.aliyuncs.com', user='evan_junhui',
password='mima',
db='supplier', port=3306, charset='utf8')
cursor = conn.cursor()
sql = '''insert into supplier_013 (url,type_of_size,type1_item,type2_item,product_name,brand,sku_code,retail_price,supply_price,ue_price,supplier,filter_sku_code,pic1,pic2,pic3,pic4,pic5,pic6,pic7,pic8,pic9,crawl_time,task_id,size,size_number)
values ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')''' % (
url, type_of_size, type1_item, type2_item, product_name, brand, sku_code, retail_price, supply_price,
ue_price, supplier, filter_sku_code, picture[0], picture[1],
picture[2], picture[3], picture[4], picture[5], picture[6], picture[7], picture[8], crawl_time, task_id,
size, size_number) # 一定要使用,不然会出现解析失败
try:
cursor.execute(sql)
conn.commit()
print('数据写入成功')
except:
cursor.rollback()
print('写入失败')
cursor.close()
conn.close()
# -----------------------------------------------------------------------------------------------------------end
except:
print('爬取:', url, '失败,')
url_fail.append(url)
print('爬取失败的SKU数量是',len(url_fail))
print(url_fail)