爬虫笔记：淘宝商品价格定向爬虫实例分析

最新推荐文章于 2024-07-25 19:59:32 发布

嘭啦啦啦啦塵

最新推荐文章于 2024-07-25 19:59:32 发布

阅读量2.6k

点赞数 6

分类专栏： Python爬虫学习笔记

本文链接：https://blog.csdn.net/qq_40405370/article/details/81005212

版权

Python爬虫学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

功能描述：

目的：获取淘宝搜索页面的信息，提取其中的商品名称和价格

理解：淘宝的搜索接口、翻页处理

搜索接口与翻页的url对应属性：

Google Chrome上进入淘宝，搜索书包，点进商品页面，点击下一页

搜索书包的起始页面url:_https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180710&ie=utf8

书包第二页：https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180710&ie=utf8&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44

书包第三页：https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180710&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=88

程序的结构设计

步骤1：提交商品搜索请求，循环获取页面

步骤2：对于每个页面，提取商品名称和价格信息

步骤3：将信息输出到屏幕上

淘宝商品比价定向爬虫实例编写

#CrowTaobaoPrice.py
import requests
import re
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parsePage(ilt, html):
try:
plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price , title])
except:
print("")
def printGoodsList(ilt):
tplt = "{:4}\t{:8}\t{:16}"
print(tplt.format("序号", "价格", "商品名称"))
count = 0
for g in ilt:
count = count + 1
print(tplt.format(count, g[0], g[1]))
def main():
goods = '书包'
depth = 3
start_url = 'https://s.taobao.com/search?q=' + goods
infoList = []
for i in range(depth):
try:
url = start_url + '&s=' + str(44*i)
html = getHTMLText(url)
parsePage(infoList, html)
except:
continue
printGoodsList(infoList)
main()

实例分析

def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""r

r = requests.get(url,timeout=30)

requests.get:获取html的主要方法（requests库的使用方法）

r=requests.get(url,params,**kwargs)

url: 需要爬取的网站地址。
params: 翻译过来就是参数， url中的额外参数，字典或者字节流格式，可选。
**kwargs : 12个控制访问的参数

**kwargs参数：

timeout：用于设定超时时间，单位为秒，当发起一个get请求时可以设置一个timeout时间，如果在timeout时间内请求内容没有返回，将产生一个timeout的异常。

r.raise_for_status()

r.status_code:http请求的返回状态，若为200则表示请求成功。用r.raise_for_status() 语句去捕捉异常，该语句在方法内部判断r.status_code是否等于200，如果不等于，则抛出异常。

r.encoding = r.apparent_encoding

r.encoding:从http header 中猜测的相应内容编码方式.r.apparent_encoding:从内容中分析出的响应内容编码方式（备选编码方式）相等则表示：自动配置编码

def parsePage(ilt, html):
try:
plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price , title])
except:
print("")

re.findall(r'\"view price\"\:\"[\d\.]*\"',html)

re.findall()：搜索字符串，以列表类型返回全部能匹配的子串，上式中“view price\”来自于：点进任一商品页面，右键查看源代码，商品价格的html属性为view price,商品名称的属性为raw title，‘\"’匹配双引号，‘view_price’匹配view_price,‘\:’匹配冒号，‘\d\.’匹配一个整数加一个小数点，‘[]*’*号匹配中括号内的正则表达式，商品名称同理，其中‘.*?’用于匹配前面频繁或重复出现符号的非贪婪版本，这里用于匹配最短数目的同性质字符，如下图。

(r'')r表示单引号内全为正则表达式符号，如\n要在正则表达式中表示要加个反斜杠‘\\n’,加个r就可以直接写\n

eval(plt[i].split(':')[1]

eval()：将字符串str当成有效的表达式来求值并返回计算结果。

plt[i].split(':')[1]:split(':')将一个字符串按照正则表达式匹配结果进行分割，返回列表类型,用":"隔开。此式是将数组plt第i个字符串进行分割。

append()方法用于将传入的对象附加(添加)到现有列表中