Python实战6
目标:爬取淘宝某宝贝的销售量
思路:打开淘宝,搜索目标宝贝关键词,
然后观察网址去掉没用的后缀:
https://s.taobao.com/search?q=零基础入门学python&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20181115&ie=utf8
去掉没用的后缀之后的链接:
https://s.taobao.com/search?q=零基础入门学python
之后点击按照销量排序:
https://s.taobao.com/search?q=零基础入门学python&sort=sale-desc
然后把这页的网址内容拷贝下来:
import requests
def open_url(keywords):
payload={'q':keywords,'sort':'sale-desc'}
url="https://s.taobao.com/search"
res=requests.get(url,params=payload)
return res
def main():
keywords=input("请输入关键词;")
res=open_url(keywords)
with open('item.txt','w',encoding='utf-8')as file:
file.write(res.text)
if __name__ =='__main__':
main()
分析爬下来的界面发现一个字段(g_page_config)的内容是我们需要的:
还等什么?当然是用re包把这大块部分提取出来:
import re
def main():
with open("item.txt",'r',encoding="utf-8")as file1: #先提取数据
g_page_config=re.search(r'g_page_config =(.*?);\n',file1.read())
with open("g_page_config.txt",'w',encoding='utf-8')as file2:
file2.write(g_page_config.group(1))
if __name__=="__main__":
main()
之后我们把提取的这部分内容,其实是字典的key打印出来:
import re
import json
def find_keys(targets):
keys=iter(targets) #把目标变成迭代结构
for each in keys:
if type(targets[each]) is not dict:
print (each)
else:
print(each)
find_keys(targets[each])
def main():
with open("item.txt","r",encoding="utf-8")as file:
g_page_config=re.search(r'g_page_config = (.*?);\n',file.read())#把要抓取的大块内容下载下来
page_config_json=json.loads(g_page_config.group(1)) #把抓取的内容转换成python的数据结构
find_keys(page_config_json)
if __name__=="__main__":
main()
效果如下图:
为了好看一些,我们把这些key的层次结构打印出来:
import re
import json
def get_space_end(level):
return ' ' * level + '-'
def get_space_expand(level):
return ' ' * level + '+'
def find_keys(targets, level):
keys = iter(targets)
for each in keys:
if type(targets[each]) is not dict:
print(get_space_end(level) + each)
else:
next_level = level+1
print(get_space_expand(level) + each)
find_keys(targets[each], next_level)
def main():
with open("item.txt", "r", encoding="utf-8") as file:
g_page_config = re.search(r'g_page_config = (.*?);\n', file.read())
page_config_json = json.loads(g_page_config.group(1))
find_keys(page_config_json, 1)
if __name__ == "__main__":
main()
效果如下图;
然后仔细分析我们下载下来的item.txt的文件,发现我们的目标数据在 “auctions” 这个键的值中。再仔细一看,“auctions” 对应的值其实是一个列表,其中每个元素又存在字典的嵌套!之后我们先将 “auctions” 的值提取出来(从上面打印的结构层次中,可以看出 “mod” -> “itemlist” -> “data” -> “auctions” 的嵌套关系),之后再把,autions里的字段提取出来,并找到各个字段的实际意义,我们不难发现:
“nid” – 商品的 ID
“title” – 商品的标题
“detail_url” – 商品的链接
“view_price” – 商品的价格
“view_sales” – 商品的销量
“nick” – 商家的名称
这样就找到最终的目标啦!
import requests
import re
import json
def open_url(keyword, page=1):
# &s=0 表示从第1个商品开始显示,由于1页是44个商品,所以 &s=44 表示第二页
# &sort=sale-desc 表示按销量排序
payload = {'q':keyword, 's':str((page-1) * 44), "sort":"sale-desc"}
url = "https://s.taobao.com/search"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
res = requests.get(url, params=payload, headers=headers)
return res
# 获取列表页的所有商品
def get_items(res):
g_page_config = re.search(r'g_page_config = (.*?);\n', res.text)
page_config_json = json.loads(g_page_config.group(1))
page_items = page_config_json['mods']['itemlist']['data']['auctions']
results = [] # 整理出我们关注的信息(ID、标题、链接、售价、销量和商家)
for each_item in page_items:
dict1 = dict.fromkeys(('nid', 'title', 'detail_url', 'view_price', 'view_sales', 'nick'))
dict1['nid'] = each_item['nid']
dict1['title'] = each_item['title']
dict1['detail_url'] = each_item['detail_url']
dict1['view_price'] = each_item['view_price']
dict1['view_sales'] = each_item['view_sales']
dict1['nick'] = each_item['nick']
results.append(dict1)
return results
# 统计该页面所有商品的销量
def count_sales(items):
count = 0
for each in items:
if '小甲鱼' in each['title']:
count += int(re.search(r'\d+', each['view_sales']).group())
return count
def main():
keyword = input("请输入搜索关键词:")
length = 3
total = 0
for each in range(length):
res = open_url(keyword, each+1)
items = get_items(res)
total += count_sales(items)
print("总销量是:", total)
if __name__ == "__main__":
main()