Python爬取steam特惠促销榜
用python爬取https://store.steampowered.com/search/?os=win&specials=1&filter=topsellers的steam特惠促销信息
import requests
from bs4 import BeautifulSoup
import bs4
def Get_html(url):
try:
r=requests.get(url,timeout=30)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
return ""
def Fill_html(gList,html):
soup=BeautifulSoup(html,"html.parser")
a=soup.find_all(name='a',attrs={"data-search-page":"1"})
b=soup.find_all(name='span',attrs={"class":"title"})
c=soup.find_all(name="span",attrs={"style":"color: #888888;"})
d=soup.find_all(name="div",attrs={"class":"col search_discount responsive_secondrow"})
for i in b:
gList.append(i.string)
for i in d:
gList.append(i.text)
for i in c:
gList.append(i.string)
for i in c:
gList.append(i.next_sibling.next_sibling)
for i in a:
gList.append(i["href"])
def Print_html(gList):
d="{0:{6}^2}\t{1:{6}^3}\t{2:{6}^3}\t{3:{6}^3}\t{4:{6}<50}\t{5:{6}>15}"
print(d.format("排名","折扣","原价","现价","游戏","链接",chr(12288)))
a=len(gList)//5
for i in range(a):
print(d.format(i+1,gList[a+i].strip(),gList[a*2+i],gList[a*3+i].strip(),gList[i].strip(),gList[a*4+i].strip(),chr(12288)))
def main():
url="https://store.steampowered.com/search/?os=win&specials=1&filter=topsellers"
getinfo=[]
html=Get_html(url)
Fill_html(getinfo,html)
Print_html(getinfo)
main()
引用requests库和BeautifulSoup完成爬虫,主体为爬取转换函数,指定爬取函数和输出函数三部分。
源码分析
steam特售商品原码如下
<a href="https://store.steampowered.com/app/1118010/Monster_Hunter_World_Iceborne/?snr=1_7_7_2300_150_1"
data-ds-appid="1118010" data-ds-itemkey="App_1118010" data-ds-tagids="[19,3859,1695,1685,9564,4026,1697]" data-ds-crtrids="[33273264,34827959]" onmouseover="GameHover( this, event, 'global_hover', {"type":"app","id":1118010,"public":1,"v6":1} );" onmouseout="HideGameHover( this, event, 'global_hover' )" class="search_result_row ds_collapse_flag "
data-search-page="1" data-gpnav="item">
<div class="col search_capsule"><img src="https://media.st.dl.pinyuncloud.com/steam/apps/1118010/capsule_sm_120.jpg?t=1605143784" srcset="https://media.st.dl.pinyuncloud.com/steam/apps/1118010/capsule_sm_120.jpg?t=1605143784 1x, https://media.st.dl.pinyuncloud.com/steam/apps/1118010/capsule_231x87.jpg?t=1605143784 2x"></div>
<div class="responsive_search_name_combined">
<div class="col search_name ellipsis">
<span class="title">Monster Hunter World: Iceborne</span>
<p>
<span class="platform_img win"></span> </p>
</div>
<div class="col search_released responsive_secondrow">2020年1月9日</div>
<div class="col search_reviewscore responsive_secondrow">
<span class="search_review_summary mixed" data-tooltip-html="褒贬不一<br>13,454 篇用户的游戏评测中有 52% 为好评。<br><br>此产品在一个或多个时间段内出现跑题评测活动。这些时间段内的评测已按您的偏好设置不计入此产品的评测分数。">
</span>
</div>
<div class="col search_price_discount_combined responsive_secondrow" data-price-final="16800">
<div class="col search_discount responsive_secondrow">
<span>-38%</span>
</div>
<div class="col search_price discounted responsive_secondrow">
<span style="color: #888888;"><strike>¥ 271</strike></span><br>¥ 168 </div>
</div>
</div>
<div style="clear: left;"></div>
</a>
这里我们通过分析原码解析爬取特定数据的函数;
我们用soup.find_all(name,attrs={})搜索特定html数据,name为标签名,attrs为区分的特定属性,以链接为例,我们找到name为a,attrs包含{“data-search-page”:“1”}的href属性就是要找的链接,其他元素的搜索同理。
当一个标签下有多个要提取元素时,且被提取元素被标签分割,即这些标签处于平行关系时,我们可以用特定指令搜索平行节点,例如:
i.next_sibling 下一平行节点标签
i.previous_sibling 上一平行节点标签
i.next_siblings 后续所有平行节点标签
i.previous_siblings 前面所有平行节点标签