跟搜狐车库的爬取思路是一样的。首先找到每个车型的连接,然后遍历每个车型的连接去爬取所需的数据。不过网易车型库相较于搜狐车库而言是爬取的时间是远远少于搜狐汽车的。毕竟网易汽车的数据是不用渲染就可以爬取下来的,而搜狐汽车的数据需要渲染之后才可以爬取下来。
步骤1:获得品牌的连接
import requests
import re
url = 'http://product.auto.163.com/'
def getHtml(url):
data={'test':'data'}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
'Accept - Encoding': 'gzip, deflate',
'Accept - Language': 'zh - CN, zh;q = 0.9'
}
html=requests.get(url,headers=headers,params=data)
html.encoding='GBK'
return html.text
def cutstr(html):
pattern=re.compile('<a.*?id="(.*?)".*?_seriseId=.*?</a>')
strs=re.findall(pattern,html)
return strs
def gotoFile():
html = getHtml(url)
with open('wangyicar3.txt','w',encoding='utf-8') as f:
for i in cutstr(html):
str='http://product.auto.163.com/series/'+i+'.html#008B00'
f.write(str+'\n')
f.close()
gotoFile()
步骤2:获得每个车型的连接
import requests
import re
# url = 'http://product.auto.163.com/series/16979.html#008B00'
def getHtml(url):
data={'test':'data'}
headers={'User-Agent'