分析网址结构
1、原网址结构
页数 | URL |
---|---|
1 | https://www.kugou.com/yy/rank/home/1-8888.html?from=rank |
2 | https://www.kugou.com/yy/rank/home/2-8888.html?from=rank |
3 | https://www.kugou.com/yy/rank/home/3-8888.html?from=rank |
… | … |
23 | https://www.kugou.com/yy/rank/home/23-8888.html?from=rank |
共23页500条数据
2、分析URL地址
分析可知网址结构的公共部分为:https://www.kugou.com/yy/rank/home/{?}-8888.html?from=rank
{?}:变量部分
3、构造URL循环列表
使用for循环构造列表内容,代码如下:
// An highlighted block
urls = ['https://www.kugou.com/yy/rank/home/{}-8888.html?from=rank'.format(number) for number in range(1, 24)] # 总共23个URL
代码编写
需要引入requests、BeautifulSoup和time库,代码如下:
// An highlighted block
import requests
from bs4 import BeautifulSoup
import time
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
#开始获取数据
def get_info(url):
result = requests.get(url,headers=headers)
soup = BeautifulSoup(result.text,'html.parser')
ranks = soup.select('span.pc_temp_num')
links = soup.select('div.pc_temp_songlist > ul > li > a')
titles = soup.select('div.pc_temp_songlist > ul > li > a')
times = soup.select('div.pc_temp_songlist > ul > li > span.pc_temp_tips_r > span')
for rank,link,title,time in zip(ranks,links,titles,times):#多重循环
data = {
'rank':rank.get_text().strip(),
# 'title':title.get_text().strip(),
'link':link.get('href'),#歌曲链接
'singer':title.get_text().strip().split('-')[0],
'song':title.get_text().strip().split('-')[1],
'time':time.get_text().strip()
}
print(data)
if __name__=='__main__':
urls = ['https://www.kugou.com/yy/rank/home/{}-8888.html?from=rank'.format(number) for number in range(1, 24)] # 总共23个URL
# get_info(urls[0])
i=0
for url in urls:
i=i+1
print("爬取第{}页数据".format(i))
get_info(url)
time.sleep(1)#程序暂停1秒
运行结果
END