目标
从网站网站中获取排名、歌手、歌曲名、时长四个信息。在此通过CSS选择器来获取信息。
网站:https://www.kugou.com/yy/rank/home/1-23784.html?from=rank
代码
#import libs
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
#get html
def GetHtmlText(url):
try:
r=requests.get(url,headers=headers,timeout=30)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
return 'Error'
def GetSongListInfo(html):
songslist=[]
soup=BeautifulSoup(html,'lxml')
#提取名次
ranks=soup.select('span.pc_temp_num')
#提取歌曲名和歌手
titles=soup.select('div.pc_temp_songlist > ul > li > a')
#提取歌曲时长
times=soup.select('span.pc_temp_time')
for ranks,titles,times in zip(ranks,titles,times):
rank=ranks.get_text().strip()
title=titles.get_text().split('-')[1].strip()
singer=titles.get_text().split('-')[0].strip()
time=times.get_text().strip()
songslist.append([rank,title,singer,time])
return songslist
def SaveFile(songslist):
with open('kugou.txt','a',encoding='utf-8') as f:
for song in songslist:
f.write(json.dumps(song,ensure_ascii=False)+'\n')
def main():
url='https://www.kugou.com/yy/rank/home/1-23784.html?from=rank'
html=GetHtmlText(url)
# print(html)
songslist=GetSongListInfo(html)
SaveFile(songslist)
if __name__=='__main__':
main()
结果
保存之后可以得到文件:
笔记
zip()函数
在这里我的理解是,通过select选择器获取的信息rank、title、singer、time信息是各自独立但有一定顺序的,通过zip函数可以将它们整合在一起,之后再按照对应的顺序输出。