1.网页分析
网址:https://music.163.com/#/search/m/?s=许嵩&type=1
观察网页,所有的歌曲信息都在class="srchsongst"的div标签下
2.爬取信息
selenium安装报错请戳:https://blog.csdn.net/weixin_43746433/article/details/95237254
from selenium import webdriver
from lxml import etree
import time
import csv
def get_info(url):
chrome_driver=r"D:\Python\Anaconda\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe" #你的chromedriver.exe地址
driver=webdriver.Chrome(executable_path=chrome_driver)
driver.maximize_window()
driver.get(url)
driver.implicitly_wait(10)
iframe=driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.frame(iframe)
html=etree.HTML(driver.page_source)
infos=html.xpath('//div[@class="srchsongst"]/div')
for info in infos:
song_id=info.xpath('div[2]/div/div/a/@href')[0].split('=')[-1]
song=info.xpath('div[2]/div/div/a/b/text()')[0]
singer1=info.xpath('div[4]/div/a')[0]
singer=singer1.xpath('string(.)')
album=info.xpath('div[5]/div/a/@title')[0]
print(song_id,song,singer,album)
writer.writerow([song_id,song,singer,album])
if __name__=='__main__':
fp=open('music.csv','w',newline='',encoding='utf-8')
writer=csv.writer(fp)
writer.writerow(['song_id','song','singer','album'])
url='https://music.163.com/#/search/m/?s=%E8%AE%B8%E5%B5%A9&type=1'
get_info(url)
文件展示
3.爬取歌词
通过歌词的api网址找到每首歌词,在通过爬取的csv文件读取歌曲的id和name
import requests
import re
import json
import pandas as pd
url=''
headers={'user-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
def get_info(id):
res=requests.get('http://music.163.com/api/song/lyric?id={}&lv=1&kv=1&tv=-1'.format(id),headers=headers)
json_data=json.loads(res.text)
lyric=json_data['lrc']['lyric']
lyric=re.sub('\[.*\]','',lyric)
return str(lyric)
def txt():
data=pd.read_csv('music.csv')
for i in range(len(data['song_id'])):
fp=open(r'歌词/{}.txt'.format(data['song'][i]),'w',encoding='utf-8')
fp.write(get_info(data['song_id'][i]))
fp.close()
txt()
爬取成功!
4 数据分析
4.1 数据基本情况
许嵩歌曲共计175首,妥妥的原创高产歌手~
4.2 专辑单曲数
早期的许嵩,是一个网络歌手,所以都放在了许嵩单曲集中,随后发的苏格拉没有底,寻雾启示是很不错的优秀专辑。
4.3 词云
词云的绘制请戳:https://blog.csdn.net/weixin_43746433/article/details/89856014