爬取qq音乐
一:分析搜索界面
怎么找到歌曲信息
分别搜索不同的歌手或者个名,可以发现只有歌单列表是变化的!
当我们观察网页url时,随着网页的加载请求而变化只是网址后 面的值
复制浏览器的url得: y.qq.com/portal/sear…
对应
对应
这里应该是将中文进行了编码再传入url中,这样就得到了如何找到搜索界面的url!!哈哈哈! 但是只搜素一页的内容,尝试改变url的page值,发现页面内的歌曲在变化!
好的,准备爬取一波!
import requests
import json
from bs4 import BeautifulSoup
url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.center&searchid=47789770466433535&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=10&w=%E5%BC%A0%E5%9B%BD%E8%8D%A3&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0'
resp = requests.get(url)
resp.encoding='utf-8'
a = json.loads(resp.text)
print(a)
复制代码
但是!我xx
爬取的内容居然没有歌曲信息!?那初步判断是通过js获取的,于是要分析请求网页的过程,浏览器F12,切换到控制台,刷新网页,Ctrl+F搜索一下找到相关包!
发现一个叫做
是一个json文件,一层层点开发现所需要的信息!那这样我们就是通过该url,请求链接得到响应的json字符串,然后用python去解析
看josn格式的文件太乱了,发现了一个解码的神奇网站!点击www.bejson.com/jsonviewern…
那就可以轻易找到歌曲信息了,在网站的右侧显示了概览,曲信息是在songlist下面,是一个list,里面包含了每首歌曲的信息,每首歌是一个json字典对象。歌曲的id,歌曲名直接在'id'和'titile'下,而歌手名在名叫作singer的list下的第一个字典下。那就可爬取啦!
首先要分析json数据,解析json字符串,转换为python对象jsondata=json.loads() 。 上代码:
import requests
import json
from bs4 import BeautifulSoup
url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.center&searchid=47789770466433535&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=10&w=%E5%BC%A0%E5%9B%BD%E8%8D%A3&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0'
resp = requests.get(url)
resp.encoding='utf-8'
a = json.loads(resp.text)
b = a.get('data').get('song').get('list')
#print(b)
songs_list = []
for i in b:
result = {}
result['id'] = i.get('id')
result['title'] = i.get('title')
result['singer'] = i.get('singer')[0].get('title')
songs_list.append(result)
print(result)
复制代码
得到结果:
{'id': 105602369, 'title': '风继续吹', 'singer': '张国荣'}
{'id': 471461, 'title': '当爱已成往事', 'singer': '张国荣'}
{'id': 4899362, 'title': '当年情', 'singer': '张国荣'}
{'id': 1375623, 'title': '沉默是金', 'singer': '张国荣'}
{'id': 3961, 'title': '玻璃之情', 'singer': '张国荣'}
{'id': 106731742, 'title': '倩女幽魂', 'singer': '张国荣'}
{'id': 4787727, 'title': '千千阙歌 (90 Live)', 'singer': '张国荣'}
{'id': 1377649, 'title': '风再起时', 'singer': '张国荣'}
{'id': 7132726, 'title': '我 (国语)', 'singer': '张国荣'}
{'id': 163233, 'title': '共同渡过', 'singer': '张国荣'}
复制代码
哈哈哈,找到歌曲信息了!分割线
现在来构造搜索url,找到刚才json包的url为
https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.center&searchid=38127408304238659&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=10&w=%E5%BC%A0%E5%9B%BD%E8%8D%A3&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0
复制代码
对比其他url
https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=56386297828639744&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&p=1&n=10&w=%E6%88%91%E6%9B%BE&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0
复制代码
查看到w后面跟的就是要搜索的关键词!那就把要搜索的中文编码之后传进去就可以了!这次加上浏览器头部信息啦,假装我是一只浏览器,我不是爬虫啦!
import requests
from urllib import parse
import json
import urllib
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=64768420417553403&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&\
p=1&n=10&{}&g_tk=1531112714&loginUin=3237707674&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0'
word = '陈奕迅'
dict = {'w': word}
url_data = parse.urlencode(dict) # 将word编码
resp = requests.get(url.format(url_data), headers=headers)
resp.encoding='utf-8'
a = json.loads(resp.text)
b = a.get('data').get('song').get('list')
#print(b)
songs_list = []
for i in b:
result = {}
result['id'] = i.get('id')
result['title'] = i.get('title')
result['singer'] = i.get('singer')[0].get('title')
songs_list.append(result)
print(result)
复制代码
得到结果:
{'id': 1313990, 'title': '红玫瑰', 'singer': '陈奕迅'}
{'id': 1249550, 'title': '富士山下', 'singer': '陈奕迅'}
{'id': 4830342, 'title': '十年', 'singer': '陈奕迅'}
{'id': 1313993, 'title': '好久不见', 'singer': '陈奕迅'}
{'id': 9059607, 'title': '不要说话', 'singer': '陈奕迅'}
{'id': 1313988, 'title': '淘汰', 'singer': '陈奕迅'}
{'id': 4907894, 'title': '单车', 'singer': '陈奕迅'}
{'id': 1251166, 'title': '浮夸', 'singer': '陈奕迅'}
{'id': 1313992, 'title': '爱情转移', 'singer': '陈奕迅'}
{'id': 4907901, 'title': 'K歌之王 (粤语)', 'singer': '陈奕迅'}
复制代码
成功了,终于实现了搜索歌曲功能了!另外在操作中发现歌曲的搜索得到的歌曲数目是可以通过改变url中的n的值来实现的,如果将n=10,改成n=100那么将得到包含100首歌曲信息的文件!
那要怎么找到下载歌曲文件呢?
找到歌曲播放界面,尝试找到歌曲的音频文件,网页的音频文件一般在控制台的Media中可以找到,尝试寻找,浏览器F12,切换到控制台,刷新网页
一个个打开文件,但是只有最后一个文件是歌曲文件
哈哈,歌曲文件也找到了,那就观察他的url,尝试构造歌曲的url
对比两个url,vkey,和C40000+XXXXXXX,那能找到这两个参数就好了,Ctrl+f找一下!
找到好几个文件都有这个,同样复制到解析网站解析!发现目标啦!就是这个叫做什么Mid的东西!注意到之前搜索歌曲的时候也有叫做mid的东西,一对比(ctrl+f)发现也在其中,那就可以直接在搜索歌曲信息的时候找出来啦!
将mid加入到爬取的信息中:那么接着找Vkey啦!! Ctrl+f找一下,同样的方法,不变的配方!你懂的啦!
发现josn文件!开心开心,抓紧解析一下! 发现敌军! 发现veky在req->data下,将其提取出来!import requests
from urllib import parse
import json
import urllib
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey2954502924310327&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22songmid%22%3A%5B%22000bSg2U4GcrUi%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D'
resp = requests.get(url.format(url_data), headers=headers)
resp.encoding='utf-8'
a = json.loads(resp.text)
b = a.get('req').get('data').get('vkey')
print(b)
复制代码
得到:
C3BDBA226168243D649A67EF479BF6C2F0CA827800422E7590E7F6B3E551DA853E74227E4B6550D9D9F0066124A8F0D3CAFA0499C329D25D
复制代码
就是想要的vkey啦!
结果发现找错了.........,是另外一个字典下的vkey啦
是在这个目录下!
在该目录下可以看到一个叫做purl的,他居然把mid和vkey都直接融合到了一起,太有爱了是不是啊!!那就不用客气直接用就行了!import requests
from urllib import parse
import json
import urllib
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
url ='https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey8774412539618848&g_tk=360481176&loginUin=3237707674&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22songmid%22%3A%5B%220032TY8H2bEqEP%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%223237707674%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A3237707674%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D'
resp = requests.get(url, headers=headers)
resp.encoding='utf-8'
a = json.loads(resp.text)
b = a.get('req_0').get('data').get('midurlinfo')[0].get('vkey')
c = a.get('req_0').get('data').get('midurlinfo')[0].get('purl')
print(c)
url_2 = 'http://dl.stream.qqmusic.qq.com/'
print(url_2+c)
复制代码
得到结果
C4000032TY8H2bEqEP.m4a?guid=3719823069&vkey=1C4B824609DF35C9D27A89E0F323F5EA5D3CAA55FA70514F6831E2AAC0B27B8D4D6463DE9E2ED7CB2006BCE7A1A08C38034F4A0838B2EABF&uin=0&fromtag=66
http://dl.stream.qqmusic.qq.com/C4000032TY8H2bEqEP.m4a?guid=3719823069&vkey=1C4B824609DF35C9D27A89E0F323F5EA5D3CAA55FA70514F6831E2AAC0B27B8D4D6463DE9E2ED7CB2006BCE7A1A08C38034F4A0838B2EABF&uin=0&fromtag=66
复制代码
打开url,发现是目标音频文件
那下一步就是分析,构造搜索url,找到purl:
https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey2954502924310327&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22songmid%22%3A%5B%22000bSg2U4GcrUi%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D
# 看它的参数:
-: getplaysongvkey2954502924310327
g_tk: 5381
loginUin: 0
hostUin: 0
format: json
inCharset: utf8
outCharset: utf-8
notice: 0
platform: yqq.json
needNewCode: 0
data: {"req":{"module":"CDN.SrfCdnDispatchServer","method":"GetCdnDispatch","param":{"guid":"3719823069","calltype":0,"userip":""}},"req_0":{"module":"vkey.GetVkeyServer","method":"CgiGetVkey","param":{"guid":"3719823069","songmid":["000bSg2U4GcrUi"],"songtype":[0],"uin":"0","loginflag":1,"platform":"20"}},"comm":{"uin":0,"format":"json","ct":24,"cv":0}}
复制代码
对比不同的url,发现只有getplaysongvkey和data中的songmid是变化的,考虑怎么得到getplaysongvkey,然后根本找不到55555,后来尝试了一下只改变songmid发现居然也是可以找到vkey的!哈哈哈,价值就是太棒了! 那就直接传入midsong就可以啦,再次Ctrl+F,这个songmid就是之前找到mid,是一样的! 那就简单多了!呵呵呵 那就开始尝试找到每一首歌的url吧!
import requests
from urllib import parse
import json
import urllib
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp?ct=24&qqmusic_ver=1298&new_json=1&remoteplace=txt.yqq.song&searchid=64768420417553403&t=0&aggr=1&cr=1&catZhida=1&lossless=0&flag_qc=0&\
p=1&n=10&{}&g_tk=1531112714&loginUin=3237707674&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0'
word = '陈奕迅'
dict = {'w': word}
url_data = parse.urlencode(dict) # 将word编码
resp = requests.get(url.format(url_data), headers=headers)
resp.encoding='utf-8'
a = json.loads(resp.text)
b = a.get('data').get('song').get('list')
#print(b)
songs_list = []
for i in b:
result = {}
result['id'] = i.get('id')
result['title'] = i.get('title')
result['singer'] = i.get('singer')[0].get('title')
result['mid'] = i.get('mid')
songs_list.append(result['mid'])
url_1 ='https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey6989843649964012&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%223719823069%22%2C%22songmid%22%3A%5B%22{}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D'
mid = result['mid']
resp = requests.get(url_1.format(mid), headers=headers)
resp.encoding='utf-8'
a = json.loads(resp.text)
c = a.get('req_0').get('data').get('midurlinfo')[0].get('purl')
url_2 = 'http://dl.stream.qqmusic.qq.com/'
print(url_2+c)
print('\n')
复制代码
结果:
http://dl.stream.qqmusic.qq.com/C40000481cWs2JgWe0.m4a?guid=3719823069&vkey=D23EC91C0CE54E56315F8F837CDE5CCCE9A830592DC2EB06BF9D782A8811B52B3544863AFE67ACD9BE7DDE4248C57F4BE0644B75D3B36C20&uin=0&fromtag=66
http://dl.stream.qqmusic.qq.com/C400000Hv0Nh0m4ye8.m4a?guid=3719823069&vkey=AF477E991571D7118E921935B26E5009FC000278EC49A7FD02E66113070E28E448FCAA2DD3FFDC8C9704114E018AF87ABA0BD2C8970D1C25&uin=0&fromtag=66
http://dl.stream.qqmusic.qq.com/C400003Idtm746YJCM.m4a?guid=3719823069&vkey=17264244C83309BEA525942B4EBBB99D3D55D47E9A7E319641ACA65CEA9134AEDD40E096BC09A74237121D91DE65EC0587C4FB68336E0CDD&uin=0&fromtag=66
http://dl.stream.qqmusic.qq.com/
http://dl.stream.qqmusic.qq.com/C400002B2EAA3brD5b.m4a?guid=3719823069&vkey=0E30A1E14E29B3698FA67BCF91DC1BDB3B0F6E6C4E56F14975AA6808BB80C0CA87879C1F8EFDB210E8C64B44B831E659D88371169D78D0BA&uin=0&fromtag=66
http://dl.stream.qqmusic.qq.com/C400002BuJzd3ye6uP.m4a?guid=3719823069&vkey=E3D6360F04C79FD8A57D5FFFD206F188D650352B33D8049A8F5D844A6B0B01DB52406A50BA9F0A3D41F38BCB4E705AE4F545916157B3CBCA&uin=0&fromtag=66
http://dl.stream.qqmusic.qq.com/
http://dl.stream.qqmusic.qq.com/C400003wRtRu3w2W62.m4a?guid=3719823069&vkey=93B9233DDA7383C7E478D1028DA61E417861AE5D41B238B0A549906E91608A0CA01649707286E70826299B2F9A0D9605DD490F29907B5A59&uin=0&fromtag=66
http://dl.stream.qqmusic.qq.com/C400003kCfyN2zp9AW.m4a?guid=3719823069&vkey=AF4B635D20A30F81E60077349CBA52C41FAFF9F916E83F76435E6F9A8E7201E52584529D30BA71DC83AF860A14DCAC180F7BDEF5216A6A84&uin=0&fromtag=66
复制代码
成功啦!!!有些没有网址的貌似没版权吧!? 嗯,基本上完成啦!