简单的爬取音乐的名称及链接
简单爬虫
以爬取网易云音乐中的云音乐飙升榜为例
from urllib.request import Request, urlopen
import random
from bs4 import BeautifulSoup
import xlwt
def main():
html = gethtml('https://music.163.com/discover/toplist?id=19723756')
music_list = getMusic(html)
saveMusic(music_list)
def gethtml(url):
ua_list = [这里面写你的user-agent的信息,越多越好]
ua = random.choice(ua_list)
req = Request(url=url)
req.add_header('user-agent',ua)
with urlopen(req) as response:
html = response.read().decode('utf-8')
bs = BeautifulSoup(html, "html.parser")
item = bs.find_all("ul", class_="f-hide")[0]
return item
def getMusic(html):
html = str(html)
bs = BeautifulSoup(html, "html.parser")
music_list = []
for item in bs.select('ul > li >a'):
music = []
music.append(item.string)
music.append('https://music.163.com/#'+item['href'])
music_list.append(music)
return music_list
def saveMusic(music_list):
workbook = xlwt.Workbook(encoding='utf-8') # 创建wookbook对象
worksheet = workbook.add_sheet('sheet1') # 创建工作表
worksheet.write(0, 0, '音乐名') # 写入数据,第一个参数‘行’;第二个参数‘列’;第三个参数是内容
worksheet.write(0, 1, '音乐链接')
for i in range(len(music_list)):
worksheet.write(i+1, 0, music_list[i][0])
worksheet.write(i+1, 1, music_list[i][1])
workbook.save('music.xls')
if __name__ == '__main__':
main()
对于上述的代码,我主要想说的是。
1、我们在爬取的时候尽量创建user-agent池
2、其次在urllib.request中的urlopen,其实也可以像文件操作一样。当时我其实也并不知道,但是看到了read方法后,我才看了库的源码,发现确实如此。所以用来with
3、对于select的用法也是多种多样的
# css选择器
print("css选择器")
for item in bs.select("title"): # 通过标签寻找
print(item)
for item in bs.select(".mnav"): # 通过类名来查找
print(item)
for item in bs.select("#u1"): # 通过id来查找
print(item)
for item in bs.select("a[name='tj_trvideo']"): # 通过属性来查找
print(item)
for item in bs.select("head > title"): # 通过子标签来查找
print(item)
for item in bs.select(".mnav ~ .bri"): # 通过兄弟标签来查找
print(item.get_text())