千千音乐爬虫
1、前言
——————————————————— 省略0xFFFF ——————————————————————
本文主要使用scrapy框架,在爬取过程中,发现歌曲列表页面翻页是通过ajax加载的,由于query中的.r是随机产生,本人能力有限,也就没有去研究其中的规律,故此爬虫也就停留在歌曲页的第一页,如果需要全站爬取的可以通过添加中间件selenium去实现。可以模拟浏览器点击去实现全站爬虫。
可以参考selenium中文文档
https://selenium-python-zh.readthedocs.io/en/latest/index.html
2、具体实现
首先分析网页布局,本着能偷懒必偷懒的原则,发现从歌手页面抓取能够最快最简单解析全部资源。
通过获取所以歌手对呀的url,再去解析歌手页面。这不就拿到所有歌曲了吗。是不是很简单,心动不如心动,敲起来。
import scrapy
from baidu_music.items import BaiduMusicItem
class MusicSpider(scrapy.Spider):
name = 'music'
allowed_domains = ['music.taihe.com']
start_urls = ['http://music.taihe.com/artist']
def parse(self, response):
url_list = []
'''
获取歌手页面所有歌手url信息
'''
artists = response.xpath("//ul[@class='container']//li")
for li in artists:
#获取最最最火的歌手url
hot_artist = li.xpath(".//a[@class='cover']/@href").getall()
if hot_artist:
url_list.extend(hot_artist)
others = li.xpath(".//ul[@class='clearfix']/li/a[not(@class)]/@href").getall()
url_list.extend(others)
for url in url_list:
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_artist)
def parse_artist(self, response):
artist_name = response.xpath("//span[@class='artist-name']//text()").get()
song_list = response.xpath("//div[@class='song-list-wrap']//ul/li")
for song in song_list:
music_name = song.xpath(".//span[@class='songname']//a/@title").get()
music_url = song.xpath(".//span[@class='songname']//a/@href").get()
if music_name and music_url:
music_url = response.urljoin(music_url)
item = BaiduMusicItem(song=music_name, author=artist_name, url=music_url)
yield item
拿到数据接下来就是存储 。
from pymysql import cursors
from twisted.enterprise import adbapi
class BaiduMusicPipeline(object):
def __init__(self):
params = {
'host': '***',
'user': '***',
'password': '******',
'database': 'baidu_music',
'port': 3306,
'charset': 'utf8',
'cursorclass': cursors.DictCursor
}
# 连接数据库
self.dbpoor = adbapi.ConnectionPool('pymysql', **params)
self._sql = None
def process_item(self, item, spider):
self.dbpoor.runInteraction(self.insert_item, item)
def insert_item(self, cursor, item):
#将数据插入到数据库当中
cursor.execute(self.sql, (item['song'], item['author'], item['url']))
@property
def sql(self):
if not self._sql:
self._sql = '''
insert INTO music(id, song, singer, url)
VALUES (NULL ,%s, %s, %s)
'''
return self._sql
去数据库查询一下有木有:
mysql> select * from music where singer like '周杰伦';
+------+--------------------+-----------+---------------------------------------+
| id | song | singer | url |
+------+--------------------+-----------+---------------------------------------+
| 3948 | 告白气球 | 周杰伦 | http://music.taihe.com/song/266322598 |
| 3949 | 青花瓷 | 周杰伦 | http://music.taihe.com/song/354387 |
| 3950 | 夜曲 | 周杰伦 | http://music.taihe.com/song/1191265 |
| 3951 | 简单爱 | 周杰伦 | http://music.taihe.com/song/10736444 |
| 3952 | 说好的幸福呢 | 周杰伦 | http://music.taihe.com/song/1392875 |
| 3953 | 晴天 | 周杰伦 | http://music.taihe.com/song/816477 |
| 3954 | 烟花易冷 | 周杰伦 | http://music.taihe.com/song/228393 |
| 3955 | 可爱女人 | 周杰伦 | http://music.taihe.com/song/10736421 |
| 3956 | 菊花台 | 周杰伦 | http://music.taihe.com/song/252832 |
| 3957 | 青花瓷 | 周杰伦 | http://music.taihe.com/song/34543910 |
| 3958 | 七里香 | 周杰伦 | http://music.taihe.com/song/274085 |
| 3959 | 红尘客栈 | 周杰伦 | http://music.taihe.com/song/31496563 |
| 3960 | 退后 | 周杰伦 | http://music.taihe.com/song/305552 |
| 3961 | 最长的电影 | 周杰伦 | http://music.taihe.com/song/354877 |
| 3962 | 搁浅 | 周杰伦 | http://music.taihe.com/song/3451498 |
+------+--------------------+-----------+---------------------------------------+
15 rows in set (0.01 sec)
差不多我也该去听歌了!