1、创建一个项目
- scrapy startproject 项目名称
我的项目叫Neteasy_music,所以命令是scrapy startproject Neteasy_music
2、创建一个爬虫
先把目录切换到项目里面
- cd 项目名称
- scrapy genspider 爬虫名字 网站地址
我这里取的名字是neteasy_music,爬取的网页是music.163.com/discover/artist,
所以命令是scrapy genspider neteasy_music music.163.com/discover/artist
3、编写爬虫文件
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from Neteasy_music.items import SingerItem
class NeteasyMusicSpider(scrapy.Spider):
name = 'neteasy_music'
allowed_domains = ['music.163.com']
start_urls = ['https://music.163.com/discover/artist']
base_url = 'https://music.163.com'
def parse(self, response):
# 获取歌手分类链接:如华语男歌手、欧美女歌手的链接
singer_type_href = response.xpath('//a[@class="cat-flag"]/@href').extract()
del singer_type_href[0] # 删除推荐歌手
for url in singer_type_href:
full_url = self.base_url + url
# print(url)
yield Request(url&#