python爬虫框架Scrapy实际操练(原创手打)
该方法实现的功能是:将目标电影网站的链接爬取,并根据详情链接,继续爬取迅雷链接,并保存到本地的txt文件中,直接对迅雷链接进行操作即可实现迅雷电影的下载。
1.到项目存放目录运行cmd,输入 scrapy startproject 项目名
// An highlighted block
scrapy startproject film
2.可以用pycharm打开项目
3.在spiders文件夹里面新建一个爬虫
// 爬虫
class FilmSpider(scrapy.Spider):
name='film'
index=2
start_urls = ['https://www.66ys.tv/dongzuopian/index.html']
def parse(self, response):
films=response.xpath('//div[@class="listBox"]//ul/li/div[2]//a')
for film in films:
name=film.xpath('./text()').extract()[0]
href=film.xpath('./@href').extract()[0]
yield scrapy.Request(href,callback=self.detail_parse,dont_filter=True)
if (self.index<100):
url='https://www.66ys.tv/dongzuopian/index_'+str(self.index)+'.html'
yield scrapy.Request(url,callback=self.parse)
self.index +=1
4.编写项目的items.py,这里的items就相当于封装的模型,方便在管道中使用
import scrapy
class FilmItem(scrapy.Item):
href = scrapy.Field()
name = scrapy.Field()
5.编写管道文件pipelines.py
import json
class FilmPipeline(object):
def __init__(self):
self.f = open('test.txt', 'w', encoding='utf-8')
def process_item(self, item, spider):
try:
if(item['href']!=[]):
content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
self.f.write(content)
except Exception as e:
print(e)
return item
def close_spider(self):
self.f.close()
6.修改setting.py文件,将管道的设置开启
ITEM_PIPELINES = {
'film.pipelines.FilmPipeline': 300,
}
7.开启python文件,可以新建一个python文件,运行该python文件即可
from scrapy import cmdline
cmdline.execute("scrapy crawl film --nolog".split())
```