python爬虫框架Scrapy实际操练

最新推荐文章于 2024-04-28 16:43:46 发布

lmflmyvm

最新推荐文章于 2024-04-28 16:43:46 发布

阅读量197

点赞数 1

分类专栏：个人小结技术分享文章标签： scrapy爬取循环电影链接迅雷链接爬取 python爬虫

本文链接：https://blog.csdn.net/lmflmyvm/article/details/84878649

版权

个人小结同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

技术分享

1 篇文章 0 订阅

订阅专栏

python爬虫框架Scrapy实际操练（原创手打）
该方法实现的功能是：将目标电影网站的链接爬取，并根据详情链接，继续爬取迅雷链接，并保存到本地的txt文件中，直接对迅雷链接进行操作即可实现迅雷电影的下载。

1.到项目存放目录运行cmd，输入 scrapy startproject 项目名

// An highlighted block
scrapy startproject film

2.可以用pycharm打开项目

3.在spiders文件夹里面新建一个爬虫

// 爬虫
class FilmSpider(scrapy.Spider):
    name='film'
    index=2
    start_urls = ['https://www.66ys.tv/dongzuopian/index.html']
    def parse(self, response):
        films=response.xpath('//div[@class="listBox"]//ul/li/div[2]//a')
        for film in films:
            name=film.xpath('./text()').extract()[0]
            href=film.xpath('./@href').extract()[0]

            yield scrapy.Request(href,callback=self.detail_parse,dont_filter=True)
        if (self.index<100):
            url='https://www.66ys.tv/dongzuopian/index_'+str(self.index)+'.html'
            yield scrapy.Request(url,callback=self.parse)
            self.index +=1

4.编写项目的items.py，这里的items就相当于封装的模型，方便在管道中使用

import scrapy

class FilmItem(scrapy.Item):
    href = scrapy.Field()
    name = scrapy.Field()

5.编写管道文件pipelines.py

import json
class FilmPipeline(object):
    def __init__(self):
        self.f = open('test.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        try:
            if(item['href']!=[]):
                content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
                self.f.write(content)
        except Exception as e:
            print(e)
        return item
    def close_spider(self):
        self.f.close()

6.修改setting.py文件，将管道的设置开启

ITEM_PIPELINES = {
   'film.pipelines.FilmPipeline': 300,
}

7.开启python文件，可以新建一个python文件，运行该python文件即可

from  scrapy import cmdline

cmdline.execute("scrapy crawl film --nolog".split())

```