从零开始学scrapy（python3版本）一

最新推荐文章于 2024-08-05 10:37:56 发布

愤怒的红裤衩

最新推荐文章于 2024-08-05 10:37:56 发布

阅读量2.8k

点赞数

分类专栏：爬虫 python 从零开始学scrapy爬虫文章标签： python scrapy 爬虫

本文链接：https://blog.csdn.net/weixin_38011359/article/details/80474058

版权

爬虫同时被 3 个专栏收录

4 篇文章 0 订阅

订阅专栏

python

3 篇文章 0 订阅

订阅专栏

从零开始学scrapy爬虫

3 篇文章 1 订阅

订阅专栏

环境：
window10；python 3.6.2；scrapy 1.4.0
系统已安装Python2，python3 共存模式
python2,3版本共存以及使用问题的记录

创建项目
由于scrapy官网的示例站 dmoz.org 403了，所以先拿美剧天堂的网站练手
我的项目工程路径在D:\workspaces\python\scrapy
打开cmd命令行工具

cd /d D:\workspaces\python\scrapy
python3 -m scrapy startproject tutorial
cd tutorial
python3 -m scrapy genspider meijutt meijutt.com

编写爬虫脚本，此时工程路径下已经自动创建了

D:\workspaces\python\scrapy\tutorial\tutorial\spiders\meijutt.py

import scrapy
from tutorial.items import MeijuttItem

class MeijuttSpider(scrapy.Spider):
    name = 'meijutt'
    allowed_domains = ['meijutt.com']
    start_urls = ['http://www.meijutt.com/new100.html']

    def parse(self, response):
        items = []
        for sel in response.xpath('//ul[@class="top-list  fn-clear"]/li'):
            item = MeijuttItem()
            item['storyName'] = sel.xpath('./h5/a/text()').extract()
            item['storyState'] = sel.xpath('./span[1]/font/text()').extract()
            if item['storyState']:
                pass
            else:
                item['storyState'] = sel.xpath('./span[1]/text()').extract()
            item['tvStation'] = sel.xpath('./span[2]/text()').extract()
            if item['tvStation']:
                pass
            else:
                item['tvStation'] = [u'未知']
            item['updateTime'] = sel.xpath('./div[2]/text()').extract()
            if item['updateTime']:
                pass
            else:
                item['updateTime'] = sel.xpath('./div[2]/font/text()').extract()
            items.append(item)
        return items

设置爬取数组

D:\workspaces\python\scrapy\tutorial\tutorial\items.py

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
class MeijuttItem(scrapy.Item):
    # define the fields for your item here like:
    storyName = scrapy.Field()
    storyState = scrapy.Field()
    tvStation = scrapy.Field()
    updateTime = scrapy.Field()

对爬取数据进行处理

D:\workspaces\python\scrapy\tutorial\tutorial\pipelines.py

import time
import sys
import importlib
importlib.reload(sys)

class TutorialPipeline(object):
    def process_item(self, item, spider):
        return item
class MeijuttPipeline(object):
    def process_item(self, item, spider):
        today = time.strftime('%Y%m%d',time.localtime())
        fileName = today + 'movie.txt'
        with open(fileName,'a') as fp:
            fp.write(item['storyName'][0] + '\t' + str(item['storyState'][0]) + '\t' + str(item['tvStation'][0]) + '\t' + str(item['updateTime'][0]) + '\n')
        return item

运行爬虫

D:\workspaces\python\scrapy\tutorial>python3 -m scrapy crawl meijutt

查看爬取结果

参考文献
scrapy实战–爬取最新美剧–python2版本
 Scrapy入门教程
问题
如有更多问题可评论，或者关注的我的微信公众号，可以获取本项目的全部代码，我将后续跟进scrapy爬虫项目的系列教程。

愤怒的红裤衩

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录