scrapy框架菜鸟学习记录

最新推荐文章于 2024-09-14 11:30:21 发布

蜗牛壳上的小潘同志

最新推荐文章于 2024-09-14 11:30:21 发布

阅读量6.9k

点赞数 4

文章标签：菜鸟入门案例爬虫数据 scrapy

本文链接：https://blog.csdn.net/qq_40235133/article/details/102540518

版权

scrapy框架菜鸟学习记录

scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。其可以应用在数据挖掘，信息处理或储存历史数据等一系列的程序中。其最初是为了页面抓取（更确切的说，网络抓去而设计的，也可以应用在获取api所返回的数据或者通用的网络爬虫。scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

1.下载安装

我使用的是pycharm，相应的我用的python版本是3.5。

2.案例

抓取湖北电台的所有新闻的题目和简介：
在这里插入图片描述
步骤：
（1）进入pycharm的终端输scrapy startproject myfirst 创建我的第一个项目

文件说明：

scrapy.cfg 项目的配置信息，主要为Scrapy命令行工具提供一个基础的配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines 数据处理行为，如：一般结构化的数据持久化
settings.py 配置文件，如：递归的层数、并发数，延迟下载等
spiders 爬虫目录，如：创建文件，编写爬虫规则

(2)创建爬虫程序
在这里插入图片描述
在终端中输入以上代码来创建名为hbdt的爬虫文件
(3)设置好爬虫主类里的起始网页，和提取规则。（其中rules的作用是爬完本页后进入下一页继续爬取）

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class HbdtSpider(CrawlSpider):
    name = 'hbdt'
    allowed_domains = ['news.hbtv.com.cn']
    start_urls = ['http://news.hbtv.com.cn/hbxw1072?page=1']

    rules = (
        Rule(LinkExtractor(restrict_xpaths=r"//div[@class ='page ov']/a[9]"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        titles = response.xpath('//h3/a/text()').extract()
        contents = response.xpath("//div[@class='left-cont']//li//p/text()").extract()
        for title, content in zip(titles, contents):
            yield {
                "title": title,
                "content": content
            }

(4)设置好数据处理程序
把爬取的内容保存在txt文件中。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class MyfirstPipeline(object):
    def open_spider(self, spider):
        self.file = open('hbdt.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        title = item['title']
        content = item['content']
        info = title + '\n' + content + '\n'
        self.file.write(info)
        self.file.flush()
        return item

    def close_spider(self, soider):
        self.file.close()

(5)创建执行程序

from scrapy.cmdline import execute
execute(['scrapy','crawl','hbdt'])

(6)运行得到结果

蜗牛壳上的小潘同志

关注

4
点赞
踩
12

收藏

觉得还不错? 一键收藏
7
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫