oracle sql练习_使用Scrapy爬虫，并将数据存储到oracle

最新推荐文章于 2022-11-22 23:34:28 发布

weixin_39827850

最新推荐文章于 2022-11-22 23:34:28 发布

阅读量501

点赞数

文章标签： oracle sql练习

本文链接：https://blog.csdn.net/weixin_39827850/article/details/111285336

版权

本文介绍了如何使用Scrapy爬虫框架抓取网页数据，并将这些数据存储到Oracle数据库中。通过创建爬虫项目，定义spider来获取网页标签、属性和标题。这是一个基础的教程，后续会深入探讨更多复杂场景。

摘要由CSDN通过智能技术生成

今天使用的是专业的异步爬虫框架，也就是说使用普通代码经过添加进程和异常处理的，在框架中已经封装好，可以直接使用，我们只需要在spider进行标签、属性或者标题的获取，代码如下：

创建练习目录：test，cd test
使用指令创建爬虫目录：scrapy startproject shares，cd shares
使用指令创建初始spider代码：scrapy genspider share books toscrape.com

# -*- coding: utf-8 -*-import scrapyfrom ..items import SharesItem,SharesItemLoaderclass ShareSpider(scrapy.Spider):
    name = 'share'    allowed_domains = ['books.toscrape.com']    start_urls = ['http://books.toscrape.com/']    def parse(self, response):
        item = SharesItem()        for book in response.xpath('//article[@]'):
            item['name']=book.css('h3>a::attr(title)').extract_first()            if len(item['name']) >= 175:
                item['name'] = item['name'][:175]            item['price']=book.css('div.product_price>p::text').extract_first()[1:]            yield item
        next_url=response.css('ul.pager>li.next>a::attr(href)').extract_first()        print(next_url)        if next_url:
            next_url=response.urljoin(next_url)            yield scrapy.Request(next_url,callback=self.parse)

item文件中创建容器

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyfrom scrapy.loader import ItemLoaderfrom scrapy.loader.processors import TakeFirstclass SharesItemLoader(ItemLoader):
    default_output_processor = TakeFirst()class SharesItem(scrapy.Item):
    name = scrapy.Field()    price = scrapy.Field()

核心处理器中设置数据库连接

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport cx_Oracledef connect_oracle():
    conn = cx_Oracle.connect('c##scott','123456','192.168.31.210:1521/orcl')    return connclass SharesPipeline(object):def process_item(self, item, spider):
        conn = connect_oracle()        cur = conn.cursor()        sql = "insert into shares values(seq_shid.nextval,'%s','%s')"%(item['name'],item['price'])        print('这是sql语法:',sql)        try:
            cur.execute(sql)            conn.commit()        except Exception as e:print('数据库报错...[%r]'%e)            conn.rollback()        conn.close()        return item

配置文件设置爬虫延迟等设置，添加和修改如下代码

USER_AGENT:'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'ITEM_PIPELINES = {'shares.pipelines.SharesPipeline':300,}# Obey robots.txt rules# ROBOTSTXT_OBEY = TrueROBOTSTXT_OBEY = FalseDOWNLOAD_DELAY = 0.25

本文介绍的只是简单的情况作为引入，后续还会继续更新多种复杂情况，敬请期待。

weixin_39827850

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
oracle sql练习_使用Scrapy爬虫，并将数据存储到oracle

今天使用的是专业的异步爬虫框架，也就是说使用普通代码经过添加进程和异常处理的，在框架中已经封装好，可以直接使用，我们只需要在spider进行标签、属性或者标题的获取，代码如下：创建练习目录：test，cd test使用指令创建爬虫目录：scrapy startproject shares，cd shares使用指令创建初始spider代码：scrapy genspider share b...
复制链接

扫一扫