python scrapy菜鸟教程_python学习-scrapy学习笔记

最新推荐文章于 2024-07-05 10:23:36 发布

weixin_39843338

最新推荐文章于 2024-07-05 10:23:36 发布

阅读量275

点赞数

文章标签： python scrapy菜鸟教程

本文链接：https://blog.csdn.net/weixin_39843338/article/details/111760624

版权

本文介绍了如何在Python Scrapy项目中使用ORM（对象关系映射）和SQLAlchemy进行数据库操作。首先讲解了如何创建检查spider是否授权pipeline的装饰器，并在spider中设置pipeline。接着，通过创建数据库表的映射类，利用sessionmaker和Base基类进行数据库交互。最后，展示了如何在pipeline中实例化Topic类并插入数据到数据库。

摘要由CSDN通过智能技术生成

python-scrapy学习笔记

一、你可以为你的spider指定处理数据的pipeline，不过这需要一些代码

首先我们需要一个装饰器(decorator)，这个装饰器放到pipeline文件中，类的外部，因为多个pipeline需要用到这个装饰器

def check_spider_pipeline(process_item_method):

"""该注解用在pipeline上

:param process_item_method:

:return:

"""

@functools.wraps(process_item_method)

def wrapper(self, item, spider):

# message template for debugging

msg = "{0} pipeline step".format(self.__class__.__name__)

# if class is in the spider"s pipeline, then use the

# process_item method normally.

if self.__class__ in spider.pipeline:

logging.info(msg.format("executing"))

return process_item_method(self, item, spider)

# otherwise, just return the untouched item (skip this step in

# the pipeline)

else:

logging.info(msg.format("skipping"))

return item

return wrapper

装饰器的作用是判断spider中有没有设置这个pipeline方法，代码的关键在于

if self.__class__ in spider.pipeline:

基于这个判断，我们需要在spider中设置我们的pipeline：

pipeline = set([

pipelines.RentMySQLPipeline,

])

在spider类中添加这段代码，建立这两段代码的联系。在pipeline中使用装饰器之后，我们就会判断spider中是否授权了该方法去操作item。

当然，我们在使用之前也必须将pipelines导入到文件中。

两者建立联系之后，使用如下代码：

@check_spider_pipeline

def process_item(self, item, spider):

此时，就大功告成了。每个pipeline方法前都使用这个装饰器，然后在spider中授权方法的使用。

二、利用ORM，我们可以实现快速的入门操作数据库

ORM指object relational mapping，即对象关系映射。

首先我们的有一些基础知识，我自己用的是mysql和SQLAlchemy。如有不熟悉请移步mysql菜鸟教程，SQLAlchemy教程。

在我们通过spider爬取到数据之后，所有的数据都是提交给pipeline处理，pipeline需要在settings中注册

ITEM_PIPELINES = {

'spider.pipelines.SpiderPipeline': 300,

'spider.pipelines.SpiderDetailPipeline': 300,

}

然后我们需要在mysql中添加自己的数据库和表

mysql -u root -p

create database xxx;

use xxx;

create table spider(id integer not null, primary key (id));

添加好自己需要的数据之后，我们在程序中创建一个表的映射类

from sqlalchemy import Column, String, DateTime,create_engine, Integer, Text, INT

from sqlalchemy.orm import sessionmaker

from sqlalchemy.ext.declarative import declarative_base

import settings

Base = declarative_base()

class topic(Base):

__tablename__ = 'topic'

id = Column(Integer, primary_key=True, unique=True, autoincrement=True)

topic_title = Column(String(256))

topic_author = Column(String(256))

topic_author_img = Column(String(256))

topic_class = Column(String(256))

topic_reply_num = Column(Integer)

spider_time = Column(String(256))

def __init__(self, topic_title, topic_author, topic_class, topic_reply_num, spider_time, topic_author_img):

# self.topic_id = topic_id

self.topic_title = topic_title

self.topic_author = topic_author

self.topic_author_img = topic_author_img

self.topic_class = topic_class

self.topic_reply_num = topic_reply_num

self.spider_time = spider_time

DBSession = sessionmaker(bind=settings.engine)

Base作为基类，供所有的对象类继承 DBSession作为操作数据库的一个对话，通过sessionmaker创建后，可以对方便的对数据库进行操作。接下来就是进行数据的插入了，因为我们是爬虫操作，也不需要关心删除修改这些。直接上代码

class TesterhomeSpiderPipeline(object):

def __init__(self):

self.session = DBSession()

@check_spider_pipeline

def process_item(self, item, spider):

my_topic = Topic(topic_title=item['topic_title'][0].encode('unicode-escape'),

topic_author=item['topic_author'][0].encode('unicode-escape'),

topic_author_img=item['topic_author_img'][0].encode('unicode-escape'),

topic_class=item['topic_class'][0].encode('unicode-escape'),

topic_reply_num=item['topic_reply_num'][0].encode('unicode-escape'),

spider_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

try:

self.session.add(my_topic)

self.session.commit()

except:

self.session.rollback()

raise

finally:

self.session.close()

return item

通过对Topic类进行实例化，然后调用session的方法将数据插入数据库就完成了一次对数据库的操作。

本身是打算每周更新两篇博客的，也不管有没有营养，哈哈！不过周末又犯懒了，所以周一赶紧补上！

第一篇笔记是自己实际使用爬虫中遇到的问题，第二篇。。嗯。。是拿来凑数的！因为以前利用flask

开发过网站，所以SQLAlchemy用起来还是很轻松的！

本文为本人原创，创作不易，转载请注明出处！谢谢

weixin_39843338

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python scrapy菜鸟教程_python学习-scrapy学习笔记

python-scrapy学习笔记一、你可以为你的spider指定处理数据的pipeline，不过这需要一些代码首先我们需要一个装饰器(decorator)，这个装饰器放到pipeline文件中，类的外部，因为多个pipeline需要用到这个装饰器def check_spider_pipeline(process_item_method):"""该注解用在pipeline上:param proce...
复制链接

扫一扫