python的scrapy爬虫可以将爬去的数据放入数据库吗_Python基于Scrapy的爬虫数据采集（写入数据库）...

最新推荐文章于 2023-07-10 11:20:04 发布

行走的饺子

最新推荐文章于 2023-07-10 11:20:04 发布

阅读量193

点赞数

文章标签： python的scrapy爬虫可以将爬去的数据放入数据库吗

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42505792/article/details/114451227

版权

上一节已经学了如何在spider里面对网页源码进行数据过滤。

这一节将继续学习scrapy的另一个组件-pipeline，用来2次处理数据

(本节中将以储存到mysql数据库为例子)

虽然scrapy架构下，可自定义的模块很多，其实实现一个完整的scrapy爬虫，仅仅只需要我们写好

spider和pipeline，一个用来收集数据，一个用来处理数据

其他如下载中间件、引擎核心，都是自动运行的。

环境设置：

既然要写入到MySQL，那得先让python支持mysql的写入工作，也就是先安装mysql驱动pymysql

pip install pymysql

item是scrapy中，连结spider和pipeline的桥梁，

spider爬取了数据，过滤后写入到item中，

再通过yield返回给核心引擎并交付给pipeline，

由pipeline建立到数据库的连接并写入

Item：

在Item.py中声明需要传入的数据

import scrapy

class MyItem(scrapy.Item):

content = scrapy.Field()

author = scrapy.Field()

tags = scrapy.Field()

是的，只需要这么几行，声明名字就可以

Spider：

对上一节的Spider01.py中的parse()函数进行一些修改

import scrapy

from MyScrapyProject.items import MyItem #注意这里的文件和类名都是自己定义的要一致

class Spider01 (scrapy.Spider):

name= 'MyMainSpider'

start_urls=[

'http://quotes.toscrape.com/'

]

def parse(self,response):

quote_list=response.css('div.quote')

item = MyItem()#默认构造函数

for quote in quote_list:

print('Now loading a quote...')

content = quote.css('span.text::text').get()

item['content'] = content

author = quote.css('small.author::text').get()

item['author'] = author

tags = quote.css('a.tag::text').getall()

item['tags'] = ",".join(tags)#tags是字符串的列表，用join表示以‘，’为连接组成一个大串

#这里用getall() 不像上面用循环根据需求来

yield item

# with open('text.txt','w') as f:

# for oneSentence in quote_list:

# f.write(oneSentence.css('span.text::text').get()+'\n')

# f.write(oneSentence.css('small.author::text').get()+'\n')

# tag_list=oneSentence.css('a.tag');

# for tag in tag_list:

# f.write(tag.css('::text').get()+' ')

# f.write('\n')

这边记录一个小问题，

在一个代码块中注释大段代码，有可能会导致奇怪的缩进错误，

所以这里把大段的注释放在了最后

这里对yield进行一些说明，不一定完全准确：

在yield之前，已经对item封装完毕了，通过yield返回给引擎，再传给pipeline

pipeline对item处理完毕之后，回到parse继续运行，这时会从yield的下一句开始，

也就是进入for语句的下一个循环

这样的好处是保持只有一个item对象，节约空间和构造对象的时间

对于使用了pipeline的scrapy spider的parse中必须包含yield

这里的原因主要是，scrapy核心会对spider yield的返回值类型进行判断，为request时会传给一个放置request对象的队列(由scrapy自己维护，我们不用管)，而为item时才会传给pipeline

pipeline：

这基本上可以当作模板使用，

不要忘记在setting.py中开启pipeline!!!

这里初始是被注释的

#from itemadapter import ItemAdapter

import pymysql.cursors

class MysqlPipeline:

def __init__(self):

self.mysql_url='localhost'

self.mysql_db='mydatabase'

def open_spider(self,spider):

self.mysql_conn = pymysql.connect(

host = self.mysql_url,

user = 'root',

password = 'xxxxxxxxx',#填写你的mysql密码

db = 'mydatabase',

charset = 'utf8mb4',

cursorclass = pymysql.cursors.DictCursor

)

def process_item(self, item, spider):

print('process the item')

try :

cursor = self.mysql_conn.cursor()

try:

sql_write = "INSERT INTO quotes (content, author, tags) VALUES (%s, %s, %s);"

cursor.execute(sql_write, (item.get("content", ""), item.get("author", ""), item.get("tags", "")))

cursor.connection.commit()

except Exception as e:

print('Something wrong with Table INSERT')

print(e)

except Exception as e:

print('Something wrong with MYSQL')

print(e)

return item

核心在cursor.execute() 和cursor.connection.commit()

一定一定一定一定一定要写cursor.connection.commit()

本文地址：https://blog.csdn.net/Cake_C/article/details/107135741

如您对本文有疑问或者有任何想说的，请点击进行留言回复，万千网友为您解惑！

行走的饺子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python的scrapy爬虫可以将爬去的数据放入数据库吗_Python基于Scrapy的爬虫数据采集（写入数据库）...

上一节已经学了如何在spider里面对网页源码进行数据过滤。这一节将继续学习scrapy的另一个组件-pipeline，用来2次处理数据(本节中将以储存到mysql数据库为例子)虽然scrapy架构下，可自定义的模块很多，其实实现一个完整的scrapy爬虫，仅仅只需要我们写好spider和pipeline，一个用来收集数据，一个用来处理数据其他如下载中间件、引擎核心，都是自动运行的。环境设置：既然...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。