scrapy读取mysql中的url_python – 将Scrapy数据保存到MySQL中的相应URL

最新推荐文章于 2022-05-08 16:39:25 发布

weixin_39714164

最新推荐文章于 2022-05-08 16:39:25 发布

阅读量267

点赞数

文章标签： scrapy读取mysql中的url

本文链接：https://blog.csdn.net/weixin_39714164/article/details/113540079

版权

该博客介绍了如何使用Scrapy爬虫从MySQL数据库获取URL列表，然后爬取网页数据（评分和计数）。博主遇到的问题是在保存数据时，无法将这些数据关联回原始URL。解决方案是在`crawledScore`项目中添加`reviewURL`字段，并在`parse`方法中存储响应URL。在管道文件中，更新插入或更新SQL语句以确保数据与源URL相关联。

摘要由CSDN通过智能技术生成

目前正与Scrapy合作.

我有一个存储在MySQL数据库中的URL列表.蜘蛛访问这些URL,捕获两个目标信息(分数和计数).我的目标是当Scrapy完成抓取时,它会在移动到下一个URL之前自动填充相应的列.

我是新手,我似乎无法让保存部分正常工作.分数和计数成功传递到数据库.但它保存为新行而不是与源URL关联.

这是我的代码：

amazon_spider.py

import scrapy

from whatoplaybot.items import crawledScore

import MySQLdb

class amazonSpider(scrapy.Spider):

name = "amazon"

allowed_domains = ["amazon.com"]

start_urls = []

def parse(self, response):

print self.start_urls

def start_requests(self):

conn = MySQLdb.connect(

user='root',

passwd='',

db='scraper',

host='127.0.0.1',

charset="utf8",

use_unicode=True

)

cursor = conn.cursor()

cursor.execute(

'SELECT url FROM scraped;'

)

rows = cursor.fetchall()

for row in rows:

yield self.make_requests_from_url(row[0])

conn.close()

def parse(self, response):

item = crawledScore()

item['reviewScore'] = response.xpath('//*[@id="avgRating"]/span/a/span/text()').re("[0-9,.]+")[0]

item['reviewCount'] = response.xpath('//*[@id="summaryStars"]/a/text()').re("[0-9,]+")

yield item

pipelines.py

import sys

import MySQLdb

class storeScore(object):

def __init__(self):

self.conn = MySQLdb.connect(

user='root',

passwd='',

db='scraper',

host='127.0.0.1',

charset="utf8",

use_unicode=True

)

self.cursor = self.conn.cursor()

def process_item(self, item, spider):

try:

self.cursor.execute("""INSERT INTO scraped(score, count) VALUES (%s, %s)""", (item['reviewScore'], item['reviewCount']))

self.conn.commit()

except MySQLdb.Error, e:

print "Error %d: %s" % (e.args[0], e.args[1])

return item

任何帮助和指导将非常感谢.

感谢你们.

最佳答案请遵循以下步骤：

将reviewURL字段添加到crawledScore项目中：

class crawledScore(scrapy.Item):

reviewScore = scrapy.Field()

reviewCount = scrapy.Field()

reviewURL = scrapy.Field()

将回复网址保存到商品[‘reviewURL’]中：

def parse(self, response):

item = crawledScore()

item['reviewScore'] = response.xpath('//*[@id="avgRating"]/span/a/span/text()').re("[0-9,.]+")[0]

item['reviewCount'] = response.xpath('//*[@id="summaryStars"]/a/text()').re("[0-9,]+")

item['reviewURL'] = response.url

yield item

最后,在您的管道文件上,根据您的逻辑插入或更新：

插入：

def process_item(self, item, spider):

try:

self.cursor.execute("""INSERT INTO scraped(score, count, url) VALUES (%s, %s, %s)""", (item['reviewScore'], item['reviewCount'], item['reviewURL']))

self.conn.commit()

except MySQLdb.Error, e:

print "Error %d: %s" % (e.args[0], e.args[1])

return item

更新：

def process_item(self, item, spider):

try:

self.cursor.execute("""UPDATE scraped SET score=%s, count=%s WHERE url=%s""", (item['reviewScore'], item['reviewCount'], item['reviewURL']))

self.conn.commit()

except MySQLdb.Error, e:

print "Error %d: %s" % (e.args[0], e.args[1])

return item

weixin_39714164

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫