python extract_first_Python爬虫（16）利用Scrapy爬取银行理财产品信息（共12多万条）...

最新推荐文章于 2023-05-23 23:42:49 发布

埃里克 Eric

最新推荐文章于 2023-05-23 23:42:49 发布

阅读量585

点赞数

文章标签： python extract_first

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_28704565/article/details/114910894

版权

本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息，并存入MongoDB中。网页的截图如下，全部数据共12多万条。

6dfc7deae66c

银行理财产品

我们不再过多介绍Scrapy的创建和运行，只给出相关的代码。关于Scrapy的创建和运行，有兴趣的读者可以参考：Scrapy爬虫(4)爬取豆瓣电影Top250图片。

修改items.py，代码如下，用来储存每个理财产品的相关信息，如产品名称，发行银行等。

import scrapy

class BankItem(scrapy.Item):

# define the fields for your item here like:

name = scrapy.Field()

bank = scrapy.Field()

currency = scrapy.Field()

startDate = scrapy.Field()

endDate = scrapy.Field()

period = scrapy.Field()

proType = scrapy.Field()

profit = scrapy.Field()

amount = scrapy.Field()

创建爬虫文件bankSpider.py，代码如下，用来爬取网页中理财产品的具体信息。

import scrapy

from bank.items import BankItem

class bankSpider(scrapy.Spider):

name = 'bank'

start_urls = ['https://www.rong360.com/licai-bank/list/p1']

def parse(self, response):

item = BankItem()

trs = response.css('tr')[1:]

for tr in trs:

item['name'] = tr.xpath('td[1]/a/text()').extract_first()

item['bank'] = tr.xpath('td[2]/p/text()').extract_first()

item['currency'] = tr.xpath('td[3]/text()').extract_first()

item['startDate'] = tr.xpath('td[4]/text()').extract_first()

item['endDate'] = tr.xpath('td[5]/text()').extract_first()

item['period'] = tr.xpath('td[6]/text()').extract_first()

item['proType'] = tr.xpath('td[7]/text()').extract_first()

item['profit'] = tr.xpath('td[8]/text()').extract_first()

item['amount'] = tr.xpath('td[9]/text()').extract_first()

yield item

next_pages = response.css('a.next-page')

if len(next_pages) == 1:

next_page_link = next_pages.xpath('@href').extract_first()

else:

next_page_link = next_pages[1].xpath('@href').extract_first()

if next_page_link:

next_page = "https://www.rong360.com" + next_page_link

yield scrapy.Request(next_page, callback=self.parse)

为了将爬取的数据储存到MongoDB中，我们需要修改pipelines.py文件，代码如下：

# pipelines to insert the data into mongodb

import pymongo

from scrapy.conf import settings

class BankPipeline(object):

def __init__(self):

# connect database

self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])

# using name and password to login mongodb

# self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])

# handle of the database and collection of mongodb

self.db = self.client[settings['MONGO_DB']]

self.coll = self.db[settings['MONGO_COLL']]

def process_item(self, item, spider):

postItem = dict(item)

self.coll.insert(postItem)

return item

其中的MongoDB的相关参数，如MONGO_HOST, MONGO_PORT在settings.py中设置。修改settings.py如下：

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {'bank.pipelines.BankPipeline': 300}

添加MongoDB连接参数

MONGO_HOST = "localhost" # 主机IP

MONGO_PORT = 27017 # 端口号

MONGO_DB = "Spider" # 库名

MONGO_COLL = "bank" # collection名

# MONGO_USER = ""

# MONGO_PSW = ""

其中用户名和密码可以根据需要添加。

接下来，我们就可以运行爬虫了。运行结果如下：

6dfc7deae66c

运行结果

共用时3小时，爬了12多万条数据，效率之高令人惊叹！

最后我们再来看一眼MongoDB中的数据：

6dfc7deae66c

MongoDB数据库

Perfect！本次分享到此结束，欢迎大家交流~~

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python extract_first_Python爬虫（16）利用Scrapy爬取银行理财产品信息（共12多万条）...

本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息，并存入MongoDB中。网页的截图如下，全部数据共12多万条。银行理财产品我们不再过多介绍Scrapy的创建和运行，只给出相关的代码。关于Scrapy的创建和运行，有兴趣的读者可以参考：Scrapy爬虫(4)爬取豆瓣电影Top250图片。修改items.py，代码如下，用来储存每个理财产品的相关信息，如产品名称，发行银行等。...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。