用scrapy爬xml

最新推荐文章于 2022-04-07 11:01:00 发布

取啥都被占用

最新推荐文章于 2022-04-07 11:01:00 发布

阅读量1.7k

点赞数

分类专栏： Python 文章标签： xml crawl

本文链接：https://blog.csdn.net/u011410413/article/details/56673931

版权

Python 同时被 2 个专栏收录

59 篇文章 0 订阅

订阅专栏

野路子搞技术

31 篇文章 0 订阅

订阅专栏

要爬特定的asset type做实验。想来想去还是用scrapy来试试。还是挺带感。
下面放个低配版。用火车采集，那个文件是跑了五个小时，十万条网址，有200也有404的。采集内容间隔100毫秒。
放一小段xml给个概念

<DataBreakdown _SalePosition="S"/>
<Holding>
    <HoldingDetail _DetailHoldingTypeId="BQ" _ExternalId="XS1257957222" ExternalName="CGMSE 2015-2X A1 1.35 29" _StorageId="90">
        <Country _Id="IRL">Ireland</Country>
        <CUSIP>G19032AK0</CUSIP>
        <SEDOL>BYXXF19</SEDOL>
        <ISIN>XS1257957222</ISIN>
        <Currency _Id="EUR">Euro</Currency>
        <SecurityName>Carlyle C 15-2 Dac 1.35%</SecurityName>
        <Weighting>12.9579</Weighting>
        <NumberOfShare>39620000</NumberOfShare>

# -*- coding: utf-8 -*-
import scrapy
import sys  # So to export Chinese characters
reload(sys)# So to export Chinese characters
sys.setdefaultencoding('utf8')# So to export Chinese characters

class MyxmlSpider(scrapy.Spider):
    name = "PortfolioXML"
    f = open("Batch.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()

    def parse(self, response):
        for holdingType in response.xpath('//Holding'):
            DetailType = response.xpath('//Holding/HoldingDetail/@_DetailHoldingTypeId').extract()


        #Get URL with specific DetailTypeId
        if 'CL' in DetailType:
            yield {
                'CL': response.url
            }

        if 'SJ' in DetailType:
            yield {
                'SJ': response.url
            }

//shell里面输入下面这个形式。quotes_spider.py就是上面那货，quotes.json是输出文件。

scrapy runspider quotes_spider.py -o quotes.json -s LOG_FILE=sPiderlOg.log

同等文件scrapy用时：近两小时。
毕竟scrapy默认是10线程。那个火车貌似是3。
然而火车的waiting time不设，理论上貌似可以省两个多小时。
另一段低配，也是常用的爬取逻辑：

import scrapy


class MyxmlSpider(scrapy.Spider):
    name = "PortfolioXML"
    f = open("Batch.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()



    def parse(self, response):
        myItems  = ['B','BD','BG','BH','BQ','BR','BT','BU','BY','BZ','DA','IP','NB','NC','ND','NE','SD','SI','SJ','SK','SR','TF','TP','0','1','12']
        for investments in response.xpath('//Holding'):
            for holdingType in investments.xpath('.//HoldingDetail'):
                testingdata=holdingType.xpath('./@_DetailHoldingTypeId').extract()
                for myItem in myItems:
                    if myItem in testingdata:
                        itemType = holdingType.xpath('.//ISIN/text()').extract() #And you can: 
                        strHolding = ''.join(str(e) for e in itemType)  #Convert list type into string,otherwise you cannot print it like below.
                        print response.url + ":" + strHolding
                        break

关于打印那部分还可以这样：

                    if myItem in testingdata:
                        itemTypes = holdingType.xpath('.//ISIN/text()').extract()
                        group = [itemType + ";" + response.url for itemType in itemTypes]
                        print group
                        break

取啥都被占用

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
用scrapy爬xml

要爬特定的asset type做实验。想来想去还是用scrapy来试试。还是挺带感。下面放个低配版。用火车采集，那个文件是跑了五个小时，每个网址间隔200毫秒。# -*- coding: utf-8 -*-import scrapyclass MyxmlSpider(scrapy.Spider): name = "PortfolioXML" f = open("Batch.t
复制链接

扫一扫