要爬特定的asset type做实验。想来想去还是用scrapy来试试。还是挺带感。
下面放个低配版。用火车采集,那个文件是跑了五个小时,十万条网址,有200也有404的。采集内容间隔100毫秒。
放一小段xml给个概念
<DataBreakdown _SalePosition="S"/>
<Holding>
<HoldingDetail _DetailHoldingTypeId="BQ" _ExternalId="XS1257957222" ExternalName="CGMSE 2015-2X A1 1.35 29" _StorageId="90">
<Country _Id="IRL">Ireland</Country>
<CUSIP>G19032AK0</CUSIP>
<SEDOL>BYXXF19</SEDOL>
<ISIN>XS1257957222</ISIN>
<Currency _Id="EUR">Euro</Currency>
<SecurityName>Carlyle C 15-2 Dac 1.35%</SecurityName>
<Weighting>12.9579</Weighting>
<NumberOfShare>39620000</NumberOfShare>
# -*- coding: utf-8 -*-
import scrapy
import sys # So to export Chinese characters
reload(sys)# So to export Chinese characters
sys.setdefaultencoding('utf8')# So to export Chinese characters
class MyxmlSpider(scrapy.Spider):
name = "PortfolioXML"
f = open("Batch.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
for holdingType in response.xpath('//Holding'):
DetailType = response.xpath('//Holding/HoldingDetail/@_DetailHoldingTypeId').extract()
#Get URL with specific DetailTypeId
if 'CL' in DetailType:
yield {
'CL': response.url
}
if 'SJ' in DetailType:
yield {
'SJ': response.url
}
//shell里面输入下面这个形式。quotes_spider.py就是上面那货,quotes.json是输出文件。
scrapy runspider quotes_spider.py -o quotes.json -s LOG_FILE=sPiderlOg.log
同等文件scrapy用时:近两小时。
毕竟scrapy默认是10线程。那个火车貌似是3。
然而火车的waiting time不设,理论上貌似可以省两个多小时。
另一段低配,也是常用的爬取逻辑:
import scrapy
class MyxmlSpider(scrapy.Spider):
name = "PortfolioXML"
f = open("Batch.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
myItems = ['B','BD','BG','BH','BQ','BR','BT','BU','BY','BZ','DA','IP','NB','NC','ND','NE','SD','SI','SJ','SK','SR','TF','TP','0','1','12']
for investments in response.xpath('//Holding'):
for holdingType in investments.xpath('.//HoldingDetail'):
testingdata=holdingType.xpath('./@_DetailHoldingTypeId').extract()
for myItem in myItems:
if myItem in testingdata:
itemType = holdingType.xpath('.//ISIN/text()').extract() #And you can:
strHolding = ''.join(str(e) for e in itemType) #Convert list type into string,otherwise you cannot print it like below.
print response.url + ":" + strHolding
break
关于打印那部分还可以这样:
if myItem in testingdata:
itemTypes = holdingType.xpath('.//ISIN/text()').extract()
group = [itemType + ";" + response.url for itemType in itemTypes]
print group
break