CTS flow使用小贴士

#Run CTS Flow
#Try the Sample flow ReadDocumentBasic.flow
#Develop a CTS flow using ESP crawler
Need to enable callback
for both the nullWriter and ESPWriter
 
#configure a pipeline to collaborate with CTS flow
Get CTS Stages ctsAnnotationsImporter and ctsParser from feedingOverlay Package.
Don't use ctsAnnotationsImporter and related scope processing stage(scopifier and xmlifier),especially for CJK,
It seems not working right now,otherwise will get "FIXML has illegal UTF-8 byte sequences".

Create a pipeline named CrawlerCts using sitesearch pipeline as template
1.Instance a staqe named CtsParserCrawler based on CtsParser
GenerateScopesFromAnnotations 0

2.Put CtsParserCrawler after Docinit
3.Remove the follow stages:
DocumentRetriever
URLProcessor
Decompressor
FormatDetector
SimpleConverter
FlashConverter
PDFConverter
XPSConverter
SearchExportConverter
FastHTMLParser
 
#For CJK,Don't remove
LanguageAndEncodingDetector
EncodingNormalizer
 
#If you need WebAnalyser,Don't remove
WAAttributeLookup
WALinkRankAnchorTextFormatter
WACrawlerLinkFilter
WARankDocument
4.The CtsAnnotationsImporter is not necessary If you don't need scope searching
 
Tips: define your collection name in the Mapper operator
 
#Using ESP Crawler with CTS
c:/esp/etc/CrawlerGlobalDefaults.xml
...
<section name="cde"> 
  <attrib name="contentdistributors" type="list-string">
    <member> localhost:17078 </member>
  </attrib>
</section>
...
nctrl stop crawler
nctrl start crawler
Configure crawler's feeding destinations parameter on the Admin GUI
(What the FSIS document said will not work,Because if no feeding destination define,the
export config file will be empty for this group parameters)
name:cde
Target Collection:cntv1;fsistraining.crawlingvideo
Destination:cde
Pause ESP feeding:no
Primary:yes
 

crawleradmin.exe -G cntv1 > crawler_cntv1.xml
notepad ./cawler_cntv1.xml
#Confirm the feeding destination parameter
section name="feeding">
            <section name="cde">
                <attrib name="collection" type="string"> cntv1;fsistraining.crawlingvideo </attrib>
                <attrib name="destination" type="string"> cde </attrib>
                <attrib name="paused" type="boolean"> no </attrib>
                <attrib name="primary" type="boolean"> yes </attrib>
            </section>
</section>
#change the start_uris and include_uris to define where you are going to craw
<attrib name="start_uris" type="list-string">
    <member> http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</attrib>
<section name="include_uris">
            <attrib name="exact" type="list-string">
                <member> http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
            </attrib>
</section>
Run the flow from VS or FSIS Admin GUI

Remove the crawler datasource definition from the collection cntv1
crawleradmin -f ./cawler_cntv1.xml
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Added collection config(s): Scheduled collection for crawling
#Watching crawler from command winodws
crawleradmin --status
Enterprise Crawler 6.7.8 - Admin Client
Copyright (C) 2008 FAST, A Microsoft(R) Subsidiary
Collection       Status     Feed Status  Active Sites  Stored Docs     Doc Rate
-------------------------------------------------------------------------------
cntv10           Idle       Feeding      0             1               N/A
cntv11           Idle       Feeding      0             1               N/A
cntv12           Idle       Feeding      0             1               N/A
cntv13           Idle       Feeding      0             1               N/A
cntv14           Idle       Feeding      0             1               N/A
cntv8            Idle       Feeding      0             2               N/A
cntv9            Idle       Feeding      0             5               N/A
                                         0             12              0.0 dps
          
          
          
#Watching doclog
doclog -l
doclog -a http://xxx/xxxx/

#Watching CTS Flow log from
C:/Users/FSIS Service/AppData/Local/FSIS/Nodes/Fsis/ContentEngineNode1/Logs/ContentProcessing
#Watching Crawler log from
C:/esp/var/log/crawler
#Adding Spy Stage into ESP pipeline to monitor
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值