一个模拟页面操作，解析xml输出，生成CSV文件的ruby程序

最新推荐文章于 2022-12-27 21:31:39 发布

python爱好部落

最新推荐文章于 2022-12-27 21:31:39 发布

阅读量1.1k

点赞数

分类专栏： Ruby 文章标签： csv xml ruby query accessor attributes

本文链接：https://blog.csdn.net/passionboyxie/article/details/6734156

版权

Ruby 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

应用了ruby的mixin语法特性,WebUnit(SAX方式)、REXML、CSV库，代码如下：

require ‘webunit/webunit’
require ‘CSV’
require ‘rexml/document’
require ‘rexml/streamlistener’

include REXML

Response = WebUnit::TestCase::Response

# ################################
# 需要我们配置的项
# ################################

# 生成的CSV路径
$csv_file_path = ‘c:\test\ruby\data.csv’
# 要查询的数据
#$query_data = ‘GQNGSGKSSLLD’
$query_data = ‘tttggagtcctgtgtgcttctctctcttatggctgtgtccc’
# 首次进入的网址
$url = ‘http://www.test.com/test.cgi’

# #############################################################
# 修正WebUnit：遇到格式不正确的tag，原实现是抛异常，现改为简单忽略掉。
# 改写endtag和feed方法
# WebUnit安装方法：
# 下载地址：
# 下载后解压依次执行：
# cd 解压后的目录
# ruby install.rb config
# ruby install.rb setup
# ruby install.rb install
# #############################################################
module WebUnit
class Parser < SGMLParser
    alias_method rigin_endtag,:endtag
    def endtag(tag)
      begin
        origin_endtag(tag)
      rescue
        return nil
      end
    end

    alias_method rigin_feed,:feed
    def feed(response)
      begin
        origin_feed(response)
      rescue
        return nil
      end
    end
end
end

# #############################################################
# 使用WebUnit库，模拟用户浏览器上的操作：
# 1. 进入$url网址，在Search处输入要查询的数据_query，点提交按钮进入下一页
# 2. 将Format_Type下拉框选成"XML"，点Format按钮进入下一页。
# 3. 如果显示的是html页面，源码中有Status=WAITING字串，则隔段时间重新提交表单，直至取到的是要的xml内容
# 4. 返回xml格式的字串
# #############################################################
def retrieve_content(_query)
result = nil
retry_interval = 3
Response.reset
# 进入首个页面
response = Response::get($url)
form = response.form
# 输入Search后面的内容
form.params['QUERY'].value = _query
# 点按钮，进到下一页面
response = form.submit
# 该页显示The request
ID信息，将Format_type下拉框选成XML格式
form = response.form
form.params['FORMAT_TYPE'].value = ‘XML’
#
提交表单，如果取到的内容包含Status=WAITING，隔段时间再试，直至取到真正要的xml内容
content_ready = false
while !content_ready do
    response = form.submit
    content_ready = ( response.body.index(‘Status=WAITING’)==nil )
    if(!content_ready) then
      sleep retry_interval
    else
      result = response.body
    end
end

#puts result
puts ‘### retrieve content ok ###’
return result
end

# #############################################################
# 以SAX方式解析xml（速度啊！用rexml的dom+xpath方式，慢死了）
# parse_xml_content 接收xml字串，
# 返回解析结果[[Hit_id_Value,Hit_def_Value,...], [],… ] 的形式
# #############################################################
Hit_attrs = ['Hit_id','Hit_def','Hit_len','Hsp_align-len',
'Hsp_identity','Hsp_positive','Hsp_gaps','Hsp_query-from','Hsp_query-to',
'Hsp_hit-from','Hsp_hit-to','Hsp_qseq','Hsp_hseq','Hsp_midline']

class ContentHandler
include StreamListener
attr_accessor :infos
def initialize
    @infos = []
    @current_hit = nil
    @current_value = nil
    @hsp_num = 0
end

def tag_start(name, attributes)
    if name==’Hit’ then
      @current_hit = {}
      @hsp_num = 0
    end

@hsp_num += 1 if name==’Hsp’
end

def tag_end(name)
    if name==’Hit’ then
      info = []
      Hit_attrs.each{ |key|
        info << @current_hit[key]
      }
      @infos << info
      @current_hit = nil
    end

return if @hsp_num>1
@current_hit[name] = @current_value if Hit_attrs.include?(name)
end

def text(text)
@current_value = text
end
end

def parse_xml_content(content)
handler = ContentHandler.new
Document.parse_stream(content, handler)
# puts handler.infos
puts ‘### parse xml ok ###’
return handler.infos
end

# #############################################################
# 生成CSV文件
# 代码片段解析
# writer << ['a','b'] 会生成CSV的一行: a,b
# #############################################################

def generate_csv_file(csvFilePath,infos)
CSV.open(csvFilePath,’w'){|writer|
infos.each{|info| writer << info }
}
puts ‘### generate CSV file ok ###’
end

# #############################################################
# 定义main方法
# #############################################################

def __main__
xml_content = retrieve_content($query_data)
infos = parse_xml_content(xml_content)
generate_csv_file($csv_file_path,infos)
puts ‘### all completed ###’
end

# 执行main方法
__main__()

python爱好部落

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
一个模拟页面操作，解析xml输出，生成CSV文件的ruby程序

应用了ruby的mixin语法特性,WebUnit(SAX方式)、REXML、CSV库，代码如下：require ‘webunit/webunit’ require ‘CSV’ require ‘rexml/document’ require ‘rexml/str
复制链接

扫一扫