现在来讲讲,抓取网页的内容吧。刚刚接触ruby on rails的时候完全不是很理解ruby on rails的简洁没深入的研究和了解它所有的插件,之后导致写了这样的代码去完成,html信息的抓取,其实要是用open-uri和hpricot是非常的简单的,下面就把下面的信息贴出来和大家分享一下。
这完全是初学者的写法现在看起来非常可笑
url = <url>
url.gsub!(/[/s|/)|/!]/,'')
if url =~ /http/:/w+/./S+[A-Za-z0-9]/./S+[A-Za-z]/
after_url = url.gsub(/^http:|[/s|/)|/!]/,'')
else
after_url = url
end
#p after_url
address = after_url.split(///|/)
wget_url = url.gsub(//$/,'/$')
#p wget_url
if system('wget --timeout=10 --user-agent="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0" --spider '+wget_url)
Net::HTTP.start(address[0]) do |http|
if address.length <= 1
path = '/'
else
path = after_url.gsub(address[0],'')
end
resp, html_get = http.get(path,nil)
if(resp.code == '200' || resp.message == 'OK' || reg['content-length'] > 0)
@final_array = Hash.new
url = Array.new
second_time = Array.new
third_time = Array.new
first_time = html_get.scan(/<img [^~]*?>/)
first_time.each do |t|
second_time = t.scan(/(http:.*)/.(gif|jpg|jpeg|bmp|png)/)
# second_time = t.scan(/<img[^>]+?src=(/"|/')([^/'/"]+)/1/)
if second_time[0] != nil
# localimage = second_time[0][1].split(///|/)
# if localimage[0] == '.' || localimage[0] == '..'
# third_time = second_time[0][1].gsub(localimage[0],'http://' + address[0])
# else
# third_time = second_time[0][1]
# end
third_time = second_time[0].join('.')
end
if third_time[0]!= nil
base_address = third_time.split(///|/)
Net::HTTP.start(base_address[2]) do |img|
forth_time = 'http://' + base_address[2]
exec_address = third_time.gsub(forth_time,'')
imgresp, imageget = img.get(exec_address,nil)
if imgresp.code == '200'
url << third_time
end
end
end
end
@final_array[:url] = url.uniq
return @final_array
else
return false
end
end
else
return false
end
用hpricot的写法可以瞬间完成的,同时也体现出了hpricot的强大。
require 'hpricot'
require 'open-uri'
html_string = Hpricot (open('<url>'))
p html_string
img = doc.search('img')
然后再用http get去检查这些链接的是否可以使用。封数组。
同样用hpricot来解析xml,同样要比rexml快的多。