nokogiri抓取网络资源

 写道
Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
XML is like violence - if it doesn’t solve your problems, you are not using enough of it.
  Nokogiri的解析能力+open-uri的网络访问想组合就可以用来抓取网络上的一些资源了,下面的这段代码用来抓取清杯浅酌这个wp博客。由于ruby代码写的比较java化,只能通过平时多写多看来提高自己的美感,高手请飘过。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
desc "Fetch articles from http://xuzhuoer.com/"
task :fetch => :environment do
  ids = Nokogiri::HTML(open("http://xuzhuoer.com/archives/"))
  ids.css('.post li a').each_with_index do |link, index|
    href = link.attr("href")
    doc = Nokogiri::HTML(open(href))
    # get the article's content & title & tag_list
    content = doc.css('.post > .content').inner_html
    title = doc.css('h1').text
    tags = ""
    doc.css('.post_info a').each do |tag|
      tags << tag.text << " "
    end
    # create post and save it
    @post = Post.create!(:body => content, :tag_list => tags.strip!, :title => title )
    # get the article's comments
    doc.css('#comments > .comment').each_with_index do |comment, index|
      author = comment.css('.author').text
      unless comment.css('a[@class="author"]').empty?
        author_url = comment.css('a[@class="author"]').attr('href')
      end
      body = comment.css('.content p').text
      # fetch the author's md5(email) to get gravatar
      md5 = comment.css('img').attr('src').text[31...63]
      # create & save comment
      Comment.create!(:author => author, :author_url => author_url,
          :body => body, :avatar_md5 => md5,
          :commentable_type => "Post",  :commentable_id => @post.id)
      sleep(5)
    end
    sleep(rand(5))
  end
end
 

如果抓取的网站资源需要登陆后才能看到,那么这个方法就显得无能为力了。不过加上Mechanize,结果就可能不一样了。mechazie能够模拟表单的提交并在以后的表单操作中自动设置cookie。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值