Scraping with Typhoeus and Nokogiri

Scraping with Typhoeus and Nokogiri

June 12th, 2009 · 1 Comment

I’ve been working on some cool new functionality at OneSpot. We want to provide a widget that can give the reader more context about a given article. Zemanta takes the article text and hands us back a set of semantic entities, including links to their Wikipedia page, but we wanted to get a nice blurb about each entity and figured that the opening paragraph from the Wikipedia page would be reasonable.

To do this, we use Typhoeus to fetch the Wikipedia pages in parallel and Nokogiri to pull the relevant content using a custom XPath expression for Wikipedia’s page layout.

Some notes:

  • We configure Typhoeus to use Rails’s cache store for its own cache store. We cache the Wikipedia response for 7 days in order to be good Netizens and not overburden their servers.
  • Wikipedia links do not specify a hostname so we make them absolute so the links will work embedded in another page.
  • We tried Curl::Multi but it was giving us occasional bus errors.
  • My wordpress syntax highlighter is obviously subpar when it comes to regular expressions.
require 'typhoeus'
require 'nokogiri'
 
class Wikipedia
  include Typhoeus
  #self.cache = Rails.cache.instance_variable_get(:@data)
 
  remote_defaults :cache_responses => 7*24*60*60, 
      :user_agent => 'typhoeus crawler', 
      :timeout => 5
 
  define_remote_method :extract, 
      :on_success => lambda {|response| Wikipedia.extract_first_paragraph(response.body) }
 
  def self.extract_first_paragraph(content)
    nh = Nokogiri::HTML(content)
    str = nh.xpath("//div[@id='bodyContent']/p[1]").inner_html
    str.gsub /href="\/wiki/, 'href="http://en.wikipedia.org/wiki'
  end
end

And here’s how you use it.

    entities = %w(
http://en.wikipedia.org/wiki/Garth_Marenghi's_Darkplace
http://en.wikipedia.org/wiki/Bus_error
http://en.wikipedia.org/wiki/Washington
)
    content = entities.map do |url|
      Wikipedia.extract(:base_uri => url)
    end
    p content

Tags: Ruby

posted on 2011-01-13 21:11  lexus 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/lexus/archive/2011/01/13/1934915.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值