Hpricot笔记-CSDN博客

本文链接：https://blog.csdn.net/iteye_20966/article/details/81803261

Hpricot::Doc的[b][color=green]search方法[/color][/b]返回一个Hpricot::Elements对象（Hpricot::Elem对象的集合），方法的参数可以是XPath或者CSS选择器。

require 'open-uri'
require 'hpricot'
doc=Hpricot(open('http://www.tianya.cn/publicforum/content/free/1/1455739.shtml'))
content = doc.search("#pageDivTop")
puts content

[b][color=green]search方法也可以直接用除号代替[/color][/b]：

require 'open-uri'
require 'hpricot'
doc=Hpricot(open('http://www.tianya.cn/publicforum/content/free/1/1455739.shtml'))
content = doc/"#pageDivTop"
puts content

果然这玩意越看越像jQuery。。
Hpricot::Doc的[b][color=green]at方法[/color][/b]返回匹配选择器的第一个元素，并包装为Hpricot::Elem对象。Elements和Elem对象也都有这个方法。

Hpricot::Elem对象有[b][color=green][]方法[/color][/b]，参数是字符串，可以取得元素的各个属性值，例如：

img=doc.at('img')
img['src'] #第1个img元素的src值

Elem对象还有个[b][color=green]attributes方法[/color][/b]，返回Hpricot::Attributes对象，这个对象也有[b][color=green][]方法[/color][/b]，上面的代码也可以改写为：

img=doc.at('img')
img.attributes['src'] #第1个img元素的src值

Doc对象、Elem对象和Elements对象都有[b][color=green]inner_html和to_html方法。[/color][/b]

Elements对象的[b][color=green]first方法[/color][/b]返回集合中第1个Elem对象，elements[0]的作用和first方法相同。

还可以在Elements对象中继续进行搜索，如：

imgs=doc/"img"
google_logo=imgs/"#googleLogo"
divs=doc/"div"
spans=divs/"span"
#更简单的方法
spans=doc/"div span"

甚至可以在Elem对象中搜索：

divs=doc/"div"
spans=divs[0]/'span'

Elements有个[b][color=green]any?方法[/color][/b]，如果[b][color=green]Elements#size[/color][/b]为0，Elements#any?返回false，反之则返回true。

Elem对象有一个[b][color=green]set_attribute方法[/color][/b]，用来改变对象的某个属性值：

require 'open-uri'
require 'hpricot'
doc=Hpricot(open('http://www.tianya.cn/publicforum/content/free/1/1455739.shtml'))
divs=doc/'div'
divs.each do |div|
  div.set_attribute :class, "newClass"
  puts div['class']
end

下面这段代码的效果是一样的：

require 'open-uri'
require 'hpricot'
doc=Hpricot(open('http://www.tianya.cn/publicforum/content/free/1/1455739.shtml'))
divs=doc/'div'
divs.each do |div|
  div['class']= "newClass"
  puts div['class']
end

还有个更简单的方法——通过Elements的[b][color=green]set方法[/color][/b]来为其中的所有Elem设置属性值：

require 'open-uri'
require 'hpricot'
doc=Hpricot(open('http://www.tianya.cn/publicforum/content/free/1/1455739.shtml'))
divs=doc/'div'
divs.set(:class=>"newClass")
divs.each do |div|
  puts div['class']
end

但是要注意：[b]只有Elem可以用set_attribute和[]来修改属性值，只有Elements可以用set方法修改属性值。[/b]

每个Elem对象都有一个[b][color=green]css_path和xpath方法[/color][/b]。

接下来是Elements对象的几个常用方法的详解：
[quote][b][color=green]at( expression, &block )[/color][/b]

Find a single element which matches the CSS or XPath expression. If a block is given, the matching element is passed to the block and the original set of Elements is returned.[/quote]

doc=Hpricot(<<html
<div id="div1">
  <span id="span1">1</span>
  <span>2</span>
</div>
<div id="div2">
  <span>3</span>
  <input type="text" id="input1" value="haliluya"/>
</div>
html
)

t=doc.search('div').at('span') do |span|
  puts span.inner_text
end
puts t
#输出：
#1
#2
#3
#<div id="div1">
#  <span id="span1">1</span>
#  <span>2</span>
#</div>

[b][color=red]从结果可以看出，似乎带block的at方法的返回值并不是所谓的“the original set of Elements”，这里是这个版本的bug还是文档写错了？或者是我理解错了？[/color][/b]另外要注意的是，[b]带block的at方法只在Elements对象中起作用，Doc对象和Elem对象只能简单的调用不带block的at方法，加block不起作用。[/b]
[quote][b][color=green]search( expression, &block )[/color][/b]

Finds all elements which match the CSS or XPath expression. If a block is given, the matching elements are passed to the block, in turn, and the original set of Elements is returned.[/quote]
Elements对象的search方法带block的情况下的返回值倒是“the original set of Elements”了。
[quote][b][color=green]append( html_string )[/color][/b]

Add HTML from html_string within each element, to the end of each element’s content.[/quote]
[quote][b][color=green]prepend( html_string )[/color][/b]

Add HTML from html_string with each element, to the beginning of each element’s content.[/quote]
[quote][b][color=green]wrap( html_string )[/color][/b]

Wraps each element in the set inside the element created by html_string. If more than one element is found in the html_string, Hpricot locates the deepest spot inside the first element.

doc.search("a[@href]").wrap(%{<div class="link"><div class="link_inner"></div></div>})

This code wraps every link on the page inside a div.link and a div.link_inner nest.[/quote]
这里特别记一下，我一开始以为wrap方法不起作用，后来发现是我搞错了，我最初写的代码是这样的：

doc=Hpricot(<<html
<div id="div1">
  <span id="span1">1</span>
  <span>2</span>
</div>
<div id="div2">
  <span>3</span>
  <input type="text" id="input1" value="haliluya"/>
</div>
html
)

divs=doc/'div'
divs.wrap "<p></p>" #可以简单的写作divs.wrap "<p>"，再试试divs.wrap "<p><h>"
puts divs

输出结果中没发现有任何变化。后来发现[b][color=green]after方法[/color][/b]存在同样的问题，再仔细想了想，divs变量保存的是那两个div元素，wrap上去的东西在div的外面，输出divs当然看不到外面的变化了，如果这时候输出doc变量，就可以看见变化了。

Elem对象有个[b]swap[/b]方法，swap的作用是把该元素原来的内容用新字符串（可以是html也可以是普通字符串）替换掉：

doc=Hpricot(<<html
<div id="div1">
  <span id="span1">1</span>
  <span>2</span>
</div>
<div id="div2">
  <span>3</span>
  <input type="text" id="input1" value="haliluya"/>
</div>
html
)

divs=doc/'div'
divs.first.swap("<p>p</p>")
puts divs
puts "======="
puts doc

从结果中可以看到，swap之后，直接输出divs变量看不出任何变化，输出doc变量就可以看到变化了。估计是因为一般swap的操作都是一次性的吧。。
Elements的[b][color=green]remove方法[/color][/b]用于从文档中删除所有Elements中包含的元素。

# Remove all elements in this list from the document which contains them.
doc = Hpricot("<html>Remove this: <b>here</b></html>")
doc.search("b").remove
doc.to_html
# => "<html>Remove this: </html>"

Elements和Elem都有一个after方法（我试了一下，其实还有一个before方法，当然，作用跟after是相对的），用于在指定元素的后面添加字符串（可以是html代码，也可以是普通字符串）。

doc=Hpricot(<<html
<div id="div1">
  <span id="span1">1</span>
  <span>2</span>
</div>
<div id="div2">
  <span>3</span>
  <input type="text" id="input1" value="haliluya"/>
</div>
html
)

divs=doc/'div'
divs.after "<a href='#'/>"
puts doc

Elem的[b][color=green]next方法[/color][/b]返回下一个节点，不跳过文本节点。
[b][color=green]next_sibling[/color][/b]返回下一个节点，跳过文本节点。
相对的有[b][color=green]previous和previous_sibling[/color][/b]。

require 'hpricot'

doc=Hpricot(<<HTML
<TABLE cellspacing=0 border=0 width=100% >
  <TR>
    <TD>
      <font size=-1 color=green><br>
      <center>
        <div id="tianyaBrandSpan1"></div>
        <div id="adsp_content_banner_3" style="background-color:#F5F9FA;padding:10px 0 0 0;"></div>
        <div id="adsp_content_adtopic"></div>
        <div id="adsp_content_banner_1" style="background-color:#F5F9FA"></div>
      </center>
    </TD>
  </TR>
</table>
HTML
)

puts doc.at('#tianyaBrandSpan1').next #一个包含了空行的文本节点(Hpricot::Text)
puts doc.at('#tianyaBrandSpan1').next_sibling #id为adsp_content_banner_3的div节点

CSS选择器还算熟悉，就不记了，详见：[url]http://wiki.github.com/hpricot/hpricot/hpricot-css-search[/url]
[url]http://wiki.github.com/hpricot/hpricot/supported-css-selectors[/url]
最后：
[quote]The CSS selectors are almost always easier to write and cleaner to combine. In addition, XPath support is pretty limited at present.[/quote]
睡觉睡觉。