html空白文本节点是怎么产生的,如何在创建`Nokogiri :: XML`或`Nokogiri :: HTML`对象时避免创建非重要的空白文本节点...-CSDN博客

在解析缩进的XML时，从闭合和开始标记之间的空白区域创建非重要的空白文本节点。例如，从以下XML：

Tove

Jani

Reminder

其字符串表示如下，

Tove

Jani

Reminder

创建以下Document：

#(Document:0x3fc07e4540d8 {

name = "document",

children = [

#(Element:0x3fc07ec8629c {

name = "note",

children = [

#(Text "

"),

#(Element:0x3fc07ec8089c {

name = "to",

children = [ #(Text "Tove")]

}),

#(Text "

"),

#(Element:0x3fc07e8d8064 {

name = "from",

children = [ #(Text "Jani")]

}),

#(Text "

"),

#(Element:0x3fc07e8d588c {

name = "heading",

children = [ #(Text "Reminder")]

}),

#(Text "

"),

#(Element:0x3fc07e8cf590 {

name = "body",

children = [ #(Text "Don't forget me this weekend!")]

}),

#(Text "

")]

})]

})

这里有很多Nokogiri::XML::Text类型的空白节点。

我想计算Nokogiri XML children中每个节点的Document，并访问第一个或最后一个孩子，不包括非重要的空格。我希望不解析它们，或者区分那些和重要的文本节点，例如元素中的那些节点，如"Tove"。这是我正在寻找的rspec：

require 'nokogiri'

require_relative 'spec_helper'

xml_text = <

Tove

Jani

Reminder

XML

xml = Nokogiri::XML(xml_text)

def significant_nodes(node)

return 0

end

describe "Stackoverflow Question" do

it "should return the number of significant nodes in nokogiri." do

expect(significant_nodes(xml.css('note'))).to eq 4

end

我想知道如何创建significant_nodes函数。

如果我将XML更改为：

Tove

Jani

Reminder

然后，当我创建Document时，我仍然希望代表页脚;使用config.noblanks不是一种选择。

答案

您可以使用NOBLANKS option解析XML字符串，请考虑以下示例：

require 'nokogiri'

string = "

bar

puts string

# bar

document_with_blanks = Nokogiri::XML.parse(s)

document_without_blanks = Nokogiri::XML.parse(s) do |config|

config.noblanks

end

document_with_blanks.root.children.each { |child| p child }

#<:xml::text:0x3ffa4e153dac>

#<:xml::element:0x3fdce3f78488 name="bar" children="[#<Nokogiri::XML::Text:0x3fdce3f781f4">]>

#<:xml::text:0x3ffa4e15335c>

document_without_blanks.root.children.each { |child| p child }

#<:xml::element:0x3f81bef42034 name="bar" children="[#<Nokogiri::XML::Text:0x3f81bef43ee8">]>

NOBLANKS不应该删除空节点：

doc = Nokogiri.XML('') do |config|

config.noblanks

end

doc.root.children.each { |child| p child }

#<:xml::element:0x3fad0fafbfa8 name="bar">

正如OP指出，关于解析器选项的Nokogiri网站(以及libxml website)上的文档非常神秘，遵循NOBLANKS选项的行为规范：

require 'rspec/autorun'

require 'nokogiri'

def parse_xml(xml_string)

Nokogiri.XML(xml_string) { |config| config.noblanks }

end

describe "Nokogiri NOBLANKS parser option" do

it "removes whitespace nodes if they have siblings" do

doc = parse_xml("

expect(doc.root.children.size).to eq(1)

expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)

end

it "doesn't remove whitespaces nodes if they have no siblings" do

doc = parse_xml("

expect(doc.root.children.size).to eq(1)

expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Text)

end

it "doesn't remove empty nodes" do

doc = parse_xml('')

expect(doc.root.children.size).to eq(1)

expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)

end

另一答案

您可以创建仅返回元素节点的查询，并忽略文本节点。在XPath中，*只返回元素，因此查询可能看起来像(查询整个doc)：

doc.xpath('//note/*')

或者如果你想使用CSS：

doc.css('note > *')

如果要实现significant_nodes方法，则需要相对于传入的节点进行查询：

def significant_nodes(node)

node.xpath('./*').size

end

我不知道如何用CSS进行相对查询，你可能需要坚持使用XPath。

另一答案

Nokogiri的noblanks配置选项在有兄弟姐妹时不会删除所有空格文本节点：

describe "Nokogiri NOBLANKS parser option" do

it "doesn't remove whitespace Text nodes if they're surrounded by non-whitespace Text node siblings" do

doc = parse_xml("1

5")

children = doc.root.children

expect(children.size).to_not eq(5)

expect(children.size).to eq(7) #Because the two newline Text nodes are not ignored

expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)

end

我不确定为什么Nokogiri被编程以这种方式工作。我认为最好忽略所有空格文本节点不要忽略任何Text节点。