xpath 使用方法 演示

翻译文章,翻译的不是很好,请见谅

原文地址:http://manual.calibre-ebook.com/xpath.html


In this tutorial, you will be given a gentle introduction to XPath, a query language that can be used to select arbitrary parts of HTML documents in calibre. XPath is a widely used standard, and googling it will yield a ton of information. This tutorial, however, focuses on using XPath for ebook related tasks like finding chapter headings in an unstructured HTML document.

在这个指南里,你会得到1个XPath的入门介绍。XPath是用来查询HTML文档任意部分的一种查询语言。Xpath是一种广泛使用的标准,google一下你会发现大量的相关介绍。这篇指南,关注于使用XPath去查询非标准HTML文档的电子书中的章节目录。


The simplest form of selection is to select tags by name. For example, suppose you want to select all the <h2> tags in a document. The XPath query for this is simply:

最简单表单查询是根据标签名来查询。举例,如果你想查询1片文档里面所有<h2>的标签,Xpath查询语句看起来很简单:

//h:h2        (Selects all <h2> tags)

The prefix  //  means  search at any level of the document . Now suppose you want to search for  <span>  tags that are inside  <a>  tags. That can be achieved with:

//前缀的意思是查找文档的任意级别。现在如果你想查询<a>标签内的<span>标签,你可以这样来获取:

//h:a/h:span    (Selects <span> tags inside <a> tags)


If you want to search for tags at a particular level in the document, change the prefix:

如果你想查询文档的特定的级别的标签。修改前缀:

/h:body/h:div/h:p (Selects <p> tags that are children of <div> tags that are
             children of the <body> tag)


This will match only <p>A very short ebook to demonstrate the use of XPath.</p> in the Sample ebook but not any of the other <p> tags. The h: prefix in the above examples is needed to match XHTML tags. This is because internally, calibre represents all content as XHTML. In XHTML tags have a namespace, and h: is the namespace prefix for HTML tags.

Now suppose you want to select both <h1> and <h2> tags. To do that, we need a XPath construct called predicate. A predicate is simply a test that is used to select tags. Tests can be arbitrarily powerful and as this tutorial progresses, you will see more powerful examples. A predicate is created by enclosing the test expression in square brackets:

这会仅仅匹配示例电子书中的标签 <p>A   very   short   ebook   to   demonstrate   the   use   of   XPath.</p>. h:前缀在上面示例中需要匹配XHTML标签。因为这是内部的。XHTML上下文的表示方法。在XHTML标签中有1个命名空间。h:表示命名空间前缀是HTML标签的。

现在如果你想同时查询<h1>和<h2>标签,这样做。我们需要XPath构造所谓的语句。1条语句是用查询标签构成的简单的测试。测试可以在这篇指南里面反复执行.你会看到很多例子。1个语句通常创建在1个封闭的方括号中:

//*[name()='h1' or name()='h2']


There are several new features in this XPath expression. The first is the use of the wildcard *. It means match any tag. Now look at the test expression name()='h1'or name()='h2'name() is an example of a built-in function. It simply evaluates to the name of the tag. So by using it, we can select tags whose names are either h1or h2. Note that the name() function ignores namespaces so that there is no need for the h: prefix. XPath has several useful built-in functions. A few more will be introduced in this tutorial.

在XPath表达式中有几个特性。首先是*通配符。它表示匹配任何标签。现在看测试表达式 name()='h1' or name()='h2' 是1个内置函数的例子。他只检查标签的名字。所以使用它,我们可以查询命名为ht或者h2的标签。注意,name()函数忽略命名空间,所以使用时不需要h:前缀。XPath有几个内嵌函数。本篇指南里面会介绍几个。


Selecting by attributes

To select tags based on their attributes, the use of predicates is required:

通过标签的属性来查询,需要谓词:

//*[@style]              (Select all tags that have a style attribute)
//*[@class="chapter"]    (Select all tags that have class="chapter")
//h:h1[@class="bookTitle"] (Select all h1 tags that have class="bookTitle")

Here, the  @  operator refers to the attributes of the tag. You can use some of the  XPath built-in functions  to perform more sophisticated matching on attribute values.

这里,@符号标志标签的属性。你可以用用一些XPath内嵌函数(属性值)来执行复杂的匹配。


Selecting by tag content

Using XPath, you can even select tags based on the text they contain. The best way to do this is to use the power of regular expressions via the built-in functionre:test():

用Xpath,你甚至可以通过标签包含的内容来查询。最好的实现方式是使用这个常规的内嵌函数re:test()来构造表达式

//h:h2[re:test(., 'chapter|section', 'i')] (Selects <h2> tags that contain the words chapter or
                                          section)

Here the  .  operator refers to the contents of the tag, just as the  @  operator referred to its attributes.

这里 符号。标志标签的内容,和@表示属性一样。


Sample ebook

<html>
    <head>
        <title>A very short ebook</title>
        <meta name="charset" value="utf-8" />
    </head>
    <body>
        <h1 class="bookTitle">A very short ebook</h1>
        <p style="text-align:right">Written by Kovid Goyal</p>
        <div class="introduction">
            <p>A very short ebook to demonstrate the use of XPath.</p>
        </div>

        <h2 class="chapter">Chapter One</h2>
        <p>This is a truly fascinating chapter.</p>

        <h2 class="chapter">Chapter Two</h2>
        <p>A worthy continuation of a fine tradition.</p>
    </body>
</html>

XPath built-in functions

name()
The name of the current tag.
当前标签的名称
contains()
contains(s1, s2) returns  true if s1 contains s2.
返回s1包含s2的bool值
re:test()
re:test(src, pattern, flags) returns  true if the string  src matches the regular expression  pattern. A particularly useful flag is  i, it makes matching case insensitive. A good primer on the syntax for regular expressions can be found at  regexp syntax
返回src字符串匹配pattern是否成立。特别的表示是i。表示迟钝搜索。比较初级一点的语法可以查看...

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值