Python 正则表达式 Howto（1）

最新推荐文章于 2024-09-22 17:12:26 发布

xjtufjj

最新推荐文章于 2024-09-22 17:12:26 发布

阅读量845

点赞数

文章标签： python Python 正则表达式语言

很久没有写博客了，都有点忘记如何操作了。

最近闲暇时学习了一下python，觉得中文的资料实在是太.... 所以看了一些官网上的HOWTO，觉得不错，翻译一下给大家共享。希望能对喜欢python 的有所帮助。为了督促自己能完成这个系列的翻译，所以在这里标记一下。为了不误导读者，我把原文也贴上了~

原文地址：http://docs.python.org/dev/howto/regex.html

Introduction

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.

The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.

开篇

正则表达式（RE 模块）是python中内嵌的一个微型的，定制化程度极高的编程语言，在python中通过re模块来实现的。利用正则表达式，你可以为各种你想匹配的文本设计不同的规则，你可以用来匹配英文句子，email地址， tex 命令，甚至于中文，反正你喜欢的东西就可以匹配，找出你需要的东西。 “这个字符串符合这个规则么？”　或者是“这个字符串中有我想要的匹配么？”这些问题你都可以在正则匹配中找到答案。　此外，　正则表达式还能修改字符串，　还能做字符串切割。　总而言之，正则表达式是一个相当相当强大的工具:)

在python中，正则表达式的引擎是使用C写的，所以你不用怀疑其效率。正则表达式会被编译成字节码，这些字节码使用C编译的引擎执行。在一些高级的应用中，我们需要注意RE的引擎室如何实现的，或者是我们需要理解python的匹配引擎，这样我们才能写出运行快的正则表达式。这个文档只是一个howto文档，至于优化方面的内容，除了学好python的正则表达式之外，你还得花点时间学学正则表达式的原理，推荐《精通正则表达式》我在看这本书，不过太忙了，只能囫囵吞枣。

正则表达式是一门及其精简，定制化程度极高的语言，这句话另外一层含义就是不是所有的工作都可以使用正则表达式完成。（以前在I司工作的使用，有个项目使用正则表达式抓取网页中的注入代码，就遇到一些不能处理的情况。）另外有些工作可以使用正则表达式完成，但是耗费的精力远远超过使用普通python语言，这种情况下就抛弃RE吧，直接python搞定，这样做的后果是python的运行速度远低于正则表达式，并且可读性肯定不如正则表达式。

Simple Patterns

We’ll start by learning about the simplest possible regular expressions. Since regular expressions are used to operate on strings, we’ll begin with the most common task: matching characters.

For a detailed explanation of the computer science underlying regular expressions (deterministic and non-deterministic finite automata), you can refer to almost any textbook on writing compilers.

从简单的模式开始

让我们先从匹配字符开始吧。字符匹配是最简单的匹配方式，这个毋庸置疑, 但是学好字符匹配也得记住一些东西。再次重申，本文只是一个HOWTO，至于一些正则匹配背后的浩大的计算机科学原理，翻一翻大学里学习的编译原理吧（当年觉得这门课奇难，花了很多的新思，最后成绩都不理想，偶有一同学，奇猛，看了几天书，就可以融会贯通，佩服之！）

Matching Characters¶

Most letters and characters will simply match themselves. For example, the regular expression test will match the string test exactly. (You can enable a case-insensitive mode that would let this RE match Test or TEST as well; more about this later.)

There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. Much of this document is devoted to discussing various metacharacters and what they do.

字符匹配

对于大多数字符而言，匹配字符就是他们自己。比如说，我打算精确匹配’test‘，我就是简单的使用模式’test‘就可以了。（学英语的人可要注意了，英语是有大写小写的， re模块已经考虑到这点了，谁让是老外写的呢？你可以使能大小写敏感模式，这样test就可以匹配test 或者TEST了，后面详论）。

我最恨意外了，但是凡事总有意外。一些字符回座位特殊匹配字符。这些字符不能匹配他们自己，而是具有特异功能。这些字符通过重复，转义等特殊功能影响着re，所以得刮目相看啊~ 平凡的东西大家都知道，特殊的东西得好好学。本文主要讨论这些字符以及他们的在正则匹配中的含义。

Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.
. ^ $ * + ? { } [ ] \ | ( )
让我们先来看一下这些特殊字符的清单吧，具体含义我们会在下文中加以重点关注。
. ^ $ * + ? { } [ ] \ | ( )
The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc]will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class;'^' outside a character class will simply match the '^' character. For example, [^5] will match any character except '5'.

我们首先来看[]，这个中括号可不简单，在正则匹配中，他们用来标示一个用来匹配的字符集。我们可以在[]中单列，也可以给出一个范围，如果是给你范围的话需要用'-'隔开。来个例子吧，更容易理解。比如，我们想匹配a,b,c,或者是d，我们可以使用[abcd]也可以简单的写成[a-d]。给个更加有用的例子，我们想匹配所有的小写字符， [a-z] 就可以了，不用[abcdefgh。。。]累死了。这规则表达式的作者替你考虑到这些了 :)
特殊字符在[]中就变的不特殊了，他们就会蜕变成平头百姓，不再具有特殊功能。来个例子吧。[akm$]会匹配'a', 'k', 'm', 或者'$'中的任意字符。'$' 本来是特殊字符，但是在[]中就无能为力了。

另外一种操作方式是求补集，这种方式可以让你匹配不再这个集合中的字符。求补集（取反）通过在[]的开始出使用字符‘^’实现。在[]使用'^', 只会简单的匹配'^'字符，并不具有取反的功能。权利是有范围的:) 例子[^5]会匹配所有不是5的字符。

Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or\, you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of the special sequences beginning with '\' represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace. The following predefined special sequences are a subset of those available. The equivalent classes are for bytes patterns. For a complete list of sequences and expanded class definitions for Unicode string patterns, see the last part of Regular Expression Syntax.

最麻烦的同时也是最重要的可能就是反斜杠了（从大学中学习C就开始学习这个特殊字符，功能太强大）。在python的字符串中，反斜杠可以作用在很多的字符上完成特殊的功能。另外他也可以让特殊字符回归其本来面目。来个例子， []是特殊字符，你想匹配'[', 你就得用'\['（当然你也可以使用[[]）。还有一个例子就是\, 我们要匹配这个字符，我们就得使用\\。我们想匹配\\ 我们就得用\\\\。太麻烦了~

一些预先定义的以\开始的序列非常有用，这些序列往往代表一类字符。比如数字，字母，非空字串等等。下面列出了常用一些序列，记住他们，非常有用。至于一个非常全的，包含扩展类定义以及unicode的规则，请参见《正则表达式语法》

\d

Matches any decimal digit; this is equivalent to the class [0-9].

匹配数字，等价于[0-9]

\D

Matches any non-digit character; this is equivalent to the class [^0-9].

匹配非数字，等价于[^0-9]

\s

Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

匹配空白字符，比如空格，tab，回车等等。等价于[ \t\n\r\f\v] （\f\v）什么意思？

\S

Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

匹配非空白字符，等价于[^ \t\n\r\f\v]

\w

Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

匹配字母或者数字，等价于[ a-zA-Z0-9]

\W

Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

匹配非（字母或数字），等价于[ ^a-zA-Z0-9]

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or'.'.

[]不能剥夺者序列的特殊功能。比如[\s,.]能够匹配任意空白字符， , 或者.

The final metacharacter in this section is .. It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. '.' is often used where you want to match “any character”.

最后一个特殊字符时.，这个字符能匹配任意字符，但是不能夸行，换句话说这个字符不能匹配\n. （这一点得注意）。如果你想匹配所有的字符，不能使用. 。有两种方法解决这个问题，使用[\s\S]或者使用re.DOTALL这种模式。