Linux 系统中匹配文件名大小写的方式很奇怪？

最新推荐文章于 2021-05-14 11:03:07 发布

辰岡墨竹

最新推荐文章于 2021-05-14 11:03:07 发布

阅读量2.8k

点赞数 5

分类专栏：软件使用文章标签： shell linux grep 本地化模式匹配

本文链接：https://blog.csdn.net/bokutake/article/details/50371098

版权

软件使用专栏收录该内容

4 篇文章 0 订阅

订阅专栏

　　今天同事问我了一个奇怪的问题，他想在 shell 用 [a-z]* 方式匹配小写字母开头的文件名，结果里竟然有大写字母开头的文件。
　　我自己试了一下，建了一个目录，并 touch 了几个文件：abc AB 123
　　然后 echo [a-z]*，结果竟然是：abc AB！！
　　我之前没怎么用过方括号匹配，查了一下，这种 pattern 叫做字符类（character class）。《Linux 命令、编辑器与 shell 编程（第 2 版）》的 5.4.3 节讲 shell 特殊字符“[]”时明确写到：

[a-z] 代表所有小写英文字母，[a-zA-Z] 代表所有的英文字母，大小写都包括。

　　连 GNU 的 findutils 手册也如是说 Shell-Pattern-Matching：

[string]
Matches exactly one character that is a member of the string string. This is called a character class. As a shorthand, string may contain ranges, which consist of two characters with a dash between them. For example, the class ‘[a-z0-9_]’ matches a lowercase letter, a number, or an underscore. You can negate a class by placing a ‘!’ or ‘^’ immediately after the opening bracket. Thus, ‘[^A-Z@]’ matches any character except an uppercase letter or an at sign.

　　真是怪哉。难道让他发现了CentOS 7的一个bug？我在CentOS 6.5、Fedora 19等系统上试了一下，结果还是一样。
　　经过一通Google之后，终于找到了答案。有一位仁兄提到他想在 Shell 中不区分大小写匹配文件名：stackexchange - How to match case insensitive patterns with ls?
　　回答中有人提到了 Bash 有 nocaseglob 这个选项，我在网上还查到了一个类似的选项 nocasematch。Bash-Hackers 的 Wiki 解释了这两个选项 Shell Options：

nocaseglob
If set, Bash matches filenames in a case-insensitive fashion when performing pathname expansion.

nocasematch
If set, Bash matches patterns in a case-insensitive fashion when performing matching while executing case or [[ conditional commands.

　　但是我在机器上用 shopt 查了一下，发现这两个选项都是 off 状态，所以应该跟他们没关系。
　　那个问题的回答中还有人提到 LC_COLLATE=en_US 和 LC_COLLATE=C，并给出了解释说明的文章：Collate Order and Character Set - GLOB patterns and accents。这篇文章真是直击要害：

$ LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order
$ touch a A b B c C x X y Y z Z
$ ls
A B C X Y Z a b c x y z # expected sorted output
$ ls | sort | fmt
A B C X Y Z a b c x y z
$ echo [a-z]
a b c x y z
$ echo [A-Z]
A B C X Y Z
……
$ LC_COLLATE=en_US ; export LC_COLLATE # many Linux distros set this!
$ ls
a A b B c C x X y Y z Z # note the new collate order!
$ ls | sort | fmt
a A b B c C x X y Y z Z
$ echo [a-z]
a A b B c C x X y Y z # note how ‘Z’ is outside the range!
$ echo [A-Z]
A b B c C x X y Y z Z # note how ‘a’ is outside the range!

With many modern Linux locale settings, such as en_US, en_CA, or even en_CA.utf8, the character set is not laid out in strict numeric order; the collating order places upper and lower case together, in this order:

a A b B c C …. x X y Y z Z

and so the GLOB pattern [a-z] (which we expect to match only lower-case letters) actually matches all the lower-case and all but one of the upper-case letters (everything from ‘a’ to ‘z’) which means a A b B c C …. x X y Y z (and not ‘Z’)! The GLOB pattern [A-Z] (which we expect to match only upper-case letters) actually matches all the upper-case letters and all but one of the lower-case letters (everything from ‘A’ to ‘Z’) which means A b B c C …. x X y Y z Z (and not ‘a’)!

　　这篇文章挺有意思的，给出了那个模式更专业的名字： GLOB 模式。glob(7) - Linux man page 说明很久以前 /etc/glob 用于文件名模式展开，后来这个功能被内置到了 shell 里。这两篇文章都提到对于 C 风格的排列顺序是大小写敏感的，不过这种风格会导致大写字母另外排列，这种方式不自然，而且对于有重音符号的字符处理也有问题： a A á Á b B c C é É x X y Y z Z 会变成 A B C X Y Z a b c x y z Á É á é。文中的建议是：

In multi-lingual countries such as Canada, pathnames will often contain accents. Your programs need to handle them correctly. Avoid character ranges containing dashes, and use the POSIX character classes that aren’t affected by the character collating sequence being used:

$ rm [a-z]* # WRONG - dependent on collating order
$ rm [[:lower:]]* # RIGHT - use the POSIX class that always works

To be safe, always start your scripts with a correct setting of LC_COLLATE:

#!/bin/sh -u
PATH=/bin:/usr/bin ; export PATH
umask 022
LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order

　　另外，进一步查到 CentOS 4 时就有人反应这个问题了：0001511: bash shell produces incorrect results for regular expression [A-Z]*。Bug 回复里有人提到《Bash beginner’s guide》里 Chapter 4. Regular expressions 对此有说明：

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale’s collating sequence and character set. For example, in the default C locale, “[a-d]” is equivalent to “[abcd]”. Many locales sort characters in dictionary order, and in these locales “[a-d]” is typically not equivalent to “[abcd]”; it might be equivalent to “[aBbCcDd]”, for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value “C”.**

　　这一节讲的是正则表达式，并以 grep 做例子，所以看来这个问题在 grep 中也存在？进而查到 stackoverflow - grep case sensitive [A-Z]?

I cannot get grep to case sensitive search with this pattern

$ grep ‘T[A-Z]’ test.txt
The Quick Brown Fox Jumps Over The Lazy Dog
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG

　　不过我自己试了一下，在 CentOS 7 上（LANG = zh_CN），即使设置了 LC_COLLATE = zh_CN，grep 的正则表达式处理[a-z] 这样的字符类时仍然是区分大小写的，没有大写字母被匹配。
　　Does (should) LC_COLLATE affect character ranges? 中提到在有些系统上 grep 匹配类似 GLOB 模式的字符类时会受到排序设置的影响，有些则不会，而且不同的排序设置结果还有不同。en_US 其实和 en_US.utf-8 不一样，前者其实等同于 en_US.iso-8859-1。回答中有引用 POSIX 的规范 egular Expressions 的说明。该规范中有提到：

Range expressions must not be used in portable applications because their behaviour is dependent on the collating sequence. Ranges will be treated according to the current collating sequence, and include such characters that fall within the range based on that collating sequence, regardless of character values. This, however, means that the interpretation will differ depending on collating sequence. If, for instance, one collating sequence defines ä as a variant of a, while another defines it as a letter following z, then the expression [ä-z] is valid in the first language and invalid in the second.
In the following, all examples assume the collation sequence specified for the POSIX locale, unless another collation sequence is specifically defined.
The starting range point and the ending range point must be a collating element or collating symbol. An equivalence class expression used as a starting or ending point of a range expression produces unspecified results. An equivalence class can be used portably within a bracket expression, but only outside the range. For example, the unspecified expression [[=e=]-f] should be given as [[=e=]e-f]. The ending range point must collate equal to or higher than the starting range point; otherwise, the expression will be treated as invalid. The order used is the order in which the collating elements are specified in the current collation definition. One-to-many mappings (see the description of LC_COLLATE in Locale ) will not be performed. For example, assuming that the character eszet (ß) is placed in the collation sequence after r and s, but before t and that it maps to the sequence ss for collation purposes, then the expression [r-s] matches only r and s, but the expression [s-t] matches s, ß or t.
The interpretation of range expressions where the ending range point is also the starting range point of a subsequent range expression (for instance [a-m-o]) is undefined.

　　虽然标准是这么说，但是 en_US.utf-8 和 en_US.iso-8859-1 的 grep 匹配行为竟然还不一样，难道不是 bug 吗？
　　此外 awk 也可能会出现类似的问题：Why are capital letters included in a range of lower-case letters in an awk regex?

结论
　　避免使用 [a-z] 这种范围匹配模式，尤其不要在 shell 中使用，正则表达式、awk 中最好也不要用。因为这种模式很可能受系统语言和排序设置影响，而且不同版本的软件也可能有不同的（bug）。应该尽可能使用 POSIX 的具名字符类例如 [[:lower:]] 或 [[:alpha:]]，它们总是有效的；此外也可以用 Unicode 编码范围匹配的模式，它也应该是确定的。如果非要用 [a-z] 这种模式匹配小写英文字母，请确保 LC_COLLATE、LC_ALL 或者 LANG 的设置为 C 或者 POSIX。