python glob用法,在Python中使用glob.glob和正则表达式与unicode文件名的独立于文件系统的方式...-CSDN博客

本文探讨了如何在不同操作系统上实现跨平台、跨文件系统的文件名匹配问题，尤其是在处理Unicode规范化差异方面提供了实用的解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I am working on a library which I want to keep platform, filesystem and Python2.x/3.x independent. However, I don't know how to glob for files and match the filenames against regular expressions in a platform/file-system independent way.

E.g. (on Mac, using IPython, Python 2.7):

In[7]: from glob import glob

In[8]: !touch 'ü-0.é' # Create the file in the current folder

In[9]: glob(u'ü-*.é')

Out[9]: []

In[10]: import unicodedata as U

In[11]: glob(U.normalize('NFD', u'ü-*.é'))

Out[11]: [u'u\u0308-0.e\u0301']

However, this doesn't work on Linux or Windows, where I would need unicode.normalize('NFC', u'ü-*.é'). The same problem arises when I try to match the filename against a regular expression: only a unicode regular expression normalized as NFD on Mac matches the filename whereas only an NFC regular expression matches filenames read on Linux/Windows (I use the re.UNICODE flag in both instances).

Is there a standard way of handling this problem?

My hope is that just like sys.getfilesystemencoding() returns the encoding for the file system, there would exist a function which returns the Unicode normalization used by the underlying filesystem.

However, I could find neither such a function nor a safe/standard way to feature-test for it.

解决方案

I'm assuming you want to match unicode equivalent filenames, e.g. you expect an input pattern of u'\xE9*' to match both filenames u'\xE9qui' and u'e\u0301qui' on any operating system, i.e. character-level pattern matching.

You have to understand that this is not the default on Linux, where bytes are taken as bytes, and where not every filename is a valid unicode string in the current system encoding (although Python 3 uses the 'surrogateescape' error handler to represent these as str anyway).

With that in mind, this is my solution:

def myglob(pattern, directory=u'.'):

pattern = unicodedata.normalize('NFC', pattern)

results = []

enc = sys.getfilesystemencoding()

for name in os.listdir(directory):

if isinstance(name, bytes):

try:

name = name.decode(enc)

except UnicodeDecodeError:

# Filenames that are not proper unicode won't match any pattern

continue