python glob用法,在Python中使用glob.glob和正则表达式与unicode文件名的独立于文件系统的方式...

I am working on a library which I want to keep platform, filesystem and Python2.x/3.x independent. However, I don't know how to glob for files and match the filenames against regular expressions in a platform/file-system independent way.

E.g. (on Mac, using IPython, Python 2.7):

In[7]: from glob import glob

In[8]: !touch 'ü-0.é' # Create the file in the current folder

In[9]: glob(u'ü-*.é')

Out[9]: []

In[10]: import unicodedata as U

In[11]: glob(U.normalize('NFD', u'ü-*.é'))

Out[11]: [u'u\u0308-0.e\u0301']

However, this doesn't work on Linux or Windows, where I would need unicode.normalize('NFC', u'ü-*.é'). The same problem arises when I try to match the filename against a regular expression: only a unicode regular expression normalized as NFD on Mac matches the filename whereas only an NFC regular expression matches filenames read on Linux/Windows (I use the re.UNICODE flag in both instances).

Is there a standard way of handling this problem?

My hope is that just like sys.getfilesystemencoding() returns the encoding for the file system, there would exist a function which returns the Unicode normalization used by the underlying filesystem.

However, I could find neither such a function nor a safe/standard way to feature-test for it.

解决方案

I'm assuming you want to match unicode equivalent filenames, e.g. you expect an input pattern of u'\xE9*' to match both filenames u'\xE9qui' and u'e\u0301qui' on any operating system, i.e. character-level pattern matching.

You have to understand that this is not the default on Linux, where bytes are taken as bytes, and where not every filename is a valid unicode string in the current system encoding (although Python 3 uses the 'surrogateescape' error handler to represent these as str anyway).

With that in mind, this is my solution:

def myglob(pattern, directory=u'.'):

pattern = unicodedata.normalize('NFC', pattern)

results = []

enc = sys.getfilesystemencoding()

for name in os.listdir(directory):

if isinstance(name, bytes):

try:

name = name.decode(enc)

except UnicodeDecodeError:

# Filenames that are not proper unicode won't match any pattern

continue

if fnmatch.filter([unicodedata.normalize('NFC', name)], pattern):

results.append(name)

return results

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值