python cookbook（一）文本

最新推荐文章于 2022-05-03 10:31:00 发布

待飞的毛毛虫

最新推荐文章于 2022-05-03 10:31:00 发布

阅读量650

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/u011465808/article/details/24917965

版权

python 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

本文是关于Python Cookbook的学习笔记，主要介绍了文本处理的基础知识，包括字符串类型、字符处理、字符与数值转换以及字符串对齐方法。讲解了如何每次处理一个字符，以及使用list、for循环、列表推导和map函数等方法。此外，还讨论了如何判断一个对象是否表现得像字符串，以及字符串的左对齐、右对齐和居中对齐操作。

摘要由CSDN通过智能技术生成

python cookbook系列文章是记录学习python cookbook这本书时的过程，同时将这些文章作为学习笔记，方便以后的查阅。

一些预备基础

python提供的用于文本处理的最主要的工具就是字符串---不可改变的字符序列。实际上存在两种字符串：普通字符串，包含8位（ASCII）字符；unicode字符串，包含了Unicode字符。简单点来说就是普通字符串有256个不同字符，对一些英文以及一些非亚洲的语言够用了，当是像中文，日文这些象形文字就需要unicode字符串来表示，因为普通字符串不够用了。

表示一个文本字符串（单引号和双引号效果相同）

'This is a literal string'
"this is another string"
'isn\'t that grand'
"isn't that grand"

将文本扩展多行（使用反斜线）

>>> big = "This is a long string\
that spans two lines."

换行符为\n

还可以用一对连续的三应用符将字符串圈起：

bigger = """
This is an even
bigger string that
spans three lines.
"""

文本将按照原貌被存储

在字符串前面加一个r或者R，表示该字符串是一个真正的”原“字符串

>>> big = r"This is a long string\
with a backslash and a newline in it"
>>> big
'This is a long string\\\nwith a backslash and a newline in it'

将字符串中的反斜线和换行符用转义字符表示出来了，r可以理解为raw的意思

在前面加个u可以成为一个Unicode字符串

>>> hello = u'Hello\u0020World'
>>> hello
u'Hello World'

通过索引访问单个字符

>>> mystr = "my string"
>>> mystr[0]
'm'
>>> mystr[-2]
'n'
>>> mystr[1:4]
'y s'
>>> mystr[3:]
'string'
>>> mystr[-3:]
'ing'
>>> mystr[:3:-1]
'gnirt'
>>> mystr[1::2]
'ysrn'

可以用循环遍历整个字符串

for c in mystr:

上述方法将c一次绑定到mystr中的每一个字符。

构建另一个序列（这里举列表）

>>> list(mystr)
['m', 'y', ' ', 's', 't', 'r', 'i', 'n', 'g']

字符串拼接

>>> mystr + 'oid'
'my stringoid'

字符串重复

>>> mystr * 2
'my stringmy string'

1.1 每次处理一个字符

法一：

调用list

>>> thestring = 'abcabcabc'
>>> thelist = list(thestring)
>>> thelist
['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']
>>> anotherlist = []
>>> for i in range(0, len(thelist)):
	anotherlist.append(thelist[i].upper())
>>> anotherlist
['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']

法二：

直接用for语句

>>> anotherstring = ''
>>> for c in thestring:
	anotherstring += c.upper()
>>> anotherstring
'ABCABCABC'

法三：

用列表推导

>>> results = [c.upper() for c in thestring]
>>> results
['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']

法四：

使用内建的map函数

>>> def myupper(c):
	return c.upper()
>>> results = map(myupper, thestring)
>>> results
['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C']

map函数简短的使用说明请看这里

这里列举一个很有趣的效果：

>>> def abc(a, b, c):
	return a * 10000 + b * 100 + c

>>> list1 = [11, 22, 33]
>>> list2 = [44, 55, 66]
>>> list3 = [77, 88, 99]
>>> map(abc, list1, list2, list3)
[114477, 225588, 336699]

如果给出了额外的可迭代参数，则对每个可迭代参数中的元素‘并行’的应用‘function’。

如果想获得的是该字符串的所有字符的集合，可以直接调用内建的set。

>>> magic_chars = set('abracadabra')
>>> poppins_chars = set('supercalifragilisticexpialidocious')
>>> print ''.join(magic_chars & poppins_chars)
acrd
>>> magic_chars
set(['a', 'r', 'b', 'c', 'd'])

set就是集合的意思，而集合是没有重复元素的，而且也还是无序的，我们从magic_chars的值可以看出来，集合可以进行集合的运算，如并运算。

1.2 字符和字符值之间的转换

如何将一个字符转化为相应的ASCII或者Unicode码呢，或者反过来？

>>> print ord('a')
97
>>> print chr(97)
a

ord同样也接受长度为1的Unicode字符串作为参数，不过须用unichr替代chr进行反转换。

来区分一下chr(n)和str(n)的不同：

>>> print repr(chr(97))
'a'
>>> print repr(str(97))
'97'

str能够以任何整数为参数，返回一个该整数的文本形式的字符串。可不可以理解成类型转化呢？

>>> print map(ord, 'ciao')
[99, 105, 97, 111]

>>> print ''.join(map(chr, range(97, 100)))
abc

join前为什么要加' '？

join是string类的方法，' '就是代表了一个string，同时可以将参数没有空隙的显示，如果没有''，那么需要import string

下面来举几个例子：

>>> from string import join
>>> join(map(chr, range(97, 100)))
'a b c'
>>> ' '.join(map(chr, range(97, 100)))
'a b c'
>>> '?'.join(map(chr, range(97, 100)))
'a?b?c'

1.3 测试一个对象是否是类字符串

有时候需要测试一个对象，尤其是当你在写下一个函数或者方法的时候，经常需要测试传入的参数是否是一个字符串（或者说类似于字符串的行为模式）。

def isAString(anobj):
    return isinstance(anobj, basestring)

下面介绍几个关键词：

isinstance:

isinstance(...)
    isinstance(object, class-or-type-or-tuple) -> bool
    
    Return whether an object is an instance of a class or of a subclass thereof.
    With a type as second argument, return whether that is the object's type.
    The form using a tuple, isinstance(x, (A, B, ...)), is a shortcut for
    isinstance(x, A) or isinstance(x, B) or ... (etc.).

我们可以看到当地一个参数是第二个参数即某种类型或其子类型的实例时，返回true。

我们自定义的函数是为了检测是否是一个字符串，所以：

basestring：

class basestring(object)
 |  Type basestring cannot be instantiated; it is the base for str and unicode.

basestring是str和unicode的共同基类，任何类字符串的用户自定义类型都应该从积累basestring派生。

当时对于python标准库中的UserString模块提供的UserString类的实力，完全无能为力。

UserString module

This module contains two classes, UserString and MutableString. The former is a wrapper for the standard string type which can be subclassed, the latter is a variation that allows you to modify the string in place.

Note that MutableString is not very efficient. Most operations are implemented using slicing and string concatenation. If performance is important, use lists of string fragments, or the array module.

而UserString对象是明显的类字符串对象，只是不是从basestring派生的。如果想支持这种类型，可以直接检查一对象的行为是否真的像字符串一样。

def isStringLike(anobj):
    try: anobj + ''
    except: return False
    else: return True

这个isStringLike函数比isAString函数慢且复杂的多，但是适用于UserString的实例，也适用于str和unicode。

Python中通常的类型检查方法是所谓的鸭子判断法：如果它走路是鸭子，叫声也像鸭子，那么对于我们的应用而言，就可以认为它是鸭子了。

1.4 字符串对齐

左对齐ljust，右对齐rjust，居中center

>>> print '|', 'hi'.ljust(10), '|', 'hi'.rjust(10), '|', 'hi'.center(10)
| hi         |         hi |     hi    
>>> print 'hi'.center(10, '*')
****hi****

1.5 去除字符串两端的空格

使用lstrip， rstrip，strip方法：

>>> x = '   hi   '
>>> print '|', x.lstrip(), '|', x.rstrip(), '|', x.strip()
| hi    |    hi | hi
>>> x = 'xyxxyy hejyx yyx'
>>> print '|' + x.strip('xy') + '|'
| hejyx |

上面获得的字符串的开头和结尾的空格都被保留下来。注意这里的参数'xy'不是去除'xy'整体一块儿是'x','y'字符。