(4)字符串及转化
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"# 1. 字符串基础 \n",
"Python字符串是一个有序的字符的集合,被划分为不可变序列(immutable sequence)这一类别,用来存储和表现基于文本的信息。 \n",
"在Python 3.x中有三种字符串类型: \n",
"- ***str***:用于Unicode文本(包括ASCII)\n",
"- ***bytes***:用于二进制数据(包括已编码文本)\n",
"- ***bytearray***:是bytes的一种可变的变体 \n",
"\n",
"文件在两种模式下工作: \n",
"- ***text***:内容为str格式,执行Unicode编码\n",
"- ***binary***:处理原始bytes数据,不进行数据编译 \n",
"\n",
"**常见字符串常量和表达式**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"S = '' #空字符串\n",
"S = \"spam's\" #双引号和单\n",
"S = 's\\np\\ta\\x00m' #转义序列\n",
"S = '''...多行字符串...''' #三重引号字符串块\n",
"S = r'\\temp\\spam' #Raw字符串引号相同(无转义)\n",
"B = b'sp\\xc4m' #字节字符串\n",
"U = u'sp\\u00c4m' #Unicode字符串\n",
"S1 + S2 #合并(concatenate)\n",
"S * 3 #重复\n",
"S[i] #索引\n",
"S[i:j] #分片(slice)\n",
"len(S) #求长度\n",
"\"a %s parrot\" % kind #字符串格式表达式\n",
"\"a {0} parrot\".format(kind) #字符串格式化方法\n",
"S.find('pa') #字符串方法调用:搜索\n",
"S.rstrip() #移除空格\n",
"S.replace('pa', 'xx') #替换\n",
"S.split(',') #用分隔符(delimiter)分割\n",
"S.isdigit() #内容测试\n",
"S.lower() #字体转换\n",
"S.endswith('spam') #结束测试\n",
"'spam'.join(strlist) #插入分隔符\n",
"S.encode('latin-1') #Unicode编码\n",
"B.decode('utf8') #Unicode解码\n",
"for x in S: print(x) #迭代\n",
"'spam' in S\n",
"[c * 2 for c in S]\n",
"map(ord, S) #成员关系"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"# 2. 字符串常量 \n",
"字符串的编写方式有很多: \n",
"- 单引号:`'spa\"m'`\n",
"- 双引号:`\"spa'm\"`\n",
"- 三引号:`'''...spam...''',\"\"\"...spam...\"\"\"`\n",
"- 转义字符:`\"s\\tp\\na\\0m\"`\n",
"- Raw字符串:`r\"C:\\new\\test.spm\"`\n",
"- bytes字符串:`b'sp\\x01am'`\n",
"- Unicode字符串:`u'eggs\\u0020spam'` \n",
"\n",
"最常见的是单引号和双引号,使用两种引号可以不使用反斜杠转义字符就可以实现在一个字符串中包含其余种类的引号。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"(\"knight's\", 'knight\"s')"
]
},
"execution_count": 1,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"\"knight's\",'knight\"s'"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"Python会自动在任意的表达式中合并相邻的字符串常量,也可以简单地在它们之间增加+操作符来明确地表示这是一个合并操作。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"'Meaning of Life'"
]
},
"execution_count": 2,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"\"Meaning \"'of'\" Life\""
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"在字符串中间增加逗号会创建一个元组,而不是一个字符串。Python 倾向于以单引号打印字符串,除非字符串内已有单引号。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"(\"knight's\", 'knight\"s')"
]
},
"execution_count": 3,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"'knight\\'s',\"knight\\\"s\""
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"## 2.1 转义序列(Escape sequences) \n",
"转义序列可以在字符串中嵌入不容易通过键盘输入的字节。 \n",
"转义序列以反斜杠 \\ 开头,后面接一个或多个字符,在最终的字符串对象中会被一个单个字符替代。 \n",
"\n",
"**字符串反斜杠字符**:"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"**转义**|**意义**\n",
":-------|:------\n",
"\\newline|忽视(连续换行)\n",
"\\\\|反斜杠(保留\\)\n",
"\\'|单引号(保留')\n",
"\\\"|双引号(保留\")\n",
"\\a|响铃\n",
"\\b|退格\n",
"\\f|换页(formfeed)\n",
"\\n|换行 Newline(linefeed)\n",
"\\r|返回\n",
"\\t|水平制表符\n",
"\\v|垂直制表符\n",
"\\N{id}|Unicode数据库ID\n",
"\\uhhhh|Unicode 16位的十六进制值\n",
"\\Uhhhhhhhh|Unicode 32位的十六进制值\n",
"\\xhh|十六进制值hh\n",
"\\ooo|八进制值ooo\n",
"\\0|Null:二进制0字符(不是字符串结尾)\n",
"\\other|不转义(保留 \\ 和other)"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"Python以十六进制显示非打印的字符。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"'\\x01\\x01\\x03'"
]
},
"execution_count": 4,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"s = '\\001\\001\\x03'\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"## 2.2 raw 字符串抑制转义 \n",
"当字母 r(大写或小写)出现在字符串的第一个引号前则该字符串为一个 raw 字符串,raw 字符串会关闭转义机制,raw 字符串还可用于正则表达式。 \n",
"\n",
"raw 字符串不能以单个的反斜杠结尾,若要用单个反斜杠结束一个 raw 字符串,有如下两个办法:\n",
"1. 用两个反斜杠并切片掉第二个反斜杠:`r'1\\nb\\tc\\\\'[:-1]`\n",
"2. 手动添加一个反斜杠:`r'1\\nb\\tc' + '\\\\'`"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"## 2.3 三重引号编写多行字符串块 \n",
"又称块字符串,可以编写多行文本数据,以三重引号开始(单引号和双引号都可以),并紧跟任意行数的文本,再以开始时同样的三重引号结尾。 "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"'Always look\\n on the bright\\nside of life.'"
]
},
"execution_count": 3,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"mantra = \"\"\"Always look\n",
" on the bright\n",
"side of life.\"\"\"\n",
"mantra"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Always look\n",
" on the bright\n",
"side of life.\n"
]
}
],
"source": [
"print(mantra)"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"三重引号字符串会保留所有引号内的字符串,包括代码右侧的注释,所以不要在引号内添加注释。 \n",
"\n",
"三重引号字符串常用于文档字符串,当它出现在文件的特定地点时,将被当作注释一样的字符串常量。 \n",
"\n",
"三重引号字符串还可用于临时废除一些代码,对于大段的代码,这比手动在每一行之前加入 # 号,之后再删除它们要容易得多。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"x = 1\n",
"\"\"\"\n",
"import os\n",
"print(os.getcwd())\n",
"\"\"\"\n",
"y = 2"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"# 3. 实际应用中的字符串 \n",
"## 3.1 基本操作 "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"spameggsfood\n"
]
}
],
"source": [
"s = 'spam''eggs''food'\n",
"print(s)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"spameggsfood\n"
]
}
],
"source": [
"s = 'spam' + 'eggs' + 'food'\n",
"print(s)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'Ni!Ni!Ni!Ni!'"
]
},
"execution_count": 7,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"'Ni!' * 4"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"h a c k e r "
]
}
],
"source": [
"myjob = 'hacker'\n",
"for c in myjob:\n",
" print(c, end=' ')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 9,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"'k' in myjob"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 10,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"'spam' in 'abcspamdef' # 子字符串搜索"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"## 3.2 索引和分片 \n",
"字符串中的字符通过索引提取。 \n",
"\n",
"Python 支持使用负偏移获取元素。"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"('s', 'a')"
]
},
"execution_count": 11,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"s = 'spam'\n",
"s[0], s[-2] # 索引"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"('pa', 'pm', 'maps')"
]
},
"execution_count": 12,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"s[1:3], s[1::2], s[::-1] # 分片"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"分片左边偏移作为下边界(包含),右边偏移作为上边界(不包含)。"
]
},
{
"cell_type": "markdown",
"metadata": {
},
"source": [
"**扩展分片:第三个限制值** \n",
"分片表达式的第三个索引用作步进,默认为 1。"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
},
"outputs": [
{
"data": {
"text/plain": [
"'bdfhj'"
]
},
"execution_count": 13,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"S = 'abcdefghijklmnop'\n",
"S[1:10:2]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},