Python CookBook —— Chapter 5 （个人笔记）

最新推荐文章于 2020-12-13 22:15:48 发布

Gozen Sanji

最新推荐文章于 2020-12-13 22:15:48 发布

阅读量242

点赞数

分类专栏： Python 进阶个人笔记

本文链接：https://blog.csdn.net/jaychang9/article/details/108661892

版权

个人笔记同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

Python 进阶

7 篇文章 0 订阅

订阅专栏

文章目录

Chap 5 文件与 IO

Chap 5 文件与 IO

5.1 读写文本数据 — open() 的各种模式及 encoding & errors 参数, sys.getdefaultencoding()

读写各种不同编码的文本数据，如 ASCII，UTF-8 or UTF-16 编码等。

# 1. 使用 'rt' 模式的 open() 函数读取文本文件:
file_name = r'./myfile.txt'

# 1.1 Read the entire file as a single string
with open(file_name, 'rt') as f:
    data = f.read()

# 1.2 Iterate over the lines of the file
with open(file_name, 'rt') as f:
    for line in f:
        # process line
        pass


# 2. 为写入文本文件, 使用 'wt' 模式的 open() 函数, 若文件已存在则清除并覆盖:
file_name = r'./Marie.txt'

# 2.1 Write chunks of text data
with open(file_name, 'wt') as f:
    f.write('hello Python world')
    f.write('goodbye C++ world')    # 这种方式不会自动换行

# 2.2 Redirected print statement
with open(file_name, 'wt') as f:
    print('hello Python world', file=f)    # 这种语法参考 5.2 节
    print('goodbye Java world', file=f)    # 这种方式每个 print 语句写入一行


# 3. 若要在已存在文件中添加内容, 使用 'at' 模式的 open() 函数:
with open(file_name, 'at') as f:
    f.write('this line was appended to the file. 科科')

    
# >>>>>> 文件读写操作默认使用系统编码, 可通过调用 sys.getdefaultencoding() 来得到:
import sys
print(sys.getdefaultencoding())


# 4. 若要读写的文本是其他编码方式的, 可使用 open() 函数中的 encoding 参数:
with open(file_name, 'rt', encoding='latin-1') as f:    # 此编码永远不会产生解码错误
    data = f.read()
    print(data)


# 5. 不使用 with 语句时必须手动关闭文件:
f = open(file_name, 'r')
data = f.read()
f.close()


# 6. 给 open() 函数传递一个可选的 errors 参数以处理“阴魂不散的”编码错误:
with open(file_name, 'r', encoding='utf-8', errors='replace') as f:    # replace 的作用?
    contents = f.read()
    print(contents)

with open(file_name, 'r', encoding='utf-8', errors='ignore') as f:    # ignore 的作用?
    contents = f.read()
    print(contents)

使用 errors 参数避免编码错误的做法实属无奈之举，可不要依赖它，正确的做法是确保使用正确的编码。

errors 中指定 replace 和 ignore 的具体效果是？

5.2 打印输出至文件中 — print() 函数的关键字参数 file

如何将 print() 函数的输出重定向到一个文件中去？

# 在 print() 函数中指定 file 关键字参数:
with open('./test.txt', 'wt', encoding='gbk') as f:    # 注意, 这里指定 'wb' 模式会报错哦~
    print('Servant 哟, 我就是你的 Master 吗？', file=f)

""" 这里的 print 语句将打印内容直接写进 file 参数指定的文件中;
    file 参数的默认值为 sys.stdout, 该默认值代表了系统标准输出, 也就是屏幕. """

5.3 使用其他分隔符或行终止符打印 — print() 的 sep & end 参数

使用 print() 函数输出数据，但想改变默认的分隔符 or 行尾符：

# 1. 可在 print() 函数中使用 sep 和 end 关键字参数, 以想要的方式输出:
print('Hello', 'Python', 'World')
print('Hello', 'Python', 'World', sep=',')
print('Hello', 'Python', 'World', sep=',', end='!!\n')


# 2. 使用 end 参数在输出中禁止换行:
for i in range(4):
    print(i)

for i in range(4):
    print(i, end='\t')


# 3. 使用 str.join() 也能实现指定分隔符の输出:
row = ('Hello', 'Python', 'World')
print(','.join(row))

# 上面的方式仅适用于字符串, 但可以增加 “转换操作” 以解决此问题:
row = ('Hello', 'Python', 'World', 123456789)
print(','.join(str(x) for x in row))    # join() 函数内是一个 GenExpr

# 但还是指定 print() 的 sep 参数更方便:
print(*row, sep=',')

5.4 读写字节数据 — open() 函数的 ‘rb’ / ‘wb’ 参数, encode, decode

想读写二进制文件，如图片，声音文件等，可像下面这么做：

# 1. 使用模式为 'rb' / 'wb' 的 open() 函数来读取 or 写入二进制数据:

# 1.1 Read the entire file as a single byte string
with open('somefile.bin', 'rb') as f:
    data = f.read()

# 1.2 Write binary data to a file
with open('somefile.bin', 'wb') as f:
    f.write(b'Hello World')
    
 
# 2. 索引和迭代动作返回的是字节的值而非字节字符串:

# 2.1 Text String
t = 'Hello World'
print(t[0])    # 返回字节字符串

for c in t:
    print(c, end=' ')

# 2.2 Byte string
b = b'Hello World'
print(b[0])    # 返回字节的值

for c in b:
    print(c, end=' ')

    
# 3. 若想从二进制模式的文件中读取 or 写入文本数据, 必须进行【解码 & 编码】操作:
with open('somefile.bin', 'rb') as f:
    data = f.read()
    print(data.decode('utf-8'))    # 读取时解码

with open('somefile.bin', 'wb') as f:
    f.write('hello world'.encode('utf-8'))    # 写入时编码

Page 148 ~ 149 最后的两个例子看不懂

5.5 文件不存在才能写入 — open() 函数的 ‘x’ 模式 or os.path.exists() 先行判断

你想向一个文件中写入数据，但是前提必须是这个文件在文件系统上不存在：

# 1. 可以在 open() 函数中使用 'x' 模式以实现 “文件不存在时才允许写入の操作”:

# 1.1 首先通过追加写入的方式创建一个文件
with open('somefile', 'a') as f:
    f.write('1Q84')

# 1.2 然后使用 'xt' 模式尝试向上面同一个文件写入内容 (由于文件存在, 因此会弹出 FileExistsError 异常)
with open('somefile', 'xt') as f:
    f.write('1Q84')

""" 如果文件是二进制的, 则使用 'xb' 来代替 'xt' """


# 2. 上面给出 “不小心覆盖了文件” 的完美解决方案, 这里再提供一个替代方案:
import os

# 在写入前先测试文件存在性
if not os.path.exists('somefile'):
    with open('somefile', 'wt') as f:
        f.write('1Q84')
else:
    print('File already exists!')

5.6 字符串的 I/O 操作 — io.StringIO & io.BytesIO 类及其方法

使用操作 类文件对象 的程序来操作文本 or 二进制字符串：

import io
# 1. 使用 io.StringIO() & io.BytesIO() 类来创建 “类文件对象” 以操作字符串数据:

# 1.1 创建 StringIO 实例以在内存中读写 String
s = io.StringIO()
# 1.2 与写入文件的操作相同
s.write('Hello World\n')
# 1.3 将 print 语句的结果重定向到 StringIO 实例中, 这也和对文件进行相应处理一致
print('This is a test', file=s)
# 1.4 getvalue() 方法用于获取写入 StringIO 实例中的 String
print(s.getvalue())
# 1.5 也可以指定一个 String 来创建 StringIO 对象的实例
s = io.StringIO('Hello\nWorld\n')
# 1.6 然后像读取文件一样读取 StringIO 对象的实例
print(s.read())

""" io.StringIO 只能用于文本。如果你要操作二进制数据, 要使用 io.BytesIO 类来代替 """

# 1.7 创建 BytesIO 实例
s = io.BytesIO()
# 1.8 向 BytesIO 实例中写入二进制数据
s.write(b'binary data')
# 1.9 获取 BytesIO 实例中的二进制数据
print(s.getvalue())

当你想模拟一个普通文件时 StringIO 和 BytesIO 类是很有用的。比如，在单元测试中，可使用 StringIO 来创建一个包含测试数据的类文件对象，该对象可被传给某个参数为普通文件对象的函数。

5.7 读写压缩文件 — gzip.open() & bz2.open(), compresslevel 参数

你想读写一个 gzip or bz2 格式的压缩文件：

import gzip
import bz2
# gzip & bz2 模块中的 open() 函数提供了读写压缩文件的功能:

# 1. 读取 & 写入压缩文件
with gzip.open('somefile.gz', 'r') as f:    # 读
    t1 = f.read()
with gzip.open('somefile.gz', 'w') as f:    # 写
    f.write(t1)

with bz2.open('somefile.bz2', 'r') as f:    # 读
    t2 = f.read()
with bz2.open('somefile.bz2', 'w') as f:    # 写
    f.write(t2)

""" 
	若想操作二进制数据, 使用 'rb' / 'wb' 模式即可
	gzip.open() & bz2.open() 接受跟内置的 open() 函数一样的参数, 包括 encoding, errors, newline 等 
"""    


# 2. 写入压缩数据时, 可通过关键字参数 compresslevel 指定压缩级别
# (默认为最高压缩等级 9 , 此时压缩程度最高但性能最低)
with gzip.open('somefile.gz', 'w', compresslevel=5) as f:
    f.write(text)

gzip.open() & bz2.open() 还有一个很少被知道的特性 (P152)，可惜我暂时还看不懂

5.8 固定大小记录的文件迭代 — ???

看不懂0.0

5.9 读取二进制数据到可变缓冲区中 — ???

看不懂ε=( o｀ω′)ノ

5.10 内存映射的二进制文件 — ???

看不懂0.0

5.11 文件路径名的操作 — basename(), dirname(), join(), expanduser(), splitext(), split()

若需要使用路径名来获取文件名，目录名，绝对路径等，可使用 os.path 模块中的函数来实现：

import os
path = '/Users/beazley/Data/data.csv'

# 1. 获取路径中最后的部分 (即最后一个 / 后的部分)
print(os.path.basename(path))    # data.csv

# 2. 获取路径 (即最后一个 / 前的部分)
print(os.path.dirname(path))    # /Users/beazley/Data

# 3. 拼接 str 成目录 (注意拼接符为 \ )
print(os.path.join('Gozen Sanji', 'tmp', 'data', os.path.basename(path)))

# 4. 扩展为用户目录 (注意拼接符为 \ )
path = '~user/Data/data.csv'
print(os.path.expanduser(path))    # expanduser() 函数将 path 中的 "~" or "~user" 转换为用户目录

# 5. 分割目录
print(os.path.splitext(path))    # splitext() 函数返回 “路径名” & “文件扩展名” 所成的 tuple
print(os.path.split(path))    # split() 函数返回 “dirname” & “basename” 所成的 tuple

对任何文件の名操作，都应使用 os.path 模块，而不要使用标准字符串操作来构造自己的代码。特别是为了可移植性考虑时更应如此，因为 os.path 模块知道 Unix 和 Windows 系统之间的差异 并且能够可靠地处理类似 Data/data.csv 和 Data\data.csv 这样的文件名。

5.12 测试文件是否存在 — exists(), isfile(), isdir(), getsize(), getmtime()

如何测试一个文件 or 目录是否存在？

import os

# 1. 使用 os.path 模块来测试一个文件 or 目录是否存在:
print(os.path.exists(r'D:\STUDY\6 Minute English\6-Minute-English-2020'))
print(os.path.exists(r'D:\HSO\japanese_beautiful_girls.avi'))


# 2. 进一步测试文件の类型: (若测试的文件不存在, 则返回 False)

# 2.1 Is a regular file
print(os.path.isfile(r'D:\STUDY\Mathematical Analysis I by Zorich.pdf'))
print(os.path.isfile(r'D:\STUDY\6 Minute English'))    # 是路径而非文件

# 2.2 Is a directory
print(os.path.isdir(r'D:\STUDY\6 Minute English'))
print(os.path.isdir(r'D:\STUDY\Mathematical Analysis I by Zorich.pdf'))    # 是文件而非目录

''' 另有 os.path.islink() & os.path.realpath() 函数, 不太明白, 似乎和 Linux 相关 0.0 '''


# 3.若还想获取“元数据” (如文件大小 or 修改日期), 仍可使用 os.path 模块来解决:

# 3.1 getsize() 返回 path 的大小, 以字节为单位。若该文件不存在 or 不可访问, 则抛出 FileNotFoundError 异常
print(os.path.getsize(r'D:\STUDY\Mathematical Analysis I by Zorich.pdf'))

import time
# 3.2 getmtime() 返回 path 的最后修改时间, 返回值是一个浮点数, 为纪元秒数。若该文件不存在 or 不可访问, 则抛出 FileNotFoundError 异常
print(os.path.getmtime(r'D:\STUDY\Mathematical Analysis I by Zorich.pdf'))
print(time.ctime(os.path.getmtime(r'D:\STUDY\Mathematical Analysis I by Zorich.pdf')))

""" 使用 os.path 时需要注意考虑文件权限的问题, 特别是在获取元数据时, 程序可能会抛出 PermissionError 异常 """

5.13 获取文件夹中的文件列表 — os.listdir(), glob.glob(), fnmatch(), os.stat() 及其返回值的属性

如何获取文件系统中某个目录下の所有文件列表？

import os
# 1. listdir() 函数返回指定目录下“所有文件の列表”, 包括所有 “文件” & “子目录” 等:
names = os.listdir(r'D:\STUDY')
print(names)


import os.path
# 2. 若需要通过某种方式 “过滤数据”, 可考虑结合 os.path 库中的一些函数来使用列表解析:

# 2.1 Get all regular files
names = [name for name in os.listdir(r'D:\STUDY')
         if os.path.isfile(os.path.join(r'D:\STUDY', name))]
print(names)

# 2.2 # Get all dirs
names = [name for name in os.listdir(r'D:\STUDY')
         if os.path.isdir(os.path.join(r'D:\STUDY', name))]
print(names)


# 3. 字符串的 startswith & endswith 方法对于过滤一个目录的内容也是很有用的:
pdf_files = [name for name in os.listdir(r'D:\STUDY')
             if name.endswith('.pdf')]
print(pdf_files)


# 4. 对于文件名的匹配, 还可考虑使用 glob / fnmatch 模块:
import glob
pdf_files2 = glob.glob(r'D:\STUDY\*.pdf')    # glob.glob() 方法返回匹配的文件の完整路径的列表
print(pdf_files2)

from fnmatch import fnmatch
# fnmatch() 返回 True / False
pdf_files3 = [name for name in os.listdir(r'D:\STUDY') if fnmatch(name, '*.pdf')]
print(pdf_files3)


# 5. 上面只是获取了目录中实体名列表, 下面来获取其他元数据:
import os
import os.path
import glob

pdfs = glob.glob(r'D:\STUDY\*.pdf')

# 5.1 Get file sizes and modification dates
name_sz_date = [(name, os.path.getsize(name), os.path.getmtime(name)) for name in pdfs]

for name, size, mtime in name_sz_date:
    print(name, size, mtime, sep=' 🐬 ')

# 5.2 Alternative: Get file metadata
pdf_metadata = [(name, os.stat(name)) for name in pdfs]    # os.stat() 的介绍见 Remark

for name, meta in pdf_metadata:
    print(name, meta.st_size, meta.st_mtime, sep=' 🐬 ')

Remark：1. os.stat() 方法用于在给定路径上执行一个系统 stat 的调用；2. 返回值.st_size：普通文件以字节为单位的大小；3. 返回值.st_mtime：最后一次修改的时间。

5.14 忽略文件名编码 — ???

看不懂0.0

5.15 打印不合法的文件名 — ???

看不懂0.0

5.16 增加或改变已打开文件的编码 — ???

还是不太懂

5.17 将字节写入文本文件 — ???

不懂

5.18 将文件描述符包装成文件对象 — ???

以后再整。。。

5.19 创建临时文件和文件夹 — TemporaryFile, NamedTemporaryFile, TemporaryDirectory() 及其各种参数

你需要在程序执行时创建一个临时文件 or 目录，并希望使用完后可以自动销毁；tempfile 模块中有很多函数可以完成此任务。

# 1. 为创建一个匿名的临时文件, 可使用 tempfile.TemporaryFile:
# ------------------------------------------------------->
from tempfile import TemporaryFile

with TemporaryFile('w+t') as f:
    # 1.1 Read / write to the file
    f.write('Hello World\n')
    f.write('Testing\n')    # Remark: 写入不会自动换行, 需要换行请手动附上 (\n)

    # 1.2 Seek back to beginning and read the data
    f.seek(0)    # seek(0) 表示: 移动文件读取指针到文件起始位置
    data = f.read()
    print(data)

# 1.3 Temporary file is destroyed


# 2. 使用临时文件的另一种写法:
# ------------------------------------------------------->
f = TemporaryFile('w+t')
# 在这里使用临时文件 (写入 / 读取)
pass
f.close()
# 关闭后临时文件自动销毁

"""
    Remark: TemporaryFile() 的第一个参数是文件模式, 通常来讲文本模式使用 'w+t', 二进制模式使用 'w+b'
            此模式同时支持读和写, 在这里是很有用的, 因为当你关闭文件去改变模式时, 文件实际上已经不存在了
"""


# 3. 另外 TemporaryFile() 还支持跟内置的 open() 函数一样的参数 (似乎不支持 errors 参数):
# ------------------------------------------------------->
with TemporaryFile('w+t', encoding='utf-8') as f:
    pass


# 4. 在多数 Unix 系统上, 通过 TemporaryFile() 创建的文件都是匿名的, 甚至连目录都没有。
# 若想打破该限制, 可使用 NamedTemporaryFile() 来代替:
# ------------------------------------------------------->
from tempfile import NamedTemporaryFile

with NamedTemporaryFile('w+t') as f_obj:
    print(f"filename is {f_obj.name}.")    # 这里的 f.name 属性包含了该临时文件的文件名

# File automatically destroyed


# 5. NamedTemporaryFile 与 TemporaryFile 相同, 文件关闭时会被自动删除。
# 若不想这么做, 可通过关键字参数 delete 来控制:
# ------------------------------------------------------->
with NamedTemporaryFile('w+t', delete=False) as f_obj:    # 关键字参数 delete=False 禁止了文件自动删除
    print(f"filename is {f_obj.name}.")


# 6. 为创建一个临时目录, 可使用 tempfile.TemporaryDirectory():
# ------------------------------------------------------->
from tempfile import TemporaryDirectory

with TemporaryDirectory() as dirname:
    print(f"dirname is {dirname}")
    # 在此使用该临时目录
    pass

# 临时目录及其中全部文件自动删除

"""
    Remark: 
        TemporaryFile(), NamedTemporaryFile() & TemporaryDirectory() 函数应该是处理临时文件目录的最简单的方式了, 因为它们会自动处理所有的创建和清理步骤
"""


# 7. 所有和临时文件相关的函数都允许通过使用关键字参数 prefix, suffix & dir 来“自定义目录 & 命名规则”:
g = NamedTemporaryFile(prefix='jays_temp', suffix='.txt', 
                       dir='../Spider')    # dir 需是个真实存在的 path
print(g.name)