文件读写汇总

学习飞行的山药

已于 2022-08-26 11:23:41 修改

阅读量462

点赞数

分类专栏： Python 文章标签： python 开发语言

于 2019-11-15 11:38:08 首次发布

本文链接：https://blog.csdn.net/rosalind_xu/article/details/103080656

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本篇的主要目的是汇总文本文件读写操作，作为备忘录和工具书。

Python

基础文本读写

以下为参考链接：
读写文本文件.Python之旅
 Python文本处理常用功能
 Python文本处理正则表达式篇

函数

读写的函数：
read() # 将文本文件所有行读入到一个文件当中
readline() # 将文本文件读入，一次读入一行
readlines() # 将文本文件全部读入，返回list形式，每一行是其中的一个元素

打开文件流的函数：

try:
	f = open(file_name, 'r')
	data = f.read()
finally:
	if f:
		f.close()

with open(file_name, 'r') as f:
	data = f.read()

基础的文本处理

将字符串按照分隔符切开
str.split([sep, maxsplit])
:params sep 分割依据的字符串
:params maxsplit 最大分割次数
return 分割成的字符串数组

二进制文件读写

二进制文件是什么

.bin 二进制文件
扩展名为.bin的文件就是一个二进制文件（binary）。不同于文本文件，二进制文件用记事本、Notepad++等打开都是乱码。

但是.bin文件可以用WINHEX等软件打开，WINHEX将二进制文件内容转为十六进制的形式展现出来。二进制文件里面存储的都是数据，只有按照某个预先设定的规则读出，才能明白这些数字的具体含义。

二进制文件相比于文本文件的优点：节约存储空间、读写速度快、有一定的加密保护作用。

Python读写二进制文件

使用模式为 rb 或 wb 的 open() 函数来读取或写入二进制数据，如

with open('somefile.bin', 'rb/wb') as f:

数据处理

字节字符串的索引和迭代动作返回的是字节的值而不是字节字符串
如：

b = b'Hello World'
print(b[0])  # 72

从二进制模式的文件中读取或写入文本数据，必须确保要进行解码和编码操作
如：

with open('somefile.bin', 'rb') as f:
    data = f.read(16)
    text = data.decode('utf-8')

with open('somefile.bin', 'wb') as f:
    text = 'Hello World'
    f.write(text.encode('utf-8'))

二进制I/O还有一个鲜为人知的特性就是数组和C结构体类型能直接被写入，而不需要中间转换为自己对象
写入：f.write(obj)

import array
nums = array.array('i', [1, 2, 3, 4])
with open('data.bin','wb') as f:
    f.write(nums)

读出：f.readinto(obj)

import array
a = array.array('i', [0, 0, 0, 0, 0, 0, 0, 0])
with open('data.bin', 'rb') as f:
	f.readinto(a)

HDF5文件读写

HDF5文件是什么

HDF5 是一种层次化的格式（hierarchical format），经常用于存储复杂的科学数据。

在存储带有关联的元数据（metadata）的复杂层次化数据的时候，这个格式非常有用，例如计算机模拟实验的运算结果等，并且在技术上提供了丰富的接口，包含C，C++，Fortran, Python, Java等，能够在不同的语言间完美兼容。

一个 HDF5 文件是存储两类对象的容器，这两类对象分别为：

dataset，即类似数组的数据集合，在Python中我们可以像是numpy数组一样使用dataset；
group，即类似目录的容器，其中可以继续包含dataset和group，在Python中我们可以像是目录一样使用group。

dataset和group都可以有描述性的元数据，称之为attribute，在Python中我们可以像是字典一样使用属性。

Python读写HDF5文件

Python语言依赖h5df工具包对HDF5文件进行书写。

HDF5文件创建

with open('hdf5_file.hdf5', 'w') as hdf5_file: 
	g = hdf5_file.create_group(name, track_order=False)  # 创建新的group，可以基于打开的文件句柄或者group上创建。
	d = hdf5_file.create_dataset(name, shape=None, dtype=None, data=None)  # 创建新的dataset。shape类型为list/tuple，dtype为数据类型。data的shape和dtype必须和指定的兼容。
	hdf5_file['label'] = range(100)  # 打开的文件句柄（相当于 "/" group），group 和 dataset 上都可以创建 attribute，以类似于字典的操作方式创建和读取 attribute。

具体实例：

import os
import numpy as np
import h5py

file_name = 'test.hdf5'
# create a new HDF5 file
f = h5py.File(file_name)
# create a new group
f.create_group('/grp1') # or f.create_group('grp1')
# create a nother group inside grp1
f.create_group('/grp1/grp2') # or f.create_group('grp1/grp2')
# create a dataset in group "/"
data = np.arange(6).reshape(2, 3)
f.create_dataset('dset1', data=data) # or f.create_dataset('/dset1', data=data)
# create another dataset in group /grp1
f.create_dataset('grp1/dset2', data=data) # or f.create_dataset('/grp1/dset2', data=data)
# create an attribute of "/"
f.attrs['a'] = 1 # or f.attrs['/a'] = 1
# create an attribute of group "/grp1"
f['grp1'].attrs['b'] = 'xyz'
# create an attribute of dataset "/grp1/dset2"
f['grp1/dset2'].attrs['c'] = np.array([1, 2])
# close file
f.close()

# open the existing test.hdf5 for read only
f = h5py.File(file_name, 'r')
# read dataset /dset1
print '/dset1 = %s' % f['dset1'][:]
# read dataset /grp1/dset2
print '/grp1/dset2 = %s' % f['/grp1/dset2'][:]
# get attributes
print f.attrs['a']
print f['grp1'].attrs['b']
print f['grp1/dset2'].attrs['c']

# remove the created file
os.remove(file_name)

HDF5文件读取

hdf5_file = hdf5.File(name, mode='r/w/a')  # name为文件名字符串；mode中的r为只读，w为创建新文件写，a为打开已经存在的文件进行读写

HDF5文件写入

Json 文件读写

Json 文件是什么

Json文件是一种轻量级的数据交换格式。
完全独立于编程语言的文本格式来存储和表示数据。
简洁和清晰的层次结构使得 JSON 成为理想的数据交换语言。易于人阅读和编写，易于机器解析和生成，并有效地提升网络传输效率。

Python读写Json文件

json模块
json.dumps()/dump() # 对数据进行编码，也就是将一个Python数据类型列表进行json格式的编码；从形式来说，json.dumps()函数是将字典转化为字符串。
json.loads()/load() # 对数据进行解码，也就是将json格式数据转换为Python数据类型；从形式上来说，json.loads()函数是将字符串转化为字典。
注意：带s的用于数据类型之间的转换，不带s的用于文件的读写
例子：

# 数据类型之间的转换
test_dict = {'bigberg': [7600, {1: [['iPhone', 6300], ['Bike', 800], ['shirt', 300]]}]}
json_str = json.dumps(test_dict)  # 将数据转换成为json数据类型
new_dict = json.loads(json_str)  # 将json数据类型转换为数据

# json文件读写
with open("../config/record.json","w") as f:
	json.dump(new_dict,f)

with open("../config/record.json",'r') as load_f:
	load_dict = json.load(load_f)

Python数据类型与json数据类型的转化表

Python 编码为 JSON 类型转换对应表：
Python JSON
dict object
list, tuple array
str string
int, float, int- & float-derived Enums number
True true
False false
None null
JSON 解码为 Python 类型转换对应表：
JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None

Python读写Jsonl文件

Jsonl文件即每行存储一个json对象的文件，无法直接用json.load/dump()读写。

from datasets import load_dataset
import json

jsonl_dataset = [{'id': 1, 'num': 'one'}, {'id': 2, 'num': 'two'}]

# Jsonl文件写
with open('output.jsonl', 'w', encoding='utf8') as f:
    for obj in jsonl_dataset:
        print(json.dumps(obj), file=f)
        
# Jsonl文件读
# normal
with open(path, 'r', encoding='utf-8') as f:
    jsonl_dataset = []
    for line in f.readlines():
        jsonl_dataset.append(json.loads(line))
# dataset
with open(path, 'w', encoding='utf8') as f:
    for obj in jsonl_dataset:
        print(json.dumps(obj), file=f)
jsonl_dataset = load_dataset('json', data_files=path)