背景:
pandas DataFrame中有字段是dict类型,使用to_excel方法直接保存下次读取出来,dict字段会变成字符串,无法识别;
目标:保存dict字段,下次读出来还是dict
方法一:使用json.dums 和 json.loads
import pandas as pd
import json
def save_excel(df, filename):
with pd.ExcelWriter(filename,
engine='xlsxwriter',
engine_kwargs={'options':{'strings_to_urls': False, 'encoding':'utf-8-sig'}})\
as writer:
df.to_excel(excel_writer=writer, header=True, index=False, sheet_name='Sheet_name_1')
# 创建一个包含 JSON 数据的 DataFrame
data = {
'name': ['Alice', 'Bob'],
'json_data': [{"key1": "value"}, {"key2": "value2"}]
}
df = pd.DataFrame(data)
# 将字典字段转换为 JSON 字符串
df['json_data'] = df['json_data'].apply(lambda x: json.dumps(x, ensure_ascii=False))
# 保存到 Excel
save_excel(df, 'data.xlsx')
# 从 Excel 读取
df_read = pd.read_excel('data.xlsx')
# 如果需要将 JSON 字符串转换回 Python 对象
df_read['json_data'] = df_read['json_data'].apply(json.loads)
print(df_read)
print(type(df_read['json_data'].iloc[0]))
输出:
name json_data
0 Alice {'key1': 'value'}
1 Bob {'key2': 'value2'}
<class 'dict'>
要点:
保存到文件之前,需要
df['json_data'] = df['json_data'].apply(lambda x: json.dumps(x, ensure_ascii=False)
从文件读取后,需要
df_read['json_data'] = df_read['json_data'].apply(json.loads)
json.dumps是把python dict类型数据转成json格式的字符串,而json.loads是把json字符串转成python dict,Python的json.loads() 方法与json.dumps()方法。
方法二:保存成json再读成json:
# 保存
def save_json(df, outfile):
json_str = df.to_json(orient='records', force_ascii=False)
with open(outfile, 'w', encoding='utf-8') as f:
f.write(json_str)
# 读取
def read_data(filepath, filename):
with open('data.json', 'r', encoding='utf-8') as file:
data = json.load(file)
df = pd.DataFrame(data)
return df
outfile = 'data2.json'
save_json(df, outfile)
df = read_data(outfile)
df['json_data'] = df['json_data'].apply(json.loads)
print(type(df['json_data'].iloc[0]))
以下两种读法可能会因json文件过大出错
df = pd.read_json(filename, orient='records')
def read_data(filepath, filename):
# 设置每个块的大小,这里以10000行为例
chunk_size = 10 ** 5
chunks = []
for chunk in pd.read_json(filename, chunksize=chunk_size, lines=True):
# 处理每个块的数据
# 例如,可以在这里进行数据清洗或聚合操作
chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)
return df