Python酷库之旅-第三方库Pandas(007)

最新推荐文章于 2024-09-03 17:22:28 发布

神奇夜光杯

最新推荐文章于 2024-09-03 17:22:28 发布

阅读量1.3k

点赞数 57

分类专栏： Myelsa的Python酷库之旅文章标签： python pandas 人工智能开发语言标准库及第三方库 excel 学习和成长

本文链接：https://blog.csdn.net/ygb_1024/article/details/140223487

版权

Myelsa的Python酷库之旅专栏收录该内容

158 篇文章 44 订阅

订阅专栏

一、用法精讲

13、pandas.ExcelWriter类

13-1、语法

13-2、参数

13-3、功能

13-3-1、创建和写入Excel文件

13-5、说明

13-6、用法

13-6-1、数据准备

13-6-2、代码示例

13-6-3、结果输出

14、pandas.read_json函数

14-1、语法

14-2、参数

14-3、功能

14-4、返回值

14-5、说明

14-6、用法

14-6-1、数据准备

14-6-2、代码示例

14-6-3、结果输出

15、pandas.json_normalize函数

15-1、语法

15-2、参数

15-3、功能

15-4、返回值

15-5、说明

15-6、用法

一、用法精讲

13、pandas.ExcelWriter类

13-1、语法

# 13、pandas.ExcelWriter类
ExcelWriter(path: 'FilePath | WriteExcelBuffer | ExcelWriter', engine: 'str | None' = None, date_format: 'str | None' = None, datetime_format: 'str | None' = None, mode: 'str' = 'w', storage_options: 'StorageOptions | None' = None, if_sheet_exists: 'ExcelWriterIfSheetExists | None' = None, engine_kwargs: 'dict | None' = None) -> 'Self'
   
   Class for writing DataFrame objects into excel sheets.
   
   Default is to use:
   
   * `xlsxwriter <https://pypi.org/project/XlsxWriter/>`__ for xlsx files if xlsxwriter
     is installed otherwise `openpyxl <https://pypi.org/project/openpyxl/>`__
   * `odswriter <https://pypi.org/project/odswriter/>`__ for ods files
   
   See ``DataFrame.to_excel`` for typical usage.
   
   The writer should be used as a context manager. Otherwise, call `close()` to save
   and close any opened file handles.
   
   Parameters
   ----------
   path : str or typing.BinaryIO
       Path to xls or xlsx or ods file.
   engine : str (optional)
       Engine to use for writing. If None, defaults to
       ``io.excel.<extension>.writer``.  NOTE: can only be passed as a keyword
       argument.
   date_format : str, default None
       Format string for dates written into Excel files (e.g. 'YYYY-MM-DD').
   datetime_format : str, default None
       Format string for datetime objects written into Excel files.
       (e.g. 'YYYY-MM-DD HH:MM:SS').
   mode : {'w', 'a'}, default 'w'
       File mode to use (write or append). Append does not work with fsspec URLs.
   storage_options : dict, optional
       Extra options that make sense for a particular storage connection, e.g.
       host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
       are forwarded to ``urllib.request.Request`` as header options. For other
       URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are
       forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more
       details, and for more examples on storage options refer `here
       <https://pandas.pydata.org/docs/user_guide/io.html?
       highlight=storage_options#reading-writing-remote-files>`_.
   
   if_sheet_exists : {'error', 'new', 'replace', 'overlay'}, default 'error'
       How to behave when trying to write to a sheet that already
       exists (append mode only).
   
       * error: raise a ValueError.
       * new: Create a new sheet, with a name determined by the engine.
       * replace: Delete the contents of the sheet before writing to it.
       * overlay: Write contents to the existing sheet without first removing,
         but possibly over top of, the existing contents.
   
       .. versionadded:: 1.3.0
   
       .. versionchanged:: 1.4.0
   
          Added ``overlay`` option
   
   engine_kwargs : dict, optional
       Keyword arguments to be passed into the engine. These will be passed to
       the following functions of the respective engines:
   
       * xlsxwriter: ``xlsxwriter.Workbook(file, **engine_kwargs)``
       * openpyxl (write mode): ``openpyxl.Workbook(**engine_kwargs)``
       * openpyxl (append mode): ``openpyxl.load_workbook(file, **engine_kwargs)``
       * odswriter: ``odf.opendocument.OpenDocumentSpreadsheet(**engine_kwargs)``
   
       .. versionadded:: 1.3.0
   
   Notes
   -----
   For compatibility with CSV writers, ExcelWriter serializes lists
   and dicts to strings before writing.

13-2、参数

13-2-1、path(必须)：字符串，表示要写入的Excel文件的路径(包括文件名)。

13-2-2、engine(可选，默认值为None)：字符串，表示用于写入Excel文件的引擎。Pandas支持多种引擎，如xlsxwriter、openpyxl、odswriter(仅适用于.ods文件)等，如果未指定，Pandas将根据文件扩展名自动选择。

13-2-3、date_format(可选，默认值为None)：字符串，表示用于写入Excel文件的日期格式。如果DataFrame中包含日期类型的数据，并且您希望以特定格式保存它们，可以使用此参数。注意，这不会更改DataFrame中的数据，只影响写入Excel文件时的显示格式。

13-2-4、datetime_format(可选，默认值为None)：类似于date_format，但用于日期时间类型的数据。

13-2-5、mode(可选，默认值为'w')：表示文件打开模式。'w'表示写入模式(如果文件已存在，则覆盖)，'a'表示追加模式(注意，Excel文件通常不支持传统意义上的追加，因此该参数在ExcelWriter中可能不太有用，且其行为可能因引擎而异)。

13-2-6、storage_options(可选，默认值为None)：对于需要额外存储选项的存储连接(如HTTP、S3等)，可以传递一个字典作为此参数，这些选项将传递给底层的存储系统。

13-2-7、if_sheet_exists(可选，默认值为None)：当尝试写入已存在的工作表时，此参数控制行为。'error'(默认值，如果未指定)将引发错误；'new' 将创建一个新工作表(以数字为后缀)；'replace' 将替换现有工作表的内容。请注意，并非所有引擎都支持此参数。

13-2-8、engine_kwargs(可选，默认值为None)：一个字典，包含要传递给Excel写入引擎的额外关键字参数，这允许您利用引擎提供的特定功能，如设置工作簿的属性或样式。

13-3、功能

13-3-1、创建和写入Excel文件

13-3-1-1、pandas.ExcelWriter类允许用户指定Excel文件的路径和名称，并创建该文件(如果文件不存在)或覆盖现有文件(如果文件已存在，并且以写入模式打开)。

13-3-1-2、用户可以将多个DataFrame对象写入到同一个Excel文件的不同工作表中，或者将单个DataFrame写入到指定的工作表中。

13-3-2、自定义工作表

13-3-2-1、在写入DataFrame时，用户可以指定工作表的名称，以及是否包含DataFrame的索引列和列名。

13-3-2-2、用户还可以控制DataFrame数据在工作表中的起始位置和范围，例如通过指定起始行和列。

13-3-3、支持多种引擎

pandas.ExcelWriter支持多种写入引擎，如xlsxwriter、openpyxl和odswriter(对于.ods文件)，这些引擎提供了不同的功能和性能特点，用户可以根据需要选择合适的引擎。

13-3-4、样式和格式化

尽管pandas.ExcelWriter本身主要关注于数据的写入，但用户可以通过与引擎结合使用(如xlsxwriter的样式和格式化功能)，来自定义Excel文件的外观，包括单元格的字体、颜色、边框等。

13-3-5、处理已存在的工作表

在某些情况下，用户可能需要向已存在的Excel文件中的工作表写入数据。虽然Excel文件本身不支持传统意义上的追加，但用户可以通过读取文件、合并数据、然后重新写入的方式来实现这一需求，if_sheet_exists参数提供了在写入已存在工作表时的行为控制。

13-4、返回值

13-4-1、pandas.ExcelWriter类本身并不直接返回数据，它的主要作用是提供一个上下文管理器(通过with语句)或一个可调用对象(通过调用save()方法)，用于将DataFrame数据写入到Excel文件中。

13-4-2、当使用with语句时，pandas.ExcelWriter会在代码块执行完毕后自动关闭文件并释放资源，无需显式调用save()方法。

13-4-3、如果不使用with语句，则需要在数据写入完成后显式调用save()方法来保存文件并关闭ExcelWriter对象。

13-5、说明

无

13-6、用法

13-6-1、数据准备

无

13-6-2、代码示例

# 13、pandas.ExcelWriter类
# 13-1、将单个DataFrame写入Excel文件
import pandas as pd
# 创建一个简单的 DataFrame
df = pd.DataFrame({
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
})
# 使用ExcelWriter写入Excel文件
with pd.ExcelWriter('output.xlsx', engine='xlsxwriter') as writer:
    df.to_excel(writer, sheet_name='Sheet1', index=False)
# 注意：这里使用了 with 语句，它会自动处理文件的打开和关闭
# 将excel文件中Sheet1的内容打印出来
xls = pd.ExcelFile('output.xlsx', engine='openpyxl')
# 通过ExcelFile对象读取工作表
df1= pd.read_excel(xls, sheet_name='Sheet1')
print(df)
print()

# 13-2、将多个DataFrame写入同一个Excel文件的不同工作表
import pandas as pd
# 创建两个DataFrame
df1 = pd.DataFrame({
    'Name': ['John', 'Anna'],
    'Age': [28, 34]
})
df2 = pd.DataFrame({
    'City': ['New York', 'Paris', 'Berlin'],
    'Country': ['USA', 'France', 'Germany']
})
# 使用ExcelWriter写入多个工作表
with pd.ExcelWriter('output_multiple_sheets.xlsx', engine='xlsxwriter') as writer:
    df1.to_excel(writer, sheet_name='People', index=False)
    df2.to_excel(writer, sheet_name='Places', index=False)
# 注意：这里也使用了with语句
# 将excel文件中不用工作表的内容打印出来
xls = pd.ExcelFile('output_multiple_sheets.xlsx', engine='openpyxl')
# 通过ExcelFile对象读取工作表
df1 = pd.read_excel(xls, sheet_name='People')
df2 = pd.read_excel(xls, sheet_name='Places')
print(df1)
print()
print(df2)
print()

# 13-3、使用xlsxwriter引擎添加样式
import pandas as pd
# 创建一个DataFrame
df = pd.DataFrame({
    'Data': [10, 20, 30, 20, 15, 30, 45]
})
# 使用ExcelWriter和xlsxwriter引擎，并添加样式
with pd.ExcelWriter('output_with_styles.xlsx', engine='xlsxwriter') as writer:
    # 转换DataFrame到Excel
    df.to_excel(writer, sheet_name='Sheet1', index=False)
    # 获取xlsxwriter的工作簿和工作表对象
    workbook = writer.book
    worksheet = writer.sheets['Sheet1']
    # 创建一个格式，加粗并设置字体颜色
    bold = workbook.add_format({'bold': True, 'font_color': 'red'})
    # 应用格式到第一列
    worksheet.set_column('A:A', None, bold)
# 注意：这里使用了xlsxwriter的功能来添加样式
# 将excel文件中Sheet1的内容打印出来
xls = pd.ExcelFile('output_with_styles.xlsx', engine='openpyxl')
# 通过ExcelFile对象读取工作表
df1= pd.read_excel(xls, sheet_name='Sheet1')
print(df)

13-6-3、结果输出

# 13、pandas.ExcelWriter类
# 13-1、将单个DataFrame写入Excel文件
#     Name  Age      City
# 0   John   28  New York
# 1   Anna   34     Paris
# 2  Peter   29    Berlin
# 3  Linda   32    London

# 13-2、将多个DataFrame写入同一个Excel文件的不同工作表
#    Name  Age
# 0  John   28
# 1  Anna   34
# 
#        City  Country
# 0  New York      USA
# 1     Paris   France
# 2    Berlin  Germany

# 13-3、使用xlsxwriter引擎添加样式
#    Data
# 0    10
# 1    20
# 2    30
# 3    20
# 4    15
# 5    30
# 6    45

14、pandas.read_json函数

14-1、语法

# 14、pandas.read_json函数
pandas.read_json(path_or_buf, *, orient=None, typ='frame', dtype=None, convert_axes=None, convert_dates=True, keep_default_dates=True, precise_float=False, date_unit=None, encoding=None, encoding_errors='strict', lines=False, chunksize=None, compression='infer', nrows=None, storage_options=None, dtype_backend=_NoDefault.no_default, engine='ujson')
Convert a JSON string to pandas object.

Parameters:
path_or_bufa valid JSON str, path object or file-like object
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.json.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

Deprecated since version 2.1.0: Passing json literal strings is deprecated.

orientstr, optional
Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is:

'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

'records' : list like [{column -> value}, ... , {column -> value}]

'index' : dict like {index -> {column -> value}}

'columns' : dict like {column -> {index -> value}}

'values' : just the values array

'table' : dict like {'schema': {schema}, 'data': {data}}

The allowed and default values depend on the value of the typ parameter.

when typ == 'series',

allowed orients are {'split','records','index'}

default is 'index'

The Series index must be unique for orient 'index'.

when typ == 'frame',

allowed orients are {'split','records','index', 'columns','values', 'table'}

default is 'columns'

The DataFrame index must be unique for orients 'index' and 'columns'.

The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.

typ{‘frame’, ‘series’}, default ‘frame’
The type of object to recover.

dtypebool or dict, default None
If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don’t infer dtypes at all, applies only to the data.

For all orient values except 'table', default is True.

convert_axesbool, default None
Try to convert the axes to the proper dtypes.

For all orient values except 'table', default is True.

convert_datesbool or list of str, default True
If True then default datelike columns may be converted (depending on keep_default_dates). If False, no dates will be converted. If a list of column names, then those columns will be converted and default datelike columns may also be converted (depending on keep_default_dates).

keep_default_datesbool, default True
If parsing dates (convert_dates is not False), then try to parse the default datelike columns. A column label is datelike if

it ends with '_at',

it ends with '_time',

it begins with 'timestamp',

it is 'modified', or

it is 'date'.

precise_floatbool, default False
Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality.

date_unitstr, default None
The timestamp unit to detect if converting dates. The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds respectively.

encodingstr, default is ‘utf-8’
The encoding to use to decode py3 bytes.

encoding_errorsstr, optional, default “strict”
How encoding errors are treated. List of possible values .

New in version 1.3.0.

linesbool, default False
Read the file as a json object per line.

chunksizeint, optional
Return JsonReader object for iteration. See the line-delimited json docs for more information on chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.

compressionstr or dict, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

New in version 1.5.0: Added support for .tar files.

Changed in version 1.4.0: Zstandard support.

nrowsint, optional
The number of lines from the line-delimited jsonfile that has to be read. This can only be passed if lines=True. If this is None, all the rows will be returned.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

dtype_backend{‘numpy_nullable’, ‘pyarrow’}, default ‘numpy_nullable’
Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

"numpy_nullable": returns nullable-dtype-backed DataFrame (default).

"pyarrow": returns pyarrow-backed nullable ArrowDtype DataFrame.

New in version 2.0.

engine{“ujson”, “pyarrow”}, default “ujson”
Parser engine to use. The "pyarrow" engine is only available when lines=True.

New in version 2.0.

Returns:
Series, DataFrame, or pandas.api.typing.JsonReader
A JsonReader is returned when chunksize is not 0 or None. Otherwise, the type returned depends on the value of typ.

14-2、参数

14-2-1、path_or_buf(可选，默认值为None)：字符串、文件对象或类文件对象，表示JSON数据的路径或缓冲区。

14-2-2、orient(可选，默认值为None)：字符串，用于指示JSON文件中数据的期望格式。

14-2-2-1、'split'：字典像{index -> [index], columns -> [columns], data -> [values]}。

14-2-2-2、'records'：列表像[{column -> value}, ... , {column -> value}]。

14-2-2-3、'index'：字典像index -> {column -> value}}，其中索引是JSON对象中的键。

14-2-2-4、'columns'：字典像{{column -> index} -> value}。

14-2-2-5、'values'：仅仅是值数组。

14-2-2-6、如果没有指定，Pandas会尝试自动推断。

14-2-3、typ(可选，默认值为'frame')：字符串，{'frame', 'series'}，指定返回对象的类型。默认为 'frame'，即DataFrame；如果设置为'series'，则返回Series。

14-2-4、dtype(可选，默认值为None)：单个类型或字典，用于强制指定列的数据类型。如果传入的是字典，键是列名，值是数据类型。

14-2-5、convert_axes(可选，默认值为None)：布尔值，是否将轴(轴标签)转换为DatetimeIndex。

14-2-6、convert_dates(可选，默认值为True)：布尔值，是否尝试将日期字符串转换为日期对象。

14-2-7、keep_default_dates(可选，默认值为True)：布尔值，如果解析日期时遇到无法识别的日期，是否使用默认日期值(NaT)。

14-2-8、precise_float(可选，默认值为False)：布尔值，是否在解析浮点数时以更精确的方式处理(使用高精度小数库)。

14-2-9、date_unit(可选，默认值为None)：字符串，用于解析日期字符串的字符串单位，如 's', 'ms', 'us' 等，如果未指定，将尝试自动推断。

14-2-10、encoding(可选，默认值为None)：字符串，用于文件的编码。如果为None，则使用系统默认编码。

14-2-11、encoding_errors(可选，默认值为'strict')：字符串，指定如何处理编码错误。

14-2-12、lines(可选，默认值为False)：布尔值，如果为True，则假定文件是一个JSON对象的换行分隔的列表。

14-2-13、chunksize(可选，默认值为None)：整数，返回对象的迭代器，每个迭代器的数据块大小为指定的行数，这对于处理大文件很有用。

14-2-14、compression(可选，默认值为'infer')：{'infer', 'gzip', 'bz2', 'zip', 'xz', None}，用于解压文件的压缩格式。如果设置为'infer'，则自动检测压缩格式。

14-2-15、nrows(可选，默认值为None)：整数，需要读取的行数(对于JSON行文件)。

14-2-16、storage_options(可选，默认值为None)：字典，用于文件存储的额外选项，如AWS S3访问密钥。

14-2-17、dtype_backend(可选)：内部使用，通常不需要用户指定。

14-2-18、 engine(可选，默认值为'ujson')：字符串，用于解析JSON的引擎。默认为'ujson'，也可以使用'python'(标准库)。

14-3、功能

读取JSON格式的数据，并将其转换成Pandas的DataFrame或Series对象。

14-4、返回值

14-4-1、当typ='frame'(默认值)时，返回一个Pandas DataFrame对象。DataFrame是一个二维标签数据结构，可以存储具有不同数据类型的表格数据。

14-4-2、当typ='series'时，返回一个Pandas Series对象。Series是一种一维数组结构，可以存储任何数据类型(整数、字符串、浮点数、Python 对象等)，每个元素都有一个标签(索引)。

14-5、说明

该函数对于数据分析、数据清洗和数据处理任务非常有用，因为Pandas提供了丰富的数据操作功能。

14-6、用法

14-6-1、数据准备

# 创建.json文件example.json
# 方法1：直接使用Pandas库创建
import pandas as pd
# 创建一个包含数据的字典列表
data = [
    {"name": "John", "age": 30, "city": "New York"},
    {"name": "Anna", "age": 25, "city": "Paris"}
]
# 将字典列表转换为DataFrame
df = pd.DataFrame(data)
# 将DataFrame保存为JSON文件，指定orient='records'以得到你想要的格式
df.to_json('example.json', orient='records', lines=True, index=False)
# 注意：
# - orient='records' 告诉pandas以记录列表的形式导出JSON，每个记录是一个字典。
# - lines=True 表示将每个记录输出为一行，这有助于在读取大型文件时节省内存。
# - index=False 表示不将DataFrame的索引作为JSON对象的一部分输出。

# 方法2：使用标准库json创建(推荐)
import pandas as pd
import json
# 创建一个包含数据的字典列表
data = [
    {"name": "John", "age": 30, "city": "New York"},
    {"name": "Anna", "age": 25, "city": "Paris"}
]
# 将字典列表转换为DataFrame
df = pd.DataFrame(data)
# 将DataFrame转换为记录列表
records = df.to_dict(orient='records')
# 使用json模块将记录列表写入文件
with open('example.json', 'w') as f:
    json.dump(records, f)
# 现在，'example.json' 包含了一个严格的JSON数组。

14-6-2、代码示例

# 14、pandas.read_json函数
# 14-1、从JSON字符串读取
import pandas as pd
# JSON 字符串
json_str = '''  
[  
  {"name": "John", "age": 30, "city": "New York"},  
  {"name": "Anna", "age": 25, "city": "Paris"}  
]  
'''
# 从JSON字符串读取数据
df = pd.read_json(json_str, orient='records')
print(df, end='\n\n')

# 14-2、从JSON文件中读取
import pandas as pd
# 从JSON文件读取数据
df = pd.read_json('example.json', orient='records')
print(df, end='\n\n')

# 14-3、读取具有不同orient的JSON
import pandas as pd
# JSON字符串，使用'split'orient
json_str = '''  
{  
  "columns": ["name", "age", "city"],  
  "index": [0, 1],  
  "data": [  
    ["John", 30, "New York"],  
    ["Anna", 25, "Paris"]  
  ]  
}  
'''
# 从JSON字符串读取数据，使用'split'orient
df = pd.read_json(json_str, orient='split')
print(df, end='\n\n')

14-6-3、结果输出

# 14、pandas.read_json函数
# 14-1、从JSON字符串读取
#    name  age      city
# 0  John   30  New York
# 1  Anna   25     Paris

# 14-2、从JSON文件中读取
#    name  age      city
# 0  John   30  New York
# 1  Anna   25     Paris

# 14-3、读取具有不同orient的JSON
#    name  age      city
# 0  John   30  New York
# 1  Anna   25     Paris

15、pandas.json_normalize函数

15-1、语法

# 15、pandas.json_normalize函数
pandas.json_normalize(data, record_path=None, meta=None, meta_prefix=None, record_prefix=None, errors='raise', sep='.', max_level=None)
Normalize semi-structured JSON data into a flat table.

Parameters:
data
dict or list of dicts
Unserialized JSON objects.

record_path
str or list of str, default None
Path in each object to list of records. If not passed, data will be assumed to be an array of records.

meta
list of paths (str or list of str), default None
Fields to use as metadata for each record in resulting table.

meta_prefix
str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if meta is [‘foo’, ‘bar’].

record_prefix
str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if path to records is [‘foo’, ‘bar’].

errors
{‘raise’, ‘ignore’}, default ‘raise’
Configures error handling.

‘ignore’ : will ignore KeyError if keys listed in meta are not always present.

‘raise’ : will raise KeyError if keys listed in meta are not always present.

sep
str, default ‘.’
Nested records will generate names separated by sep. e.g., for sep=’.’, {‘foo’: {‘bar’: 0}} -> foo.bar.

max_level
int, default None
Max number of levels(depth of dict) to normalize. if None, normalizes all levels.

Returns:
frame
DataFrame
Normalize semi-structured JSON data into a flat table.

15-2、参数

15-2-1、data(必须)：一个字典、一个字典的列表，或者是任何可以转换为JSON格式的Python对象，表示要扁平化的JSON对象或对象列表。

15-2-2、record_path(可选，默认值为None)：指定JSON对象中应该被当作记录(即DataFrame的行)的路径。如果指定了record_path，函数将仅从这些路径中提取记录，并尝试将这些记录展平；如果不指定，函数将尝试将整个输入对象展平。

15-2-3、meta(可选，默认值为None)：指定JSON对象中应该作为元数据(即不展平，直接添加到DataFrame中的列)的键的列表，这些列将保留其原始JSON值。

15-2-4、meta_prefix(可选，默认值为None)：为元数据列添加的前缀，这有助于区分元数据列和其他由JSON数据生成的列。

15-2-5、record_prefix(可选，默认值为None)：为从record_path指定的记录中提取的列添加的前缀，这有助于在结果DataFrame中区分来自不同记录的列。

15-2-6、errors(可选，默认值为'raise')：指定在遇到JSON解析错误时的行为。如果设置为'raise'(默认值)，则抛出错误；如果设置为'ignore'，则忽略错误并继续处理。

15-2-7、sep(可选，默认值为'.')：用于连接嵌套字典键的分隔符，在将嵌套键展平为列名时使用。

15-2-8、max_level(可选，默认值为None)：指定展平过程中要处理的最大嵌套级别。默认值是None，表示不限制嵌套级别。

15-3、功能

用于将半结构化的JSON数据扁平化为表格形式(DataFrame)。

15-4、返回值

返回值是一个DataFrame对象。

15-5、说明

无

15-6、用法

15-6-1、数据准备

# 创建.json文件example.json
# 方法1：直接使用Pandas库创建
import pandas as pd
# 创建一个包含数据的字典列表
data = [
    {
        "id": 1,
        "info": {"name": "Alice", "age": 30},
        "hobbies": ["reading", "cycling"]
    },
    {
        "id": 2,
        "info": {"name": "Bob", "age": 25},
        "hobbies": ["swimming", "coding"]
    }
]

# 将字典列表转换为DataFrame
df = pd.DataFrame(data)
# 将DataFrame保存为JSON文件，指定orient='records'以得到你想要的格式
df.to_json('example.json', orient='records', lines=True, index=False)
# 注意：
# - orient='records' 告诉pandas以记录列表的形式导出JSON，每个记录是一个字典。
# - lines=True 表示将每个记录输出为一行，这有助于在读取大型文件时节省内存。
# - index=False 表示不将DataFrame的索引作为JSON对象的一部分输出。

# 方法2：使用标准库json创建(推荐)
import pandas as pd
import json
# 创建一个包含数据的字典列表
data = [  
    {
        "id": 1,
        "info": {"name": "Alice", "age": 30},
        "hobbies": ["reading", "cycling"]
    },
    {
        "id": 2,
        "info": {"name": "Bob", "age": 25},
        "hobbies": ["swimming", "coding"]
    }
]
# 将字典列表转换为DataFrame
df = pd.DataFrame(data)
# 将DataFrame转换为记录列表
records = df.to_dict(orient='records')
# 使用json模块将记录列表写入文件
with open('example.json', 'w') as f:
    json.dump(records, f)
# 现在，'example.json' 包含了一个严格的JSON数组。

15-6-2、代码示例

# 15、pandas.json_normalize函数
import pandas as pd
# 读取JSON文件，假设文件结构是符合'records'导向的
df = pd.read_json('example.json', orient='records')
# 假设您想要将'id'列作为元数据，可以简单地创建一个新的DataFrame或直接在df上操作
# 如果您想保留原始df，可以创建一个新的DataFrame
df2 = df.drop('id', axis=1)  # 删除'id'列，以便在后续处理中将其视为元数据
df2['meta_id'] = df['id']  # 在df2上添加'meta_id'列，作为元数据
# 或者，如果您不介意修改原始df，可以直接添加前缀
df['meta_id'] = df['id']
df.drop('id', axis=1, inplace=True)  # 注意，这会修改原始df
# 打印结果
print(df2)  # 如果使用了第一个方法（创建新DataFrame）
# 或者
print(df)  # 如果使用了第二个方法（直接在df上操作）

15-6-3、结果输出

# 15、pandas.json_normalize函数
#                            info             hobbies  meta_id
# 0  {'name': 'Alice', 'age': 30}  [reading, cycling]        1
# 1    {'name': 'Bob', 'age': 25}  [swimming, coding]        2
#                            info             hobbies  meta_id
# 0  {'name': 'Alice', 'age': 30}  [reading, cycling]        1
# 1    {'name': 'Bob', 'age': 25}  [swimming, coding]        2