Pandas 模块 - 读写(2)-Excel 表格及其他-read_excel/to_excel

江南野栀子

已于 2024-04-18 12:11:49 修改

阅读量3.3k

点赞数 1

分类专栏： # Python 数据分析文章标签： python 数据分析 pandas

于 2021-11-19 16:06:14 首次发布

本文链接：https://blog.csdn.net/u010701274/article/details/121396758

版权

Python 数据分析专栏收录该内容

22 篇文章 28 订阅

订阅专栏

2. 利用 .read_excel() 和 .to_excel() 从外部读取或者写入数据

2.1 .read_excel() 语法

2.2 .read_excel() 范例

2.2.1 最简单的 io 参数

2.2.2 read_csv() 没有的 sheet_name

2.2.3 相似的参数 header，names ，index_col ，usecols

2.2.4 和 .read_csv() 含义不一样的 engine

2.3 .to_excel() 语法

2.4 .to_excel() 范例

2.4.1 最简单的 excel_writer

2.4.2 指定工作表名 sheet_name

2.4.3 类似的参数 columns、header、index、index_label

2.4.4 好理解的 startrow 、startcol

2.4.5 可选的 NaN参数 na_rep

2.4.6 可选的浮点数格式参数 float_format

.read_excel() 和 .to_excel() 是一对，恰如 .read_csv() 和 .to_csv() 一样，甚至很多参数都是雷同的。

2. 利用 .read_excel() 和 .to_excel() 从外部读取或者写入数据

2.1 .read_excel() 语法

网上有很多 .read_excel() 语法结构，但是最权威的还是写在代码里面的。具体每个参数的用法，就在范例里面一一说明吧。

Help on function read_excel in module pandas.io.excel._base:

read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, storage_options: Union[Dict[str, Any], NoneType] = None)
    Read an Excel file into a pandas DataFrame.
    
    Supports `xls`, `xlsx`, `xlsm`, `xlsb`, `odf`, `ods` and `odt` file extensions
    read from a local filesystem or URL. Supports an option to read
    a single sheet or a list of sheets.
    
    Parameters
    ----------
    io : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be: ``file://localhost/path/to/table.xlsx``.
    
        If you want to pass in a path object, pandas accepts any ``os.PathLike``.
    
        By file-like object, we refer to objects with a ``read()`` method,
        such as a file handle (e.g. via builtin ``open`` function)
        or ``StringIO``.
    sheet_name : str, int, list, or None, default 0
        Strings are used for sheet names. Integers are used in zero-indexed
        sheet positions. Lists of strings/integers are used to request
        multiple sheets. Specify None to get all sheets.
    
        Available cases:
    
        * Defaults to ``0``: 1st sheet as a `DataFrame`
        * ``1``: 2nd sheet as a `DataFrame`
        * ``"Sheet1"``: Load sheet with name "Sheet1"
        * ``[0, 1, "Sheet5"]``: Load first, second and sheet named "Sheet5"
          as a dict of `DataFrame`
        * None: All sheets.
    
    header : int, list of int, default 0
        Row (0-indexed) to use for the column labels of the parsed
        DataFrame. If a list of integers is passed those row positions will
        be combined into a ``MultiIndex``. Use None if there is no header.
    names : array-like, default None
        List of column names to use. If file contains no header row,
        then you should explicitly pass header=None.
    index_col : int, list of int, default None
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``.  If a
        subset of data is selected with ``usecols``, index_col
        is based on the subset.
    usecols : int, str, list-like, or callable default None
        * If None, then parse all columns.
        * If str, then indicates comma separated list of Excel column letters
          and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
          both sides.
        * If list of int, then indicates list of column numbers to be parsed.
        * If list of string, then indicates list of column names to be parsed.
    
          .. versionadded:: 0.24.0
    
        * If callable, then evaluate each column name against it and parse the
          column if the callable returns ``True``.
    
        Returns a subset of the columns according to behavior above.
    
          .. versionadded:: 0.24.0
    
    squeeze : bool, default False
        If the parsed data only contains one column then return a Series.
    dtype : Type name or dict of column -> type, default None
        Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
        Use `object` to preserve data as stored in Excel and not interpret dtype.
        If converters are specified, they will be applied INSTEAD
        of dtype conversion.
    engine : str, default None
        If io is not a buffer or path, this must be set to identify io.
        Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb".
        Engine compatibility :
    
        - "xlrd" supports old-style Excel files (.xls).
        - "openpyxl" supports newer Excel file formats.
        - "odf" supports OpenDocument file formats (.odf, .ods, .odt).
        - "pyxlsb" supports Binary Excel files.
    
        .. versionchanged:: 1.2.0
            The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
            now only supports old-style ``.xls`` files.
            When ``engine=None``, the following logic will be
            used to determine the engine:
    
           - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
             then `odf <https://pypi.org/project/odfpy/>`_ will be used.
           - Otherwise if ``path_or_buffer`` is an xls format,
             ``xlrd`` will be used.
           - Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed,
             then ``openpyxl`` will be used.
           - Otherwise if ``xlrd >= 2.0`` is installed, a ``ValueError`` will be raised.
           - Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised. This
             case will raise a ``ValueError`` in a future version of pandas.
    
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can
        either be integers or column labels, values are functions that take one
        input argument, the Excel cell content, and return the transformed
        content.
    true_values : list, default None
        Values to consider as True.
    false_values : list, default None
        Values to consider as False.
    skiprows : list-like, int, or callable, optional
        Line numbers to skip (0-indexed) or number of lines to skip (int) at the
        start of the file. If callable, the callable function will be evaluated
        against the row indices, returning True if the row should be skipped and
        False otherwise. An example of a valid callable argument would be ``lambda
        x: x in [0, 2]``.
    nrows : int, default None
        Number of rows to parse.
    na_values : scalar, str, list-like, or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values. By default the following values are interpreted
        as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
        '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a',
        'nan', 'null'.
    keep_default_na : bool, default True
        Whether or not to include the default NaN values when parsing the data.
        Depending on whether `na_values` is passed in, the behavior is as follows:
    
        * If `keep_default_na` is True, and `na_values` are specified, `na_values`
          is appended to the default NaN values used for parsing.
        * If `keep_default_na` is True, and `na_values` are not specified, only
          the default NaN values are used for parsing.
        * If `keep_default_na` is False, and `na_values` are specified, only
          the NaN values specified `na_values` are used for parsing.
        * If `keep_default_na` is False, and `na_values` are not specified, no
          strings will be parsed as NaN.
    
        Note that if `na_filter` is passed in as False, the `keep_default_na` and
        `na_values` parameters will be ignored.
    na_filter : bool, default True
        Detect missing value markers (empty strings and the value of na_values). In
        data without any NAs, passing na_filter=False can improve the performance
        of reading a large file.
    verbose : bool, default False
        Indicate number of NA values placed in non-numeric columns.
    parse_dates : bool, list-like, or dict, default False
        The behavior is as follows:
    
        * bool. If True -> try parsing the index.
        * list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
          each as a separate date column.
        * list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
          a single date column.
        * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call
          result 'foo'
    
        If a column or index contains an unparseable date, the entire column or
        index will be returned unaltered as an object data type. If you don`t want to
        parse some cells as date just change their type in Excel to "Text".
        For non-standard datetime parsing, use ``pd.to_datetime`` after ``pd.read_excel``.
    
        Note: A fast-path exists for iso8601-formatted dates.
    date_parser : function, optional
        Function to use for converting a sequence of string columns to an array of
        datetime instances. The default uses ``dateutil.parser.parser`` to do the
        conversion. Pandas will try to call `date_parser` in three different ways,
        advancing to the next if an exception occurs: 1) Pass one or more arrays
        (as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the
        string values from the columns defined by `parse_dates` into a single array
        and pass that; and 3) call `date_parser` once for each row using one or
        more strings (corresponding to the columns defined by `parse_dates`) as
        arguments.
    thousands : str, default None
        Thousands separator for parsing string columns to numeric.  Note that
        this parameter is only necessary for columns stored as TEXT in Excel,
        any numeric columns will automatically be parsed, regardless of display
        format.
    comment : str, default None
        Comments out remainder of line. Pass a character or characters to this
        argument to indicate comments in the input file. Any data between the
        comment string and the end of the current line is ignored.
    skipfooter : int, default 0
        Rows at the end to skip (0-indexed).
    convert_float : bool, default True
        Convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric
        data will be read in as floats: Excel stores all numbers as floats
        internally.
    mangle_dupe_cols : bool, default True
        Duplicate columns will be specified as 'X', 'X.1', ...'X.N', rather than
        'X'...'X'. Passing in False will cause data to be overwritten if there
        are duplicate names in the columns.
    storage_options : dict, optional
        Extra options that make sense for a particular storage connection, e.g.
        host, port, username, password, etc., if using a URL that will
        be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
        will be raised if providing this argument with a local path or
        a file-like buffer. See the fsspec and backend storage implementation
        docs for the set of allowed keys and values.
    
        .. versionadded:: 1.2.0

2.2 .read_excel() 范例

2.2.1 最简单的 io 参数

io：str、bytes、ExcelFile、xlrd.Book、path对象或类似文件的对象任何有效的字符串路径都是可以接受的。字符串可以是URL。有效的URL方案包括http、ftp、s3和文件。对于文件URL，需要一个主机。本地文件可以是：``file://localhost/path/to/table.xlsx``. 如果要传入路径对象，pandas会接受任何``os.PathLike``。对于类文件对象，我们使用“read（）”方法引用对象，例如文件句柄（例如，通过内置的“open”函数）或“StringIO”。

代码如下

data02=pd.read_excel(r"D:\pandas\Excel file.xlsx")
data02

运行结果

2.2.2 read_csv() 没有的 sheet_name

sheet_name ：str、int、list或None，默认值为0；字符串用于图纸名称。整数用于索引

工作表位置，从 0 开始。字符串/整数列表用于请求多个工作表。指定“None”以获取所有sheet。

现有案例：

*默认为``0``：作为`数据帧的第一个工作表`

*``1``:作为`DataFrame`的第二张工作表`

*`“Sheet1``：加载名为“Sheet1”的工作表

*`[0,1，“Sheet5”]``：加载第一张、第二张和名为“Sheet5”的图纸，结果为`DataFrame`类型的字典结构

*None：所有工作表。

以上面的例子为例，其实 "D:\pandas\Excel file.xlsx" 中有两个工作表，但是只读了第一个工作表。

如果我想读第二个工作表，有两种效果等同的代码：

sheet_name=1

data02=pd.read_excel(r"D:\pandas\Excel file.xlsx",sheet_name=1)
data02

或者 sheet_name=“Sheet2”

data02=pd.read_excel(r"D:\pandas\Excel file.xlsx",sheet_name="Sheet2")
data02

如果我想两张表都读取： sheet_name=None，此时返回的就是一个 ”DataFrame“ 结构的字典，

data02=pd.read_excel(r"D:\pandas\Excel file.xlsx",sheet_name=None)
data02

如果想依次读取这两个工作表的信息

for tmp in data02:
    print("*"*30)
    print(tmp)
    print(data02[tmp])

2.2.3 相似的参数 header，names ，index_col ，usecols

对 Sheet1 稍作修改

Sheet2 保留原貌

代码1）

data02=pd.read_excel(r"D:\pandas\Excel file.xlsx",sheet_name=None,
                     index_col=0,# 设置第一列为行名
                     skiprows=[0,None], #设置 sheet1 过滤第一行，sheet 2 不需要
                    names=["id","Name","Birthday"] #名字改为 "id","Name","Birthday"
                    )
data02

运行结果

代码2）

data02=pd.read_excel(r"D:\pandas\Excel file.xlsx",sheet_name="Sheet2",
                    index_col=0,
                    usecols = ['作者']
                    )
data02

运行结果

大家可能会问，为什么我没有把 header，names ，index_col ，usecols 用一个例子来表达

嗯呢，我试过，但是反复出错。

例如下面的代码 3），我只是在代码 1）上加了 usecols=["Name"]，就反复报错

“

ValueError: Number of passed names did not match number of header fields in the file

”

我对此的解释只能是，Pandas 还不完美。

data02=pd.read_excel(r"D:\pandas\Excel file.xlsx",sheet_name=None,
                     index_col=0,# 设置第一列为行名
                     skiprows=[0,None], #设置 sheet1 过滤第一行，sheet 2 不需要
                    names=["id","Name","Birthday"], #名字改为 "id","Name","Birthday"
                     usecols=["Name"]
                    )
data02

结果报错

***********************************************************************************************************

此外，还想给大家一个建议，如果每个工作表的结构是不一致的，建议不要一次性读取所有的表格，而是一张张读取的，毕竟针对不同的工作表结构，代码实现不完全一样，非要整在一行代码里面完全没有必要。

通过对不同结构的工作表的轮询，用不同的代码去实现，准确性会更高。

***********************************************************************************************************

2.2.4 和 .read_csv() 含义不一样的 engine

在 .read_csv() 函数里面支持 engine 为{"C","python"}，但是在.read_excel() 里面，engine 默认为None，可以接受的参数有“ xlrd”，“ openpyxl”或“ odf”，用于使用第三方的库去解析excel文件。

如果io不是缓冲区或路径，则必须将其设置为标识 io。
支持的引擎：“xlrd”、“openpyxl”、“odf”、“pyxlsb”。
engine 兼容性：
-“xlrd”支持旧式Excel文件（.xls）。
-“openpyxl”支持更新的Excel文件格式。
-“odf”支持OpenDocument文件格式（.odf、.ods、.odt）。
-“pyxlsb”支持二进制Excel文件。

例如要支持 OpenDocument文件，engine 使用 odf 时候，要先导入第三方库 odfpy，否则会报错

”

ImportError: Missing optional dependency 'odfpy'.  Use pip or conda to install odfpy.

“

导入第三方库 odfpy 方法如下：

代码

import pandas as pd
data02=pd.read_excel(r"D:\pandas\openDoc.ods",sheet_name="Sheet2",
                     index_col=0,# 设置第一列为行名
                    names=["id","Name","Birthday"],#名字改为 "id","Name","Birthday"
                     engine="odf"
                    )
data02

运行结果

2.3 .to_excel() 语法

还是直接看代码里面的注释吧，这里绝大多数都在 .to_csv() 函数里面聊过，用途是雷同的。

Help on function to_excel in module pandas.core.generic:

to_excel(self, excel_writer, sheet_name: 'str' = 'Sheet1', na_rep: 'str' = '', float_format: 'Optional[str]' = None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None, inf_rep='inf', verbose=True, freeze_panes=None, storage_options: 'StorageOptions' = None) -> 'None'
Write object to an Excel sheet.

To write a single object to an Excel .xlsx file it is only necessary to
specify a target file name. To write to multiple sheets it is necessary to
create an `ExcelWriter` object with a target file name, and specify a sheet
in the file to write to.

Multiple sheets may be written to by specifying unique `sheet_name`.
With all data written to the file it is necessary to save the changes.
Note that creating an `ExcelWriter` object with a file name that already
exists will result in the contents of the existing file being erased.

Parameters
----------
excel_writer : path-like, file-like, or ExcelWriter object
File path or existing ExcelWriter.
sheet_name : str, default 'Sheet1'
Name of sheet which will contain DataFrame.
na_rep : str, default ''
Missing data representation.
float_format : str, optional
Format string for floating point numbers. For example
``float_format="%.2f"`` will format 0.1234 to 0.12.
columns : sequence or list of str, optional
Columns to write.
header : bool or list of str, default True
Write out the column names. If a list of string is given it is
assumed to be aliases for the column names.
index : bool, default True
Write row names (index).
index_label : str or sequence, optional
Column label for index column(s) if desired. If not specified, and
`header` and `index` are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex.
startrow : int, default 0
Upper left cell row to dump data frame.
startcol : int, default 0
Upper left cell column to dump data frame.
engine : str, optional
Write engine to use, 'openpyxl' or 'xlsxwriter'. You can also set this
via the options ``io.excel.xlsx.writer``, ``io.excel.xls.writer``, and
``io.excel.xlsm.writer``.

.. deprecated:: 1.2.0

As the `xlwt <https://pypi.org/project/xlwt/>`__ package is no longer
maintained, the ``xlwt`` engine will be removed in a future version
of pandas.

merge_cells : bool, default True
Write MultiIndex and Hierarchical Rows as merged cells.
encoding : str, optional
Encoding of the resulting excel file. Only necessary for xlwt,
other writers support unicode natively.
inf_rep : str, default 'inf'
Representation for infinity (there is no native representation for
infinity in Excel).
verbose : bool, default True
Display more information in the error logs.
freeze_panes : tuple of int (length 2), optional
Specifies the one-based bottommost row and rightmost column that
is to be frozen.
storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc., if using a URL that will
be parsed by ``fsspec``, e.g., starting "s3://", "gcs://". An error
will be raised if providing this argument with a non-fsspec URL.
See the fsspec and backend storage implementation docs for the set of
allowed keys and values.

2.4 .to_excel() 范例

先准备数据

2.4.1 最简单的 excel_writer

excel_writer : path-like, file-like, or ExcelWriter object File path or existing ExcelWriter.

代码

data02.to_excel(r"D:\pandas\Save Excel.xlsx")

运行结果

2.4.2 指定工作表名 sheet_name

sheet_name : str, default 'Sheet1' Name of sheet which will contain DataFrame.

在 2.4.1 小节中，生成的默认工作表名就是“Sheet1”，现在我们给它设定我们自己想要的

data02.to_excel(r"D:\pandas\Save Excel.xlsx",sheet_name=r"随笔记录")

2.4.3 类似的参数 columns、header、index、index_label

columns : sequence or list of str, optional Columns to write.序列或者字符串的列表，用于写入的可选择的列

header : bool or list of str, default True Write out the column names. If a list of string is given it is assumed to be aliases for the column names.

布尔值，或者字符串的列表，默认为True，即写入列名字。如果给了一个字符串列表，则被认为是列的别名。

index : bool, default True Write row names (index).

布尔值，默认为True，即写入行名

index_label : str or sequence, optional Column label for index column(s) if desired. If not specified, and `header` and `index` are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

字符串或者序列号，即 index 列的标识。如果没有指定，且`header` 和`index` 都是True,则使用 index 名字。如果 DataFrame 使用了多个Index，那么要给一个序列。

先看第一段代码

data02.to_excel(r"D:\pandas\Save Excel.xlsx",
                sheet_name=r"随笔记录",
                columns=["Birthday"],#只选择 Birthaday
                header=False,#不写入列名
                index_label="IndexLabel"
               )

运行结果

再看第二段代码

data02.to_excel(r"D:\pandas\Save Excel.xlsx",
                sheet_name=r"随笔记录",
                columns=["Birthday"],#只选择 Birthaday
                header=True,#即默认这种形式，只不过我写明白了
                index_label="IndexLabel"
               )

运行结果

大家可以看到，当 header = True 时候，index_label 才起作用。除了header ，index 也会影响index_label 的作用。上面两段代码中 index 都是默认为 True 的情况。

下面看看第三段代码

data02.to_excel(r"D:\pandas\Save Excel.xlsx",
                sheet_name=r"随笔记录",
                columns=["Birthday"],#只选择 Birthaday
                header=True,#即默认这种形式，只不过我写明白了
                index_label="IndexLabel",
                index=False
               )

运行结果

可以看到 index = False 时候，既然都不写入 index 了，index_label 自然也就没有意义了。

2.4.4 好理解的 startrow 、startcol

startrow : int, default 0 Upper left cell row to dump data frame.

startcol : int, default 0 Upper left cell column to dump data frame.

startrow 和 startcol 分别指定数据放置的起始行和起始列，默认是 0，工作表序号从 0 开始。

data02=data02.to_excel(r"D:\pandas\Save Excel.xlsx",
                   sheet_name="NewSheet",
                    startrow=1,
                     startcol=2
                    )
data02

运行结果

2.4.5 可选的 NaN参数 na_rep

na_rep : str, default '' Missing data representation.

当我们的数据存在 NaN时候

如果使用默认的“”，结果如下

如果我们设置 na_rep = "暂时不知道"

代码如下：

data02.to_excel(r"D:\pandas\Save Excel2.xlsx",na_rep = "暂时不知道")

运行结果如下

2.4.6 可选的浮点数格式参数 float_format

float_format : str, optional Format string for floating point numbers. For example ``float_format="%.2f"`` will format 0.1234 to 0.12.

这个也比较好理解，不赘述。

代码如下

data.to_excel(r"D:\pandas\Save Excel4.xlsx",float_format='%.4f')

运行结果

其余的参数就大家自行专研吧，等我有空再写。

'''

要是大家觉得写得还行，麻烦点个赞或者收藏吧，想给博客涨涨人气，非常感谢！

'''

江南野栀子

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
Pandas 模块 - 读写(2)-Excel 表格及其他-read_excel/to_excel

.read_excel() 和 .to_excel() 是一对，恰如 .read_csv() 和 .to_csv() 一样，甚至很多参数都是雷同的。2. 利用.read_excel() 和 .to_excel() 从外部读取或者写入数据2.1 .read_excel()语法网上有很多 .read_excel() 语法结构，但是最权威的还是写在代码里面的。具体每个参数的用法，就在范例里面一一说明吧。Help on function read_excel in modul...
复制链接

扫一扫