python 缺失值筛选_Python练习：数据缺失值处理

最新推荐文章于 2022-06-22 15:22:26 发布

伊利心情

最新推荐文章于 2022-06-22 15:22:26 发布

阅读量1.8k

点赞数 2

文章标签： python 缺失值筛选

本文链接：https://blog.csdn.net/weixin_34335039/article/details/112430628

版权

本文通过一个自建的Excel数据集，介绍了如何使用Python的pandas库处理数据中的缺失值。首先，使用`read_excel()`读取文件，发现数据中的缺失值以NaN表示。接着，介绍了两种处理缺失值的方法：一是使用`dropna()`删除包含缺失值的行；二是使用`fillna()`进行填充，包括用均值填充和使用前后记录的值进行填充。对于`fillna()`，可以通过设置`inplace=True`来直接替换原有数据，并可通过`method`参数选择填充方式。

摘要由CSDN通过智能技术生成

使用Excel自建了一个数据集，作为演示数据，如下：

可见，数据中存在部分缺失值。

第一步：读入excel文件。

这里需要使用pandas库中的read_excel()函数。初次使用这个函数，可以看一看帮助文档~

一个很小的技巧，使用help()查看帮助文档时，第一，一定要明确函数所在的库名。第二，函数后面不能加()，否则会报错。错误原因是函数没有参数。因为，当函数本身是要输入参数的，一旦我们添加了括号，就必须输入参数。

正确的查看read_excel()函数的帮助文档方法如下：

import pandas as pd
help(pd.read_excel)
... 
Help on function read_excel in module pandas.io.excel._base:
read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)
    Read an Excel file into a pandas DataFrame.
    
    Supports `xls`, `xlsx`, `xlsm`, `xlsb`, and `odf` file extensions
    read from a local filesystem or URL. Supports an option to read
    a single sheet or a list of sheets.
    
    Parameters
    ----------
    io : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be: ``file://localhost/path/to/table.xlsx``.
    
        If you want to pass in a path object, pandas accepts any ``os.PathLike``.
    
        By file-like object, we refer to objects with a ``read()`` method,
        such as a file handler (e.g. via builtin ``open`` function)
        or ``StringIO``.
    sheet_name : str, int, list, or None, default 0
        Strings are used for sheet names. Integers are used in zero-indexed
        sheet positions. Lists of strings/integers are used to request
        multiple sheets. Specify None to get all sheets.
    
        Available cases:
    
        * Defaults to ``0``: 1st sheet as a `DataFrame`
        * ``1``: 2nd sheet as a `DataFrame`
        * ``"Sheet1"``: Load sheet with name "Sheet1"
        * ``[0, 1, "Sheet5"]``: Load first, second and sheet named "Sheet5"
          as a dict of `DataFrame`
        * None: All sheets.
    
    header : int, list of int, default 0
        Row (0-indexed) to use for the column labels of the parsed
        DataFrame. If a list of integers is passed those row positions will
        be combined into a ``MultiIndex``. Use None if there is no header.
    names : array-like, default None
        List of column names to use. If file contains no header row,
        then you should explicitly pass header=None.
    index_col : int, list of int, default None
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``.  If a
        subset of data is selected with ``usecols``, index_col
        is based on the subset.
    usecols : int, str, list-like, or callable default None
        * If