pandas常用方法

最新推荐文章于 2023-03-20 00:09:54 发布

zjLOVEcyj

最新推荐文章于 2023-03-20 00:09:54 发布

阅读量208

点赞数

分类专栏： python数据分析库文章标签：数据分析 csv pandas

本文链接：https://blog.csdn.net/cyj5201314/article/details/107100382

版权

python数据分析库专栏收录该内容

5 篇文章 0 订阅

订阅专栏

读取csv文件

In [1]: import pandas as pd

In [2]: food_info = pd.read_csv("food_info.csv")

查看dataframe的各字段的数据类型

In [5]: print(food_info.dtypes)
NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64
Carbohydrt_(g)     float64
Fiber_TD_(g)       float64
Sugar_Tot_(g)      float64
Calcium_(mg)       float64
Iron_(mg)          float64
Magnesium_(mg)     float64
Phosphorus_(mg)    float64
Potassium_(mg)     float64
Sodium_(mg)        float64
Zinc_(mg)          float64
Copper_(mg)        float64
Manganese_(mg)     float64
Selenium_(mcg)     float64
Vit_C_(mg)         float64
Thiamin_(mg)       float64
Riboflavin_(mg)    float64
Niacin_(mg)        float64
Vit_B6_(mg)        float64
Vit_B12_(mcg)      float64
Vit_A_IU           float64
Vit_A_RAE          float64
Vit_E_(mg)         float64
Vit_D_mcg          float64
Vit_D_IU           float64
Vit_K_(mcg)        float64
FA_Sat_(g)         float64
FA_Mono_(g)        float64
FA_Poly_(g)        float64
Cholestrl_(mg)     float64
dtype: object

查看read_csv的帮助文档

In [7]: help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer:Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file into DataFrame.

    Also supports optionally iterating or breaking of the file
    into chunks.

    Additional help can be found in the online docs for
    `IO Tools <http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>`_.

    Parameters
    ----------
    filepath_or_buffer : str, path object or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be: file://localhost/path/to/table.csv.

        If you want to pass in a path object, pandas accepts any ``os.PathLike``.

        By file-like object, we refer to objects with a ``read()`` method, such as
        a file handler (e.g. via builtin ``open`` function) or ``StringIO``.
    sep : str, default ','
        Delimiter to use. If sep is None, the C engine cannot automatically detect
        the separator, but the Python parsing engine can, meaning the latter will
        be used and automatically detect the separator by Python's builtin sniffer
        tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
        different from ``'\s+'`` will be interpreted as regular expressions and
        will also force the use of the Python parsing engine. Note that regex
        delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``.
    delimiter : str, default ``None``
        Alias for sep.
    header : int, list of int, default 'infer'
        Row number(s) to use as the column names, and the start of the
        data.  Default behavior is to infer the column names: if no names
        are passed the behavior is identical to ``header=0`` and column
        names are inferred from the first line of the file, if column
        names are passed explicitly then the behavior is identical to
        ``header=None``. Explicitly pass ``header=0`` to be able to
        replace existing names. The header can be a list of integers that
        specify row locations for a multi-index on the columns
        e.g. [0,1,3]. Intervening rows that are not specified will be
        skipped (e.g. 2 in this example is skipped). Note that this
        parameter ignores commented lines and empty lines if
        ``skip_blank_lines=True``, so ``header=0`` denotes the first line of
        data rather than the first line of the file.
    names : array-like, optional
        List of column names to use. If file contains no header row, then you
        should explicitly pass ``header=None``. Duplicates in this list are not
        allowed.
    index_col : int, str, sequence of int / str, or False, default ``None``
      Column(s) to use as the row labels of the ``DataFrame``, either given as
      string name or column index. If a sequence of int / str is given, a
      MultiIndex is used.

      Note: ``index_col=False`` can be used to force pandas to *not* use the first
      column as the index, e.g. when you have a malformed file with delimiters at
      the end of each line.
    usecols : list-like or callable, optional
        Return a subset of the columns. If list-like, all elements must either
        be positional (i.e. integer indices into the document columns) or strings
        that correspond to column names provided either by the user in `names` or
        inferred from the document header row(s). For example, a valid list-like
        `usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
        Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
        To instantiate a DataFrame from ``data`` with element order preserved use
        ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
        in ``['foo', 'bar']`` order or
        ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
        for ``['bar', 'foo']`` order.
-- More  --

查看dataframe的前三行数据

In [9]: food_info.head(3)
Out[9]:
   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  ...  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)
0    1001          BUTTER WITH SALT      15.87         717  ...      51.368       21.021        3.043           215.0
1    1002  BUTTER WHIPPED WITH SALT      15.87         717  ...      50.489       23.426        3.012           219.0
2    1003      BUTTER OIL ANHYDROUS       0.24         876  ...      61.924       28.732        3.694           256.0

[3 rows x 36 columns]

查看dataframe的尾三行数据

In [10]: food_info.tail(3)
Out[10]:
      NDB_No         Shrt_Desc  Water_(g)  Energ_Kcal  ...  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)
8615   90480        SYRUP CANE       26.0         269  ...       0.000        0.000        0.000             0.0
8616   90560         SNAIL RAW       79.2          90  ...       0.361        0.259        0.252            50.0
8617   93600  TURTLE GREEN RAW       78.5          89  ...       0.127        0.088        0.170            50.0

[3 rows x 36 columns]

查看dataframe的各个字段的名称

In [13]: food_info.columns
Out[13]:
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')

查看dataframe的形状

In [14]: food_info.shape
Out[14]: (8618, 36)

查看某些行的数据

In [18]: food_info.loc[2:5]
Out[18]:
   NDB_No             Shrt_Desc  Water_(g)  Energ_Kcal  ...  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)
2    1003  BUTTER OIL ANHYDROUS       0.24         876  ...      61.924       28.732        3.694           256.0
3    1004           CHEESE BLUE      42.41         353  ...      18.669        7.778        0.800            75.0
4    1005          CHEESE BRICK      41.11         371  ...      18.764        8.598        0.784            94.0
5    1006           CHEESE BRIE      48.42         334  ...      17.410        8.013        0.826           100.0

[4 rows x 36 columns]

取某一列的数据

In [19]: food_info["Water_(g)"]
Out[19]:
0       15.87
1       15.87
2        0.24
3       42.41
4       41.11
        ...
8613    43.00
8614    70.25
8615    26.00
8616    79.20
8617    78.50
Name: Water_(g), Length: 8618, dtype: float64

取某几列的数据

In [22]: food_info[["Water_(g)","Energ_Kcal"]]
Out[22]:
      Water_(g)  Energ_Kcal
0         15.87         717
1         15.87         717
2          0.24         876
3         42.41         353
4         41.11         371
...         ...         ...
8613      43.00         305
8614      70.25         111
8615      26.00         269
8616      79.20          90
8617      78.50          89

[8618 rows x 2 columns]

把dataframe的列名转化为列表

In [24]: food_info.columns.tolist()
Out[24]:
['NDB_No',
 'Shrt_Desc',
 'Water_(g)',
 'Energ_Kcal',
 'Protein_(g)',
 'Lipid_Tot_(g)',
 'Ash_(g)',
 'Carbohydrt_(g)',
 'Fiber_TD_(g)',
 'Sugar_Tot_(g)',
 'Calcium_(mg)',
 'Iron_(mg)',
 'Magnesium_(mg)',
 'Phosphorus_(mg)',
 'Potassium_(mg)',
 'Sodium_(mg)',
 'Zinc_(mg)',
 'Copper_(mg)',
 'Manganese_(mg)',
 'Selenium_(mcg)',
 'Vit_C_(mg)',
 'Thiamin_(mg)',
 'Riboflavin_(mg)',
 'Niacin_(mg)',
 'Vit_B6_(mg)',
 'Vit_B12_(mcg)',
 'Vit_A_IU',
 'Vit_A_RAE',
 'Vit_E_(mg)',
 'Vit_D_mcg',
 'Vit_D_IU',
 'Vit_K_(mcg)',
 'FA_Sat_(g)',
 'FA_Mono_(g)',
 'FA_Poly_(g)',
 'Cholestrl_(mg)']

按列进行排序

#inplace表示是否替换，即是在原对象上排序还是新生成一个对象后在之上排序
#ascending表示修改默认的升序排序为降序排序
In [33]: food_info.sort_values("Water_(g)", inplace=True, ascending=False)

In [34]: food_info["Water_(g)"]
Out[34]:
4209    100.0
4378    100.0
4348    100.0
4377    100.0
4376    100.0
        ...
6067      NaN
6113      NaN
1983      NaN
7776      NaN
6095      NaN
Name: Water_(g), Length: 8618, dtype: float64

判断某列数据是否存在缺失值

In [38]: pd.isnull(water)
Out[38]:
4209    False
4378    False
4348    False
4377    False
4376    False
        ...
6067     True
6113     True
1983     True
7776     True
6095     True
Name: Water_(g), Length: 8618, dtype: bool

对某一列数据求均值（自动过滤缺失值）

In [41]: water.mean()
Out[41]: 54.16370993961914

分组统计
index表示分组的字段，values表示统计字段， aggfunc表示聚合函数

dataframe.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)

drop掉有缺失值的行或列

#drop掉有缺失值的行
food_info.dropna(axis=1)
#drop掉有缺失值的列
food_info.dropna(axis=0, subset=["Age", "Sex"])

样本定位

#查看第83个样本的Water_(g)值
food_info.loc[83,"Water_(g)"]

重置索引值

food_info.reset_index(drop=True)

自定义函数

def fun(column):
    item = column.loc[99]
    return item
food_info.apply(fun)

zjLOVEcyj

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas常用方法

读取csv文件In [1]: import pandas as pdIn [2]: food_info = pd.read_csv("food_info.csv")查看dataframe的各字段的数据类型In [5]: print(food_info.dtypes)NDB_No int64Shrt_Desc objectWater_(g) float64Energ_Kcal int64Prot.
复制链接

扫一扫