Python——pandas——read_csv

最新推荐文章于 2023-01-17 17:33:37 发布

qq_39149099

最新推荐文章于 2023-01-17 17:33:37 发布

阅读量182

点赞数

文章标签： Python pandas

本文链接：https://blog.csdn.net/qq_39149099/article/details/103983795

版权

参考资料：

https://www.cnblogs.com/happymeng/p/10481293.html

https://pandas.pydata.org/pandas-docs/version/0.24/reference/api/pandas.read_csv.html#pandas.read_csv

本小节主要讲解pandas模块下的方法read_csv参数介绍。read_csv主要用于读取csv文件和txt文件。该方法的唯一必填参数为文件路径，其余皆为可选参数，具体应用举例如下：

1，sep：分隔符参数，主要用于分割列，默认是","和"\t"，

2，header：列标题，默认第0行为列标题，也可以使用header=n来指定第n行为列标题，若不需要列标题，则header=None

3，prefix：列标题前缀，在未设置header或者header=None时有效

4，skiprows：忽略文件前n行

5，skipfooter：忽略文件末尾n行

6，nrows：读取文件n行

7，encoding：文件编码方式，通常为“utf-8”

8，converters：设置指定列的处理函数，可以用"序号"也可以使用“列名”进行列的指定

9，names：给列命名

10，usecols：读取指定列，

主要讲解以上参数，对于其余的参数，可参考上面的2个链接。以下为几个小实例。

converters参数讲解：txt文件内容如下：

thing.txt
1 2019-03-22 00:06:24.4463094 中文测试
2 2019-03-22 00:06:32.4565680 需要编辑encoding
3 2019-03-22 00:06:32.4565680 ashshsh
4 2017-03-22 00:06:32.8041945 eggg

需求：将该txt文件读取到pandas中，将秒后面的小数去掉，如：00:06:24.4463094变为00:06:24，脚本如下：

import pandas as pd


def func(strings):
    return strings.split('.')[0]


filepath = r'D:\测试目录\thing.txt'
df = pd.read_csv(filepath, sep=' ', header=None, converters={2: func})
print(df)
print(df.shape)
"""
输出结果：
   0           1         2             3
0  1  2019-03-22  00:06:24          中文测试
1  2  2019-03-22  00:06:32  需要编辑encoding
2  3  2019-03-22  00:06:32       ashshsh
3  4  2017-03-22  00:06:32          eggg
(4, 4)
"""

converters中可以包含多个键值对，即对不同的列使用相同或者不同的函数进行操作，例如，使用函数对日期、时间列进行操作，脚本如下：

import pandas as pd


def func(strings):
    return strings.split('.')[0]


def fun(dates):
    return dates.split('-')[0]


filepath = r'D:\工作\逾重行李管理系统\rpex_auto_test\rpex_i_automatic_test\测试目录\thing.txt'
df = pd.read_csv(filepath, sep=' ', header=None, names=['序列', '日期', '时间', '备注'], converters={1: fun, 2: func})
print(df)
print(df.shape)
"""
输出结果：
   序列    日期        时间            备注
0   1  2019  00:06:24          中文测试
1   2  2019  00:06:32  需要编辑encoding
2   3  2019  00:06:32       ashshsh
3   4  2017  00:06:32          eggg
(4, 4)
"""

参数names，usecols结合使用，给names给每一列都取一个名字，然后使用usecols筛选指定列。脚本如下：

import pandas as pd


def func(strings):
    return strings.split('.')[0]


filepath = r'D:\工作\逾重行李管理系统\rpex_auto_test\rpex_i_automatic_test\测试目录\thing.txt'
df = pd.read_csv(filepath, sep=' ', header=None, names=['序列', '日期', '时间', '备注'], converters={2: func})
print(df)
print(df.shape)
"""
输出结果：
   序列          日期        时间            备注
0   1  2019-03-22  00:06:24          中文测试
1   2  2019-03-22  00:06:32  需要编辑encoding
2   3  2019-03-22  00:06:32       ashshsh
3   4  2017-03-22  00:06:32          eggg
(4, 4)
"""

使用usecols获取指定列：日期、时间

import pandas as pd


def func(strings):
    return strings.split('.')[0]


filepath = r'D:\工作\逾重行李管理系统\rpex_auto_test\rpex_i_automatic_test\测试目录\thing.txt'
df = pd.read_csv(filepath, sep=' ', header=None, names=['序列', '日期', '时间', '备注'], converters={2: func},
                 usecols=['日期', '时间'])
print(df)
print(df.shape)
"""
输出结果：
           日期        时间
0  2019-03-22  00:06:24
1  2019-03-22  00:06:32
2  2019-03-22  00:06:32
3  2017-03-22  00:06:32
(4, 2)
"""