pandas read_csv函数整理（names、skiprows、nrows、chunksize）比赛常用函数细节剖析

最新推荐文章于 2024-08-16 15:21:00 发布

gaozhanfire

最新推荐文章于 2024-08-16 15:21:00 发布

阅读量3.3w

点赞数 37

分类专栏： pandas学习文章标签： skiprows nrows 数据科学竞赛竞赛 pandas

本文链接：https://blog.csdn.net/gaozhanfire/article/details/95648555

版权

pandas学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

read_csv函数

import pandas as pd

本文所用的数据文件

head.csv(包含“字符串”表头，同时可以用id当index做实验)

id,shuju,label
1,3,postive
2,7,negative
5,7,postive
6,8,postive
3,5,negative

fff.csv

9,6
1,3
2,4
3,5
4,6
5,7

header这个属性详解

当表头的type和其下面内容的type不相同时，比如表头是字符串，内容是数字的时候

当header属性不设置（缺省）的时候

##############可以看到，就用了那一堆字符串来当表头了
a=pd.read_csv("head.csv")
a

	id	shuju	label
0	1	3	postive
1	2	7	negative
2	5	7	postive
3	6	8	postive
4	3	5	negative

当header属性设置为None时候

###############可以看到，甚至连那一堆字符串都不能当表头了
a=pd.read_csv("head.csv",header=None)
a

	0	1	2
0	id	shuju	label
1	1	3	postive
2	2	7	negative
3	5	7	postive
4	6	8	postive
5	3	5	negative

当没有表头，或者表头的type和csv内容的type相一致的时候

header缺省时

#########可以看到，会拿第一行来直接当表头
a=pd.read_csv("fff.csv")
a

	9	6
0	1	3
1	2	4
2	3	5
3	4	6
4	5	7

header=None时候

############可以看到，不用header=None
a=pd.read_csv("fff.csv",header=None)
a

	0	1
0	9	6
1	1	3
2	2	4
3	3	5
4	4	6
5	5	7

可以看到，如果表头的type和csv内容的type相一致的时候，那么直接读取，会让第一行来当表头
此时加header=None，可以让第一行不当表头，而默认给0、1 来当表头
所以 header这个属性，是指，在不加header=None这个属性所出来的数据的基础上，把那个数据的表头去掉，换成0开头的表头

names属性

以下两个代码块
表明了！！！！
当设置了names属性之后，header无论设不设置，都会是None

a=pd.read_csv("fff.csv",header=None)
a

	0	1
0	9	6
1	1	3
2	2	4
3	3	5
4	4	6
5	5	7

a=pd.read_csv("fff.csv",header=None,names=['a','b'])
a

	a	b
0	9	6
1	1	3
2	2	4
3	3	5
4	4	6
5	5	7

skiprows属性

head.csv(包含“字符串”表头，同时可以用id当index做实验)

id,shuju,label
1,3,postive
2,7,negative
5,7,postive
6,8,postive
3,5,negative

fff.csv

9,6
1,3
2,4
3,5
4,6
5,7

pd.read_csv("head.csv",skiprows=2,header=None)

	0	1	2
0	2	7	negative
1	5	7	postive
2	6	8	postive
3	3	5	negative

pd.read_csv("head.csv",skiprows=2,header=None,names=['a','b','c'])

	a	b	c
0	2	7	negative
1	5	7	postive
2	6	8	postive
3	3	5	negative

pd.read_csv("fff.csv",skiprows=2,header=None)

	0	1
0	2	4
1	3	5
2	4	6
3	5	7

对比上面两段代码的效果
可以发现，无论是带表头还是不带表头，skiprows=2的效果，都是读第三行（也就是跳了两行读）
如果是带表头的文件，那么，其原理是把第一行的id,shuju,label 也当成一行了

nrows属性

这个属性非常实用，他可以被用在数据量非常大的时候，直接用这个属性来取一个大文件中的几行数据！！
head.csv

id,shuju,label
1,3,postive
2,7,negative
5,7,postive
6,8,postive
3,5,negative

fff.csv

9,6
1,3
2,4
3,5
4,6
5,7

有字符串表头的时候

pd.read_csv("head.csv",nrows=2,header=None)

	0	1	2
0	id	shuju	label
1	1	3	postive

连表头也会取着

没有字符串表头的时候

pd.read_csv("fff.csv",nrows=2,header=None)

	0	1
0	9	6
1	1	3

nrows和skiprows结合使用！！！

head.csv

id,shuju,label
1,3,postive
2,7,negative
5,7,postive
6,8,postive
3,5,negative

fff.csv

9,6
1,3
2,4
3,5
4,6
5,7

pd.read_csv("head.csv",nrows=2,skiprows=3,header=None)

	0	1	2
0	5	7	postive
1	6	8	postive

由此可见，这个实际上是先
把
id,shuju,label
1,3,postive
2,7,negative
这三行跳过之后
再用nrows取数
那么，其实，当文件有表头，想跳过文档“内容”（也就是不包含表头）的前500条，再取5000条数据的时候
需要记得，skiprows会把表头也算一行！！！

最后需要注意的一点，就是 header和name属性，都是在其他的属性执行完后
比如skiprows跳完之后
在跳完行之后的数据上
决定表头

chunksize属性

这个属性返回的就是一个迭代器，用于分批次读取数据
他是每次取文档“内容”（即不包含表头）的数据的前**条

gaozhanfire

关注

37
点赞
踩
113

收藏

觉得还不错? 一键收藏
3
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录