pandas读取与存储操作详解

项目github地址:bitcarmanlee easy-algorithm-interview-and-practice
欢迎大家star,留言,一起学习进步

1.前言

pandas可以将读取到的数据(不一定是csv或者txt)转换成dataframe,然后后面可以方便地对dataframe进行操作,进行各种数据分析工作。下面我们对pandas里常用的一些IO操作进行详细的分析。

2.read_csv

read_csv最常用的方式是从文件中读取数据,read_csv默认的分隔符号是逗号
示例数据:

57647:0.059819,26223:0.048002,100295:0.055268,60232:0.049508
35824:0.04753,57776:0.055802,40677:0.049119,14445:0.040235
102136:0.052933,3736:0.07613,21681:0.10266,44816:0.058018

不指定列名称

def read11():
    data = pd.read_csv("../data/tt3", header=None)
    print data
                 0               1                2               3
0   57647:0.059819  26223:0.048002  100295:0.055268  60232:0.049508
1    35824:0.04753  57776:0.055802   40677:0.049119  14445:0.040235
2  102136:0.052933    3736:0.07613    21681:0.10266  44816:0.058018

指定列名称

def read12():
    data = pd.read_csv("../data/tt3", header=None, names=['c1', 'c2', 'c3', 'c4'])
    print data
                c1              c2               c3              c4
0   57647:0.059819  26223:0.048002  100295:0.055268  60232:0.049508
1    35824:0.04753  57776:0.055802   40677:0.049119  14445:0.040235
2  102136:0.052933    3736:0.07613    21681:0.10266  44816:0.058018

上面的数据都不带表头,所以设置header=None

3.read_table

read_csv默认的分隔符是逗号,如果想改变分隔符,可以用read_table指定分隔符

数据

8803b236442fed8a37a5beb04556f684	54530	1	0
f4afa2b8f50e8aacf967628b3dc11bff	102017	1	0
4077bb0001ba73f4dd966a7f6fb46075	50276	1	0

代码

def read21():
    data = pd.read_table("../data/tt1", header=None, sep='\t')
    print data


def read22():
    data = pd.read_table("../data/tt1", header=None, sep='\t', names=['c1', 'c2', 'c3', 'c4'])
    print data


read21()
read22()
 FutureWarning: read_table is deprecated, use read_csv instead.
  data = pd.read_table("../data/tt1", header=None, sep='\t')
                                  0       1  2  3
0  8803b236442fed8a37a5beb04556f684   54530  1  0
1  f4afa2b8f50e8aacf967628b3dc11bff  102017  1  0
2  4077bb0001ba73f4dd966a7f6fb46075   50276  1  0
FutureWarning: read_table is deprecated, use read_csv instead.
  data = pd.read_table("../data/tt1", header=None, sep='\t', names=['c1', 'c2', 'c3', 'c4'])
                                 c1      c2  c3  c4
0  8803b236442fed8a37a5beb04556f684   54530   1   0
1  f4afa2b8f50e8aacf967628b3dc11bff  102017   1   0
2  4077bb0001ba73f4dd966a7f6fb46075   50276   1   0

根据提示来看,read_table方法后面会被废弃,统一用read_csv方法。

4.设置索引

def read31():
    data = pd.read_csv("../data/tt3", header=None, names=['c1', 'c2', 'c3', 'c4'], index_col='c2')
    print data


read31()
                             c1               c3              c4
c2                                                              
26223:0.048002   57647:0.059819  100295:0.055268  60232:0.049508
57776:0.055802    35824:0.04753   40677:0.049119  14445:0.040235
3736:0.07613    102136:0.052933    21681:0.10266  44816:0.058018

跟之前的结果对比,如果设置了index_col来设置列索引,原来默认从0开始的整数索引不见了。之前没设置索引的时候,默认是按行号从0开始设置索引的。

def read32():
    data = pd.read_csv("../data/tt3", header=None, names=['c1', 'c2', 'c3', 'c4'], index_col=['c2', 'c1'])
    print data

                                             c3              c4
c2             c1                                              
26223:0.048002 57647:0.059819   100295:0.055268  60232:0.049508
57776:0.055802 35824:0.04753     40677:0.049119  14445:0.040235
3736:0.07613   102136:0.052933    21681:0.10266  44816:0.058018

上面的例子为指定多个列为索引

5.缺失值删除

实际数据肯定会比较脏,不会特别干净,有缺失值的情况很常见,填充缺失值就成了数据预处理中一项很重要的工作。
pandas在读取文件时,默认会将NA, NULL等特殊字符串当成缺失值,默认会使用NaN进行替换。

数据如下

8803b236442fed8a37a5beb04556f684	54530	1	0	abc	NULL
f4afa2b8f50e8aacf967628b3dc11bff	102017	1	0	NULL	NULL
4077bb0001ba73f4dd966a7f6fb46075	50276	1	0	NULL	NULL
4077bb0001ba73f4dd966a7f6fb46075	50276	1	0	456	123
def read41():
    data = pd.read_table("../data/tt2", header=None, sep='\t')
    print data
                                  0       1  2  3    4      5
0  8803b236442fed8a37a5beb04556f684   54530  1  0  abc    NaN
1  f4afa2b8f50e8aacf967628b3dc11bff  102017  1  0  NaN    NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276  1  0  NaN    NaN
3  4077bb0001ba73f4dd966a7f6fb46075   50276  1  0  456  123.0

5.1删除含有缺失值的行与列

def read42():
    data = pd.read_table("../data/tt2", header=None, sep='\t')
    print data.dropna()
                                  0      1  2  3    4      5
3  4077bb0001ba73f4dd966a7f6fb46075  50276  1  0  456  123.0

dropna删除含有缺失值的行。如果想删除含有缺失值的列,可以指定axis=1

def read42():
    data = pd.read_table("../data/tt2", header=None, sep='\t')
    print data.dropna(axis=1)
                                  0       1  2  3
0  8803b236442fed8a37a5beb04556f684   54530  1  0
1  f4afa2b8f50e8aacf967628b3dc11bff  102017  1  0
2  4077bb0001ba73f4dd966a7f6fb46075   50276  1  0
3  4077bb0001ba73f4dd966a7f6fb46075   50276  1  0

5.2删除全为NaN的行或者列

数据如下

8803b236442fed8a37a5beb04556f684	54530	1	0	abc	NULL
f4afa2b8f50e8aacf967628b3dc11bff	102017	1	0	NULL	NULL
4077bb0001ba73f4dd966a7f6fb46075	50276	1	0	NULL	NULL
NULL	NaN	NULL	NULL	NULL	NULL
def read43():
    data = pd.read_table("../data/tt2", header=None, sep='\t')
    print data.dropna(how='all')
    print data.dropna(how='all', axis=1)
                                  0         1    2    3    4   5
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc NaN
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0  NaN NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  NaN NaN
                                  0         1    2    3    4
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0  NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  NaN
3                               NaN       NaN  NaN  NaN  NaN

6.缺失值填充

更多的时候,我们是需要对缺失值进行填充而不是删除,下面看看怎么填充缺失值。

数据

8803b236442fed8a37a5beb04556f684	54530	1	0	abc	NULL
f4afa2b8f50e8aacf967628b3dc11bff	102017	1	0	NULL	NULL
4077bb0001ba73f4dd966a7f6fb46075	50276	1	0	NULL	NULL
NULL	NaN	NULL	NULL	NULL	NULL

6.1 所有缺失值按相同值填充

def read51():
    data = pd.read_table("../data/tt2", header=None, sep='\t')
    print data.fillna(0)
                                  0         1    2    3    4    5
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc  0.0
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0    0  0.0
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0    0  0.0
3                                 0       0.0  0.0  0.0    0  0.0

6.2 不同列填充不同值

上面的填充方式太过简单粗暴,实际中我们一般不会这么干,而是会按照不同列填充会比较多。

def read52():
    data = pd.read_table("../data/tt2", header=None, sep='\t', names=['c1', 'c2', 'c3', 'c4', 'c5', 'c6'])
    print data.fillna({'c3': 'c3default', 'c4': 'c4default'})
                                 c1        c2         c3         c4   c5  c6
0  8803b236442fed8a37a5beb04556f684   54530.0          1          0  abc NaN
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0          1          0  NaN NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0          1          0  NaN NaN
3                               NaN       NaN  c3default  c4default  NaN NaN

6.3 前向填充与后向填充

def read53():
    data = pd.read_table("../data/tt2", header=None, sep='\t', names=['c1', 'c2', 'c3', 'c4', 'c5', 'c6'])
    print data.fillna(method='ffill')
    print data.fillna(method='bfill')
                                 c1        c2   c3   c4   c5  c6
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc NaN
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0  abc NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  abc NaN
3  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  abc NaN
                                 c1        c2   c3   c4   c5  c6
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc NaN
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0  NaN NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  NaN NaN
3                               NaN       NaN  NaN  NaN  NaN NaN

ffill为前向填充,使用默认是上一行的值,设置axis=1可以使用列进行填充
bfill为后向填充,使用下一行的值,不存在的时候就不填充

6.4 使用列均值填充

def read54():
    data = pd.read_table("../data/tt2", header=None, sep='\t', names=['c1', 'c2', 'c3', 'c4', 'c5', 'c6'])
    print data.fillna(data.mean())
                                 c1        c2   c3   c4   c5  c6
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc NaN
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0  NaN NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  NaN NaN
3                               NaN   68941.0  1.0  0.0  NaN NaN

7.跳过某些行

8803b236442fed8a37a5beb04556f684	54530	1	0	abc	NULL
test
f4afa2b8f50e8aacf967628b3dc11bff	102017	1	0	NULL	NULL
test
4077bb0001ba73f4dd966a7f6fb46075	50276	1	0	NULL	NULL
NULL	NaN	NULL	NULL	NULL	NULL
def read61():
    data1 = pd.read_table("../data/tt2", header=None, sep='\t', names=['c1', 'c2', 'c3', 'c4', 'c5', 'c6'])
    print data1
    data2 = pd.read_table("../data/tt2", header=None, sep='\t', names=['c1', 'c2', 'c3', 'c4', 'c5', 'c6'], skiprows=[1, 3])
    print data2
                                 c1        c2   c3   c4   c5  c6
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc NaN
1                              test       NaN  NaN  NaN  NaN NaN
2  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0  NaN NaN
3                              test       NaN  NaN  NaN  NaN NaN
4  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  NaN NaN
5                               NaN       NaN  NaN  NaN  NaN NaN

                                 c1        c2   c3   c4   c5  c6
0  8803b236442fed8a37a5beb04556f684   54530.0  1.0  0.0  abc NaN
1  f4afa2b8f50e8aacf967628b3dc11bff  102017.0  1.0  0.0  NaN NaN
2  4077bb0001ba73f4dd966a7f6fb46075   50276.0  1.0  0.0  NaN NaN
3                               NaN       NaN  NaN  NaN  NaN NaN

8.读取json

数据为

{
  "apples": {
      "June": 3,
      "Robert": 2,
      "Lily": 0,
      "David": 1
  },
  "oranges": {
      "June": 0,
      "Robert": 3,
      "Lily": 7,
      "David": 2
  }
}
def read71():
    data = pd.read_json("../data/tt2")
    print data
        apples  oranges
David        1        2
June         3        0
Lily         0        7
Robert       2        3
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值