3.5 文件的读取与存储
- 目标:
- pandas几种文件读取的操作
- 应用CSV方式和HDF方式实现文件的读取与存储
- 内容:
- 3.5.1 CSV
- 3.5.2 HDF5
- 3.5.3 JSON
- 3.5.4 拓展
- 3.5.5 总结
pandas支持读写的文件格式:
3.5.1 CSV
1. 读取CSV文件——read_csv()
- pandas.read_csv(filepath_or_buffer, sep=’,’, delimiter=None)
- filepath_or_buffer:文件路径
- usecols:指定读取的列名、列表形式
pd.read_csv('stock.csv', usecols=['trade_date', 'high', 'low', 'open', 'close'], index_col=0)
close | open | high | low | |
---|---|---|---|---|
trade_date | ||||
20200313 | 2887.4265 | 2804.2322 | 2910.8812 | 2799.9841 |
20200312 | 2923.4856 | 2936.0163 | 2944.4651 | 2906.2838 |
20200311 | 2968.5174 | 3001.7616 | 3010.0286 | 2968.5174 |
20200310 | 2996.7618 | 2918.9347 | 3000.2963 | 2904.7989 |
20200309 | 2943.2907 | 2987.1805 | 2989.2051 | 2940.7138 |
... | ... | ... | ... | ... |
19910719 | 136.7000 | 137.6600 | 138.5400 | 136.6600 |
19910718 | 137.1700 | 137.1700 | 137.1700 | 135.8100 |
19910717 | 135.8100 | 135.8100 | 135.8100 | 135.3900 |
19910716 | 134.4700 | 134.3900 | 134.4700 | 133.1400 |
19910715 | 133.1400 | 133.9000 | 134.1000 | 131.8700 |
7002 rows × 4 columns
read_csv默认将第一行作为列索引,当文件中不存在表头时,这会导致其将第一行的数据作为列索引,这显然是不正确的,为此我们可以指定names参数来手动设置列索引
pd.read_csv('stock2.csv')
20200313 | 2887.4265 | 2804.2322 | 2910.8812 | 2799.9841 | 2923.4856 | -36.0591 | -1.2334 | 366450436 | 393019665.2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 20200312 | 2923.4856 | 2936.0163 | 2944.4651 | 2906.2838 | 2968.5174 | -45.0318 | -1.5170 | 307778457.0 | 3.282092e+08 |
1 | 20200311 | 2968.5174 | 3001.7616 | 3010.0286 | 2968.5174 | 2996.7618 | -28.2444 | -0.9425 | 352470970.0 | 3.787666e+08 |
2 | 20200310 | 2996.7618 | 2918.9347 | 3000.2963 | 2904.7989 | 2943.2907 | 53.4711 | 1.8167 | 393296648.0 | 4.250172e+08 |
3 | 20200309 | 2943.2907 | 2987.1805 | 2989.2051 | 2940.7138 | 3034.5113 | -91.2206 | -3.0061 | 414560736.0 | 4.381439e+08 |
4 | 20200306 | 3034.5113 | 3039.9395 | 3052.4439 | 3029.4632 | 3071.6771 | -37.1658 | -1.2100 | 362061533.0 | 3.773885e+08 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6996 | 19910719 | 136.7000 | 137.6600 | 138.5400 | 136.6600 | 137.1700 | -0.4700 | -0.3426 | 10823.0 | 5.242826e+03 |
6997 | 19910718 | 137.1700 | 137.1700 | 137.1700 | 135.8100 | 135.8100 | 1.3600 | 1.0014 | 847.0 | 4.644160e+02 |
6998 | 19910717 | 135.8100 | 135.8100 | 135.8100 | 135.3900 | 134.4700 | 1.3400 | 0.9965 | 660.0 | 3.975240e+02 |
6999 | 19910716 | 134.4700 | 134.3900 | 134.4700 | 133.1400 | 133.1400 | 1.3300 | 0.9989 | 2796.0 | 1.328502e+03 |
7000 | 19910715 | 133.1400 | 133.9000 | 134.1000 | 131.8700 | 132.8000 | 0.3400 | 0.2560 | 11938.0 | 5.534900e+03 |
7001 rows × 10 columns
pd.read_csv('stock2.csv', names=['trade_date', 'close', 'open', 'high', 'low', 'pre_close', 'change', 'pct_chg', 'vol', 'amount'
])
trade_date | close | open | high | low | pre_close | change | pct_chg | vol | amount | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 20200313 | 2887.4265 | 2804.2322 | 2910.8812 | 2799.9841 | 2923.4856 | -36.0591 | -1.2334 | 366450436.0 | 3.930197e+08 |
1 | 20200312 | 2923.4856 | 2936.0163 | 2944.4651 | 2906.2838 | 2968.5174 | -45.0318 | -1.5170 | 307778457.0 | 3.282092e+08 |
2 | 20200311 | 2968.5174 | 3001.7616 | 3010.0286 | 2968.5174 | 2996.7618 | -28.2444 | -0.9425 | 352470970.0 | 3.787666e+08 |
3 | 20200310 | 2996.7618 | 2918.9347 | 3000.2963 | 2904.7989 | 2943.2907 | 53.4711 | 1.8167 | 393296648.0 | 4.250172e+08 |
4 | 20200309 | 2943.2907 | 2987.1805 | 2989.2051 | 2940.7138 | 3034.5113 | -91.2206 | -3.0061 | 414560736.0 | 4.381439e+08 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6997 | 19910719 | 136.7000 | 137.6600 | 138.5400 | 136.6600 | 137.1700 | -0.4700 | -0.3426 | 10823.0 | 5.242826e+03 |
6998 | 19910718 | 137.1700 | 137.1700 | 137.1700 | 135.8100 | 135.8100 | 1.3600 | 1.0014 | 847.0 | 4.644160e+02 |
6999 | 19910717 | 135.8100 | 135.8100 | 135.8100 | 135.3900 | 134.4700 | 1.3400 | 0.9965 | 660.0 | 3.975240e+02 |
7000 | 19910716 | 134.4700 | 134.3900 | 134.4700 | 133.1400 | 133.1400 | 1.3300 | 0.9989 | 2796.0 | 1.328502e+03 |
7001 | 19910715 | 133.1400 | 133.9000 | 134.1000 | 131.8700 | 132.8000 | 0.3400 | 0.2560 | 11938.0 | 5.534900e+03 |
7002 rows × 10 columns
2. 写入csv文件
- DataFrame.to_csv(path_or_buf=None, sep=’,’, columns=None, header=True, index=True, index_label=None, mode=‘w’, encoding=None)
- path:字符串或文件句柄,默认为None
- sep:分隔符,默认为’,’
- columns:指定要写入的列,不指定时默认全部写入
- mode:'w’重写,'a’追加
- index:是否写入行索引
- header:是否写入列索引
- Series.to_csv(path=None, index=True, sep=’,’, na_rep=’’, float_format=None, header=False, index_label=None, mode=‘w’, encoding=None, compression=None, date_format=None, decimal=’.’)
保存open列的数据
data[:10]
close | open | high | low | change | pct_chg | vol | amount | |
---|---|---|---|---|---|---|---|---|
trade_date | ||||||||
20200313 | 2887.4265 | 222 | 2910.8812 | 2799.9841 | -36.0591 | -1.2334 | 366450436.0 | 393019665.2 |
20200312 | 2923.4856 | 100 | 2944.4651 | 2906.2838 | -45.0318 | -1.5170 | 307778457.0 | 328209202.4 |
20200311 | 2968.5174 | 100 | 3010.0286 | 2968.5174 | -28.2444 | -0.9425 | 352470970.0 | 378766619.0 |
20200310 | 2996.7618 | 100 | 3000.2963 | 2904.7989 | 53.4711 | 1.8167 | 393296648.0 | 425017184.8 |
20200309 | 2943.2907 | 100 | 2989.2051 | 2940.7138 | -91.2206 | -3.0061 | 414560736.0 | 438143854.6 |
20200306 | 3034.5113 | 100 | 3052.4439 | 3029.4632 | -37.1658 | -1.2100 | 362061533.0 | 377388542.7 |
20200305 | 3071.6771 | 100 | 3074.2571 | 3022.9262 | 60.0114 | 1.9926 | 445425806.0 | 482770471.4 |
20200304 | 3011.6657 | 100 | 3012.0035 | 2974.3583 | 18.7689 | 0.6271 | 353338278.0 | 389893917.5 |
20200303 | 2992.8968 | 100 | 3026.8420 | 2976.6230 | 21.9656 | 0.7394 | 410108047.0 | 447053681.5 |
20200302 | 2970.9312 | 100 | 2982.5068 | 2899.3100 | 90.6274 | 3.1465 | 367333369.0 | 397244201.2 |
data[:10].to_csv('csv_write_test.csv', columns=['open'])
pd.read_csv('csv_write_test.csv')
trade_date | open | |
---|---|---|
0 | 20200313 | 222 |
1 | 20200312 | 100 |
2 | 20200311 | 100 |
3 | 20200310 | 100 |
4 | 20200309 | 100 |
5 | 20200306 | 100 |
6 | 20200305 | 100 |
7 | 20200304 | 100 |
8 | 20200303 | 100 |
9 | 20200302 | 100 |
# 指定不写入行索引index
data[:10].to_csv('csv_write_test.csv', columns=['open'], index=False)
pd.read_csv('csv_write_test.csv')
open | |
---|---|
0 | 222 |
1 | 100 |
2 | 100 |
3 | 100 |
4 | 100 |
5 | 100 |
6 | 100 |
7 | 100 |
8 | 100 |
9 | 100 |
3.5.2 HDF5文件
read_hdf()与to_hdf()
HDF5文件的读取和存储需要指定一个键,值为要读取或存储的DataFrame——一个hdf5文件中可以有多个DataFrame,这些DataFrame组成了第三个维度
-
pandas.read_hdf(path_or_buf, key=None, **kwargs)
- 从hdf5文件中读取数据
- path_or_buf:文件路径
- key:读取的键
- mode:文件的打开方式,'w’表示重写,‘a’表示追加
- return:The selected object
-
DataFrame.to_hdf(path_or_buf, key, **kwargs)
-
能够存储三维数据——一个hdf5文件中可以有多个DataFrame,这些DataFrame组成了第三个维度
- key1—DataFrame1二维数据
- key2—DataFrame2二维数据
3.5.3 JSON
json是常用的一种数据交换格式,前后端之间的交互经常用到,也会在存储的时候选择这种格式。
1. read_json()
pandas.read_json(path_or_buf=None, orient=None, typ=‘frame’, lines=False)
- 将json格式转换成pandas的DataFrame格式
- orient:读取的数据以什么样的格式展示
- ‘split’: dict like {index -> [index], columns -> [columns], data -> values}
- ’records’: list like [{column -> value}, …, {column -> value}]——一般采用此格式
- ‘index’: dict like {index -> {column -> value}}
- ’columns’: dict like {column -> {index -> value}} 默认该格式
- ‘values’: just the values array
- lines: boolean, default False
- 是否按行读取json对象,一般设置为True
- typ: default ‘frame’, 指定转换成的对象类型Series或DataFrame
- orient:读取的数据以什么样的格式展示
2. to_json()
- DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
- 将pandas对象存储为json格式
- path_or_buf=None:文件地址
- orient:存储的json形式,(‘split’, ‘records’, ‘index’, ‘columns’, ‘values’)
- lines:是否按行存储
3.5.4 拓展
优先选择使用HDF5文件存储
- HDF5支持压缩,使用的方式是blosc,这个是速度最快也是pandas默认支持的压缩方式
- 压缩可以提高磁盘的利用率,节省空间
- HDF5跨平台,可以轻松迁移到hadoop上面
3.5.5 总结
- pandas的csv、HDF5、JSON文件的读取