数据分析第三篇——Pandas的输入/输出操作

3.5 文件的读取与存储

  • 目标:
    • pandas几种文件读取的操作
    • 应用CSV方式和HDF方式实现文件的读取与存储
  • 内容:
    • 3.5.1 CSV
    • 3.5.2 HDF5
    • 3.5.3 JSON
    • 3.5.4 拓展
    • 3.5.5 总结

pandas支持读写的文件格式:

3.5.1 CSV

1. 读取CSV文件——read_csv()
  • pandas.read_csv(filepath_or_buffer, sep=’,’, delimiter=None)
    • filepath_or_buffer:文件路径
    • usecols:指定读取的列名、列表形式
pd.read_csv('stock.csv', usecols=['trade_date', 'high', 'low', 'open', 'close'], index_col=0)
closeopenhighlow
trade_date
202003132887.42652804.23222910.88122799.9841
202003122923.48562936.01632944.46512906.2838
202003112968.51743001.76163010.02862968.5174
202003102996.76182918.93473000.29632904.7989
202003092943.29072987.18052989.20512940.7138
...............
19910719136.7000137.6600138.5400136.6600
19910718137.1700137.1700137.1700135.8100
19910717135.8100135.8100135.8100135.3900
19910716134.4700134.3900134.4700133.1400
19910715133.1400133.9000134.1000131.8700

7002 rows × 4 columns

read_csv默认将第一行作为列索引,当文件中不存在表头时,这会导致其将第一行的数据作为列索引,这显然是不正确的,为此我们可以指定names参数来手动设置列索引
pd.read_csv('stock2.csv')
202003132887.42652804.23222910.88122799.98412923.4856-36.0591-1.2334366450436393019665.2
0202003122923.48562936.01632944.46512906.28382968.5174-45.0318-1.5170307778457.03.282092e+08
1202003112968.51743001.76163010.02862968.51742996.7618-28.2444-0.9425352470970.03.787666e+08
2202003102996.76182918.93473000.29632904.79892943.290753.47111.8167393296648.04.250172e+08
3202003092943.29072987.18052989.20512940.71383034.5113-91.2206-3.0061414560736.04.381439e+08
4202003063034.51133039.93953052.44393029.46323071.6771-37.1658-1.2100362061533.03.773885e+08
.................................
699619910719136.7000137.6600138.5400136.6600137.1700-0.4700-0.342610823.05.242826e+03
699719910718137.1700137.1700137.1700135.8100135.81001.36001.0014847.04.644160e+02
699819910717135.8100135.8100135.8100135.3900134.47001.34000.9965660.03.975240e+02
699919910716134.4700134.3900134.4700133.1400133.14001.33000.99892796.01.328502e+03
700019910715133.1400133.9000134.1000131.8700132.80000.34000.256011938.05.534900e+03

7001 rows × 10 columns

pd.read_csv('stock2.csv', names=['trade_date', 'close', 'open', 'high', 'low', 'pre_close', 'change', 'pct_chg', 'vol', 'amount'
])
trade_datecloseopenhighlowpre_closechangepct_chgvolamount
0202003132887.42652804.23222910.88122799.98412923.4856-36.0591-1.2334366450436.03.930197e+08
1202003122923.48562936.01632944.46512906.28382968.5174-45.0318-1.5170307778457.03.282092e+08
2202003112968.51743001.76163010.02862968.51742996.7618-28.2444-0.9425352470970.03.787666e+08
3202003102996.76182918.93473000.29632904.79892943.290753.47111.8167393296648.04.250172e+08
4202003092943.29072987.18052989.20512940.71383034.5113-91.2206-3.0061414560736.04.381439e+08
.................................
699719910719136.7000137.6600138.5400136.6600137.1700-0.4700-0.342610823.05.242826e+03
699819910718137.1700137.1700137.1700135.8100135.81001.36001.0014847.04.644160e+02
699919910717135.8100135.8100135.8100135.3900134.47001.34000.9965660.03.975240e+02
700019910716134.4700134.3900134.4700133.1400133.14001.33000.99892796.01.328502e+03
700119910715133.1400133.9000134.1000131.8700132.80000.34000.256011938.05.534900e+03

7002 rows × 10 columns

2. 写入csv文件
  • DataFrame.to_csv(path_or_buf=None, sep=’,’, columns=None, header=True, index=True, index_label=None, mode=‘w’, encoding=None)
    • path:字符串或文件句柄,默认为None
    • sep:分隔符,默认为’,’
    • columns:指定要写入的列,不指定时默认全部写入
    • mode:'w’重写,'a’追加
    • index:是否写入行索引
    • header:是否写入列索引
  • Series.to_csv(path=None, index=True, sep=’,’, na_rep=’’, float_format=None, header=False, index_label=None, mode=‘w’, encoding=None, compression=None, date_format=None, decimal=’.’)
保存open列的数据
data[:10]
closeopenhighlowchangepct_chgvolamount
trade_date
202003132887.42652222910.88122799.9841-36.0591-1.2334366450436.0393019665.2
202003122923.48561002944.46512906.2838-45.0318-1.5170307778457.0328209202.4
202003112968.51741003010.02862968.5174-28.2444-0.9425352470970.0378766619.0
202003102996.76181003000.29632904.798953.47111.8167393296648.0425017184.8
202003092943.29071002989.20512940.7138-91.2206-3.0061414560736.0438143854.6
202003063034.51131003052.44393029.4632-37.1658-1.2100362061533.0377388542.7
202003053071.67711003074.25713022.926260.01141.9926445425806.0482770471.4
202003043011.66571003012.00352974.358318.76890.6271353338278.0389893917.5
202003032992.89681003026.84202976.623021.96560.7394410108047.0447053681.5
202003022970.93121002982.50682899.310090.62743.1465367333369.0397244201.2
data[:10].to_csv('csv_write_test.csv', columns=['open'])
pd.read_csv('csv_write_test.csv')
trade_dateopen
020200313222
120200312100
220200311100
320200310100
420200309100
520200306100
620200305100
720200304100
820200303100
920200302100
# 指定不写入行索引index
data[:10].to_csv('csv_write_test.csv', columns=['open'], index=False)
pd.read_csv('csv_write_test.csv')
open
0222
1100
2100
3100
4100
5100
6100
7100
8100
9100

3.5.2 HDF5文件

read_hdf()与to_hdf()

HDF5文件的读取和存储需要指定一个键,值为要读取或存储的DataFrame——一个hdf5文件中可以有多个DataFrame,这些DataFrame组成了第三个维度

  • pandas.read_hdf(path_or_buf, key=None, **kwargs)

    • 从hdf5文件中读取数据
    • path_or_buf:文件路径
    • key:读取的键
    • mode:文件的打开方式,'w’表示重写,‘a’表示追加
    • return:The selected object
  • DataFrame.to_hdf(path_or_buf, key, **kwargs)

  • 能够存储三维数据——一个hdf5文件中可以有多个DataFrame,这些DataFrame组成了第三个维度

    • key1—DataFrame1二维数据
    • key2—DataFrame2二维数据

3.5.3 JSON

json是常用的一种数据交换格式,前后端之间的交互经常用到,也会在存储的时候选择这种格式。

1. read_json()

pandas.read_json(path_or_buf=None, orient=None, typ=‘frame’, lines=False)

  • 将json格式转换成pandas的DataFrame格式
    • orient:读取的数据以什么样的格式展示
      • ‘split’: dict like {index -> [index], columns -> [columns], data -> values}
      • ’records’: list like [{column -> value}, …, {column -> value}]——一般采用此格式
      • ‘index’: dict like {index -> {column -> value}}
      • ’columns’: dict like {column -> {index -> value}} 默认该格式
      • ‘values’: just the values array
    • lines: boolean, default False
      • 是否按行读取json对象,一般设置为True
    • typ: default ‘frame’, 指定转换成的对象类型Series或DataFrame
2. to_json()
  • DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
    • 将pandas对象存储为json格式
    • path_or_buf=None:文件地址
    • orient:存储的json形式,(‘split’, ‘records’, ‘index’, ‘columns’, ‘values’)
    • lines:是否按行存储

3.5.4 拓展

优先选择使用HDF5文件存储

  • HDF5支持压缩,使用的方式是blosc,这个是速度最快也是pandas默认支持的压缩方式
  • 压缩可以提高磁盘的利用率,节省空间
  • HDF5跨平台,可以轻松迁移到hadoop上面

3.5.5 总结

  • pandas的csv、HDF5、JSON文件的读取
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值