3、数据载入、存储及文件格式
输入和输出通常有以下几种类型:读取文本文件及硬盘上其他更高效的格式文件、从数据库载入数据、与网络资源进行交互(比如Web API)。
3.1 文本格式数据的读写
将表格型数据读取为DataFrame
对象是pandas
的重要特性。read_csv
和read_table
应该是使用最多的函数。
In [35]: import pandas as pd
In [40]: df = pd.read_csv('ex1.csv')
In [41]: df
Out[41]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
使用! type ex1.csv
打开文件,使用read_table()
可以指定分隔符:
In [45]: ! type ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [46]: pd.read_table('ex1.csv', sep=', ' )
D:/Anaconda3/Scripts/ipython-script.py:1: FutureWarning: read_table is deprecated, use read_csv instead.
if __name__ == '__main__':
D:/Anaconda3/Scripts/ipython-script.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
if __name__ == '__main__':
Out[46]:
a,b,c,d,message
0 1,2,3,4,hello
1 5,6,7,8,world
2 9,10,11,12,foo
假设想要message
列成为返回DataFrame
的索引,则可以指定位置4
的列为索引,或将’message’
传给参数index_col
:
In [56]: names = ['a', 'b', 'c' ,'d', 'message']
In [57]: pd.read_csv('ex1.csv', names=names, index_col='message')
Out[57]:
a b c d # 以message为索引
message
message a b c d
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
如果想要从多个列中形成一个分层索引,需要传入一个包含列序号或列名的列表:
d = pd.DataFrame({'key1' : ['one', 'one', 'one', 'two', 'two', 'two'],
'key2' : ['a', 'b' ,'c' ,'a' ,'b' ,'c'],
'v1' : [1,3,5,7,9,11],
'v2' : [2,4,6,8,10,12]})
print(d)
'''
key1 key2 v1 v2
0 one a 1 2
1 one b 3 4
2 one c 5 6
3 two a 7 8
4 two b 9 10
5 two c 11 12
'''
d.to_csv('csv_mindex.csv', index=False) # 取消索引的存储 以防后序读取出现错误
'''
key1,key2,v1,v2
one,a,1,2
one,b,3,4
one,c,5,6
two,a,7,8
two,b,9,10
two,c,11,12
'''
# 从多个列中形成一个分层索引
parsed = pd.read_csv('csv_mindex.csv', index_col=['key1', 'key2'])
print(parsed)
'''
v1 v2
key1 key2
one a 1 2
b 3 4
c 5 6
two a 7 8
b 9 10
c 11 12
'''
ex3.txt
中的数据是:
A B C
aaa -2 -4 -6
bbb -3 23 21
ccc -1.23 0.2255 9.12
在某些情况下,一张表的分隔符并不是固定的,使用空白或其他方式来分隔字段。考虑如下文本文件:
l = list(open('ex3.txt'))
print(l)
'''
['A B C\n', 'aaa -2 -4 -6\n', 'bbb -3 23 21\n', 'ccc -1.23 0.2255 9.12']
'''
当字段是以多种不同数量的空格分开时,尽管你可以手工处理,但在这些情况下也可以向read_table
传入一个正则表达式作为分隔符。在本例中,正则表达式为\s+
,因此我们可以得到:
In [91]: reslut = pd.read_table('ex3.txt', sep='\s+')
In [92]: reslut
Out[92]:
A B C
aaa -2.00 -4.0000 -6.00
bbb -3.00 23.0000 21.00
ccc -1.23 0.2255 9.12
上例中,由于列名的数量比数据的列数少一个,因此read_table
推断第一列应当作为DataFrame
的索引。
ex4.txt
的内容是:
# 嘿嘿!
a,b,c,d,message
# 只是为了让你觉得更难
# 谁用计算机读取csv文件?
1,2,3,4, hello
5,6,7,8,world
9,10,11,12,foo
可以使用skiprows
来跳过第一行、第三行和第四行:
In [93]: pd.read_csv('ex4.txt', skiprows=[0, 2, 3])
Out[93]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
ex5.csv
的文件内容是:
something, a, b, c, d, message
one, 1,2,3,4
two, 5,6,8, world
three, 9,10, 11, 12, foo
失值处理是文件解析过程中一个重要且常常微妙的部分。通常情况下,缺失值要么不显示(空字符串),要么用一些标识值。默认情况下,pandas
使用一些常见的标识,例如NA
和NULL
:
In [97]: result = pd.read_csv('ex5.csv')
In [98]: reslut
Out[98]:
something a b c d message
0 one 1 2 3 4 NaN
1 two 5 6 8 world NaN
2 three 9 10 11 12 foo
In [100]: pd.isnull(result)
Out[100]:
something a b c d message
0 False False False False False True
1 False False False False False True
2 False False False False False False
na_values
选项可以传入一个列表或一组字符串来处理缺失值:
In [102]: pd.read_csv('ex5.csv', na_values=['NULL'])
Out[102]:
something a b c d message
0 one 1 2 3 4 NaN
1 two 5 6 8 world NaN
2 three 9 10 11 12 foo
在字典中,每列可以指定不同的缺失值标识:
In [103]: sentinels = {'message' : ['foo', 'NA'], 'something' : ['two']}
In [104]: pd.read_csv('ex5.csv', na_values=sentinels)# 上面出现的字典中的value更新为NaN
Out[104]:
something a b c d message
0 one 1 2 3 4 NaN
1 NaN 5 6 8 world NaN
2 three 9 10 11 12 foo
一些read_csv/read_table
函数参数:
3.1.1 分块读入文本文件
当处理大型文件或找出正确的参数集来正确处理大文件时,可能需要读入文件的一个小片段或者按小块遍历文件。
In [107]: result = pd.read_csv('users.csv')
In [108]: result
Out[108]:
user_id\titem_id\tbehavior_type\tuser_geohash\titem_category\ttime
0 10001082\t285259775\t1\t97lk14c\t4076\t2014-12...
1 10001082\t4368907\t1\t\t5503\t2014-12-12 12
2 10001082\t4368907\t1\t\t5503\t2014-12-12 12
3 10001082\t53616768\t1\t\t9762\t2014-12-02 15
4 10001082\t151466952\t1\t\t5232\t2014-12-12 11
.. ...
95 10001082\t282816229\t1\t\t9666\t2014-12-14 03
96 10001082\t362845104\t1\t\t5176\t2014-12-13 14
97 10001082\t262112219\t1\t\t9614\t2014-11-22 12
98 10001082\t275221686\t1\t\t10576\t2014-12-13 14
99 10001082\t275221686\t1\t\t10576\t2014-12-02 22
[100 rows x 1 columns]
只想读取一小部分行(避免读取整个文件),可以指明nrows
:
In [110]: pd.read_csv('users.csv', nrows=5)
Out[110]:
user_id\titem_id\tbehavior_type\tuser_geohash\titem_category\ttime
0 10001082\t285259775\t1\t97lk14c\t4076\t2014-12...
1 10001082\t4368907\t1\t\t5503\t2014-12-12 12
2 10001082\t4368907\t1\t\t5503\t2014-12-12 12
3 10001082\t53616768\t1\t\t9762\t2014-12-02 15
4 10001082\t151466952\t1\t\t5232\t2014-12-12 11
先生成一个新的数据:
l = ['L', 'B', 'G', 'R', 'Q', 'L', 'E', 'K', 'G']
l *= 3
n = len(l)
d = np.random.randn(n)
data = pd.DataFrame({'one' : d, 'two' : d, 'three' : d, 'key' : l})
#print(data)
data.to_csv('ex6.csv', index=False)
'''
one,two,three,key
-0.8253385888577351,-0.8253385888577351,-0.8253385888577351,L
-0.3005054628025023,-0.3005054628025023,-0.3005054628025023,B
0.08667646938805751,0.08667646938805751,0.08667646938805751,G
2.4717947134908873,2.4717947134908873,2.4717947134908873,R
-0.315253990143174,-0.315253990143174,-0.315253990143174,Q
-0.4602857943618783,-0.4602857943618783,-0.4602857943618783,L
0.5768993301685353,0.5768993301685353,0.5768993301685353,E
-0.4106573433924394,-0.4106573433924394,-0.4106573433924394,K
-0.7054454240211718,-0.7054454240211718,-0.7054454240211718,G
0.059222319806148255,0.059222319806148255,0.059222319806148255,L
2.1285400157366716,2.1285400157366716,2.1285400157366716,B
0.42874042415653413,0.42874042415653413,0.42874042415653413,G
-0.12655874642828271,-0.12655874642828271,-0.12655874642828271,R
1.4617361857075006,1.4617361857075006,1.4617361857075006,Q
1.4671052202024812,1.4671052202024812,1.4671052202024812,L
1.060765777158635,1.060765777158635,1.060765777158635,E
0.5057505511496003,0.5057505511496003,0.5057505511496003,K
-0.4852338681885889,-0.4852338681885889,-0.4852338681885889,G
0.48837632594424535,0.48837632594424535,0.48837632594424535,L
0.14959791412619264,0.14959791412619264,0.14959791412619264,B
0.08895475974680539,0.08895475974680539,0.08895475974680539,G
-1.3200711041372188,-1.3200711041372188,-1.3200711041372188,R
-0.07161822585532003,-0.07161822585532003,-0.07161822585532003,Q
0.15537064319789615,0.15537064319789615,0.15537064319789615,L
-1.309895454805748,-1.309895454805748,-1.309895454805748,E
0.05251784769130209,0.05251784769130209,0.05251784769130209,K
0.100051443342729,0.100051443342729,0.100051443342729,G
'''
在上面新生成的数据中做如下操作:为了分块读入文件,可以指定chunksize
作为每一块的行数:
hunker = pd.read_csv('ex6.csv', chunksize=100)
#print(chunker)
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
print(tot)
'''
<pandas.io.parsers.TextFileReader object at 0x0000024D04AA87F0>
L 6.0
G 6.0
Q 3.0
B 3.0
R 3.0
K 3.0
E 3.0
dtype: float64
'''
3.1.2 将数据写入文本格式
使用DataFrame
的to_csv
方法,可以将数据导出为逗号分隔的文件,当然,其他的分隔符也是可以的:
In [13]: data = pd.DataFrame(np.arange(12).reshape((3,4)),
...: index=['a', 'b', 'c'],
...: columns=['one', 'two', 'three', 'four'])
In [14]: data
Out[14]:
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
In [16]: import sys
In [17]: data.to_csv(sys.stdout, sep='|') #输出到控制台
|one|two|three|four
a|0|1|2|3
b|4|5|6|7
c|8|9|10|11
缺失值在输出时以空字符串出现。可以使用自己想要用标识值对缺失值进行标注:
In [18]: data.to_csv(sys.stdout, na_rep='NULL') # 有缺失的地方用NULL来替代
,one,two,three,four
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
按照自己选择的顺序写入列的子集
In [19]: data
Out[19]:
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
In [20]: data.to_csv(sys.stdout, index=False, columns=['one', 'two'])
one,two
0,1
4,5
8,9
3.1.3 使用分隔格式
ex7.csv
中的内容如下所示:
"a","b","c"
"1","2","3"
"1","2","3"
要使用它,需要将任一打开的文件或文件型对象传给csv.reader
。像遍历文件那样遍历reader
会产生元组,元组的值为删除了引号的字符:
import csv
f = open('ex7.csv')
reader = csv.reader(f)
for line in reader:
print(line)
'''
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']
'''
首先将文件读取为行的列表:
In [29]: with open('ex7.csv') as f:
...: lines = list(csv.reader(f))
...:
In [33]: lines
Out[33]: [['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3']]
然后,将数据拆分为列名行和数据行:
In [30]: header, values = lines[0], lines[1:]
In [34]: lines[0]
Out[34]: ['a', 'b', 'c']
In [35]: lines[1:]
Out[35]: [['1', '2', '3'], ['1', '2', '3']]
使用字典推导式和表达式zip(*values)
生成一个包含数据列的字典,字典中行转置成列:
In [31]: data_dict = {h : v for h, v in zip(header, zip(*values))}
In [32]: data_dict
Out[32]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
3.1.4 JSON数据
JSON
(JavaScript Object Notation的简写)已经成为Web
浏览器和其他应用间通过HTTP
请求发送数据的标准格式。它是一种比CSV
等表格文本形式更为自由的数据形式。如下示例:
In [1]: obj = """{"name" : "Wes", "places_lived" : ["united Status", "Spain", "Germany"
...: ],"pet" : null, "siblings" : [{"name" : "Scott", "age" : 30, "pets" : [
...: "Zeus",
...: "Zuko"]}, {"name" : "Katie", "age" : 38, "pets" : ["Sixes", "Stache", "
...: Cisio"]}
...: ]}"""
In [2]: import json
In [3]: result = json.loads(obj)
In [4]: result
Out[4]:
{'name': 'Wes',
'pet': None,
'places_lived': ['united Status', 'Spain', 'Germany'],
'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']},
{'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisio']}]}
另一方面,json.dumps
可以将Python
对象转换回JSON
:
In [5]: asjson = json.dumps(resut)
In [6]: asjson
Out[6]: '{"places_lived": ["united Status", "Spain", "Germany"], "pet": null, "name": "Wes", "siblings": [{"age": 30, "pets": ["Zeus", "Zuko"], "name": "Scott"}, {"age": 38, "pets": ["Sixes", "Stache", "Cisio"], "name": "Katie"}]}'
In [7]:
自行决定如何将JSON对
象或对象列表转换为DataFrame
或其他数据结构。比较方便的方式是将字典构成的列表(之前是JSON
对象)传入DataFrame
构造函数,并选出数据字段的子集:
In [10]: siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
In [11]: siblings
Out[11]:
name age
0 Scott 30
1 Katie 38
pandas.read_json
可以自动将JSON
数据集按照指定次序转换为Series
或DataFrame
。例如:
In [19]: ! type example.json
[{"a" : 1, "b" : 2, "c" : 3},
{"a" : 4, "b" : 5, "c" : 6},
{"a" : 7, "b" : 8, "c" : 9}]
In [20]: data = pd.read_json('example.json')
In [21]: data
Out[21]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
In [23]: print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
In [25]: print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]