利用Python进行数据分析——数据载入、存储及文件格式（7）-CSDN博客

本文链接：https://blog.csdn.net/bwqiang/article/details/107786000

3、数据载入、存储及文件格式

输入和输出通常有以下几种类型：读取文本文件及硬盘上其他更高效的格式文件、从数据库载入数据、与网络资源进行交互（比如Web API）。

3.1 文本格式数据的读写

将表格型数据读取为DataFrame对象是pandas的重要特性。read_csv和read_table应该是使用最多的函数。

在这里插入图片描述

In [35]: import pandas as pd

In [40]: df = pd.read_csv('ex1.csv')

In [41]: df
Out[41]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

使用! type ex1.csv打开文件，使用read_table()可以指定分隔符：

In [45]: ! type ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [46]: pd.read_table('ex1.csv', sep=', ' )
D:/Anaconda3/Scripts/ipython-script.py:1: FutureWarning: read_table is deprecated, use read_csv instead.
  if __name__ == '__main__':
D:/Anaconda3/Scripts/ipython-script.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':
Out[46]:
  a,b,c,d,message
0   1,2,3,4,hello
1   5,6,7,8,world
2  9,10,11,12,foo

假设想要message列成为返回DataFrame的索引，则可以指定位置4的列为索引，或将’message’传给参数index_col：

In [56]: names = ['a', 'b', 'c' ,'d', 'message']

In [57]: pd.read_csv('ex1.csv', names=names, index_col='message')
Out[57]:
         a   b   c   d # 以message为索引
message
message  a   b   c   d
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

如果想要从多个列中形成一个分层索引，需要传入一个包含列序号或列名的列表：

d = pd.DataFrame({'key1' : ['one', 'one', 'one', 'two', 'two', 'two'],
                 'key2' : ['a', 'b' ,'c' ,'a' ,'b' ,'c'],
                 'v1' : [1,3,5,7,9,11],
                 'v2' : [2,4,6,8,10,12]})
print(d)
'''
  key1 key2  v1  v2
0  one    a   1   2
1  one    b   3   4
2  one    c   5   6
3  two    a   7   8
4  two    b   9  10
5  two    c  11  12
'''
d.to_csv('csv_mindex.csv', index=False) # 取消索引的存储 以防后序读取出现错误
'''
key1,key2,v1,v2
one,a,1,2
one,b,3,4
one,c,5,6
two,a,7,8
two,b,9,10
two,c,11,12
'''

# 从多个列中形成一个分层索引
parsed = pd.read_csv('csv_mindex.csv', index_col=['key1', 'key2']) 
print(parsed)
'''
          v1  v2
key1 key2        
one  a      1   2
     b      3   4
     c      5   6
two  a      7   8
     b      9  10
     c     11  12
'''

ex3.txt中的数据是：

A B C
aaa -2 -4 -6
bbb -3 23 21
ccc -1.23 0.2255 9.12

在某些情况下，一张表的分隔符并不是固定的，使用空白或其他方式来分隔字段。考虑如下文本文件：

l = list(open('ex3.txt'))
print(l)
'''
['A B C\n', 'aaa -2 -4 -6\n', 'bbb -3 23 21\n', 'ccc -1.23 0.2255 9.12']
'''

当字段是以多种不同数量的空格分开时，尽管你可以手工处理，但在这些情况下也可以向read_table传入一个正则表达式作为分隔符。在本例中，正则表达式为\s+，因此我们可以得到：


In [91]: reslut = pd.read_table('ex3.txt', sep='\s+')

In [92]: reslut
Out[92]:
        A        B      C
aaa -2.00  -4.0000  -6.00
bbb -3.00  23.0000  21.00
ccc -1.23   0.2255   9.12

上例中，由于列名的数量比数据的列数少一个，因此read_table推断第一列应当作为DataFrame的索引。

ex4.txt的内容是：

# 嘿嘿！
a,b,c,d,message
# 只是为了让你觉得更难
# 谁用计算机读取csv文件？
1,2,3,4, hello
5,6,7,8,world
9,10,11,12,foo

可以使用skiprows来跳过第一行、第三行和第四行：

In [93]: pd.read_csv('ex4.txt', skiprows=[0, 2, 3])
Out[93]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

ex5.csv的文件内容是：

something, a, b, c, d, message
one, 1,2,3,4
two, 5,6,8, world
three, 9,10, 11, 12, foo

失值处理是文件解析过程中一个重要且常常微妙的部分。通常情况下，缺失值要么不显示（空字符串），要么用一些标识值。默认情况下，pandas使用一些常见的标识，例如NA和NULL：

In [97]: result = pd.read_csv('ex5.csv')

In [98]: reslut
Out[98]:
  something   a   b   c       d  message
0       one   1   2   3       4      NaN
1       two   5   6   8   world      NaN
2     three   9  10  11      12      foo


In [100]: pd.isnull(result)
Out[100]:
   something      a      b      c      d   message
0      False  False  False  False  False      True
1      False  False  False  False  False      True
2      False  False  False  False  False     False

na_values选项可以传入一个列表或一组字符串来处理缺失值：


In [102]: pd.read_csv('ex5.csv', na_values=['NULL'])
Out[102]:
  something   a   b   c       d  message
0       one   1   2   3       4      NaN
1       two   5   6   8   world      NaN
2     three   9  10  11      12      foo

在字典中，每列可以指定不同的缺失值标识：

In [103]: sentinels = {'message' : ['foo', 'NA'], 'something' : ['two']}

In [104]: pd.read_csv('ex5.csv', na_values=sentinels)# 上面出现的字典中的value更新为NaN
Out[104]:
  something   a   b   c       d  message
0       one   1   2   3       4      NaN
1       NaN   5   6   8   world      NaN
2     three   9  10  11      12      foo

一些read_csv/read_table函数参数：
在这里插入图片描述

3.1.1 分块读入文本文件

当处理大型文件或找出正确的参数集来正确处理大文件时，可能需要读入文件的一个小片段或者按小块遍历文件。

In [107]: result = pd.read_csv('users.csv')

In [108]: result
Out[108]:
   user_id\titem_id\tbehavior_type\tuser_geohash\titem_category\ttime
0   10001082\t285259775\t1\t97lk14c\t4076\t2014-12...
1         10001082\t4368907\t1\t\t5503\t2014-12-12 12
2         10001082\t4368907\t1\t\t5503\t2014-12-12 12
3        10001082\t53616768\t1\t\t9762\t2014-12-02 15
4       10001082\t151466952\t1\t\t5232\t2014-12-12 11
..                                                ...
95      10001082\t282816229\t1\t\t9666\t2014-12-14 03
96      10001082\t362845104\t1\t\t5176\t2014-12-13 14
97      10001082\t262112219\t1\t\t9614\t2014-11-22 12
98     10001082\t275221686\t1\t\t10576\t2014-12-13 14
99     10001082\t275221686\t1\t\t10576\t2014-12-02 22

[100 rows x 1 columns]

只想读取一小部分行（避免读取整个文件），可以指明nrows：


In [110]: pd.read_csv('users.csv', nrows=5)
Out[110]:
  user_id\titem_id\tbehavior_type\tuser_geohash\titem_category\ttime
0  10001082\t285259775\t1\t97lk14c\t4076\t2014-12...
1        10001082\t4368907\t1\t\t5503\t2014-12-12 12
2        10001082\t4368907\t1\t\t5503\t2014-12-12 12
3       10001082\t53616768\t1\t\t9762\t2014-12-02 15
4      10001082\t151466952\t1\t\t5232\t2014-12-12 11

先生成一个新的数据：


l = ['L', 'B', 'G', 'R', 'Q', 'L', 'E', 'K', 'G']
l *= 3

n = len(l)

d = np.random.randn(n)

data = pd.DataFrame({'one' : d, 'two' : d, 'three' : d, 'key' : l})

#print(data)

data.to_csv('ex6.csv', index=False)
'''
one,two,three,key
-0.8253385888577351,-0.8253385888577351,-0.8253385888577351,L
-0.3005054628025023,-0.3005054628025023,-0.3005054628025023,B
0.08667646938805751,0.08667646938805751,0.08667646938805751,G
2.4717947134908873,2.4717947134908873,2.4717947134908873,R
-0.315253990143174,-0.315253990143174,-0.315253990143174,Q
-0.4602857943618783,-0.4602857943618783,-0.4602857943618783,L
0.5768993301685353,0.5768993301685353,0.5768993301685353,E
-0.4106573433924394,-0.4106573433924394,-0.4106573433924394,K
-0.7054454240211718,-0.7054454240211718,-0.7054454240211718,G
0.059222319806148255,0.059222319806148255,0.059222319806148255,L
2.1285400157366716,2.1285400157366716,2.1285400157366716,B
0.42874042415653413,0.42874042415653413,0.42874042415653413,G
-0.12655874642828271,-0.12655874642828271,-0.12655874642828271,R
1.4617361857075006,1.4617361857075006,1.4617361857075006,Q
1.4671052202024812,1.4671052202024812,1.4671052202024812,L
1.060765777158635,1.060765777158635,1.060765777158635,E
0.5057505511496003,0.5057505511496003,0.5057505511496003,K
-0.4852338681885889,-0.4852338681885889,-0.4852338681885889,G
0.48837632594424535,0.48837632594424535,0.48837632594424535,L
0.14959791412619264,0.14959791412619264,0.14959791412619264,B
0.08895475974680539,0.08895475974680539,0.08895475974680539,G
-1.3200711041372188,-1.3200711041372188,-1.3200711041372188,R
-0.07161822585532003,-0.07161822585532003,-0.07161822585532003,Q
0.15537064319789615,0.15537064319789615,0.15537064319789615,L
-1.309895454805748,-1.309895454805748,-1.309895454805748,E
0.05251784769130209,0.05251784769130209,0.05251784769130209,K
0.100051443342729,0.100051443342729,0.100051443342729,G

'''

在上面新生成的数据中做如下操作：为了分块读入文件，可以指定chunksize作为每一块的行数：

hunker = pd.read_csv('ex6.csv', chunksize=100)
#print(chunker)
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)


tot = tot.sort_values(ascending=False)
print(tot)
'''
<pandas.io.parsers.TextFileReader object at 0x0000024D04AA87F0>
L    6.0
G    6.0
Q    3.0
B    3.0
R    3.0
K    3.0
E    3.0
dtype: float64

'''

3.1.2 将数据写入文本格式

使用DataFrame的to_csv方法，可以将数据导出为逗号分隔的文件，当然，其他的分隔符也是可以的：

In [13]: data = pd.DataFrame(np.arange(12).reshape((3,4)),
    ...: index=['a', 'b', 'c'],
    ...: columns=['one', 'two', 'three', 'four'])

In [14]: data
Out[14]:
   one  two  three  four
a    0    1      2     3
b    4    5      6     7
c    8    9     10    11

In [16]: import sys

In [17]: data.to_csv(sys.stdout, sep='|') #输出到控制台
|one|two|three|four
a|0|1|2|3
b|4|5|6|7
c|8|9|10|11

缺失值在输出时以空字符串出现。可以使用自己想要用标识值对缺失值进行标注：

In [18]: data.to_csv(sys.stdout, na_rep='NULL') # 有缺失的地方用NULL来替代
,one,two,three,four
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11

按照自己选择的顺序写入列的子集

In [19]: data
Out[19]:
   one  two  three  four
a    0    1      2     3
b    4    5      6     7
c    8    9     10    11

In [20]: data.to_csv(sys.stdout, index=False, columns=['one', 'two'])
one,two
0,1
4,5
8,9

3.1.3 使用分隔格式

ex7.csv中的内容如下所示：

"a","b","c"
"1","2","3"
"1","2","3"

要使用它，需要将任一打开的文件或文件型对象传给csv.reader。像遍历文件那样遍历reader会产生元组，元组的值为删除了引号的字符：

import csv
f = open('ex7.csv')
reader = csv.reader(f)
for line in reader:
    print(line)

'''
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']
'''

首先将文件读取为行的列表：

In [29]: with open('ex7.csv') as f:
    ...:     lines = list(csv.reader(f))
    ...:

In [33]: lines
Out[33]: [['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3']]

然后，将数据拆分为列名行和数据行：

In [30]: header, values = lines[0], lines[1:]

In [34]: lines[0]
Out[34]: ['a', 'b', 'c']

In [35]: lines[1:]
Out[35]: [['1', '2', '3'], ['1', '2', '3']]

使用字典推导式和表达式zip(*values)生成一个包含数据列的字典，字典中行转置成列：

In [31]: data_dict = {h : v for h, v in zip(header, zip(*values))}

In [32]: data_dict
Out[32]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

3.1.4 JSON数据

JSON（JavaScript Object Notation的简写）已经成为Web浏览器和其他应用间通过HTTP请求发送数据的标准格式。它是一种比CSV等表格文本形式更为自由的数据形式。如下示例：

In [1]: obj = """{"name" : "Wes", "places_lived" : ["united Status", "Spain", "Germany"
   ...:  ],"pet" : null, "siblings" : [{"name" : "Scott", "age" : 30, "pets" : [
   ...: "Zeus",
   ...: "Zuko"]}, {"name" : "Katie", "age" : 38, "pets" : ["Sixes", "Stache", "
   ...: Cisio"]}
   ...: ]}"""

In [2]: import json

In [3]: result = json.loads(obj)

In [4]: result
Out[4]:
{'name': 'Wes',
 'pet': None,
 'places_lived': ['united Status', 'Spain', 'Germany'],
 'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']},
  {'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisio']}]}

另一方面，json.dumps可以将Python对象转换回JSON：

In [5]: asjson = json.dumps(resut)

In [6]: asjson
Out[6]: '{"places_lived": ["united Status", "Spain", "Germany"], "pet": null, "name": "Wes", "siblings": [{"age": 30, "pets": ["Zeus", "Zuko"], "name": "Scott"}, {"age": 38, "pets": ["Sixes", "Stache", "Cisio"], "name": "Katie"}]}'

In [7]:

自行决定如何将JSON对象或对象列表转换为DataFrame或其他数据结构。比较方便的方式是将字典构成的列表（之前是JSON对象）传入DataFrame构造函数，并选出数据字段的子集：

In [10]: siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])

In [11]: siblings
Out[11]:
    name  age
0  Scott   30
1  Katie   38

pandas.read_json可以自动将JSON数据集按照指定次序转换为Series或DataFrame。例如：

In [19]: ! type example.json
[{"a" : 1, "b" : 2, "c" : 3},
  {"a" : 4, "b" : 5, "c" : 6},
  {"a" : 7, "b" : 8, "c" : 9}]

In [20]: data = pd.read_json('example.json')

In [21]: data
Out[21]:
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

In [23]: print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}

In [25]: print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]