3. 数据加载、存储与文件格式
- 读取文本文件
- 加载数据库的数据
- web API
import pandas as pd
import numpy as np
3.1 读写文本格式数据
各种将表格数据转换成DataFrame对象的函数,read_csv和read_table最常用!!!
函数 | 说明 |
---|---|
read_csv | 从文件、URL、文件型对象中加载带分隔符的数据。默认分隔符为逗号 |
read_table | 从文件、URL、文件型对象中加载带分隔符的数据。默认分隔符为**’\t’** |
read_fwf | 读取定宽格式数据(=无分隔符) |
read_excel | 读取剪贴板中的数据,可以看作read_table的剪贴板版本。将网页转换为表格时很有用 |
read_hdf | 读取pandas写的HDF5文件 |
read_html | 读取HTML文档中的所有表格 |
read_json | 读取JSON字符串中的数据 |
read_msgpack | 二进制编码pandas数据 |
read_pickle | 读取Python pickle格式中存储的 |
read_sas | 读取存储与SAS系统自定义存储格式的SAS数据集 |
read_sql | (使用SQLAlchemy)读取SQL查询结果为pandas的DataFrame |
read_stata | 读取Stata文件格式的数据集 |
read_feather | 读取Feather二进制文件格式 |
df = pd.read_csv('example1.csv')
df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pd.read_table('example1.csv')
a,b,c,d,message | |
---|---|
0 | 1,2,3,4,hello |
1 | 5,6,7,8,world |
2 | 9,10,11,12,foo |
上面忘记添加分隔符了,于是所有的列全部挤在一起,于是我们添加一下read_table的分隔符(sep)为逗号
pd.read_table('example1.csv', sep=',')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
如果我们不想默认分配列名,就将header设置为None,或者自己设定列名
pd.read_csv('example1.csv',header=None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | a | b | c | d | message |
1 | 1 | 2 | 3 | 4 | hello |
2 | 5 | 6 | 7 | 8 | world |
3 | 9 | 10 | 11 | 12 | foo |
pd.read_csv('example1.csv',names=['one','two','three','four','vocabulary'])
one | two | three | four | vocabulary | |
---|---|---|---|---|---|
0 | a | b | c | d | message |
1 | 1 | 2 | 3 | 4 | hello |
2 | 5 | 6 | 7 | 8 | world |
3 | 9 | 10 | 11 | 12 | foo |
将某一列设为index(行的编号)
names=['one','two','three','four','vocabulary']
pd.read_csv('example1.csv',names=names,index_col=4)
one | two | three | four | |
---|---|---|---|---|
vocabulary | ||||
message | a | b | c | d |
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
传入有列编号或列名组成的列表可以将多个列做为层次化索引
parsed = pd.read_csv('csv_mindex.csv', index_col=['key1','key2'])
parsed
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
有些表格不是用固定的分隔符来分割字段(比如用空格或其他的模式)
通过正则表达式来提取
pd.read_table('example3.txt', sep='\s+') #\s+代表空格分割
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
d | 10 | 11 | 12 |
skiprows选项可以帮我们跳过乱七八糟的行
pd.read_table('example4.txt', skiprows=[0,2,3], sep=',', index_col=4)
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
list(open('example4.txt')) #其实应该先打开文件,确定哪几行是我们不需要的
['#hey!\n',
'a,b,c,d,message\n',
'#just wanted to make things more difficult for you\n',
'#who reads CSV files with computers, anyway?\n',
'1,2,3,4,hello\n',
'5,6,7,8,world\n',
'9,10,11,12,foo']
没有或空白会被NA值填上,后续可以用fillna处理
result = pd.read_csv('example5.csv',skiprows=[0,2,3])
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2.0 | NaN | 4 | hello |
1 | two | 5 | 6.0 | 7.0 | 8 | world |
2 | three | 9 | NaN | 11.0 | 12 | foo |
pd.isnull(result)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | True | False | False |
1 | False | False | False | False | False | False |
2 | False | False | True | False | False | False |
有时候除了空白之外,有一些不合理的数我们也希望用NA填上,这时可以用na_values
result=pd.read_csv('example5.csv',na_values=[0,1],skiprows=[0,2,3])
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | NaN | 2.0 | NaN | 4 | hello |
1 | two | 5.0 | 6.0 | 7.0 | 8 | world |
2 | three | 9.0 | NaN | 11.0 | 12 | foo |
sentinels={'message':['foo','hello'], 'something': ['two']}
pd.read_csv('example5.csv',na_values=sentinels,skiprows=[0,2,3])
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2.0 | NaN | 4 | NaN |
1 | NaN | 5 | 6.0 | 7.0 | 8 | world |
2 | three | 9 | NaN | 11.0 | 12 | NaN |
read_csv/read_table函数的参数
参数 | 说明 |
---|---|
path | 表示文件系统位置、URL、文件型对象的字符串 |
sep或delimiter | 用于对行中各字段进行拆分的字符序列或正则表达式 |
header | 用作列名的行号,默认为0,如果没有就应该设置为None |
index_col | 用作行索引的列编号(整数)或列名。可以是单个名称/数字或多个名称/数字组成的列表 |
names | 用于结果的列名列表,结合header=None |
skiprows | 需要忽略的行数,从文件的第0行开始算,用列表表示。 |
skip_footer | 需要忽略的行数,从文件末尾算起 |
na_values | 用来把无意义值替换为NA |
comment | 用域将注释信息从行尾拆分出去的字符(一个或多个) |
parse_dates | 尝试将数据解析为日期,默认为False。如果为True,则尝试解析所有列。此外,还可以指定需要解析的一组列名或列号。如果列表的元素为列表或元组,就会将多个列组合到一起再进行日期解析工作(例如:日期和时间分别位于两个列中) |
keep_data_col | 如果连接多列解析日期,则保持参与连接的列。默认为False |
converters | 由列号/列名跟函数之间 |
dayfirst | 当解析有歧义的日期时,将其看做国际格式(例如,7/6/2012->June 7, 2012)。默认为False |
date_parser | 用于解析日期 |
nrows | 需要读取的行数 |
iterator | 返回一个TextParser以便逐块读取文件 |
chunksize | 文件快的大小(用于迭代) |
verbose | 打印各种解析器输出信息,比如“非数值列中的缺失值的数量” |
encoding | 用域unicode的文本编码格式,有时候要改成“gbk”才能读文件 |
squeeze | 如果数据经解析后仅含一列,则返回Series |
thousands | 千分位分隔符,如“,”或“.”(用于大数字等等) |
逐块读取文本文件
读大文件前可以先设置pandas,使其显示更加紧凑些
pd.options.display.max_rows = 10
result = pd.read_csv('shop_payNum_new.csv')
result
time_stamp | shop_id | pay_num | cate_2_name | |
---|---|---|---|---|
0 | 2016-01-01 | 14 | 33 | fast food |
1 | 2016-01-02 | 14 | 35 | fast food |
2 | 2016-01-03 | 14 | 20 | fast food |
3 | 2016-01-04 | 14 | 53 | fast food |
4 | 2016-01-05 | 14 | 54 | fast food |
... | ... | ... | ... | ... |
10021 | 2016-10-27 | 1715 | 45 | hot pot |
10022 | 2016-10-28 | 1715 | 60 | hot pot |
10023 | 2016-10-29 | 1715 | 110 | hot pot |
10024 | 2016-10-30 | 1715 | 123 | hot pot |
10025 | 2016-10-31 | 1715 | 46 | hot pot |
10026 rows × 4 columns
nrows
只读取几行(避免读取整个文件)
pd.read_csv('shop_payNum_new.csv', nrows=5)
time_stamp | shop_id | pay_num | cate_2_name | |
---|---|---|---|---|
0 | 2016-01-01 | 14 | 33 | fast food |
1 | 2016-01-02 | 14 | 35 | fast food |
2 | 2016-01-03 | 14 | 20 | fast food |
3 | 2016-01-04 | 14 | 53 | fast food |
4 | 2016-01-05 | 14 | 54 | fast food |
chunksize(块的大小)
指定的是行数
chunker = pd.read_csv('PythonForDataAnalysis-master/ch06/ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x29e098a7070>
tot = pd.Series([], dtype=np.float64) #加上dtype=np.float64,不然会有过时警告
for piece in chunker:
tot = tot.add(piece['key'].value_counts(),fill_value=0)
tot = tot.sort_values(ascending=False)
tot
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
...
5 157.0
2 152.0
0 151.0
9 150.0
1 146.0
Length: 36, dtype: float64
将数据写出到文本格式
data = pd.read_csv('PythonForDataAnalysis-master/ch06/ex5.csv')
data
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
to_csv
data.to_csv('PythonForDataAnalysis-master/ch06/out.csv') # 默认将数据写到⼀个以逗号分隔的⽂件
sep参数(之前出现过)
import sys
data.to_csv(sys.stdout, sep ='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
将数据写到⼀个以逗号分隔的⽂件,这边是直接输出到sys.stdout,也就是屏幕上,其实都一样,只要写个路径就好了
na_rep参数
Replace NA,因为输出结果会被表示为空字符串,所以用这个参数设置个符号来填充位置
data.to_csv(sys.stdout, sep = '/', na_rep=0)
/something/a/b/c/d/message
0/one/1/2/3.0/4/0
1/two/5/6/0/8/world
2/three/9/10/11.0/12/foo
index和header参数
- index是控制是否输出行标签,默认为真
- header是控制是否输出列标签,默认为真
data.to_csv(sys.stdout, sep = '|', na_rep = 0, index= False, header=False)
one|1|2|3.0|4|0
two|5|6|0|8|world
three|9|10|11.0|12|foo
columns参数
筛选一部分想要的列
data.to_csv(sys.stdout, index=False,columns=list('abc'))
a,b,c
1,2,3.0
5,6,
9,10,11.0
Series也有一个to_csv方法
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7),index=dates)
ts.to_csv('PythonForDataAnalysis-master/ch06/tseries.csv')
ts.to_csv(sys.stdout, header=False)
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6
处理分隔符格式: csv.reader
list(open('PythonForDataAnalysis-master/ch06/ex7.csv'))
['"a","b","c"\n', '"1","2","3"\n', '"1","2","3","4"\n']
f = open('PythonForDataAnalysis-master/ch06/ex7.csv')
import csv
reader = csv.reader(f)
for line in reader:
print(line) #去掉了之前的引号和逗号
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']
进一步处理
- 读取文件到多行列表
with open('PythonForDataAnalysis-master/ch06/ex7.csv') as f:
lines = list(csv.reader(f))
- 将数据分为标题行和数据行
header, values = lines[0], lines[1:]
- 用字典构造式和zip(*values),后者将行转置为列并创建数据列的字典
data_dict = {h:v for h, v in zip(header,zip(*values)) }
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
* 测试zip(*values)的作用
for i in zip(*values):
print(i)
('1', '1')
('2', '2')
('3', '3')
for i in values:
print(i)
['1', '2', '3']
['1', '2', '3', '4']
设置自己的输出格式:继承csv.Dialect类
class my_dialect(csv.Dialect):
lineterminator = '\n' # 以换行符作为行的结尾
delimiter = ';' # 分隔符
quotechar = '"' # 引用符
quoting = csv.QUOTE_MINIMAL # 特殊字段会加引号
f = open('PythonForDataAnalysis-master/ch06/ex7.csv')
reader = csv.reader(f, dialect=my_dialect)
各个csv的参数也可以以关键字的形式提供给csv.reader,从而无需定义子类
参数 | 说明 |
---|---|
delimiter | 分割字段的单字符。默认为“,”comma |
lineterminator | 行结束符。默认为“\r\n”。读操作会忽略,因为它能认出跨平台的行结束符 |
quotechar | 带有特殊字符(如分隔符)的字段的引用符号。默认为“"”,quotation |
quoting | 引用约定。可选值的含义:csv.QUOTE_ALL(引用所有字段) csv.QUOTE_MINIMAL(只引用带有诸如分隔符之类特殊字段的字段) csv.QUOTE_NONNUMERIC csv.QUOTE_NON(不引用),默认为QUOTE_MINIMAL |
skipinitialspace | 跳过分隔符后面的空格。默认为False |
doublequote | 如何处理字段内的引用符号。如果为True则双写 |
escapechar | 对分隔符进行转义的字符串 |
注意:对于那些使⽤复杂分隔符或多字符分隔符的⽂件,只能使⽤字符串的split⽅法或正则表达式⽅法re.split进⾏⾏拆分和其他整理⼯作了。csv无法应付。
手工输出分隔符文件: csv.writer
with open('mydata.csv','w') as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))
writer.writerow(('4','5','6'))
writer.writerow(('7','8','9'))
验证:
with open('mydata.csv') as f:
lines = list(csv.reader(f, delimiter=';'))
header, values = lines[0], lines[1:]
data_dict = {h:v for h, v in zip(header,zip(*values))}
data_dict
{'one': ('1', '4', '7'), 'two': ('2', '5', '8'), 'three': ('3', '6', '9')}
JSON数据
obj = """
{"name":"wes",
"places_lived":["United States","Spain","Germany"],
"pet": null,
"siblings": [{"name":"Scott","age":30, "pets":["Zeus","Zuko"]},
{"name":"Katie","age":38,"pets": ["Sixes","Stache","Cisco"]}
]
}
"""
import json
result = json.loads(obj)
result
{'name': 'wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
json.dumps
将Python对象转换成JSON格式
asjson = json.dumps(result)
asjson
'{"name": "wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'
JSON对象转DataFrame直接用DataFrame的构造器
siblings = pd.DataFrame(result['siblings'])
siblings
name | age | pets | |
---|---|---|---|
0 | Scott | 30 | [Zeus, Zuko] |
1 | Katie | 38 | [Sixes, Stache, Cisco] |
pandas.read_json
⾃动将特别格式的JSON数据集转换为Series或DataFrame
data = pd.read_json('PythonForDataAnalysis-master/ch06/example.json')
data
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
to_json方法
将数据从pandas输出到JSON
print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
XML和HTML:Web信息收集
Python有许多可以读写常⻅的HTML和XML格式数据的库,包括lxml、Beautiful Soup和html5lib
lxml的速度⽐较快,但其它的 库处理有误的HTML或XML⽂件更好
pandas.read_html
默认条件下,它会搜索、尝试解析 < table > 标签内的的表格数据。结果是⼀个列表的DataFrame对象
tables = pd.read_html('PythonForDataAnalysis-master/ch06/fdic_failed_bank_list.html') #记录银行破产信息的html
len(tables)
1
failures = tables[0]
failures.head()
Bank Name | City | ST | CERT | Acquiring Institution | Closing Date | Updated Date | |
---|---|---|---|---|---|---|---|
0 | Allied Bank | Mulberry | AR | 91 | Today's Bank | September 23, 2016 | November 17, 2016 |
1 | The Woodbury Banking Company | Woodbury | GA | 11297 | United Bank | August 19, 2016 | November 17, 2016 |
2 | First CornerStone Bank | King of Prussia | PA | 35312 | First-Citizens Bank & Trust Company | May 6, 2016 | September 6, 2016 |
3 | Trust Company Bank | Memphis | TN | 9956 | The Bank of Fayette County | April 29, 2016 | September 6, 2016 |
4 | North Milwaukee State Bank | Milwaukee | WI | 20364 | First-Citizens Bank & Trust Company | March 11, 2016 | June 16, 2016 |
数据清洗和分析
例如:按年份计算倒闭的银行数
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()
2010 157
2009 140
2011 92
2012 51
2008 25
...
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, Length: 15, dtype: int64
利⽤lxml.objectify解析XML
from lxml import objectify
path = 'PythonForDataAnalysis-master/ch06/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path)) #parse是解析的意思
root = parsed.getroot()
root
<Element PERFORMANCE at 0x29e0967d800>
root.INDICATOR返回⼀个⽤于产⽣各个< INDICATOR >XML元素的⽣成器。
data = []
skip_fields = ['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
perf = pd.DataFrame(data)
perf.head()
AGENCY_NAME | INDICATOR_NAME | DESCRIPTION | PERIOD_YEAR | PERIOD_MONTH | CATEGORY | FREQUENCY | INDICATOR_UNIT | YTD_TARGET | YTD_ACTUAL | MONTHLY_TARGET | MONTHLY_ACTUAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Metro-North Railroad | On-Time Performance (West of Hudson) | Percent of commuter trains that arrive at thei... | 2008 | 1 | Service Indicators | M | % | 95 | 96.9 | 95 | 96.9 |
1 | Metro-North Railroad | On-Time Performance (West of Hudson) | Percent of commuter trains that arrive at thei... | 2008 | 2 | Service Indicators | M | % | 95 | 96 | 95 | 95 |
2 | Metro-North Railroad | On-Time Performance (West of Hudson) | Percent of commuter trains that arrive at thei... | 2008 | 3 | Service Indicators | M | % | 95 | 96.3 | 95 | 96.9 |
3 | Metro-North Railroad | On-Time Performance (West of Hudson) | Percent of commuter trains that arrive at thei... | 2008 | 4 | Service Indicators | M | % | 95 | 96.8 | 95 | 98.3 |
4 | Metro-North Railroad | On-Time Performance (West of Hudson) | Percent of commuter trains that arrive at thei... | 2008 | 5 | Service Indicators | M | % | 95 | 96.6 | 95 | 95.8 |
XML数据可以⽐本例复杂得多。每个标记都可以有元数据。下⾯是个HTML的链接标签(它也算是⼀段有效的XML):
from io import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()
root
<Element a at 0x29e095c6ac0>
root.get('href')
'http://www.google.com'
root.text
'Google'