Jupyter notebook 给你不一样的阅读体验!
输入输出通常分为以下几大类:读取文本文件和其他的更高效的磁盘存储格式、加载数据库中的数据、利用Web API操作网络资源
6.1读写文本格式的数据
pandas提供了一些用于将表格型数据读取为DataFrame对象的函数。下表进行了总结,其中用的比较多的就是read_csv和read_table
import pandas as pd
df = pd. read_csv( 'data/ex1.csv' )
df
a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
虽然是.csv文件,但是可以使用read_table来读取,但是需要指定分隔符为“,”
pd. read_table( 'data/ex1.csv' , sep= ',' )
d:\program filles\python\lib\site-packages\ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the separator encoded in utf-8 is > 1 char long, and the 'c' engine does not support such separators; you can avoid this warning by specifying engine='python'.
"""Entry point for launching an IPython kernel.
a,b,c,d,message 0 1,2,3,4,hello 1 5,6,7,8,world 2 9,10,11,12,foo
并不是所有的文件都有标题行,读入文件时可以使用默认的列名,也可以自己定义
pd. read_csv( 'data/ex2.csv' , header= None )
0 1 2 3 4 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
pd. read_csv( 'data/ex2.csv' , names= [ 'A' , 'B' , 'C' , 'D' , 'message' ] )
A B C D message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
如果想要明确的将该列放到索引4的位置上可以通过index_col指定message
names = [ 'a' , 'b' , 'c' , 'd' , 'message' ]
pd. read_csv( 'data/ex2.csv' , names= names, index_col= 'message' )
a b c d message hello 1 2 3 4 world 5 6 7 8 foo 9 10 11 12
如果想要将多个列做成一个层次化索引,只需传入由列编号或列名组成的列表即可
parsed = pd. read_csv( 'data/csv_mindex.csv' , index_col= [ 'key1' , 'key2' ] )
parsed
value1 value2 key1 key2 one a 1 2 b 3 4 c 5 6 d 7 8 two a 9 10 b 11 12 c 13 14 d 15 16
还有一些文件含有不需要的数据,可以使用skiprows参数 跳过指定的行
result = pd. read_csv( 'data/ex4.csv' , skiprows= [ 0 , 2 , 3 ] )
result
a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo
缺失数据经常是要么没有(空字符串),要么用某个标记值表示。默认情况下,pandas会用一组经常出现的标记值进行识别,比如NA及NULL:
result = pd. read_csv( 'data/ex5.csv' )
result
something a b c d message 0 one 1 2 3.0 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11.0 12 foo
pd. isnull( result)
something a b c d message 0 False False False False False True 1 False False False True False False 2 False False False False False False
参数na_values可以使用一个列表或者集合的字符串表示缺失值
result = pd. read_csv( 'data/ex5.csv' , na_values= [ 'NULL' ] )
result
something a b c d message 0 one 1 2 3.0 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11.0 12 foo
sentinels = { 'message' : [ 'foo' , 'NA' ] , 'something' : [ 'two' ] }
pd. read_csv( 'data/ex5.csv' , na_values= sentinels)
something a b c d message 0 one 1 2 3.0 4 NaN 1 NaN 5 6 NaN 8 world 2 three 9 10 11.0 12 NaN
read_csv和read_table 一些常用的参数
逐块读取文本文件
在处理很大的文件时,或找出大文件中的参数集以便于后续处理时,你可能只想读取文件的一小部分或逐块对文件进行迭代。 在看大文件之前,需要先设置pandas的一些参数
pd. options. display. max_rows = 8
result = pd. read_csv( 'data/examples/ex6.csv' )
result
one two three four key 0 0.467976 -0.038649 -0.295344 -1.824726 L 1 -0.358893 1.404453 0.704965 -0.200638 B 2 -0.501840 0.659254 -0.421691 -0.057688 G 3 0.204886 1.074134 1.388361 -0.982404 R ... ... ... ... ... ... 9996 -0.479893 -0.650419 0.745152 -0.646038 E 9997 0.523331 0.787112 0.486066 1.093156 K 9998 -0.362559 0.598894 -1.843201 0.887292 G 9999 -0.096376 -1.012999 -0.657431 -0.573315 0
10000 rows × 5 columns
如果之想读取几行,可以通过nrows进行指定即可:
pd. read_csv( 'data/examples/ex6.csv' , nrows= 5 )
one two three four key 0 0.467976 -0.038649 -0.295344 -1.824726 L 1 -0.358893 1.404453 0.704965 -0.200638 B 2 -0.501840 0.659254 -0.421691 -0.057688 G 3 0.204886 1.074134 1.388361 -0.982404 R 4 0.354628 -0.133116 0.283763 -0.837063 Q
如果要逐块的读取文件,可以指定chunksize(行数)
chunker = pd. read_csv( 'data/examples/ex6.csv' , chunksize= 1000 )
chunker
<pandas.io.parsers.TextFileReader at 0xed5ed0>
read_csv所返回的这个TextParser对象使你可以根据chunksize对文件进行逐块迭代。比如说,我们可以迭代处理ex6.csv,将值计数聚合到"key"列中,如下所示:
tot = pd. Series( [ ] )
for piece in chunker:
tot = tot. add( piece[ 'key' ] . value_counts( ) , fill_value= 0 )
tot = tot. sort_values( ascending= False )
tot[ : 10 ]
E 368.0
X 364.0
L 346.0
O 343.0
...
J 337.0
F 335.0
K 334.0
H 330.0
Length: 10, dtype: float64
将数据写出到文本格式
data = pd. read_csv( 'data/examples/ex5.csv' )
data
something a b c d message 0 one 1 2 3.0 4 NaN 1 two 5 6 NaN 8 world 2 three 9 10 11.0 12 foo
data. to_csv( 'data/out.csv' )
缺失值在输出结果中会被表示为空字符串。也可将其表示为别的标记值
import sys
data. to_csv( sys. stdout, na_rep= 'NULL' )
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
如果没有设置其他参数,则默认列出行和列的标签。也可以不列出:
data. to_csv( sys. stdout, index= False , header= False )
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
data. to_csv( sys. stdout, index= False , columns= [ 'a' , 'c' , 'b' ] )
a,c,b
1,3.0,2
5,,6
9,11.0,10
import numpy as np
dates = pd. date_range( '1/1/2000' , periods= 7 )
ts = pd. Series( np. arange( 7 ) , index = dates)
ts. to_csv( 'data/tseries.csv' )
pd. read_csv( 'data/tseries.csv' )
2000-01-01 0 0 2000-01-02 1 1 2000-01-03 2 2 2000-01-04 3 3 2000-01-05 4 4 2000-01-06 5 5 2000-01-07 6
处理分隔符格式
大部分存储在磁盘上的表格型数据都可以使用pandas.read_csv进行加载。但有时候需要做一些手工处理
import pandas as pd
import csv
f = open ( "data/examples/ex7.csv" )
reader = csv. reader( f)
for line in reader:
print ( line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']
为了让数据格式符合要求,需要做以下工作。首先读取文件到一个多行的列表中:
with open ( "data/examples/ex7.csv" ) as f:
lines = list ( csv. reader( f) )
header, values = lines[ 0 ] , lines[ 1 : ]
data_dict = { h: v for h, v in zip ( header, zip ( * values) ) }
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
CSV文件的形式有很多。只需定义csv.Dialect的一个子类即可定义出新格式(如专门的分隔符、字符串引用约定、行结束符等):
class my_dialect ( csv. Dialect) :
lineterminator = '\n'
delimiter = '|'
quotechar = '"'
quoting = csv. QUOTE_MINIMAL
with open ( "data/examples/ex7.csv" ) as f:
reader = csv. reader( f, dialect= my_dialect)
print ( reader)
<_csv.reader object at 0x07025DB0>
各个CSV语支的参数也可以用关键字的形式提供给csv.reader,而无需定义子类:
with open ( "data/examples/ex7.csv" ) as f:
reader = csv. reader( f, delimiter= '|' )
print ( reader)
<_csv.reader object at 0x07025D30>
with open ( 'data/mydata.csv' , 'w' ) as f:
writer = csv. writer( f, dialect= my_dialect)
writer. writerow( ( 'one' , 'two' , 'three' ) )
writer. writerow( ( '1' , '2' , '3' ) )
writer. writerow( ( '4' , '5' , '6' ) )
writer. writerow( ( '7' , '8' , '9' ) )
JSON 文件
JSON(JavaScript Object Notation的简称)已经成为通过HTTP请求在Web浏览器和其他应用程序之间发送数据的标准格式之一。它是一种比表格型文本格式(如CSV)灵活得多的数据格式。
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
除其空值null和一些其他的细微差别(如列表末尾不允许存在多余的逗号)之外, JSON非常接近于有效的Python代码。基本类型有对象(字典)、数组(列表)、字符串、数值、布尔值以及null。 对象中所有的键都必须是字符串。许多Python库都可以读写JSON数据。我将使用json,因为它是构建于Python标准库中的。通过json.loads即可将JSON字符串转换成Python形式:
import json
result = json. loads( obj)
result
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
json.dumps则将Python对象转换成JSON格式:
asjson = json. dumps( result)
asjson
'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'
将(一个或一组)JSON对象转换为DataFrame最简单方便的方式是:向DataFrame构造器传入一个字典的列表(就是原先的JSON对象),并选取数据字段的子集:
siblings = pd. DataFrame( result[ 'siblings' ] , columns= [ 'name' , 'age' ] )
siblings
pandas.read_json可以自动将特别格式的JSON数据集转换为Series或DataFrame。默认选项假设JSON数组中的每个对象是表格中的一行:
data = pd. read_json( 'data/examples/example.json' )
data
将数据从pandas输出到JSON,可以使用to_json方法:
print ( data. to_json( ) )
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
print ( data. to_json( orient= 'records' ) )
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
XML和HTML:Web信息收集
Python有许多可以读写常见的HTML和XML格式数据的库,包括lxml、Beautiful Soup和html5lib。lxml的速度比较快,但其它的库处理有误的HTML或XML文件更好。pandas有一个内置的功能,read_html,它可以使用lxml和Beautiful Soup自动将HTML文件中的表格解析为DataFrame对象
import pandas as pd
tables = pd. read_html( 'data/examples/fdic_failed_bank_list.html' )
len ( tables)
1
failures = tables[ 0 ]
failures. head( )
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date 0 Allied Bank Mulberry AR 91 Today's Bank September 23, 2016 November 17, 2016 1 The Woodbury Banking Company Woodbury GA 11297 United Bank August 19, 2016 November 17, 2016 2 First CornerStone Bank King of Prussia PA 35312 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016 3 Trust Company Bank Memphis TN 9956 The Bank of Fayette County April 29, 2016 September 6, 2016 4 North Milwaukee State Bank Milwaukee WI 20364 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016
close_timestamps = pd. to_datetime( failures[ 'Closing Date' ] )
close_timestamps. dt. year. value_counts( )
2010 157
2009 140
2011 92
2012 51
2008 25
2013 24
2014 18
2002 11
2015 8
2016 5
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, dtype: int64
利用lxml.objectify解析XML
XML(Extensible Markup Language)是另一种常见的支持分层、嵌套数据以及元数据的结构化数据格式。介绍了pandas.read_html函数,XML和HTML的结构很相似,但XML更为通用。
from lxml import objectify
path = 'data/mta_perf/Performance_MNR.xml'
parsed = objectify. parse( open ( path) )
root = parsed. getroot( )
data = [ ]
skip_fields = [ 'PARENT_SEQ' , 'INDICATOR_SEQ' , 'DESIRED_CHANGE' , 'DECIMAL_PLACES' ]
for elt in root. INDICATOR:
el_data = { }
for child in elt. getchildren( ) :
if child. tag in skip_fields:
continue
el_data[ child. tag] = child. pyval
data. append( el_data)
perf = pd. DataFrame( data)
perf. head( )
AGENCY_NAME CATEGORY DESCRIPTION FREQUENCY INDICATOR_NAME INDICATOR_UNIT MONTHLY_ACTUAL MONTHLY_TARGET PERIOD_MONTH PERIOD_YEAR YTD_ACTUAL YTD_TARGET 0 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei... M On-Time Performance (West of Hudson) % 96.9 95 1 2008 96.9 95 1 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei... M On-Time Performance (West of Hudson) % 95 95 2 2008 96 95 2 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei... M On-Time Performance (West of Hudson) % 96.9 95 3 2008 96.3 95 3 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei... M On-Time Performance (West of Hudson) % 98.3 95 4 2008 96.8 95 4 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei... M On-Time Performance (West of Hudson) % 95.8 95 5 2008 96.6 95
from io import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify. parse( StringIO( tag) ) . getroot( )
root
<Element a at 0xcac8a0>
root. get( 'href' )
'http://www.google.com'
root. text
'Google'