《利用Python进行数据分析》学习笔记ch06(7)

第6章 数据加载、存储与文件格式

输入输出通常可以划分为几大类:读取文本文件和其他更高效的磁盘存储格式,加载数据库中的数据,利用Web API操作网络资源

读写文本格式的数据

import pandas as pd
import numpy as np
from pandas import Series,DataFrame
!type C:\\pytest\\ch06\\ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

由于该文件是逗号隔开,所以用read_csv将其读入一个DataFrame:

df=pd.read_csv('C:\\pytest\\ch06\\ex1.csv')
df
abcdmessage
01234hello
15678world
29101112foo

也可以使用read_table,只不过需要指定分隔符而已:

pd.read_table('C:\\pytest\\ch06\\ex1.csv',sep=',')
abcdmessage
01234hello
15678world
29101112foo

没有标题行文件:

!type C:\\pytest\\ch06\\ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',header=None)
01234
01234hello
15678world
29101112foo
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',names=['a','b','c','d','message'])
abcdmessage
01234hello
15678world
29101112foo
names=['a','b','c','d','message']
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',names=names,index_col='message')
abcd
message
hello1234
world5678
foo9101112

将多个列做成一个层次化索引,只需传入由列编号或列名组成的列表即可:

!type C:\\pytest\\ch06\\csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed=pd.read_csv('C:\\pytest\\ch06\\csv_mindex.csv',index_col=['key1','key2'])
parsed
value1value2
key1key2
onea12
b34
c56
d78
twoa910
b1112
c1314
d1516

有些表格可能不是用固定的分隔符去分隔字段的,对于这种情况,可以编写一个正则表达式来作为read_table的分隔符。

list(open('C:\\pytest\\ch06\\ex3.txt'))
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

该文件各个字段由数量不定的空白符分隔,这个情况可以用正则表达式\s+表示:

result=pd.read_table('C:\\pytest\\ch06\\ex3.txt',sep='\s+')
result
ABC
aaa-0.264438-1.026059-0.619500
bbb0.9272720.302904-0.032399
ccc-0.264273-0.386314-0.217601
ddd-0.871858-0.3483821.100491

可以用skiprows跳过文件的第一行、第三行和第四行:

!type C:\\pytest\\ch06\\ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('C:\\pytest\\ch06\\ex4.csv',skiprows=[0,2,3])
abcdmessage
01234hello
15678world
29101112foo

pandas会用一组经常出现的标记值进行识别,如NA,-1.#IND以及NULL等:

!type C:\\pytest\\ch06\\ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result=pd.read_csv('C:\\pytest\\ch06\\ex5.csv')
result
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo
pd.isnull(result)
somethingabcdmessage
0FalseFalseFalseFalseFalseTrue
1FalseFalseFalseTrueFalseFalse
2FalseFalseFalseFalseFalseFalse

na_values可以接受一组用于表示缺失值的字符串:

result=pd.read_csv('C:\\pytest\\ch06\\ex5.csv',na_values=['NULL'])
result
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo

可以用一个字典为各列指定不同的NA标记值:

sentinels={'message':['foo','NA'],'something':['two']}
pd.read_csv('C:\\pytest\\ch06\\ex5.csv',na_values=sentinels)
somethingabcdmessage
0one123.04NaN
1NaN56NaN8world
2three91011.012NaN

逐块读取文本文件

读取文件的一部分或逐块对文件进行迭代

result=pd.read_csv('C:\\pytest\\ch06\\ex6.csv')
result
onetwothreefourkey
00.467976-0.038649-0.295344-1.824726L
1-0.3588931.4044530.704965-0.200638B
2-0.5018400.659254-0.421691-0.057688G
30.2048861.0741341.388361-0.982404R
40.354628-0.1331160.283763-0.837063Q
51.8174800.7422730.419395-2.251035Q
6-0.7767640.935518-0.332872-1.875641U
7-0.9131351.530624-0.5726570.477252K
80.358480-0.497572-0.3670160.507702S
9-1.740877-1.160417-1.6378302.172201G
100.240564-0.3282491.2521551.0727968
110.7640181.165476-0.6395441.495258R
120.571035-0.3105370.582437-0.2987651
132.3176580.430710-1.3342160.199679P
141.547771-1.119753-2.2776340.329586J
15-1.3106080.401719-1.0009871.156708E
16-0.0884960.6347120.1533240.415335B
17-0.018663-0.247487-1.4465220.750938A
18-0.070127-1.5790970.1208920.671432F
19-0.194678-0.4920392.3596050.319810H
20-0.2486180.868707-0.492226-0.717959W
21-1.091549-0.867110-0.647760-0.832562C
220.641404-0.138822-0.621963-0.284839C
231.2164080.9926870.165162-0.069619V
24-0.5644740.7928320.7470530.571675I
251.759879-0.515666-0.2304811.362317S
260.1262660.3092810.382820-0.239199L
271.334360-0.100152-0.840731-0.6439676
28-0.7376200.278087-0.053235-0.950972J
29-1.148486-0.986292-0.1449630.124362Y
99700.633495-0.1865240.9276270.1431644
99710.308636-0.1128570.762842-1.0729771
9972-1.627051-0.9781510.154745-1.229037Z
99730.3148470.0979890.1996080.955193P
99741.6669070.9920050.496128-0.686391S
99750.0106030.708540-1.2587110.226541K
99760.118693-0.714455-0.501342-0.254764K
99770.302616-2.011527-0.6280850.768827H
9978-0.0985721.769086-0.215027-0.053076A
9979-0.0190581.9649940.738538-0.883776F
9980-0.5953490.001781-1.423355-1.458477M
99811.392170-1.396560-1.425306-0.847535H
9982-0.896029-0.1522871.9244830.3651846
9983-2.274642-0.9018741.5003520.996541N
9984-0.3018981.0199061.1021602.624526I
9985-2.548389-0.5853741.496201-0.718815D
9986-0.0645880.759292-1.568415-0.420933E
9987-0.143365-1.111760-1.8155810.4352742
9988-0.070412-1.0559210.338017-0.440763X
99890.6491480.994273-1.3842270.485120Q
9990-0.3707690.404356-1.051628-1.0508998
9991-0.4099800.155627-0.8189901.277350W
99920.301214-1.1112030.6682580.671922A
99931.8211170.4164450.1738740.505118X
99940.0688041.3227590.8023460.223618H
99952.311896-0.417070-1.409599-0.515821L
9996-0.479893-0.6504190.745152-0.646038E
99970.5233310.7871120.4860661.093156K
9998-0.3625590.598894-1.8432010.887292G
9999-0.096376-1.012999-0.657431-0.5733150

10000 rows × 5 columns

通过nrows读取几行:

pd.read_csv('C:\\pytest\\ch06\\ex6.csv',nrows=5)
onetwothreefourkey
00.467976-0.038649-0.295344-1.824726L
1-0.3588931.4044530.704965-0.200638B
2-0.5018400.659254-0.421691-0.057688G
30.2048861.0741341.388361-0.982404R
40.354628-0.1331160.283763-0.837063Q

要逐块读取文件,需要设置chunksize(行数):

chunker=pd.read_csv('C:\\pytest\\ch06\\ex6.csv',chunksize=1000)
chunker
tot=Series([])
for piece in chunker:
    tot=tot.add(piece['key'].value_counts(),fill_value=0)
tot=tot.order(ascending=False)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: order is deprecated, use sort_values(...)
tot[:10]
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

将数据写出到文本格式

数据也可以被输出为分隔符格式的文本:

data=pd.read_csv('C:\\pytest\\ch06\\ex5.csv')
data
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo

利用DataFrame的to_csv方法,将数据写到一个以逗号分隔的文件中:

data.to_csv('C:\\pytest\\ch06\\out.csv')
!type C:\\pytest\\ch06\\out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
import sys
data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

缺失值由空白字符串表示为别的标记值:

data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

只写出一部分列,并以你指定的顺序排列:

data.to_csv(sys.stdout,index=False,cols=['a','b','c'])
something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

Series也有一个to_csv方法:

dates=pd.date_range('1/1/2000',periods=7)
ts=Series(np.arange(7),index=dates)
ts.to_csv('C:\\pytest\\ch06\\tseries.csv')
!type C:\\pytest\\ch06\\tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6
Series.from_csv('C:\\pytest\\ch06\\tseries.csv',parse_dates=True)
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

手工处理分隔符格式

表格型数据都能用pandas.read_table进行加载:

!type C:\\pytest\\ch06\\ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"

对于任何单字符分隔符文件,可以直接使用Python内置的csv模块:

import csv
f=open('C:\\pytest\\ch06\\ex7.csv')
reader=csv.reader(f)
reader
for line in reader:
    print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']
lines=list(csv.reader(open('C:\\pytest\\ch06\\ex7.csv')))
header,values=lines[0],lines[1:]
data_dict={h: v for h,v in zip(header,zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV文件的形式有很多,只需定义csv.Dialect的一个子类即可定义出新格式(如专门的分隔符、字符串引用约定、行结束符等):

class my_dialect(csv.Dialect):
    lineterminator='\n'
    delimiter=';'
    quotechar='"'
reader=csv.reader(f,delimiter='|')

JSON数据

JSON是一种比表格型文本格式(如CSV)灵活得多的数据格式:

obj="""{"name":"Wes",
        "place_lived":["United States","Spain","Germany"],
        "pet":null,
        "siblings":[{"name":"Scott","age":25,"pet":"Zuko"},{"name":"Katie","age":33,"pet":"Cisco"}]}"""
通过json.loads即可将JSON字符串转换为Python形式:
import json
result=json.loads(obj)
result
{'name': 'Wes',
 'pet': None,
 'place_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
  {'age': 33, 'name': 'Katie', 'pet': 'Cisco'}]}

json.dumps则将Python对象转换成JSON格式:

asjson=json.dumps(result)
将(一个或一组)JSON对象转换为DataFrame或其他便于分析的数据结构,向DataFrame构造器传入一组JSON对象,并选取数据字段的子集:
siblings=DataFrame(result['siblings'],columns=['name','age'])
siblings
nameage
0Scott25
1Katie33

XML和HTML:Web信息收集

Python有许多可以读写HTML和XML格式数据的库,lxml就是其中之一,它能够高效且可靠地解析大文件。找到希望获取的URL,利用urllibs将其打开,然后用lxml解析得到的数据流:

from lxml.html import parse
from urllib.request import urlopen
parsed=parse(urlopen('https://finance.yahoo.com/q/op?s=AAPL+Options'))
doc=parsed.getroot()
通过这个对象,你可以获取特定类型的所有HTML标签(tag)。使用文档根节点的findall方法以及一个XPath:
links=doc.findall('.//a')
links[15:20]
[<Element a at 0x137db5479f8>,
 <Element a at 0x137db547a48>,
 <Element a at 0x137db547a98>,
 <Element a at 0x137db547ae8>,
 <Element a at 0x137db547b38>]

要得到URL和链接文本,你必须使用各对象的get方法(针对URL)和text_content方法(针对显示文本):

lnk=links[28]
lnk
<Element a at 0x137db547e08>
lnk.get('href')
'https://itunes.apple.com/us/podcast/yahoo-finance-presents/id1241417570?mt=2'
lnk.text_content()
"LISTENMajor League Soccer: The bear and bull caseYahoo Finance's Jen Rogers speaks to Dan Roberts and Kevin Chupka about the business of MLS"

编写下面这条列表推导式(list comprehension)即可获取文档中的全部URL:

urls=[lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:]
['https://smallbusiness.yahoo.com',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US&id=SLN2310',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US',
 'https://yahoo.uservoice.com/forums/382977',
 'http://info.yahoo.com/privacy/us/yahoo/',
 'http://info.yahoo.com/relevantads/',
 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
 'https://twitter.com/YahooFinance',
 'https://facebook.com/yahoofinance',
 'http://yahoofinance.tumblr.com']

二进制数据格式

实现数据的二进制格式存储最简单的办法之一是使用Python内置的pickle序列化。为了使用方便,pandas对象都有一个用于将数据以pickle形式保存到磁盘上的save方法:

frame=pd.read_csv('C:\\pytest\\ch06\\ex1.csv')
frame
abcdmessage
01234hello
15678world
29101112foo

使用HDF5格式

pandas有一个最小化的类似于字典的HDFStore类,它通过PyTable存储pandas对象:

store=pd.HDFStore('mydata.h5')
store['obj1']=frame
store['obj1_col']=frame['a']
store
store['obj1']
abcdmessage
01234hello
15678world
29101112foo

读取Microsoft Excel文件

需要安装xlrd和openpyxl包

使用HTML和Web API

import requests
url = 'https://api.github.com/repos/pydata/pandas/milestones/28/labels'
resp = requests.get(url)
resp
data[:5]
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo
issue_labels = DataFrame(data)
issue_labels
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo

使用数据库

使用一款嵌入式的SQLite数据库:

import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""

con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]
cursor.description
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))
DataFrame(rows, columns=zip(*cursor.description)[0])
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-60-1d88cef6e699> in <module>()
----> 1 DataFrame(rows, columns=zip(*cursor.description)[0])


TypeError: 'zip' object is not subscriptable
import pandas.io.sql as sql
sql.read_sql('select * from test', con)
abcd
0AtlantaGeorgia1.256
1TallahasseeFlorida2.603
2SacramentoCalifornia1.705

存取MongoDB中的数据

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值