输入、输出通常可以划分为几个大类:
- 读取文本文件
- 其他高效磁盘存储格式
- 加载数据库中的数据
- 利用网络WEB API操作网络资源
读取文本格式数据
函数 | 说明 |
---|---|
read_csv | 从文件、url、文件型对象中加载带分隔符的数据,默认分隔符为逗号 |
read_table | 从文件、url、文件型对象中加载带分隔符的数据,默认制表符“\t” |
read_fwf | 读取定宽列格式数据(也就是说没有分隔符) |
read_clipboard | 读取剪切板数据,网页转化为表格时非常有用 |
下面分析一个csv文件,ex1.csv。将其读入一个DataFrame:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.read_csv('ex1.csv')
df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
我们也可以用read_table,只不过需要指定分隔符:
pd.read_table('ex1.csv', sep=',')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
然而,并不是所有的文件都有标题行。比如下面这个:
!type ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
读入该文件的方法有两个。你可以让pandas为其分配默认的列名,也可以自己定义:
pd.read_csv('ex2.csv', header=None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pd.read_csv('ex2.csv', names=['a','b','c','d','message'])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
假设你希望将message列做成DataFrame的索引。你可以明确表示要将该列放到索引4的位置上,也可以通过index_col参数指定“message”:
names=['a','b','c','d','message']
pd.read_csv('ex2.csv', names=names, index_col='message')
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
!type csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed = pd.read_csv('csv_mindex.csv',index_col=['key1','key2'])
parsed
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
有些表格可能不是用固定的分隔符分割字段的,如果这样,可以编写一个正则表达式来作为read_table的分隔符。下面这个文件各个字段由数量不定的空白符分割,可以用正则表达式\s表示。
list(open('ex3.txt'))
[' A B C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb 0.927272 0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382 1.100491\n']
结果如下:
result = pd.read_table('ex3.txt', sep='\s+')
这里由于列名比数据行的数量少,read_table推断第一列是索引
result
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
下面这例,跳过文件的第一行、第三行和第四行:
!type ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('ex4.csv', skiprows=[0,2,3])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
缺失值处理
缺失值处理是文件解析任务中的重要组成部分。缺失数据经常要么是没有(空字符串),要么用某个标记值表示。默认情况下,pandas会用一组经常出现的标记值,如NA、-1、#IND以及NULL等:
!type ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result = pd.read_csv('ex5.csv')
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
pd.isnull(result)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | True | False | False |
2 | False | False | False | False | False | False |
na_values可以接受一组用于表示缺失值的字符串:
result = pd.read_csv('ex5.csv', na_values=['NULL'])
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
可以用一个字典为各列指定不同的NA标记值:
sentinels = {'message':['foo','NA'],'something':['two']}
pd.read_csv('ex5.csv', na_values=sentinels)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | NaN | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
函数 | 说明 |
---|---|
parse_dates | 尝试将数据解析为日期,默认为False。如果为True, 则尝试解析所有列。此外,还可以指定 需要解析的一组列号或列名。如果列表的元素为 列表或元组,就会将多个列组合到一起再进行日期解析 工作分析(例如,日期/时间/分别位于两个列中) |
keep_data_col | 如果连接多列解析日期,则保持参与连接的列 默认为False |
converters | 由列号/列名跟函数之间的映射关系组成的字典。例如{'foo':f} 会对foo列 |
dayfirst | 默认为False |
date_parse | 用于解析日期的函数 |
nrows | 需要读取的行数 |
iterator | 返回一个TextParse以便逐块读取 |
chunksize | 文件块的大小(用于迭代) |
skip_footer | 需要忽略的行数(从文件末尾处算起) |
逐块读取文本文件
处理大文件时,只想读取文件的一小部分或逐块对文件进行迭代。
result = pd.read_csv('ex6.csv')
result
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
5 | 1.817480 | 0.742273 | 0.419395 | -2.251035 | Q |
6 | -0.776764 | 0.935518 | -0.332872 | -1.875641 | U |
7 | -0.913135 | 1.530624 | -0.572657 | 0.477252 | K |
8 | 0.358480 | -0.497572 | -0.367016 | 0.507702 | S |
9 | -1.740877 | -1.160417 | -1.637830 | 2.172201 | G |
10 | 0.240564 | -0.328249 | 1.252155 | 1.072796 | 8 |
11 | 0.764018 | 1.165476 | -0.639544 | 1.495258 | R |
12 | 0.571035 | -0.310537 | 0.582437 | -0.298765 | 1 |
13 | 2.317658 | 0.430710 | -1.334216 | 0.199679 | P |
14 | 1.547771 | -1.119753 | -2.277634 | 0.329586 | J |
15 | -1.310608 | 0.401719 | -1.000987 | 1.156708 | E |
16 | -0.088496 | 0.634712 | 0.153324 | 0.415335 | B |
17 | -0.018663 | -0.247487 | -1.446522 | 0.750938 | A |
18 | -0.070127 | -1.579097 | 0.120892 | 0.671432 | F |
19 | -0.194678 | -0.492039 | 2.359605 | 0.319810 | H |
20 | -0.248618 | 0.868707 | -0.492226 | -0.717959 | W |
21 | -1.091549 | -0.867110 | -0.647760 | -0.832562 | C |
22 | 0.641404 | -0.138822 | -0.621963 | -0.284839 | C |
23 | 1.216408 | 0.992687 | 0.165162 | -0.069619 | V |
24 | -0.564474 | 0.792832 | 0.747053 | 0.571675 | I |
25 | 1.759879 | -0.515666 | -0.230481 | 1.362317 | S |
26 | 0.126266 | 0.309281 | 0.382820 | -0.239199 | L |
27 | 1.334360 | -0.100152 | -0.840731 | -0.643967 | 6 |
28 | -0.737620 | 0.278087 | -0.053235 | -0.950972 | J |
29 | -1.148486 | -0.986292 | -0.144963 | 0.124362 | Y |
... | ... | ... | ... | ... | ... |
9970 | 0.633495 | -0.186524 | 0.927627 | 0.143164 | 4 |
9971 | 0.308636 | -0.112857 | 0.762842 | -1.072977 | 1 |
9972 | -1.627051 | -0.978151 | 0.154745 | -1.229037 | Z |
9973 | 0.314847 | 0.097989 | 0.199608 | 0.955193 | P |
9974 | 1.666907 | 0.992005 | 0.496128 | -0.686391 | S |
9975 | 0.010603 | 0.708540 | -1.258711 | 0.226541 | K |
9976 | 0.118693 | -0.714455 | -0.501342 | -0.254764 | K |
9977 | 0.302616 | -2.011527 | -0.628085 | 0.768827 | H |
9978 | -0.098572 | 1.769086 | -0.215027 | -0.053076 | A |
9979 | -0.019058 | 1.964994 | 0.738538 | -0.883776 | F |
9980 | -0.595349 | 0.001781 | -1.423355 | -1.458477 | M |
9981 | 1.392170 | -1.396560 | -1.425306 | -0.847535 | H |
9982 | -0.896029 | -0.152287 | 1.924483 | 0.365184 | 6 |
9983 | -2.274642 | -0.901874 | 1.500352 | 0.996541 | N |
9984 | -0.301898 | 1.019906 | 1.102160 | 2.624526 | I |
9985 | -2.548389 | -0.585374 | 1.496201 | -0.718815 | D |
9986 | -0.064588 | 0.759292 | -1.568415 | -0.420933 | E |
9987 | -0.143365 | -1.111760 | -1.815581 | 0.435274 | 2 |
9988 | -0.070412 | -1.055921 | 0.338017 | -0.440763 | X |
9989 | 0.649148 | 0.994273 | -1.384227 | 0.485120 | Q |
9990 | -0.370769 | 0.404356 | -1.051628 | -1.050899 | 8 |
9991 | -0.409980 | 0.155627 | -0.818990 | 1.277350 | W |
9992 | 0.301214 | -1.111203 | 0.668258 | 0.671922 | A |
9993 | 1.821117 | 0.416445 | 0.173874 | 0.505118 | X |
9994 | 0.068804 | 1.322759 | 0.802346 | 0.223618 | H |
9995 | 2.311896 | -0.417070 | -1.409599 | -0.515821 | L |
9996 | -0.479893 | -0.650419 | 0.745152 | -0.646038 | E |
9997 | 0.523331 | 0.787112 | 0.486066 | 1.093156 | K |
9998 | -0.362559 | 0.598894 | -1.843201 | 0.887292 | G |
9999 | -0.096376 | -1.012999 | -0.657431 | -0.573315 | 0 |
10000 rows × 5 columns
如果只想读取几行,可以通过nrows进行指定即可:
pd.read_csv('ex6.csv', nrows=5)
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
要逐块读取文件,需要设置chunksize:
chunker = pd.read_csv('ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x1e6ff1e1e48>
read_csv所返回的这个TextParse对象使你可以根据chunksize对文件进行逐块迭代。比如,ex6.csv中,将计数值聚合到“key”中。
chunker = pd.read_csv('ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(),fill_value=0)
tot = tot.sort_values(ascending=False) #原文tot.order(ascending=False),新版没有order
tot[:10]
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
M 338.0
J 337.0
F 335.0
K 334.0
H 330.0
dtype: float64
TextParse还有一个get_chunk方法,可以使你读取任意大小的块。
将数据写出到文本格式
数据也可以被输出为分隔符格式的文本。我们再来看看之前读过的一个csv文件:
data = pd.read_csv('ex5.csv')
data.to_csv('out.csv')
!type out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
#还可以使用其他分隔符
data.to_csv('out.csv',sep='|')
!type out.csv
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
#直接输出结果到屏幕
import sys
data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
#缺失值在输出结果中会被表示为空字符串。你可能希望将其标记为别的标记值:
data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
#如果没有设置其他选项,则会写出行和列的标签。当然,它们也可以被禁用。
data.to_csv(sys.stdout, index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
#此外,你还可以只写出一部分的列,并以你指定的顺序排列:
data.to_csv(sys.stdout, index=False, columns=['a','b','c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0
#Series也有一个to_csv方法:
dates = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7),index=dates)
ts.to_csv('tseries.csv')
!type tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6
虽然只需一点整理工作(无header行,第一列做索引)就能用read_csv将csv文件读取为Series,但还有一个更为方便的from_csv方法:
Series.from_csv('tseries.csv',parse_dates=True)
D:\Anaconda3\lib\site-packages\pandas\core\series.py:2890: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls
infer_datetime_format=infer_datetime_format)
2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 5
2000-01-07 6
dtype: int64
手工处理分隔符
大部分存储在磁盘上的表格型数据都能用Pandas.read_table进行加载。然而,有时还是需要做一些手工处理。由于接收到含有畸形行的文件而使read_table出毛病的情况并不少见。为了说明这些基本工具,看看下面这个简单的csv文件:
!type ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"
对于任何单字符分隔符文件,可以直接使用Python内置的csv模块。将任意已打开的文件或文件型的对象传给csv.reader:
import csv
f = open('ex7.csv')
reader = csv.reader(f)
对这个reader进行迭代将会为每行产生一个元组(并移除了所有的引号):
for line in reader:
print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']
现在,为了使数据格式合乎要求,你需要对其做一些整理工作:
lines =list(csv.reader(open('ex7.csv')))
header, values = lines[0],lines[1:]
data_dict = {h:v for h, v in zip(header, zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
csv文件的形式有很多。只需定义csv.Dialect的一个子类即可定义出新格式(如专门的分隔字符串引用约行结束符等):
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
quoting = 0
reader = csv.reader(f, dialect=my_dialect)
各个CSV语支的参数也可以关键字的形式提供给csv.reader,而无需定义子类:
reader = csv.reader(f,delimiter='|')
可用的选项(csv.Dialect)及功能如表6-3所示。
表6-3 CSV语支选择
参数 | 说明 |
---|---|
dilimiter | 用于分隔字段的单字符字符串。默认“,” |
lineterminator | 用于写操作的行结束符,默认“\r\n”。读操作符将忽略此选项,它能认出跨平台的行结束符 |
quotechar | 用于带有特殊字符(如分隔符)的字段的引用符号。默认为“"” |
quoting | 引用约定。可选值包括csv.QUOTE_ALL(引用所有字段),csv.QUOTE_MINIMAL(只引用带有诸如分隔符之类特殊字符的字段),csv.QUOTE_NONNUMERIC以及csv.QUOTE_NON(不引用)。 |
skipinitialspace | 忽略分隔符后面的空白符。默认为False |
doublequote | 如何处理字段内的引用符号。如果为TRUE,则双写。 |
escapecher | 用于对分隔符进行转义的字符串(如果quoting被设置成csv.QUOTE_NONE的话)。默认禁用 |
要手工输出分隔符文件,你可以使用csv.writer.它接受一个已经打开且可写的文件对象以及和csv.reader相同的那些语支和格式化选项:
with open('mydata.csc','w') as f:
writer = csv.writer(f,dialect = my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))
writer.writerow(('4','5','6'))
writer.writerow(('7','8','9'))
JSON数据
JSON(JavaScript Object Notation)已经成为通过HTTP请求在web浏览器和其他应用程序之间发送数据的标准格式之一。它是一种比表格型文件格式(如CSV)灵活的多的数据格式。下面是一个例子:
obj = """
{"name":"Wes",
"places_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
{"name":"Katie","age":33,"pet":"Cosco"}]
}
"""
除其空值null和一些其他的细微差别(如列表末尾不允许存在多余的逗号)之外,JSON非常接近于有效的Python代码。基本类型有对象(字典)、数组(列表)、字符串、数值、布尔值及null。对象中所有的键都必须是字符串。许多Python库都可以读写JSON数据。我将使用json,因为它是构建于Python标准库中的。通过json.loads即可将JSON字符串转换成Python形式:;
import json
result = json.loads(obj)
result
{'name': 'Wes',
'pet': None,
'places_lived': ['United States', 'Spain', 'Germany'],
'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
{'age': 33, 'name': 'Katie', 'pet': 'Cosco'}]}
#相反,json.dumps则将Python对象转换成JSON格式:
asjon = json.dumps(result)
如果将(一个或一组)JSON对象转换为DataFrame或其他便于分析的数据结构就由你决定了。最简单方便的方式是:向DataFrame构造器传入一组JSON对象,并选取数据字段的子集。
siblings = DataFrame(result['siblings'],columns=['name','age'])
siblings
name | age | |
---|---|---|
0 | Scott | 25 |
1 | Katie | 33 |
https://finance.yahoo.com/quote/AAPL/options?ltr=1
XML和HTML:web信息收集
Python有许多可以读写HTML和XML格式数据的库。lxml就是其中之一,它能够高效且可靠地解析大文件。比如上面那个Yahoo的网页:
from lxml.html import parse
from urllib.request import urlopen
parsed = parse(urlopen('https://finance.yahoo.com/quote/AAPL/options?ltr=1&guccounter=1.html'))
doc = parsed.getroot()
通过上面这个对象doc,我们可以获取特定类型的所有HTML标签(tag),比如含有所需数据的table标签。假设,我们想得到该文档中所有的url连接。HTML中的链接是a标签。使用文档根节点的findall方法以及一个XPath(对文档的“查询”的一种表示手段):
links = doc.findall('.//a')
print('Find out links :',len(links))
links[15:20]
Find out links : 127
[<Element a at 0x1e680279548>,
<Element a at 0x1e680279598>,
<Element a at 0x1e6802795e8>,
<Element a at 0x1e680279638>,
<Element a at 0x1e680279688>]
但这些link是表示HTML元素的对象。要得到URL和链接文本,你必须使用各种对象的get方法(针对URL)和text_content方法(针对显式文本):
lnk = links[28]
lnk.get('href')
'/quote/AAPL/options?strike=187.5&straddle=false'
lnk.text_content()
'187.50'
因此,编写下面这条列表推导式(list comprehension)即可获取文档中的全部URL:
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[:]
['https://finance.yahoo.com/',
'https://yahoo.uservoice.com/forums/439018',
'https://mail.yahoo.com/?.intl=us&.lang=en-US&.partner=none&.src=finance',
'/quote/AAPL?p=AAPL',
'/quote/AAPL/key-statistics?p=AAPL',
'/quote/AAPL/history?p=AAPL',
'/quote/AAPL/profile?p=AAPL',
'/quote/AAPL/financials?p=AAPL',
'/quote/AAPL/analysis?p=AAPL',
'/quote/AAPL/options?p=AAPL',
'/quote/AAPL/holders?p=AAPL',
'/quote/AAPL/sustainability?p=AAPL',
'/quote/AAPL/options?ltr=1&guccounter=1.html&straddle=true',
'/quote/AAPL190524C00167500?p=AAPL190524C00167500',
'/quote/AAPL/options?strike=167.5&straddle=false',
'/quote/AAPL190524C00172500?p=AAPL190524C00172500',
'/quote/AAPL/options?strike=172.5&straddle=false',
'/quote/AAPL190524C00175000?p=AAPL190524C00175000',
'/quote/AAPL/options?strike=175&straddle=false',
'/quote/AAPL190524C00177500?p=AAPL190524C00177500',
'/quote/AAPL/options?strike=177.5&straddle=false',
'/quote/AAPL190524C00180000?p=AAPL190524C00180000',
'/quote/AAPL/options?strike=180&straddle=false',
'/quote/AAPL190524C00182500?p=AAPL190524C00182500',
'/quote/AAPL/options?strike=182.5&straddle=false',
'/quote/AAPL190524C00185000?p=AAPL190524C00185000',
'/quote/AAPL/options?strike=185&straddle=false',
'/quote/AAPL190524C00187500?p=AAPL190524C00187500',
'/quote/AAPL/options?strike=187.5&straddle=false',
'/quote/AAPL190524C00190000?p=AAPL190524C00190000',
'/quote/AAPL/options?strike=190&straddle=false',
'/quote/AAPL190524C00192500?p=AAPL190524C00192500',
'/quote/AAPL/options?strike=192.5&straddle=false',
'/quote/AAPL190524C00195000?p=AAPL190524C00195000',
'/quote/AAPL/options?strike=195&straddle=false',
'/quote/AAPL190524C00197500?p=AAPL190524C00197500',
'/quote/AAPL/options?strike=197.5&straddle=false',
'/quote/AAPL190524C00200000?p=AAPL190524C00200000',
'/quote/AAPL/options?strike=200&straddle=false',
'/quote/AAPL190524C00202500?p=AAPL190524C00202500',
'/quote/AAPL/options?strike=202.5&straddle=false',
'/quote/AAPL190524C00205000?p=AAPL190524C00205000',
'/quote/AAPL/options?strike=205&straddle=false',
'/quote/AAPL190524C00207500?p=AAPL190524C00207500',
'/quote/AAPL/options?strike=207.5&straddle=false',
'/quote/AAPL190524C00210000?p=AAPL190524C00210000',
'/quote/AAPL/options?strike=210&straddle=false',
'/quote/AAPL190524C00212500?p=AAPL190524C00212500',
'/quote/AAPL/options?strike=212.5&straddle=false',
'/quote/AAPL190524C00215000?p=AAPL190524C00215000',
'/quote/AAPL/options?strike=215&straddle=false',
'/quote/AAPL190524C00217500?p=AAPL190524C00217500',
'/quote/AAPL/options?strike=217.5&straddle=false',
'/quote/AAPL190524C00220000?p=AAPL190524C00220000',
'/quote/AAPL/options?strike=220&straddle=false',
'/quote/AAPL190524C00222500?p=AAPL190524C00222500',
'/quote/AAPL/options?strike=222.5&straddle=false',
'/quote/AAPL190524C00225000?p=AAPL190524C00225000',
'/quote/AAPL/options?strike=225&straddle=false',
'/quote/AAPL190524C00230000?p=AAPL190524C00230000',
'/quote/AAPL/options?strike=230&straddle=false',
'/quote/AAPL190524C00232500?p=AAPL190524C00232500',
'/quote/AAPL/options?strike=232.5&straddle=false',
'/quote/AAPL190524C00235000?p=AAPL190524C00235000',
'/quote/AAPL/options?strike=235&straddle=false',
'/quote/AAPL190524C00237500?p=AAPL190524C00237500',
'/quote/AAPL/options?strike=237.5&straddle=false',
'/quote/AAPL190524C00242500?p=AAPL190524C00242500',
'/quote/AAPL/options?strike=242.5&straddle=false',
'/quote/AAPL190524P00150000?p=AAPL190524P00150000',
'/quote/AAPL/options?strike=150&straddle=false',
'/quote/AAPL190524P00160000?p=AAPL190524P00160000',
'/quote/AAPL/options?strike=160&straddle=false',
'/quote/AAPL190524P00165000?p=AAPL190524P00165000',
'/quote/AAPL/options?strike=165&straddle=false',
'/quote/AAPL190524P00172500?p=AAPL190524P00172500',
'/quote/AAPL/options?strike=172.5&straddle=false',
'/quote/AAPL190524P00175000?p=AAPL190524P00175000',
'/quote/AAPL/options?strike=175&straddle=false',
'/quote/AAPL190524P00177500?p=AAPL190524P00177500',
'/quote/AAPL/options?strike=177.5&straddle=false',
'/quote/AAPL190524P00180000?p=AAPL190524P00180000',
'/quote/AAPL/options?strike=180&straddle=false',
'/quote/AAPL190524P00182500?p=AAPL190524P00182500',
'/quote/AAPL/options?strike=182.5&straddle=false',
'/quote/AAPL190524P00185000?p=AAPL190524P00185000',
'/quote/AAPL/options?strike=185&straddle=false',
'/quote/AAPL190524P00187500?p=AAPL190524P00187500',
'/quote/AAPL/options?strike=187.5&straddle=false',
'/quote/AAPL190524P00190000?p=AAPL190524P00190000',
'/quote/AAPL/options?strike=190&straddle=false',
'/quote/AAPL190524P00192500?p=AAPL190524P00192500',
'/quote/AAPL/options?strike=192.5&straddle=false',
'/quote/AAPL190524P00195000?p=AAPL190524P00195000',
'/quote/AAPL/options?strike=195&straddle=false',
'/quote/AAPL190524P00197500?p=AAPL190524P00197500',
'/quote/AAPL/options?strike=197.5&straddle=false',
'/quote/AAPL190524P00200000?p=AAPL190524P00200000',
'/quote/AAPL/options?strike=200&straddle=false',
'/quote/AAPL190524P00202500?p=AAPL190524P00202500',
'/quote/AAPL/options?strike=202.5&straddle=false',
'/quote/AAPL190524P00227500?p=AAPL190524P00227500',
'/quote/AAPL/options?strike=227.5&straddle=false',
'/quote/AAPL190524P00230000?p=AAPL190524P00230000',
'/quote/AAPL/options?strike=230&straddle=false',
'/quote/AAPL190524P00250000?p=AAPL190524P00250000',
'/quote/AAPL/options?strike=250&straddle=false',
'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US&id=SLN2310',
'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US',
'https://yahoo.uservoice.com/forums/382977',
'http://info.yahoo.com/privacy/us/yahoo/',
'http://info.yahoo.com/relevantads/',
'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
'https://finance.yahoo.com/sitemap/',
'http://twitter.com/YahooFinance',
'http://facebook.com/yahoofinance',
'http://yahoofinance.tumblr.com',
'/',
'/watchlists',
'/portfolios',
'/screener',
'/calendar',
'/industries',
'/videos/',
'/news/',
'/personal-finance',
'/tech']
现在,从文档中找到正确表格的办法就是反复试验了。有些网站会给目标表格加上一个id属性。我确定有两个分别放置看涨数据和看跌数据的表格:
tables = doc.findall('.//table')
tables #只有两个对象
[<Element table at 0x1e68027ae58>, <Element table at 0x1e68027a9a8>]
calls = tables[0] #AAPL calls数据
puts = tables[1] #AAPL puts数据
rows = calls.findall('.//tr') #每个表格都有一个标题行,然后才是数据行。
len(rows)
29
对于标题行和数据行,我们希望获取每个单元格内的文本。对于标题行,就是th单元格,而对于数据行,则是td单元格:
def _unpack(row,kind='td'):
elts = row.findall('.//%s' % kind)
return[val.text_content() for val in elts]
这样,我们就得到了:
_unpack(rows[0],kind='th')
['Contract Name',
'Last Trade Date',
'Strike',
'Last Price',
'Bid',
'Ask',
'Change',
'% Change',
'Volume',
'Open Interest',
'Implied Volatility']
_unpack(rows[1],kind='td')
['AAPL190524C00167500',
'2019-05-14 3:49PM EDT',
'167.50',
'21.90',
'20.65',
'21.90',
'0.00',
'-',
'5',
'5',
'57.52%']
现在,把所有步骤结合起来,将数据转换为一个DataFrame。由于数值型数据仍然是字符串格式,所以我们希望将部分列(可能不是全部)转换为浮点格式。虽然你可以手动实现该功能,但是pandas恰好就有一个TextParser类可用于自动类型转换:
from pandas.io.parsers import TextParser
def parse_options_data(table):
rows =table.findall('.//tr')
header = _unpack(rows[0],kind='th')
data = [_unpack(r) for r in rows[1:]]
return TextParser(data, names=header).get_chunk()
最后,我对那两个lxml表格对象调用该解析函数并得到最终的DataFrame:
call_data = parse_options_data(calls)
put_data = parse_options_data(puts)
call_data[:10]
Contract Name | Last Trade Date | Strike | Last Price | Bid | Ask | Change | % Change | Volume | Open Interest | Implied Volatility | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAPL190524C00167500 | 2019-05-14 3:49PM EDT | 167.5 | 21.90 | 20.65 | 21.90 | 0.00 | - | 5 | 5 | 57.52% |
1 | AAPL190524C00172500 | 2019-05-17 3:50PM EDT | 172.5 | 16.60 | 16.60 | 17.00 | -2.35 | -12.40% | 204 | 142 | 48.88% |
2 | AAPL190524C00175000 | 2019-05-17 3:44PM EDT | 175.0 | 14.55 | 14.20 | 14.60 | -1.30 | -8.20% | 138 | 99 | 45.17% |
3 | AAPL190524C00177500 | 2019-05-17 3:50PM EDT | 177.5 | 11.90 | 11.85 | 12.50 | -1.80 | -13.14% | 120 | 234 | 46.00% |
4 | AAPL190524C00180000 | 2019-05-17 3:51PM EDT | 180.0 | 9.70 | 9.65 | 10.00 | -0.60 | -5.83% | 1,550 | 411 | 39.09% |
5 | AAPL190524C00182500 | 2019-05-17 3:39PM EDT | 182.5 | 7.68 | 7.50 | 7.85 | -0.47 | -5.77% | 272 | 406 | 36.40% |
6 | AAPL190524C00185000 | 2019-05-17 3:59PM EDT | 185.0 | 5.90 | 5.85 | 6.00 | -0.50 | -7.81% | 3,184 | 1,001 | 35.40% |
7 | AAPL190524C00187500 | 2019-05-17 3:59PM EDT | 187.5 | 4.05 | 4.00 | 4.10 | -0.59 | -12.72% | 6,159 | 1,645 | 31.69% |
8 | AAPL190524C00190000 | 2019-05-17 3:59PM EDT | 190.0 | 2.69 | 2.59 | 2.67 | -0.36 | -11.80% | 19,710 | 4,230 | 30.03% |
9 | AAPL190524C00192500 | 2019-05-17 3:59PM EDT | 192.5 | 1.60 | 1.56 | 1.61 | -0.34 | -17.53% | 11,506 | 4,427 | 28.91% |
XML(Extensible Markup Language)是另一种常见的支持分层、嵌套数据以及元数据的结构化数据格式。本书所使用的这些文件实际上来自于一个很大的XML文档。
纽约大都会运输署(Metropolitan Transportation Authority,MTA)发布了一些有关其公交和列车服务的数据资料(http://www.mta.info/developers/download.html )。这里,我们将看看包含在一组XML文件中的运行情况数据。每项列车或公交服务都各有各自的文件(如Metro-North Railroad的文件是Performance_MNR.xml),其中每条XML记录就是一条月度数据。这里需要注意,前文所述的网址和文件都已经有了些许变化,下面的文本展示,实际是“./Performance_MNRR.xml”文本内容,我们的后续学习也将基于该文件。
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
-<PERFORMANCE>
-<INDICATOR>
<INDICATOR_SEQ>28345</INDICATOR_SEQ>
<PARENT_SEQ>55526</PARENT_SEQ>
<AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
<INDICATOR_NAME>Hudson Line - OTP</INDICATOR_NAME>
<DESCRIPTION>Percent of commuter trains that arrive at their destinations within 5 minutes and 59 seconds of the scheduled time.</DESCRIPTION>
<CATEGORY>Service Indicators</CATEGORY>
<FREQUENCY>M</FREQUENCY>
<DESIRED_CHANGE>U</DESIRED_CHANGE>
<INDICATOR_UNIT>%</INDICATOR_UNIT>
<DECIMAL_PLACES>1</DECIMAL_PLACES>
-<YEAR>
<PERIOD_YEAR>2008</PERIOD_YEAR>
-<MONTH>
<PERIOD_MONTH>1</PERIOD_MONTH>
-<MONTHLYVALUES>
<YTD_TARGET>98.00</YTD_TARGET>
利用lxml.objectify解析文件,通过getroot得到XML文件的根节点的引用:
from lxml import objectify
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
path="Performance_MNRR.xml"
parsed = objectify.parse(open(path))
root = parsed.getroot()
root.INDICATOR返回一个用于产生各个XML元素的生成器。对于每条记录,我们可以用标记名(如YTD_ACTUAL)和数据值填充一个字典(排除几个标记):
data = []
skip_fields = ['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.text
data.append(el_data)
最后,将这组字典转换为一个DataFrame:
perf = DataFrame(data)
perf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 7 columns):
AGENCY_NAME 13 non-null object
CATEGORY 13 non-null object
DESCRIPTION 13 non-null object
FREQUENCY 13 non-null object
INDICATOR_NAME 13 non-null object
INDICATOR_UNIT 13 non-null object
YEAR 0 non-null object
dtypes: object(7)
memory usage: 808.0+ bytes
XML数据可以比本例复杂得多。每个标记都可以有元数据。看看下面这个HTML的链接标记(它也算是一段有效的XML):
from io import StringIO
tag = '<a href="http://www.baidu.com">BaiDu</a>'
root = objectify.parse(StringIO(tag)).getroot()
现在可以访问链接文本或标记中的任何字段了(如href):
root
<Element a at 0x137f2861e88>
root.get('href')
'http://www.baidu.com'
root.text
'BaiDu'