利用PYTHON进行数据分析 学习笔记(二)

输入、输出通常可以划分为几个大类:

  • 读取文本文件
  • 其他高效磁盘存储格式
  • 加载数据库中的数据
  • 利用网络WEB API操作网络资源

读取文本格式数据

函数说明
read_csv从文件、url、文件型对象中加载带分隔符的数据,默认分隔符为逗号
read_table从文件、url、文件型对象中加载带分隔符的数据,默认制表符“\t”
read_fwf读取定宽列格式数据(也就是说没有分隔符)
read_clipboard读取剪切板数据,网页转化为表格时非常有用

下面分析一个csv文件,ex1.csv。将其读入一个DataFrame:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame


df = pd.read_csv('ex1.csv')
df
abcdmessage
01234hello
15678world
29101112foo

我们也可以用read_table,只不过需要指定分隔符:

pd.read_table('ex1.csv', sep=',')
abcdmessage
01234hello
15678world
29101112foo

然而,并不是所有的文件都有标题行。比如下面这个:

!type ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

读入该文件的方法有两个。你可以让pandas为其分配默认的列名,也可以自己定义:

pd.read_csv('ex2.csv', header=None)
01234
01234hello
15678world
29101112foo
pd.read_csv('ex2.csv', names=['a','b','c','d','message'])
abcdmessage
01234hello
15678world
29101112foo

假设你希望将message列做成DataFrame的索引。你可以明确表示要将该列放到索引4的位置上,也可以通过index_col参数指定“message”:

names=['a','b','c','d','message']
pd.read_csv('ex2.csv', names=names, index_col='message')
abcd
message
hello1234
world5678
foo9101112
!type csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed = pd.read_csv('csv_mindex.csv',index_col=['key1','key2'])
parsed
value1value2
key1key2
onea12
b34
c56
d78
twoa910
b1112
c1314
d1516

有些表格可能不是用固定的分隔符分割字段的,如果这样,可以编写一个正则表达式来作为read_table的分隔符。下面这个文件各个字段由数量不定的空白符分割,可以用正则表达式\s表示。

list(open('ex3.txt'))
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

结果如下:

result = pd.read_table('ex3.txt', sep='\s+')

这里由于列名比数据行的数量少,read_table推断第一列是索引

result
ABC
aaa-0.264438-1.026059-0.619500
bbb0.9272720.302904-0.032399
ccc-0.264273-0.386314-0.217601
ddd-0.871858-0.3483821.100491

下面这例,跳过文件的第一行、第三行和第四行:

!type ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('ex4.csv', skiprows=[0,2,3])
abcdmessage
01234hello
15678world
29101112foo

缺失值处理

缺失值处理是文件解析任务中的重要组成部分。缺失数据经常要么是没有(空字符串),要么用某个标记值表示。默认情况下,pandas会用一组经常出现的标记值,如NA、-1、#IND以及NULL等:

!type ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result = pd.read_csv('ex5.csv')
result
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo
pd.isnull(result)
somethingabcdmessage
0FalseFalseFalseFalseFalseTrue
1FalseFalseFalseTrueFalseFalse
2FalseFalseFalseFalseFalseFalse

na_values可以接受一组用于表示缺失值的字符串:

result = pd.read_csv('ex5.csv', na_values=['NULL'])
result
somethingabcdmessage
0one123.04NaN
1two56NaN8world
2three91011.012foo

可以用一个字典为各列指定不同的NA标记值:

sentinels = {'message':['foo','NA'],'something':['two']}
pd.read_csv('ex5.csv', na_values=sentinels)
somethingabcdmessage
0one123.04NaN
1NaN56NaN8world
2three91011.012NaN
函数说明
parse_dates尝试将数据解析为日期,默认为False。如果为True,
则尝试解析所有列。此外,还可以指定
需要解析的一组列号或列名。如果列表的元素为
列表或元组,就会将多个列组合到一起再进行日期解析
工作分析(例如,日期/时间/分别位于两个列中)
keep_data_col如果连接多列解析日期,则保持参与连接的列
默认为False
converters由列号/列名跟函数之间的映射关系组成的字典。例如{'foo':f}
会对foo列
dayfirst默认为False
date_parse用于解析日期的函数
nrows需要读取的行数
iterator返回一个TextParse以便逐块读取
chunksize文件块的大小(用于迭代)
skip_footer需要忽略的行数(从文件末尾处算起)

逐块读取文本文件

处理大文件时,只想读取文件的一小部分或逐块对文件进行迭代。

result = pd.read_csv('ex6.csv')
result
onetwothreefourkey
00.467976-0.038649-0.295344-1.824726L
1-0.3588931.4044530.704965-0.200638B
2-0.5018400.659254-0.421691-0.057688G
30.2048861.0741341.388361-0.982404R
40.354628-0.1331160.283763-0.837063Q
51.8174800.7422730.419395-2.251035Q
6-0.7767640.935518-0.332872-1.875641U
7-0.9131351.530624-0.5726570.477252K
80.358480-0.497572-0.3670160.507702S
9-1.740877-1.160417-1.6378302.172201G
100.240564-0.3282491.2521551.0727968
110.7640181.165476-0.6395441.495258R
120.571035-0.3105370.582437-0.2987651
132.3176580.430710-1.3342160.199679P
141.547771-1.119753-2.2776340.329586J
15-1.3106080.401719-1.0009871.156708E
16-0.0884960.6347120.1533240.415335B
17-0.018663-0.247487-1.4465220.750938A
18-0.070127-1.5790970.1208920.671432F
19-0.194678-0.4920392.3596050.319810H
20-0.2486180.868707-0.492226-0.717959W
21-1.091549-0.867110-0.647760-0.832562C
220.641404-0.138822-0.621963-0.284839C
231.2164080.9926870.165162-0.069619V
24-0.5644740.7928320.7470530.571675I
251.759879-0.515666-0.2304811.362317S
260.1262660.3092810.382820-0.239199L
271.334360-0.100152-0.840731-0.6439676
28-0.7376200.278087-0.053235-0.950972J
29-1.148486-0.986292-0.1449630.124362Y
..................
99700.633495-0.1865240.9276270.1431644
99710.308636-0.1128570.762842-1.0729771
9972-1.627051-0.9781510.154745-1.229037Z
99730.3148470.0979890.1996080.955193P
99741.6669070.9920050.496128-0.686391S
99750.0106030.708540-1.2587110.226541K
99760.118693-0.714455-0.501342-0.254764K
99770.302616-2.011527-0.6280850.768827H
9978-0.0985721.769086-0.215027-0.053076A
9979-0.0190581.9649940.738538-0.883776F
9980-0.5953490.001781-1.423355-1.458477M
99811.392170-1.396560-1.425306-0.847535H
9982-0.896029-0.1522871.9244830.3651846
9983-2.274642-0.9018741.5003520.996541N
9984-0.3018981.0199061.1021602.624526I
9985-2.548389-0.5853741.496201-0.718815D
9986-0.0645880.759292-1.568415-0.420933E
9987-0.143365-1.111760-1.8155810.4352742
9988-0.070412-1.0559210.338017-0.440763X
99890.6491480.994273-1.3842270.485120Q
9990-0.3707690.404356-1.051628-1.0508998
9991-0.4099800.155627-0.8189901.277350W
99920.301214-1.1112030.6682580.671922A
99931.8211170.4164450.1738740.505118X
99940.0688041.3227590.8023460.223618H
99952.311896-0.417070-1.409599-0.515821L
9996-0.479893-0.6504190.745152-0.646038E
99970.5233310.7871120.4860661.093156K
9998-0.3625590.598894-1.8432010.887292G
9999-0.096376-1.012999-0.657431-0.5733150

10000 rows × 5 columns

如果只想读取几行,可以通过nrows进行指定即可:

pd.read_csv('ex6.csv', nrows=5)
onetwothreefourkey
00.467976-0.038649-0.295344-1.824726L
1-0.3588931.4044530.704965-0.200638B
2-0.5018400.659254-0.421691-0.057688G
30.2048861.0741341.388361-0.982404R
40.354628-0.1331160.283763-0.837063Q

要逐块读取文件,需要设置chunksize:

chunker = pd.read_csv('ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x1e6ff1e1e48>

read_csv所返回的这个TextParse对象使你可以根据chunksize对文件进行逐块迭代。比如,ex6.csv中,将计数值聚合到“key”中。

chunker = pd.read_csv('ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(),fill_value=0)
tot = tot.sort_values(ascending=False) #原文tot.order(ascending=False),新版没有order
tot[:10]
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

TextParse还有一个get_chunk方法,可以使你读取任意大小的块。

将数据写出到文本格式

数据也可以被输出为分隔符格式的文本。我们再来看看之前读过的一个csv文件:

data = pd.read_csv('ex5.csv')
data.to_csv('out.csv')
!type out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
#还可以使用其他分隔符
data.to_csv('out.csv',sep='|')
!type out.csv
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
#直接输出结果到屏幕
import sys
data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
#缺失值在输出结果中会被表示为空字符串。你可能希望将其标记为别的标记值:
data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
#如果没有设置其他选项,则会写出行和列的标签。当然,它们也可以被禁用。
data.to_csv(sys.stdout, index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
#此外,你还可以只写出一部分的列,并以你指定的顺序排列:
data.to_csv(sys.stdout, index=False, columns=['a','b','c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0
#Series也有一个to_csv方法:
dates = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7),index=dates)
ts.to_csv('tseries.csv')
!type tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

虽然只需一点整理工作(无header行,第一列做索引)就能用read_csv将csv文件读取为Series,但还有一个更为方便的from_csv方法:

Series.from_csv('tseries.csv',parse_dates=True)
D:\Anaconda3\lib\site-packages\pandas\core\series.py:2890: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls
  infer_datetime_format=infer_datetime_format)





2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

手工处理分隔符

大部分存储在磁盘上的表格型数据都能用Pandas.read_table进行加载。然而,有时还是需要做一些手工处理。由于接收到含有畸形行的文件而使read_table出毛病的情况并不少见。为了说明这些基本工具,看看下面这个简单的csv文件:

!type ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"

对于任何单字符分隔符文件,可以直接使用Python内置的csv模块。将任意已打开的文件或文件型的对象传给csv.reader:

import csv
f = open('ex7.csv')

reader = csv.reader(f)

对这个reader进行迭代将会为每行产生一个元组(并移除了所有的引号):

for line in reader:
    print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']

现在,为了使数据格式合乎要求,你需要对其做一些整理工作:

lines =list(csv.reader(open('ex7.csv')))
header, values = lines[0],lines[1:]
data_dict = {h:v for h, v in zip(header, zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

csv文件的形式有很多。只需定义csv.Dialect的一个子类即可定义出新格式(如专门的分隔字符串引用约行结束符等):

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = 0
    
reader = csv.reader(f, dialect=my_dialect)

各个CSV语支的参数也可以关键字的形式提供给csv.reader,而无需定义子类:

reader = csv.reader(f,delimiter='|')

可用的选项(csv.Dialect)及功能如表6-3所示。

表6-3 CSV语支选择
参数说明
dilimiter用于分隔字段的单字符字符串。默认“,”
lineterminator用于写操作的行结束符,默认“\r\n”。读操作符将忽略此选项,它能认出跨平台的行结束符
quotechar用于带有特殊字符(如分隔符)的字段的引用符号。默认为“"”
quoting引用约定。可选值包括csv.QUOTE_ALL(引用所有字段),csv.QUOTE_MINIMAL(只引用带有诸如分隔符之类特殊字符的字段),csv.QUOTE_NONNUMERIC以及csv.QUOTE_NON(不引用)。
skipinitialspace忽略分隔符后面的空白符。默认为False
doublequote如何处理字段内的引用符号。如果为TRUE,则双写。
escapecher用于对分隔符进行转义的字符串(如果quoting被设置成csv.QUOTE_NONE的话)。默认禁用

要手工输出分隔符文件,你可以使用csv.writer.它接受一个已经打开且可写的文件对象以及和csv.reader相同的那些语支和格式化选项:

with open('mydata.csc','w') as f:
    writer = csv.writer(f,dialect = my_dialect)
    writer.writerow(('one','two','three'))
    writer.writerow(('1','2','3'))
    writer.writerow(('4','5','6'))
    writer.writerow(('7','8','9'))

JSON数据

JSON(JavaScript Object Notation)已经成为通过HTTP请求在web浏览器和其他应用程序之间发送数据的标准格式之一。它是一种比表格型文件格式(如CSV)灵活的多的数据格式。下面是一个例子:

obj = """
{"name":"Wes",
"places_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
                {"name":"Katie","age":33,"pet":"Cosco"}]
}
"""

除其空值null和一些其他的细微差别(如列表末尾不允许存在多余的逗号)之外,JSON非常接近于有效的Python代码。基本类型有对象(字典)、数组(列表)、字符串、数值、布尔值及null。对象中所有的键都必须是字符串。许多Python库都可以读写JSON数据。我将使用json,因为它是构建于Python标准库中的。通过json.loads即可将JSON字符串转换成Python形式:;

import json
result = json.loads(obj)
result
{'name': 'Wes',
 'pet': None,
 'places_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
  {'age': 33, 'name': 'Katie', 'pet': 'Cosco'}]}
#相反,json.dumps则将Python对象转换成JSON格式:
asjon = json.dumps(result)

如果将(一个或一组)JSON对象转换为DataFrame或其他便于分析的数据结构就由你决定了。最简单方便的方式是:向DataFrame构造器传入一组JSON对象,并选取数据字段的子集。

siblings = DataFrame(result['siblings'],columns=['name','age'])
siblings
nameage
0Scott25
1Katie33

https://finance.yahoo.com/quote/AAPL/options?ltr=1

XML和HTML:web信息收集

Python有许多可以读写HTML和XML格式数据的库。lxml就是其中之一,它能够高效且可靠地解析大文件。比如上面那个Yahoo的网页:

from lxml.html import parse
from urllib.request import urlopen

parsed = parse(urlopen('https://finance.yahoo.com/quote/AAPL/options?ltr=1&guccounter=1.html'))

doc = parsed.getroot()

通过上面这个对象doc,我们可以获取特定类型的所有HTML标签(tag),比如含有所需数据的table标签。假设,我们想得到该文档中所有的url连接。HTML中的链接是a标签。使用文档根节点的findall方法以及一个XPath(对文档的“查询”的一种表示手段):

links = doc.findall('.//a')
print('Find out links :',len(links))
links[15:20]
Find out links : 127





[<Element a at 0x1e680279548>,
 <Element a at 0x1e680279598>,
 <Element a at 0x1e6802795e8>,
 <Element a at 0x1e680279638>,
 <Element a at 0x1e680279688>]

但这些link是表示HTML元素的对象。要得到URL和链接文本,你必须使用各种对象的get方法(针对URL)和text_content方法(针对显式文本):

lnk = links[28]
lnk.get('href')
'/quote/AAPL/options?strike=187.5&straddle=false'
lnk.text_content()
'187.50'

因此,编写下面这条列表推导式(list comprehension)即可获取文档中的全部URL:

urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[:]
['https://finance.yahoo.com/',
 'https://yahoo.uservoice.com/forums/439018',
 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.partner=none&.src=finance',
 '/quote/AAPL?p=AAPL',
 '/quote/AAPL/key-statistics?p=AAPL',
 '/quote/AAPL/history?p=AAPL',
 '/quote/AAPL/profile?p=AAPL',
 '/quote/AAPL/financials?p=AAPL',
 '/quote/AAPL/analysis?p=AAPL',
 '/quote/AAPL/options?p=AAPL',
 '/quote/AAPL/holders?p=AAPL',
 '/quote/AAPL/sustainability?p=AAPL',
 '/quote/AAPL/options?ltr=1&guccounter=1.html&straddle=true',
 '/quote/AAPL190524C00167500?p=AAPL190524C00167500',
 '/quote/AAPL/options?strike=167.5&straddle=false',
 '/quote/AAPL190524C00172500?p=AAPL190524C00172500',
 '/quote/AAPL/options?strike=172.5&straddle=false',
 '/quote/AAPL190524C00175000?p=AAPL190524C00175000',
 '/quote/AAPL/options?strike=175&straddle=false',
 '/quote/AAPL190524C00177500?p=AAPL190524C00177500',
 '/quote/AAPL/options?strike=177.5&straddle=false',
 '/quote/AAPL190524C00180000?p=AAPL190524C00180000',
 '/quote/AAPL/options?strike=180&straddle=false',
 '/quote/AAPL190524C00182500?p=AAPL190524C00182500',
 '/quote/AAPL/options?strike=182.5&straddle=false',
 '/quote/AAPL190524C00185000?p=AAPL190524C00185000',
 '/quote/AAPL/options?strike=185&straddle=false',
 '/quote/AAPL190524C00187500?p=AAPL190524C00187500',
 '/quote/AAPL/options?strike=187.5&straddle=false',
 '/quote/AAPL190524C00190000?p=AAPL190524C00190000',
 '/quote/AAPL/options?strike=190&straddle=false',
 '/quote/AAPL190524C00192500?p=AAPL190524C00192500',
 '/quote/AAPL/options?strike=192.5&straddle=false',
 '/quote/AAPL190524C00195000?p=AAPL190524C00195000',
 '/quote/AAPL/options?strike=195&straddle=false',
 '/quote/AAPL190524C00197500?p=AAPL190524C00197500',
 '/quote/AAPL/options?strike=197.5&straddle=false',
 '/quote/AAPL190524C00200000?p=AAPL190524C00200000',
 '/quote/AAPL/options?strike=200&straddle=false',
 '/quote/AAPL190524C00202500?p=AAPL190524C00202500',
 '/quote/AAPL/options?strike=202.5&straddle=false',
 '/quote/AAPL190524C00205000?p=AAPL190524C00205000',
 '/quote/AAPL/options?strike=205&straddle=false',
 '/quote/AAPL190524C00207500?p=AAPL190524C00207500',
 '/quote/AAPL/options?strike=207.5&straddle=false',
 '/quote/AAPL190524C00210000?p=AAPL190524C00210000',
 '/quote/AAPL/options?strike=210&straddle=false',
 '/quote/AAPL190524C00212500?p=AAPL190524C00212500',
 '/quote/AAPL/options?strike=212.5&straddle=false',
 '/quote/AAPL190524C00215000?p=AAPL190524C00215000',
 '/quote/AAPL/options?strike=215&straddle=false',
 '/quote/AAPL190524C00217500?p=AAPL190524C00217500',
 '/quote/AAPL/options?strike=217.5&straddle=false',
 '/quote/AAPL190524C00220000?p=AAPL190524C00220000',
 '/quote/AAPL/options?strike=220&straddle=false',
 '/quote/AAPL190524C00222500?p=AAPL190524C00222500',
 '/quote/AAPL/options?strike=222.5&straddle=false',
 '/quote/AAPL190524C00225000?p=AAPL190524C00225000',
 '/quote/AAPL/options?strike=225&straddle=false',
 '/quote/AAPL190524C00230000?p=AAPL190524C00230000',
 '/quote/AAPL/options?strike=230&straddle=false',
 '/quote/AAPL190524C00232500?p=AAPL190524C00232500',
 '/quote/AAPL/options?strike=232.5&straddle=false',
 '/quote/AAPL190524C00235000?p=AAPL190524C00235000',
 '/quote/AAPL/options?strike=235&straddle=false',
 '/quote/AAPL190524C00237500?p=AAPL190524C00237500',
 '/quote/AAPL/options?strike=237.5&straddle=false',
 '/quote/AAPL190524C00242500?p=AAPL190524C00242500',
 '/quote/AAPL/options?strike=242.5&straddle=false',
 '/quote/AAPL190524P00150000?p=AAPL190524P00150000',
 '/quote/AAPL/options?strike=150&straddle=false',
 '/quote/AAPL190524P00160000?p=AAPL190524P00160000',
 '/quote/AAPL/options?strike=160&straddle=false',
 '/quote/AAPL190524P00165000?p=AAPL190524P00165000',
 '/quote/AAPL/options?strike=165&straddle=false',
 '/quote/AAPL190524P00172500?p=AAPL190524P00172500',
 '/quote/AAPL/options?strike=172.5&straddle=false',
 '/quote/AAPL190524P00175000?p=AAPL190524P00175000',
 '/quote/AAPL/options?strike=175&straddle=false',
 '/quote/AAPL190524P00177500?p=AAPL190524P00177500',
 '/quote/AAPL/options?strike=177.5&straddle=false',
 '/quote/AAPL190524P00180000?p=AAPL190524P00180000',
 '/quote/AAPL/options?strike=180&straddle=false',
 '/quote/AAPL190524P00182500?p=AAPL190524P00182500',
 '/quote/AAPL/options?strike=182.5&straddle=false',
 '/quote/AAPL190524P00185000?p=AAPL190524P00185000',
 '/quote/AAPL/options?strike=185&straddle=false',
 '/quote/AAPL190524P00187500?p=AAPL190524P00187500',
 '/quote/AAPL/options?strike=187.5&straddle=false',
 '/quote/AAPL190524P00190000?p=AAPL190524P00190000',
 '/quote/AAPL/options?strike=190&straddle=false',
 '/quote/AAPL190524P00192500?p=AAPL190524P00192500',
 '/quote/AAPL/options?strike=192.5&straddle=false',
 '/quote/AAPL190524P00195000?p=AAPL190524P00195000',
 '/quote/AAPL/options?strike=195&straddle=false',
 '/quote/AAPL190524P00197500?p=AAPL190524P00197500',
 '/quote/AAPL/options?strike=197.5&straddle=false',
 '/quote/AAPL190524P00200000?p=AAPL190524P00200000',
 '/quote/AAPL/options?strike=200&straddle=false',
 '/quote/AAPL190524P00202500?p=AAPL190524P00202500',
 '/quote/AAPL/options?strike=202.5&straddle=false',
 '/quote/AAPL190524P00227500?p=AAPL190524P00227500',
 '/quote/AAPL/options?strike=227.5&straddle=false',
 '/quote/AAPL190524P00230000?p=AAPL190524P00230000',
 '/quote/AAPL/options?strike=230&straddle=false',
 '/quote/AAPL190524P00250000?p=AAPL190524P00250000',
 '/quote/AAPL/options?strike=250&straddle=false',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US&id=SLN2310',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US',
 'https://yahoo.uservoice.com/forums/382977',
 'http://info.yahoo.com/privacy/us/yahoo/',
 'http://info.yahoo.com/relevantads/',
 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
 'https://finance.yahoo.com/sitemap/',
 'http://twitter.com/YahooFinance',
 'http://facebook.com/yahoofinance',
 'http://yahoofinance.tumblr.com',
 '/',
 '/watchlists',
 '/portfolios',
 '/screener',
 '/calendar',
 '/industries',
 '/videos/',
 '/news/',
 '/personal-finance',
 '/tech']

现在,从文档中找到正确表格的办法就是反复试验了。有些网站会给目标表格加上一个id属性。我确定有两个分别放置看涨数据和看跌数据的表格:

tables = doc.findall('.//table')
tables #只有两个对象
[<Element table at 0x1e68027ae58>, <Element table at 0x1e68027a9a8>]
calls = tables[0] #AAPL calls数据
puts = tables[1]  #AAPL puts数据
rows = calls.findall('.//tr') #每个表格都有一个标题行,然后才是数据行。
len(rows)
29

对于标题行和数据行,我们希望获取每个单元格内的文本。对于标题行,就是th单元格,而对于数据行,则是td单元格:

def _unpack(row,kind='td'):
    elts = row.findall('.//%s' % kind)
    return[val.text_content() for val in elts]

这样,我们就得到了:

_unpack(rows[0],kind='th')
['Contract Name',
 'Last Trade Date',
 'Strike',
 'Last Price',
 'Bid',
 'Ask',
 'Change',
 '% Change',
 'Volume',
 'Open Interest',
 'Implied Volatility']
_unpack(rows[1],kind='td')
['AAPL190524C00167500',
 '2019-05-14 3:49PM EDT',
 '167.50',
 '21.90',
 '20.65',
 '21.90',
 '0.00',
 '-',
 '5',
 '5',
 '57.52%']

现在,把所有步骤结合起来,将数据转换为一个DataFrame。由于数值型数据仍然是字符串格式,所以我们希望将部分列(可能不是全部)转换为浮点格式。虽然你可以手动实现该功能,但是pandas恰好就有一个TextParser类可用于自动类型转换:

from pandas.io.parsers import TextParser

def parse_options_data(table):
    rows =table.findall('.//tr')
    header = _unpack(rows[0],kind='th')
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data, names=header).get_chunk()

最后,我对那两个lxml表格对象调用该解析函数并得到最终的DataFrame:

call_data = parse_options_data(calls)
put_data = parse_options_data(puts)
call_data[:10]
Contract NameLast Trade DateStrikeLast PriceBidAskChange% ChangeVolumeOpen InterestImplied Volatility
0AAPL190524C001675002019-05-14 3:49PM EDT167.521.9020.6521.900.00-5557.52%
1AAPL190524C001725002019-05-17 3:50PM EDT172.516.6016.6017.00-2.35-12.40%20414248.88%
2AAPL190524C001750002019-05-17 3:44PM EDT175.014.5514.2014.60-1.30-8.20%1389945.17%
3AAPL190524C001775002019-05-17 3:50PM EDT177.511.9011.8512.50-1.80-13.14%12023446.00%
4AAPL190524C001800002019-05-17 3:51PM EDT180.09.709.6510.00-0.60-5.83%1,55041139.09%
5AAPL190524C001825002019-05-17 3:39PM EDT182.57.687.507.85-0.47-5.77%27240636.40%
6AAPL190524C001850002019-05-17 3:59PM EDT185.05.905.856.00-0.50-7.81%3,1841,00135.40%
7AAPL190524C001875002019-05-17 3:59PM EDT187.54.054.004.10-0.59-12.72%6,1591,64531.69%
8AAPL190524C001900002019-05-17 3:59PM EDT190.02.692.592.67-0.36-11.80%19,7104,23030.03%
9AAPL190524C001925002019-05-17 3:59PM EDT192.51.601.561.61-0.34-17.53%11,5064,42728.91%

XML(Extensible Markup Language)是另一种常见的支持分层、嵌套数据以及元数据的结构化数据格式。本书所使用的这些文件实际上来自于一个很大的XML文档。

纽约大都会运输署(Metropolitan Transportation Authority,MTA)发布了一些有关其公交和列车服务的数据资料(http://www.mta.info/developers/download.html )。这里,我们将看看包含在一组XML文件中的运行情况数据。每项列车或公交服务都各有各自的文件(如Metro-North Railroad的文件是Performance_MNR.xml),其中每条XML记录就是一条月度数据。这里需要注意,前文所述的网址和文件都已经有了些许变化,下面的文本展示,实际是“./Performance_MNRR.xml”文本内容,我们的后续学习也将基于该文件。

<?xml version="1.0" encoding="UTF-8" standalone="true"?>

-<PERFORMANCE>


-<INDICATOR>

<INDICATOR_SEQ>28345</INDICATOR_SEQ>

<PARENT_SEQ>55526</PARENT_SEQ>

<AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>

<INDICATOR_NAME>Hudson Line - OTP</INDICATOR_NAME>

<DESCRIPTION>Percent of commuter trains that arrive at their destinations within 5 minutes and 59 seconds of the scheduled time.</DESCRIPTION>

<CATEGORY>Service Indicators</CATEGORY>

<FREQUENCY>M</FREQUENCY>

<DESIRED_CHANGE>U</DESIRED_CHANGE>

<INDICATOR_UNIT>%</INDICATOR_UNIT>

<DECIMAL_PLACES>1</DECIMAL_PLACES>


-<YEAR>

<PERIOD_YEAR>2008</PERIOD_YEAR>


-<MONTH>

<PERIOD_MONTH>1</PERIOD_MONTH>


-<MONTHLYVALUES>

<YTD_TARGET>98.00</YTD_TARGET>

利用lxml.objectify解析文件,通过getroot得到XML文件的根节点的引用:

from lxml import objectify
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

path="Performance_MNRR.xml"
parsed = objectify.parse(open(path))
root = parsed.getroot()

root.INDICATOR返回一个用于产生各个XML元素的生成器。对于每条记录,我们可以用标记名(如YTD_ACTUAL)和数据值填充一个字典(排除几个标记):

data = []

skip_fields = ['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.text
    data.append(el_data)

最后,将这组字典转换为一个DataFrame:

perf = DataFrame(data)
perf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 7 columns):
AGENCY_NAME       13 non-null object
CATEGORY          13 non-null object
DESCRIPTION       13 non-null object
FREQUENCY         13 non-null object
INDICATOR_NAME    13 non-null object
INDICATOR_UNIT    13 non-null object
YEAR              0 non-null object
dtypes: object(7)
memory usage: 808.0+ bytes

XML数据可以比本例复杂得多。每个标记都可以有元数据。看看下面这个HTML的链接标记(它也算是一段有效的XML):

from io import StringIO

tag = '<a href="http://www.baidu.com">BaiDu</a>'

root = objectify.parse(StringIO(tag)).getroot()

现在可以访问链接文本或标记中的任何字段了(如href):

root
<Element a at 0x137f2861e88>
root.get('href')
'http://www.baidu.com'
root.text
'BaiDu'

转载于:https://www.cnblogs.com/sam-xu/p/10885545.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值