python数据分析(四)

最新推荐文章于 2024-05-29 10:05:41 发布

小白只对大佬的文章感兴趣

最新推荐文章于 2024-05-29 10:05:41 发布

阅读量692

点赞数 1

分类专栏： python数据分析文章标签： python 数据分析 pandas

本文链接：https://blog.csdn.net/ex_6450/article/details/125862403

版权

python数据分析专栏收录该内容

7 篇文章 5 订阅

订阅专栏

本文介绍了Python中数据的二进制存储如pickle和HDF5格式，以及如何使用pandas与Excel文件和WebAPIs交互。还展示了如何通过SQLite数据库进行数据读写，并利用SQLAlchemy进行数据库操作。

摘要由CSDN通过智能技术生成

二进制数据格式

1. pickle序列化

实现数据的高效二进制格式存储最简单的办法之一是使用Python内置的pickle序列化。pandas对象都有一个用于将数据以pickle格式保存到磁盘上的to_pickle方法:

rame = pd.read_csv('examples/ex1.csv') 
frame
# 输出为
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
frame.to_pickle('examples/frame_pickle')

# 可以通过pickle直接读取被pickle化的数据，或是使用更为方便的pandas.read_pickle：
pd.read_pickle('examples/frame_pickle')

pickle仅建议用于短期存储格式。其原因是很难保证该格式永远是稳定的；pickle的对象可能无法被后续版本的库unpickle出来。

2. 使用HDF5格式

虽然可以用PyTables或h5py库直接访问HDF5文件，pandas提供了更为高级的接口，可以简化存储Series和DataFrame对象。HDFStore类可以像字典一样，处理低级的细节：

frame = pd.DataFrame({‘a’: np.random.randn(100)})
store = pd.HDFStore(‘mydata.h5’)
store[‘obj1’] = frame
store[‘obj1_col’] = frame[‘a’]
store

HDF5文件中的对象可以通过与字典一样的API进行获取：

store['obj1']
# 输出为
	a
0	-1.252299
1	-0.702365
2	0.201182
3	-0.128883
4	-0.279090
...	...
95	-0.828810
96	-0.599029
97	0.215832
98	0.425337
99	1.383613
100 rows × 1 columns

HDFStore支持两种存储模式，‘fixed’和’table’。后者通常会更慢，但是支持使用特殊语法进行查询操作：

# put是store[‘obj2’] = frame方法的显示版本，允许我们设置其它的选项，比如格式。
store.put('obj2', frame, format='table') 
print(store.select('obj2', where=['index >= 10 and index <= 15']))
# 输出为
           a
10  0.257530
11  0.134160
12 -0.702396
13 -1.074102
14  0.047611
15 -1.006746
store.close()

pandas.read_hdf函数可以快捷使用这些工具：

frame.to_hdf('mydata.h5', 'obj3', format='table')
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])
# 输出为
	a
0	-1.252299
1	-0.702365
2	0.201182
3	-0.128883
4	-0.279090

移除.h5文件：

import os
os.remove('mydata.h5')

HDF5不是数据库。它最适合用作“一次写多次读”的数据集。虽然数据可以在任何时候被添加到文件中，但如果同时发生多个写操作，文件就可能会被破坏。

3. 读取Microsoft Excel文件

要用ExcelFile，通过传递xls或xlsx路径创建一个实例：

xlsx = pd.ExcelFile(‘examples/ex1.xlsx’)

存储在表单中的数据可以read_excel读取到DataFrame:

pd.read_excel(xlsx, 'Sheet1')
# 输出为
    0   a	b	c	d	message
0	0	1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

读取一个文件中的多个表单，也可以将文件名传递到pandas.read_excel：

frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame
# 输出为
    0	a	b	c	d	message
0	0	1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

如果要将pandas数据写入为Excel格式，你必须首先创建一个ExcelWriter，然后使用pandas对象的to_excel方法将数据写入到其中：

writer = pd.ExcelWriter('examples/ex2.xlsx') #先创建一个ExcelWrite
frame.to_excel(writer, 'Sheet1') # 使用pandas对象的to_excel方法将数据写入到其中
writer.save()
pd.read_excel('examples/ex2.xlsx', 'Sheet1')
# 输出为
   0.1	0	a	b	c	d	message
0	0	0	1	2	3	4	hello
1	1	1	5	6	7	8	world
2	2	2	9	10	11	12	foo

你还可以不使用ExcelWriter，而是传递文件的路径到to_excel:

frame.to_excel(‘examples/ex2.xlsx’)

删除ex2.xlsx文件：

! rm examples/ex2.xlsx

4.Web APIs交互

许多网站都有一些通过JSON或其他格式提供数据的公共API。通过Python访问这些API的办法有不少。一个简单易用的办法（推荐）是requests包（http://docs.python-requests.org）。
为了搜索最新的30个GitHub上的pandas主题，我们可以发一个HTTP GET请求，使用requests扩展库：

import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
# 输出为
<Response [200]>

响应对象的json方法会返回一个包含被解析过的JSON字典，加载到一个Python对象中：

data = resp.json()
data[0]['title']
# 输出为
'Period does not round down for frequencies less that 1 hour'

data中的每个元素都是一个包含所有GitHub主题页数据（不包含评论）的字典。我们可以直接传递数据到DataFrame，并提取感兴趣的字段：

issues = pd.DataFrame(data, columns=['number', 'title',
                                      'labels', 'state'])
print(issues)

5.数据库交互

在商业场景下，大多数数据可能不是存储在文本或Excel文件中。基于SQL的关系型数据库（如SQL Server、PostgreSQL和MySQL等）使用非常广泛，其它一些数据库也很流行。数据库的选择通常取决于性能、数据完整性以及应用程序的伸缩性需求。

将数据从SQL加载到DataFrame的过程很简单，此外pandas还有一些能够简化该过程的函数。例如，我将使用SQLite数据库（通过Python内置的sqlite3驱动器）：

import sqlite3
query = """
   .....: CREATE TABLE test
   .....: (a VARCHAR(20), b VARCHAR(20),
   .....:  c REAL,        d INTEGER
   .....: );"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
# 输出为
<sqlite3.Cursor at 0x2a1226f0f80>
con.commit()

插入几行数据：

data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
# 输出为
<sqlite3.Cursor at 0x7f6b15c66ce0>

从表中选取数据时，大部分Python SQL驱动器（PyODBC、psycopg2、MySQLdb、pymssql等）都会返回一个元组列表：

cursor = con.execute('select * from test') # 查询数据库表test
rows = cursor.fetchall()
rows
# 输出为
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

将这个元组列表传给DataFrame构造器，但还需要列名（位于光标的description属性中）：

cursor.description
# 输出为
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
# 输出为
             a           b     c  d
0      Atlanta     Georgia  1.25  6
1  Tallahassee     Florida  2.60  3
2   Sacramento  California  1.70  5

SQLAlchemy项目是一个流行的Python SQL工具，它抽象出了SQL数据库中的许多常见差异。pandas有一个read_sql函数，可以让你轻松的从SQLAlchemy连接读取数据。这里，我们用SQLAlchemy连接SQLite数据库，并从之前创建的表读取数据：

import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite') # 连接数据库
pd.read_sql('select * from test', db)
# 输出为
             a           b     c  d
0      Atlanta     Georgia  1.25  6
1  Tallahassee     Florida  2.60  3
2   Sacramento  California  1.70  5