阿里云MaxCompute中pyODPS的使用：多线程上传、下载、分区

最新推荐文章于 2024-08-07 09:02:13 发布

悟乙己

最新推荐文章于 2024-08-07 09:02:13 发布

阅读量6.5k

点赞数 3

分类专栏： Python︱基础与数据处理文章标签： pyodps odps maxcompute 阿里云 mars

本文链接：https://blog.csdn.net/sinat_26917383/article/details/112178220

版权

Python︱基础与数据处理专栏收录该内容

34 篇文章 26 订阅

订阅专栏

安装：

pip install pyodps

1 初始化入口

在使用PyODPS前，您需要用阿里云账号初始化一个MaxCompute的入口，执行命令如下。

from odps import ODPS
odps = ODPS('**your-access-id**', '**your-secret-access-key**', '**your-default-project**',endpoint='**your-end-point**')

参数说明：

your-access-id：账号的AccessKey ID。
your-secret-access-key：账号的AccessKey Secret。
your-default-project：使用的项目空间名称。
your-end-point：MaxCompute服务所在区域的Endpoint。详情请参见配置Endpoint。

2 java可实现：SQLTask配合Tunnel实现大量数据导出

参考：SQLTask配合Tunnel实现大量数据导出

一般情况下，如果数据量非常大，且要做迁移或导出，需要使用该方式：
Tunnel，如果需要导出的查询结果是某张表或具体某个分区的全部内容，可以使用Tunnel命令行完成。
MaxCompute提供了Tunnel命令行工具和Tunnel SDK，详情请参见Tunnel命令参考和批量数据通道概要。
当然，没有Python的SDK，好像比较难受。。

3 pyODPS

3.1 sql读入

读取SQL执行结果。
运行SQL的Instance能够直接执行open_reader操作读取SQL执行结果。读取时会出现以下两种情况：
SQL返回了结构化的数据。

with o.execute_sql('select * from table_name').open_reader() as reader:
    for record in reader:
    # 处理每一个record。

当然，这里的分区很重要，笔者使用的时候，如果不加ds = 20200101设置分区的话，就会默认查全部分区，数据量很大，一般查不出来。

所以如果数据量比较大的话，可以：


with o.execute_sql("select * from table  where ds = '20201225' limit 1;").open_reader() as reader:
    for record in reader:
        print(record)

SQL可能执行了desc命令，这时可以通过reader.raw取到原始的SQL执行结果。

with o.execute_sql('desc table_name').open_reader() as reader:
print(reader.raw)

可以得到表的字段名称:
在这里插入图片描述

3.2 DataFrame

3.2.1 dataframe读入

PyODPS提供了DataFrame API，它提供了类似Pandas的接口，但是能充分利用MaxCompute的计算能力。完整的DataFrame文档请参见DataFrame。

传入Table对象，创建DataFrame对象users。

from odps.df import DataFrame
users = DataFrame(o.get_table('pyodps_ml_100k_users'))

对DataFrame对象可以执行如下操作：
通过dtypes属性可以查看DataFrame的字段和类型，如下所示。
users.dtypes
通过head方法，可以获取前N条数据，方便快速预览数据。
users.head(10)

如果您需要在print时调用立即执行，需要开启options.interactive 。

from odps import options
from odps.df import DataFrame
options.interactive = True  # 在开始处打开开关。
iris = DataFrame(o.get_table('pyodps_iris'))
print(iris.sepal_width.sum())  # print时会立即执行。

总结来看两种方式：

# 2 to_pandas - 1
form_name= form_name'
# 从 ODPS 表创建
data2 = o.get_table(form_name).to_df()  # 使用表的to_df方法


# 2 to_pandas - 2
from odps.df import DataFrame
%time data = DataFrame(o.get_table(form_name))

3.2.2 分区

如何在dataframe中设置分区？
o.get_table('table_name').get_partition('ds=' + args['ds'])

一些官方注意事项 + 使用限制：

PyODPS节点获取本地处理的数据不能超过50MB，节点运行时占用内存不能超过1G，否则节点任务会被系统中止。请避免在PyODPS任务中写额外的Python数据处理代码。

由于兼容性原因，在DataWorks中，options.tunnel.use_instance_tunnel默认设置为False。如果需要全局开启instance tunnel，需要手动将该值设置为True。

由于实现的原因，Python的atexit包不被支持，请使用try-finally结构实现相关功能。

3.3 to_pandas

总结来看两种方式：

# 2 to_pandas - 1
form_name= form_name'
# 从 ODPS 表创建
data2 = o.get_table(form_name).to_df()  # 使用表的to_df方法


# 2 to_pandas - 2
from odps.df import DataFrame
%time data2 = DataFrame(o.get_table(form_name))

data2.to_pandas()

笔者自己测试，1000w*7的数据大概运行25min+

4 与mars融合

要在 MaxCompute 上运行 Mars，需要有相应的运行环境。这可以分为如下几种情况。

开箱即用的环境，如 dataworks，会包含所需要的依赖。
其他环境，需要自己安装相关依赖。

比这还没实操到这，先埋个坑

需要保证环境安装了：

from odps import __version__ as odps_version
from mars import __version__ as mars_version

print(odps_version)
print(mars_version)

输出的第一个为 PyODPS 版本，第二个为 Mars 版本。要求 PyODPS 至少是 0.9.0。

Python 版本推荐使用 3.7 版本，至少需要是 3.5 版本。

通过如下命令安装：

pip install -U pip  # 可选，确保 pip 版本足够新
pip install pyodps[mars]>=0.9.2  # pyodps 需要至少 0.9.2，这个方式也会安装 mars

5 如何上传数据

参考：https://help.aliyun.com/document_detail/34615.html?spm=a2c4g.11186623.6.1088.54b3452afpf7EA

5.1 两步创建表

先要规定数据scheme

from odps.models import Schema, Column, Partition
schema = Schema.from_lists( ['num', 'num2'], ['bigint', 'double'], ['pt'], ['string'])
schema.columns

其中，[‘num’, ‘num2’] 是字段名；数据类型[‘bigint’, ‘double’]，[‘pt’], [‘string’] 分区名字以及分区类型

再创建表
使用表Schema创建表，方法如下：

table = o.create_table('my_new_table', schema)
table = o.create_table('my_new_table', schema, if_not_exists=True)  # 只有不存在表时，才创建表。
table = o.create_table('my_new_table', schema, lifecycle=7)  # 设置生命周期。

5.2 一步到位创建表

# 创建非分区表。
table = o.create_table('my_new_table', 'num bigint, num2 double', if_not_exists=True)
# 创建分区表可传入（表字段列表，分区字段列表）。
table = o.create_table('my_new_table', ('num bigint, num2 double', 'pt string'), if_not_exists=True)

这里('num bigint, num2 double', 'pt string') 就是schema了

5.3 schema的数据类型

BIGINT（超级位数的整数型）、DOUBLE(浮点计算)、DECIMAL、
STRING、DATETIME、BOOLEAN、MAP和ARRAY类型。如果您需要支持TINYINT和STRUCT等新数据类型，可以打开options.sql.use_odps2_extension = True开关，示例如下。

from odps import options
options.sql.use_odps2_extension = True
table = o.create_table('my_new_table', 'cat smallint, content struct<title:varchar(100), body:string>')

其中，

BIGINT（超级位数的整数型），可以精确的表示从-263到263-1(即从-9,223,372,036,854,775,808到 9,223,372,036,854,775,807)之间的整数，它占用了八个字节的存储空间。
decimal(numeric)，用于精确存储数值。decimal 数据类型最多可存储 38 个数字,所有数字都能够放到小数点的右边。Decimal则以字符串的形式保存数值
double，float 表示的小数点位数少,double能表示的小数点位数多，更加精确

在存储同样范围的值时，通常比decimal使用更少的空间，float使用4个字节存储，double使用8个字节 ,
而 decimal依赖于M和D的值,所以decimal使用更少的空间

一些数据详解：MySQL中float、double、decimal三个浮点类型的区别与总结!

5.4 创建分区

 # 初始化
 o = ODPS(access_id,secret_access_key,default_project,end_point          )
 # 创建表格
t = o.create_table('my_new_table', schema)
# 创建分区
t.create_partition('pt=test', if_not_exists=True)  # 不存在的时候才创建

# 删除分区
t.delete_partition('pt=test', if_exists=True)  # 存在的时候才删除
partition.drop()  # Partition对象存在的时候直接drop

# 判断分区是否存在
t.exist_partitions('pt=test')

sql里面删除分区的方式：
参考：https://help.aliyun.com/document_detail/73771.html#section-yxk-fwn-k5c

--一次删除一个分区。
alter table <table_name> drop [if exists] partition <pt_spec>;
--一次删除多个分区。
alter table <table_name> drop [if exists] partition <pt_spec>,partition <pt_spec>[,partition <pt_spec>....];

举例：

alter table  xxx.xxx drop if exists partition (ds='20210701');

5.5 code snippet

联合起来可写成：

def create_table(self,schema_list,database_name):
	# 数据表新建：先构建chema 、 再上传数据	
    #schema = Schema.from_lists(schema_list, ['ds'], ['string'])
    columns = [Column(name=schema_list[0][n], type=schema_list[1][n], comment=schema_list[0][n]) for n in range(len(schema_list[0]))]

    partitions = [Partition(name='ds', type='string', comment='the partition')]
    schema = Schema(columns=columns, partitions=partitions)
    
    table_infos = self.o.create_table(database_name, schema, if_not_exists=True) 
    print('success load the table :',database_name)
    print(table_infos)
    return table_infos

def write_table(self,dataframe,table_name,ds):
	# 数据上传
    # ds = '20210609'
    records = dataframe.to_numpy().tolist()
    self.o.write_table(table_name, records, partition='ds=' + ds, create_partition=True)
    print('success load.')

data = [['Alex',10],['Bob',12],['Clarke',13]]
output = pd.DataFrame(data,columns=['Name','Age'])
schema_list = ['name', 'age'], ['string', 'double']

# create datatable
schema_list = ['name', 'age'], ['string', 'double']
database_name = 'test_2'
pf.create_table(schema_list,database_name)

# upload datatable
dataframe = data_test
pf.write_table(dataframe,'test_2','20210610')

5.6 回传大规模数据

参考文档

之前的：
ODPS 对象的 write_table 方法，例如

>>> records = [[111, 'aaa', True],                 # 这里可以是list
>>>            [222, 'bbb', False],
>>>            [333, 'ccc', True],
>>>            [444, '中文', False]]
>>> o.write_table('test_table', records, partition='pt=test', create_partition=True)

整体来看，用这种方式是比较适合小规模数据的，笔者自己测试一下发现大概20w左右的数据条数还是可以的

但是如果一次性有百万级的数据，那上述的方式就会报错：

ConnectionError: [Errno 104] Connection reset by peer

那么这时候就需要多进程来分批上传了：

import random
from multiprocessing import Pool
from odps.tunnel import TableTunnel

def write_records(session_id, block_id):
    # 使用指定的 id 创建 session
    local_session = tunnel.create_upload_session(table.name, upload_id=session_id)
    # 创建 writer 时指定 block_id
    with local_session.open_record_writer(block_id) as writer:
    # 生成5个sample数据逐一写入
        for i in range(5):
            # 生成数据并写入对应 block
            record = table.new_record([random.randint(1, 100), random.random()])
            writer.write(record)

if __name__ == '__main__':
	# 机器的核心数
    N_WORKERS = 3
    
    # 根据schema初始化表
    table = o.create_table('my_new_table', 'num bigint, num2 double', if_not_exists=True)
    # 启动ODPS tunnel
    tunnel = TableTunnel(o)
    upload_session = tunnel.create_upload_session(table.name)

    # 每个进程使用同一个 session_id
    session_id = upload_session.id
    
    pool = Pool(processes=N_WORKERS)
    futures = []
    block_ids = []
    
    for i in range(N_WORKERS):
        futures.append(pool.apply_async(write_records, (session_id, i)))
        block_ids.append(i)
    [f.get() for f in futures]

    # 最后执行 commit，并指定所有 block
    upload_session.commit(block_ids)

ODPS Tunnel是ODPS的数据通道，用户可以通过Tunnel向ODPS中上传或者下载数据。
官方文档给到的素材是上面这样的，之前没啥注释，一开始还是非常费解的，我自己跑通之后简单注释一下。
其中，write_records(session_id, block_id)就是逐个上传recond的模块，
这里笔者举一个带分区的例子：

import random
from multiprocessing import Pool
from odps.tunnel import TableTunnel

def write_records(session_id, block_id,ds_part = 'ds=20210617'):
	# 这里block_id代表第几个任务,range(N_WORKERS)
    # 使用指定的 id 创建 session
    local_session = tunnel.create_upload_session(table.name, upload_id=session_id,partition_spec=ds_part)
    # 创建 writer 时指定 block_id
    with local_session.open_record_writer(block_id) as writer:
        for _record in records_split_list[block_id]:
        	# 将所有list数据均分，然后逐一給入，并write上传
            record = table.new_record(_record)
            writer.write(record)

def split_list(listTemp, n):
    '''
    n means step
    [[1,2,3],[3,4,6],[5,6],[5,6],[5,6]]
    
    [[[1, 2, 3], [3, 4, 6], [5, 6]], [[5, 6], [5, 6]]]
    '''
    out = []
    n = int(len(listTemp) / n) + 1
    for i in range(0, len(listTemp), n):
        out.append(listTemp[i:i + n]) 
    return out

N_WORKERS = 10  # 10个线程
ds_part = 'ds=20210617'
reconds = [[1,2],[3,6],[5,6],[5,6],[5,6]]

# 创建表
table = o.create_table('my_new_table', 'num bigint, num2 double', if_not_exists=True)
# 创建表的分区
table.create_partition(ds_part, if_not_exists=True)  # 不存在的时候才创建
# 拆分数据集，将数据集拆分成10份
records_split_list = split_list(records,N_WORKERS)
# 初始化tunnel
tunnel = TableTunnel(o)
upload_session = tunnel.create_upload_session(table.name,partition_spec=ds_part)

# 每个进程使用同一个 session_id
session_id = upload_session.id
print('session_id',session_id)

pool = Pool(processes=N_WORKERS)
futures = []
block_ids = []
for i in range(N_WORKERS):
    futures.append(pool.apply_async(write_records, (session_id, i)))
    block_ids.append(i)
[f.get() for f in futures]

# 最后执行 commit，并指定所有 block
upload_session.commit(block_ids)

如果报错：
ODPSError: InvalidResourceSpec
通常为Project/Table/Partition信息与Session不一致，一般是Partition写混了

参考：MaxCompute错误码表：Tunnel常见错误
在这里插入图片描述

6 常规查询与操作

6.1 如何反选某个字段

参考：
https://help.aliyun.com/document_detail/73777.html?spm=a2c4g.11186623.6.734.4ed3399cR3UDc8

有几种情况：
选出sale_detail表中所有列名以sh开头的列。命令示例如下：

select `sh.*` from sale_detail;

选出sale_detail表中列名不为shop_name的所有列。命令示例如下：

select `(shop_name)?+.+` from sale_detail;

选出sale_detail表中排除shop_name和customer_id两列的其它列。命令示例如下：

select `(shop_name|customer_id)?+.+` from sale_detail;

选出sale_detail表中排除列名以t开头的其它列。命令示例如下：

select `(t.*)?+.+` from sale_detail;

在排除多个列时，如果col2是col1的前缀，则需保证col1写在col2的前面（较长的col写在前面）。例如，一个表有2个分区无需被查询，一个分区名为ds，另一个分区名为dshh，由于前者是后者的前缀，正确表达式为select (dshh|ds)?+.+ from t;；错误表达式为select (ds|dshh)?+.+ from t;。

6.2 列出项目空间下的表、视图、分区

参考文档：列出项目空间下的表和视图

--列出项目空间下所有的表和视图。
show tables;
--列出项目空间下表名或视图名与chart匹配的表。
show tables like '<chart>';

列出所有分区

show partitions <table_name>;

查询分区表sale_detail的信息。

desc sale_detail;

在这里插入图片描述

6.3 删除表

参考文档：列出项目空间下的表和视图

–删除表sale_detail。无论sale_detail表是否存在，均返回成功。

drop table if exists sale_detail;

6.4 查询一个表的schema

show create table xxx.table

6.5 新建表

有两种方式，直接用schema创建，或者可以抄已有数据表的表结构

create table if not exists sale_detail(
 shop_name     STRING,
 customer_id   STRING,
 total_price   DOUBLE)
partitioned by (sale_date STRING, region STRING);

如果有的表已经有了也可以通过show create table xxx.table快速拿到创建的schema

快速的方式：

create table  xxx.table1 like  xxx.table2;

6.6 sql数据插入表insert

插入或覆写数据（INSERT INTO | INSERT OVERWRITE）

insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
  select 
  shop_name, 
  customer_id,
  total_price 
  from sale_detail
  zorder by customer_id, total_price;

这里有个坑

就是，create 和 insert

7 ODPS批量下载数据

7.1 常规ODPS下载

常规ODPS下载数据的方式很多

7.2 多线程下载

多线程下载这里有点坑，pyodps文档里面标注的是：

>>> import multiprocessing
>>> n_process = multiprocessing.cpu_count()
>>> with t.open_reader(partition='pt=test') as reader:
>>>     pd_df = reader.to_pandas(n_process=n_process)

这里是跑不出来的，会报错：

odps.errors.ODPSError: InvalidResourceSpec: RequestId: xxxxxxxxxx
The specified resourcespec is not valid.
odps.errors.ODPSError: InvalidResourceSpec: RequestId: xxxxxxxxxx
The specified resourcespec is not valid.
odps.errors.ODPSError: InvalidResourceSpec: RequestId: xxxxxxxxxx
The specified resourcespec is not valid.

但是在github上有一段可以运行的代码：

>>> import multiprocessing
>>> n_process = multiprocessing.cpu_count()
>>> with o.execute_sql('select * from dual').open_reader(tunnel=True) as reader:
>>>     # n_process 指定成机器核数
>>>     pd_df = reader.to_pandas(n_process=n_process)

从两个对比来看，是之前的文档中少加了tunnel，所以不能正常启动session。

8 一些报错

8.1 OverflowError

OverflowError: Date older than 1928/01/01 and may contain errors. Ignore this error by configuring `options.allow_antique_date` to True.

很烦，有报错，但是不会停止

from odps import options
options.allow_antique_date = True

8.2 如果按条件计数：if 的使用

select SUM(IF(x='是:Y' ,1,0)) n0_sum,SUM(IF(y='是:Y' ,1,0)) n1_sum,SUM(IF(z='是:Y' ,1,0)) n2_sum
from xxxxx

8.3 截取字段函数：substr

substr(x,1,4)

截取x变量，第1 - 4个内容，比如19900101，截取的就是：1990

参考：
Python SDK概述

pyodps DataFrame

SQL及函数 > DDL语句

github中SQL

悟乙己

关注

3
点赞
踩
34

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录