odps词频统计

最新推荐文章于 2024-05-01 10:16:31 发布

hjimce

最新推荐文章于 2024-05-01 10:16:31 发布

阅读量1.1k

点赞数

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/hjimce/article/details/78134207

版权

数据挖掘专栏收录该内容

4 篇文章 1 订阅

订阅专栏

1、搭建maxcompute studio

一、编写udtf

2、在项目下面选择script新建文件：new->maxcomput python->python udtf ，然后编写文本spilt：

from odps.udf import annotate
from odps.udf import BaseUDTF


@annotate('string -> string')
class my_first_udtf(BaseUDTF):
    def process(self, arg):
        props = arg.split(' ')
        for p in props:
            self.forward(p)

然后运行一下，本地调试，可以查看结果。

3、提交udtf，并注册函数：打开该python源码，在定义的类my_first_udtf中，右键-》deploy to server，然后填写一下function name即可上传，并且注册函数function name

4、sql调用函数测试结果：

create table if not exists chenmo_word_split_tabel1(word string ) lifecycle 1;
insert overwrite table  chenmo_word_split_tabel1 select my_first_udtf(word) as word from chenmo_wc_in;

二、编写udaf进行聚合统计