数据增量更新定义_增加自定义数据源（更新中）

最新推荐文章于 2023-03-14 13:22:20 发布

weixin_39609822

最新推荐文章于 2023-03-14 13:22:20 发布

阅读量403

点赞数

文章标签：数据增量更新定义

背景：

现有业务逻辑比较复杂，如果直接在Superset里面用SQL做聚合运算，一个是精确度不够，毕竟业务侧是用go做了很多特殊处理，这样的逻辑用SQL实现要用到存储过程了，再一个是执行时间太长，业务侧做了一些定时统计，结果保存在Redis里面，汇总数据时基本就是在内存到内存，用的时间很少。

预期目标：

能够用shell命令获取数据，再想办法把数据放到DataFrame里面（有点像“大屏”的实现方式，不过“大屏”一般是用URL获取数据）。

获取数据的流程：

superset/viz.py

class BaseViz
    def get_df(self, query_obj: Optional[QueryObjectDict] = None) -> pd.DataFrame:

--get_df应该是获取数据的入口

--这里用的应该是Python 3.5引入的语法，声明函数时，可以带返回类型的信息（in Python 3.5 which you can use to annotate arguments and return the type of your function. 参见：https://stackoverflow.com/questions/5336320/how-to-know-function-return-type-and-argument-types）

if self.datasource.type == "table":
    granularity_col = self.datasource.get_column(query_obj["granularity"])
    if granularity_col:
        timestamp_format = granularity_col.python_date_format

看起来，可以增加一种datasource.type，在尝试阶段，可以在meta数据库里手工修改这个type。

meta数据库的slices表，保存了slice信息。

关于datasource，可以参考：https://zhuanlan.zhihu.com/p/70810306

标准做法是在superset/connectors/目录下，增加一个，比如说，shellcmd目录，里面仿照sqla、druid，创建models.py和views.py文件。

不过在原型阶段，可以绕过这一套框架，直接在BaseViz里写对应的代码，只要能自动查询数据即可，其他任务可以手工处理，shell 命令可以手工添加到表里（貌似需要增加一张表，类似tables，叫cmds）

先验证下，在界面上查询slices时，BaseViz::get_df能被调用到。

def get_df(
    self, query_obj: Optional[Dict[str, Any]] = None
) -> Optional[pd.DataFrame]:
    """Returns a pandas dataframe based on the query object"""
    print("测试：BaseViz::get_df")

后台日志输出：

确实调用到了（同级目录下还有个viz_sip38.py）

手工修改数据源类型：

sqlite> update slices set datasource_type = "cmd" where id = 1;

再次查询slice，报错：

File "G:supersetincubator-supersetsupersetmodelsslice.py", line 125, in get_datasource
    return db.session.query(self.cls_model).filter_by(id=self.datasource_id).first()
  File "G:supersetincubator-supersetsupersetmodelsslice.py", line 103, in cls_model
    return ConnectorRegistry.sources[self.datasource_type]
KeyError: 'cmd'

看来是不能识别cmd类型的数据源。

要仿照table类型的数据源注册一下，在《datasource在哪里》一文中，写过注册的位置是在superset/__init__.py中。

grep 一下，发现在superset/__init__.py中调用了reigster_sources函数：

不过0.37的注册代码已经挪到superset/app.py当中了：

class SupersetAppInitializer:
    ...
    def configure_data_sources(self) -> None:
        # Registering sources
        module_datasource_map = self.config["DEFAULT_MODULE_DS_MAP"]
        module_datasource_map.update(self.config["ADDITIONAL_MODULE_DS_MAP"])
        ConnectorRegistry.register_sources(module_datasource_map)

不过配置信息倒是依然在superset/config.py中：

# --------------------------------------------------
# Modules, datasources and middleware to be registered
# --------------------------------------------------
DEFAULT_MODULE_DS_MAP = OrderedDict(
    [
        ("superset.connectors.sqla.models", ["SqlaTable"]),
        ("superset.connectors.druid.models", ["DruidDatasource"]),
    ]
)

看起来，想要注册新类型，就要仿照sqla和druid写models.py文件。

2020-08-24更新：

写一个models.py试一下。

2020-09-01更新：

models.py中，哪些接口是Superset的框架所需要的？

应该可以从superset/connectors/base/models.py看出来。

base/models.py里面有四个class：

class DatasourceKind

class BaseDatasource

class BaseColumn

class BaseMetric

那么可以推测，一个新的datasource，至少需要实现class Datasource/class Column/class Metric三个类。

看看TableColumn的构成：

从BaseColumn继承而来，则BaseColumn中定义的字段也会出现在TableColumn中。
指定表名 "table_columns"，这张表应该是保存了sqla类型表的字段信息。
增加字段：table_id、is_dttm、expression、python_date_format
判断字段类型的三个函数：is_numeric、is_string、is_temporal
get_sqla_col ：处理用户定义的label和表达式等

2020-09-02更新：

看看SqlaTable的构成：

需要指定指标(metric)和列用哪个类

metric_class = SqlMetric
column_class = TableColumn

在BaseDatasource的基础上，增加了一些字段，其中sql用来保存通过SQLLab保存下来的sql语句。

聚合函数

sqla_aggregations = {
"COUNT_DISTINCT": lambda column_name: sa.func.COUNT(sa.distinct(column_name)),
"COUNT": sa.func.COUNT,
"SUM": sa.func.SUM,
"AVG": sa.func.AVG,
"MIN": sa.func.MIN,
"MAX": sa.func.MAX,
}

尝试着加了一个CmdTable类，pycharm提示需要实现抽象方法：

看了下base/models.py的代码，这几个函数，都包含了raise NotImplementedError()

先增加CmdColumn类，继续提示，需要实现抽象方法：

增加CmdMetric类，继续提示，需要实现抽象方法：

2020-09-03更新：

逐一实现这些函数。

BaseMetric::perm

这个函数的作用可以通过sqla的代码来了解。

@property
def perm(self) -> Optional[str]:
    return (
        ("{parent_name}.[{obj.metric_name}](id:{obj.id})").format(
            obj=self, parent_name=self.table.full_name
        )
        if self.table
        else None
    )

看起来是为了拼接出权限对应的对象。

expression = Column(Text, nullable=False)

expression只需要指定对应的字段即可。

CmdColumn::python_data_format

看看SqlaTable::python_data_format是怎么实现的：

expression = Column(Text)
python_date_format = Column(String(255))

2020-09-07更新：

values_for_column函数：

"""Runs query against sqla to retrieve somesample values for the given column."""

看说明，是获取样本数据用的

query函数：

调用database的查询函数，将结果放在Dataframe中，再拼装成QueryResult返回给调用者。（看起来CmdTable从命令行查询数据，也应该放在这个函数里面）

2020-09-10更新：

external_metadata 函数，看起来是获取数据类型的。

def external_metadata(self) -> List[Dict[str, str]]:
    cols = self.database.get_columns(self.table_name, schema=self.schema)
    for col in cols:
        try:
            col["type"] = str(col["type"])
        except CompileError:
            col["type"] = "UNKNOWN"
    return cols

get_query_str函数，看起来是构造sql语句用的，把前端传过来的query_obj转换成SQL语句。

def get_query_str(self, query_obj: QueryObjectDict) -> str:
    query_str_ext = self.get_query_str_extended(query_obj)
    all_queries = query_str_ext.prequeries + [query_str_ext.sql]
    return ";nn".join(all_queries) + ";"

2020-09-11更新：

打算开始写Cmd数据源的代码。

跑起来再修改。

from sqlalchemy import (
    and_,
    asc,
    Boolean,
    Column,
    DateTime,
    desc,
    ForeignKey,
    Integer,
    or_,
    select,
    String,
    Table,
    Text,
)

class CmdColumn(Model, BaseColumn):
    __tablename__ = "cmd_columns"
    expression = Column(Text)
    python_date_format = Column(String(255))

2020-09-14更新：

先把pycharm提示的抽象函数都补上，后来再补充细节。

from typing import Optional
import pandas as pd
from datetime import datetime, timedelta
from sqlalchemy import (
    and_,
    asc,
    Boolean,
    Column,
    DateTime,
    desc,
    ForeignKey,
    Integer,
    or_,
    select,
    String,
    Table,
    Text,
)
from flask_appbuilder import Model
from superset.typing import Metric, QueryObjectDict
from superset.models.helpers import AuditMixinNullable, QueryResult
from superset.utils import core as utils, import_datasource
from typing import Any, Dict, Hashable, List, NamedTuple, Optional, Tuple, Union
from superset.connectors.base.models import BaseColumn, BaseDatasource, BaseMetric


class CmdColumn(Model, BaseColumn):
    __tablename__ = "cmd_columns"
    expression = Column(Text)
    python_date_format = Column(String(255))


class CmdMetric(Model, BaseMetric):
    @property
    def perm(self) -> Optional[str]:
        return ""

    expression = Column(Text, nullable=False)

    __tablename__ = "cmd_metrics"


class CmdTable(Model, BaseDatasource):
    type = 'cmd'

    metric_class = CmdMetric
    column_class = CmdColumn

    @property
    def name(self) -> str:
        return "{}.{}".format("Cmd", "TestDatasource")

    @property
    def datasource_name(self) -> str:
        return "TestDatasource"

    def external_metadata(self) -> List[Dict[str, str]]:
        """获取字段信息"""
        cols = [{"Column_1":"string"}, {"Column_2": "string"}]
        return cols

    def get_query_str(self, query_obj: QueryObjectDict) -> str:
        return ";nn;"

    def query(self, query_obj: QueryObjectDict) -> QueryResult:
        error_message = None
        df = pd.DataFrame()
        status = utils.QueryStatus.SUCCESS
        return QueryResult(
            status=status,
            df=df,
            duration=timedelta(0),
            query="",
            error_message=error_message,
        )

    def values_for_column(self, column_name: str, limit: int = 10000) -> List[Any]:
        return []

2020-09-15更新：

在superset/config.py中增加cmd的配置项：

DEFAULT_MODULE_DS_MAP = OrderedDict(
    [
        ("superset.connectors.sqla.models", ["SqlaTable"]),
        ("superset.connectors.druid.models", ["DruidDatasource"]),
        ("superset.connectors.cmd.models", ["CmdTable"]),
    ]
)

重新加载时出现如下提示：

ERROR:superset.app:Failed to create app
Traceback (most recent call last):
  File "G:supersetincubator-supersetsupersetapp.py", line 61, in create_app
    app_initializer.init_app()
  File "G:supersetincubator-supersetsupersetapp.py", line 498, in init_app
    self.init_app_in_ctx()
  File "G:supersetincubator-supersetsupersetapp.py", line 470, in init_app_in_ctx
    self.configure_data_sources()
  File "G:supersetincubator-supersetsupersetapp.py", line 511, in configure_data_sources
    ConnectorRegistry.register_sources(module_datasource_map)
  File "G:supersetincubator-supersetsupersetconnectorsconnector_registry.py", line 41, in register_sources
    module_obj = __import__(module_name, fromlist=class_names)
  File "G:supersetincubator-supersetsupersetconnectorscmdmodels.py", line 43, in <module>
    class CmdTable(Model, BaseDatasource):
  File "G:supersetlibsite-packagesflask_sqlalchemymodel.py", line 67, in __init__
    super(NameMetaMixin, cls).__init__(name, bases, d)
  File "G:supersetlibsite-packagesflask_sqlalchemymodel.py", line 121, in __init__
    super(BindMetaMixin, cls).__init__(name, bases, d)
  File "G:supersetlibsite-packagessqlalchemyextdeclarativeapi.py", line 75, in __init__
    _as_declarative(cls, classname, cls.__dict__)
  File "G:supersetlibsite-packagessqlalchemyextdeclarativebase.py", line 131, in _as_declarative
    _MapperConfig.setup_mapping(cls, classname, dict_)
  File "G:supersetlibsite-packagessqlalchemyextdeclarativebase.py", line 160, in setup_mapping
    cfg_cls(cls_, classname, dict_)
  File "G:supersetlibsite-packagessqlalchemyextdeclarativebase.py", line 192, in __init__
    self._setup_inheritance()
  File "G:supersetlibsite-packagessqlalchemyextdeclarativebase.py", line 589, in _setup_inheritance
    "table-mapped class." % cls
sqlalchemy.exc.InvalidRequestError: Class <class 'superset.connectors.cmd.models.CmdTable'> does not have a __table__ or __tablename__ specified and does not inherit from an existing table-mapped class.

待更新...

weixin_39609822

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫