Merkle Training 笔记

最新推荐文章于 2023-01-09 19:23:03 发布

indifferent_to

最新推荐文章于 2023-01-09 19:23:03 发布

阅读量535

点赞数

分类专栏： miscellaneous

本文链接：https://blog.csdn.net/sinat_25419171/article/details/50612723

版权

miscellaneous 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Database Marketing

know me -> engage me -> keep me -> empower me

identify object

test case on business logic

run in dev environment

check and record

SQL

NOT null for dimension table

SQL Sever

windows/column index/distribute replay

T-SQL:

sequence/page data/exception handle/execute/lag lead

sqlcmd -S <server> -U <user> -P <password> -Q "select * from table;"

DML/DDL/DCL(do..while)/TCL(transaction)

RANK(134)/DENSE_RANK(1234)/ROW_NUMBER(1234)/NTILE(1122345) OVER([PARTITION BY] ORDER BY)

SSIS（data flow etl / intergration )/SSRS(report)/SSAS(analyst)

Architecture/Designer/Runtime engine/Service/Task/Data FLow Engine/API

Control/Data/Parameter/EventHandler

OELDB/ADO

用？（0,1,2）代表parameter，变量不可以在sql中使用

Package》Task》Dataflow》R/T

Container：For...Loop/For each loop/sequence

@[User::variable_name=initial value

Tips:

少用dynamic sql

少用go

union all 更快

between 比 in 快

union可以替代or

少用select into

set no count on 不返回记录

不在where条件上加函数

不创造太复杂的视图

Shell

C-c stop job

C-z pause job

C-d end input

fg/bg

ln / ln -s

chmod owner+group+others

man/help

>/<

" "解析 ‘ ‘字符

sort

less/more

grep

sed

date +y%m%d%

$?/$#/$0..

sed -n '1p' filename

awk -F'\t' '{print $1}'

awk -F'\T' '{print NF}'

DBMS

RDBMS:

SMP(Symmetric Multi-Processing)

MPP(Massively Parallel Processing) DW

OLTP/OLAP

MySQL Oracle Netezza

batch(go/exec)

identity: server_name.db_name.scheme_name.obj_name

SET ANSI_NULLS(是否查询null)

LIKE ::

+ for SQL Server || for netezza

Control Flow

begin..end

if..else

while..

case

goto

return

break/continue

wait for

join

union/except/intersect

merge

-2^31~2^31-1 integer

-2^63~2^63-1 bigint

create procedure xxx @variable varchar(20) input/output

Index

clustered index(data page)

non-clustered index data page row id

corering index 复杂索引只有第一个索引可以seek， scan开销很大

include index更轻量

SVN(check in/check out/commit/merge)

branchs/tags/trunk

diffing/blame

relational database

query and analyst rather than transaction

historical data

not only one resource

why:seperate workload/Intergrated 特征：主题/集成/时变/稳定

OLTP OLAP

index few many

scheme 3NF 去范式化

join many some

fact-dimension

staging data contains: historical data(3NF)/summary data/nalytical area(multi-dimensional)

Dell SMB

dell sftp server-->sterling->etl(stage/cdi/ost/targe)->compaign/model/report

CDI(customer data intergartion)

CR(Connected Recognition):IntellDRESS+AI(b2b/internastional)

+KnowleageLInk(Merkle CDI process):reference base->dup->dedup->overlap(match之后对原有数据进行覆盖)->fisrt logic（对客户进行确认，例如死亡，犯罪等，标准化 names&address）->data fix->demon join(NCOA)反馈处理数据（添加编号，更新地址）->id changes(xref file/id match)->cluster rule(matching rules)

Post CDI(hygiene/match process):

client input->pre-KL(补全格式）-》hygiene->match(global reference base)->output->postKL

output files: base(可否投递）COA(change of address搬家） Drops(信息不全） IID/HID/AID

file check：file count/filename/exist?/code page/row count/column count/header match/do some cleaning

File Bridge

schedule:tidal(enterprise scheduler) watch jobs processing and report issues(job definition/dependece/Job event/Variable)

feed: freference base feeds/suppression only feeds/tag files/transactional metadata

STG/HIST/ERROR

DTLM(data transformation loading map)STT

BRD/TRD

APJ/EMEA/NA/LATAM

Dell PDW

microsoft MPP dw platform

Advance Process:pick best(person,site)/transaction link/segmetation/model scores/preference management(PMC)/id change

CDI->Order process->suppression->address,email,phone validation->derived attributes->aggregates->snapshots->file outpus

batch id/job id/step id/status/start dt/end dt/last task process/error/reject/delete/update cnt/file id/file group id

Netezza:MPP

TwinFn

Disk--FPGA|CPU|Memory->Linux Host->BI|loda|ETL

S-blades(每个都是独立的server）

SPV(snippet processing unit)

SPA(snippet processing arrary)下面有很多SPU+Disk

Distribution Key(最多四个cols) 数据均匀的分布在SPU上

多column不方便join

原则：unique/join key/int/bigint(字节短)/don't on small table

Zone Map(int/bigint/data..)用于data slice，没有index有primary key无contraints

生效于orderby和organize on

Data Skew Netezza不支持远程语句

netezza sql" odbc/ole db driver

Aginity(Netezza自己的客户端）

WinSQL（can use with any RDB)

nzsql(linux shell)

system table(_v_table/_v_relation_column/_v_procedure)

CREATE TABLE XX( ....) DISTRIBUTE ON cols;

EXTERNAL TABLE (Input/output data with host/client)

sys table host

user table(SPU)basic+temp

Hadoop

distributed software framework processing of large data/across many nodes

Fault tolerant/scalability/Flexibility

your app code +map/reduce on hdfs

ecosystem of main Java programs

reliable share storage(HDFS,default block 64MB,3 copy)

write once and read many times

read large file scream(not optional) and small IO

hadoop fs -ls

analysis system(Map/Reduce)

Job Tracker/Task Tracker/NameNode/DataNode

Map Reduce(java lib)

assign a map task that can process data in parallel

Project

Hive (online OLTP on-line transaction processing)

hive -f sql file

hive -e sql

Sqoop (transfer data between hadoop and RDB)

Oozie(like Tidal)

Pig

Hbase

NoSQL

非结构化/半结构化（机器数据stream）

web log 很难实现结构化

scable(不是线性）

分布式环境（冗余）

key-value

map-combine-reduce

CA Erwin

ODS

Informatica:

Normal Join

Master Outer(Left Join, Detail is left)

Detail Outer

Full Outer

indifferent_to

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Merkle Training 笔记

Database Marketingknow me -> engage me -> keep me -> empower meUTidentify objecttest case on business logicrun in dev environmentcheck and recordSQLNOT null for dimension table
复制链接

扫一扫