Database Marketing
know me -> engage me -> keep me -> empower me
UT
identify object
test case on business logic
run in dev environment
check and record
SQL
NOT null for dimension table
SQL Sever
windows/column index/distribute replay
T-SQL:
sequence/page data/exception handle/execute/lag lead
sqlcmd -S <server> -U <user> -P <password> -Q "select * from table;"
DML/DDL/DCL(do..while)/TCL(transaction)
RANK(134)/DENSE_RANK(1234)/ROW_NUMBER(1234)/NTILE(1122345) OVER([PARTITION BY] ORDER BY)
SSIS(data flow etl / intergration )/SSRS(report)/SSAS(analyst)
Architecture/Designer/Runtime engine/Service/Task/Data FLow Engine/API
Control/Data/Parameter/EventHandler
OELDB/ADO
用?(0,1,2)代表parameter, 变量不可以在sql中使用
Package》Task》Dataflow》R/T
Container:For...Loop/For each loop/sequence
@[User::variable_name=initial value
Tips:
少用dynamic sql
少用go
union all 更快
between 比 in 快
union可以替代or
少用select into
set no count on 不返回记录
不在where条件上加函数
不创造太复杂的视图
Shell
C-c stop job
C-z pause job
C-d end input
fg/bg
ln / ln -s
chmod owner+group+others
man/help
>/<
" "解析 ‘ ‘字符
sort
less/more
grep
sed
date +y%m%d%
$?/$#/$0..
sed -n '1p' filename
awk -F'\t' '{print $1}'
awk -F'\T' '{print NF}'
DBMS
RDBMS:
SMP(Symmetric Multi-Processing)
MPP(Massively Parallel Processing) DW
OLTP/OLAP
MySQL Oracle Netezza
batch(go/exec)
identity: server_name.db_name.scheme_name.obj_name
SET ANSI_NULLS(是否查询null)
LIKE ::
+ for SQL Server || for netezza
Control Flow
begin..end
if..else
while..
case
goto
return
break/continue
wait for
join
union/except/intersect
merge
-2^31~2^31-1 integer
-2^63~2^63-1 bigint
create procedure xxx @variable varchar(20) input/output
Index
clustered index(data page)
non-clustered index data page row id
corering index 复杂索引只有第一个索引可以seek, scan开销很大
include index更轻量
SVN(check in/check out/commit/merge)
branchs/tags/trunk
diffing/blame
DW
relational database
query and analyst rather than transaction
historical data
not only one resource
why:seperate workload/Intergrated 特征:主题/集成/时变/稳定
OLTP OLAP
index few many
scheme 3NF 去范式化
join many some
fact-dimension
staging data contains: historical data(3NF)/summary data/nalytical area(multi-dimensional)
Dell SMB
dell sftp server-->sterling->etl(stage/cdi/ost/targe)->compaign/model/report
CDI(customer data intergartion)
CR(Connected Recognition):IntellDRESS+AI(b2b/internastional)
+KnowleageLInk(Merkle CDI process):reference base->dup->dedup->overlap(match之后对原有数据进行覆盖)->fisrt logic(对客户进行确认,例如死亡,犯罪等,标准化 names&address)->data fix->demon join(NCOA)反馈处理数据(添加编号,更新地址)->id changes(xref file/id match)->cluster rule(matching rules)
Post CDI(hygiene/match process):
client input->pre-KL(补全格式)-》hygiene->match(global reference base)->output->postKL
output files: base(可否投递)COA(change of address搬家) Drops(信息不全) IID/HID/AID
file check:file count/filename/exist?/code page/row count/column count/header match/do some cleaning
File Bridge
schedule:tidal(enterprise scheduler) watch jobs processing and report issues(job definition/dependece/Job event/Variable)
feed: freference base feeds/suppression only feeds/tag files/transactional metadata
STG/HIST/ERROR
DTLM(data transformation loading map)STT
BRD/TRD
APJ/EMEA/NA/LATAM
Dell PDW
microsoft MPP dw platform
Advance Process:pick best(person,site)/transaction link/segmetation/model scores/preference management(PMC)/id change
CDI->Order process->suppression->address,email,phone validation->derived attributes->aggregates->snapshots->file outpus
batch id/job id/step id/status/start dt/end dt/last task process/error/reject/delete/update cnt/file id/file group id
Netezza:MPP
TwinFn
Disk--FPGA|CPU|Memory->Linux Host->BI|loda|ETL
S-blades(每个都是独立的server)
SPV(snippet processing unit)
SPA(snippet processing arrary)下面有很多SPU+Disk
Distribution Key(最多四个cols) 数据均匀的分布在SPU上
多column不方便join
原则:unique/join key/int/bigint(字节短)/don't on small table
Zone Map(int/bigint/data..)用于data slice, 没有index有primary key无contraints
生效于orderby和organize on
Data Skew Netezza不支持远程语句
netezza sql" odbc/ole db driver
Aginity(Netezza自己的客户端)
WinSQL(can use with any RDB)
nzsql(linux shell)
system table(_v_table/_v_relation_column/_v_procedure)
CREATE TABLE XX( ....) DISTRIBUTE ON cols;
EXTERNAL TABLE (Input/output data with host/client)
sys table host
user table(SPU)basic+temp
Hadoop
distributed software framework processing of large data/across many nodes
Fault tolerant/scalability/Flexibility
your app code +map/reduce on hdfs
ecosystem of main Java programs
reliable share storage(HDFS,default block 64MB,3 copy)
write once and read many times
read large file scream(not optional) and small IO
hadoop fs -ls
analysis system(Map/Reduce)
Job Tracker/Task Tracker/NameNode/DataNode
Map Reduce(java lib)
assign a map task that can process data in parallel
Project
Hive (online OLTP on-line transaction processing)
hive -f sql file
hive -e sql
Sqoop (transfer data between hadoop and RDB)
Oozie(like Tidal)
Pig
Hbase
NoSQL
非结构化/半结构化(机器数据stream)
web log 很难实现结构化
scable(不是线性)
分布式环境(冗余)
key-value
map-combine-reduce
CA Erwin
ODS
Informatica:
Normal Join
Master Outer(Left Join, Detail is left)
Detail Outer
Full Outer