SQL心得

6 篇文章 1 订阅

由于本文内容过长,建议使用Ctrl+F搜索功能快速定位到你想要的内容!

1 .发现select结果出现空行,如:(A union all B )left join C 。可能是A和B 不需要union 删除一个即可

2 .发现select出现的行,重复问题。如 A leftjoin (select * from B left join max()… from B ) t 。可能是B做了分表,导致max()函数重复出了好几个,所以修改括号内函数为开窗函数row_number() over(partition by Fbond_id ORDER BY Findex desc, Fmodify_time desc) AS row_num

开窗函数讲解:

row_number() over(partition by Fuid ORDER BY substr(Fcreate_time,1,10) asc) AS row_num    
 解释:根据Fuid分块,之后每一块,再根据create_time升序排序。形成以Fuid为'段落',的数据模式。
 即可开窗出Fuid最早的Fcreatetime,予以使用

–使用开窗函数查询表中,最大num的值,及其对应的行所有信息↓

--查询表中,最大num的值,及其对应的行所有信息
select
	Fuid
	,fsku_id
	,fspu_name
	,num
from
	(
	select 
		Fuid
		,fsku_id
		,fspu_name
		,num
		,row_number() over(ORDER BY num desc) AS row_num  
	from
		dp_fk_tmp.leka_biz_db_t_user_coupon_tmp_20200918
	)p
	where row_num = 1

自定义order by 做一个筛选分层↓

select
	fuid,
	fvip_type
from
	(
	select 
		Fuid
		,fvip_type
		,row_number() over(partition by Fuid ORDER BY 
		(case when fvip_type = 'PLUS_VIP' and datediff(fexpire_time,now()) > 0 then 1  
			  when fvip_type = 'CLASSIC_VIP' and datediff(fexpire_time,now()) > 0 then 2 
			  else 3 end 
		)) AS row_num  
	from
		dp_snap.card_base_db_t_card_profile
	)p
	where row_num = 1

3 .遇到没有详细报错的长SQL,例如:A left join B left join C。可以拆分SQL,分别运行A,B,C子SQL,查找错误

4 .遇到出数为空的情况长SQL,例如:A left join B left join C。可以拆分SQL分别查看,分别运行A,B,C子SQL,查找为空的子查询

5 .全量表理应不该做分区,若不小心做了分区。则取数的时候一定要带上分区,否则如果该分区是每个月跑的,不加f_p_date分区,则会出现非常多的重复。因为2019.02.28包含2019.01.31和2018.12.31和… 所有之前的重复拉取

6 .A union all B,需要A,B表提出的列,字段数量,数据类型,字段顺序必须一致。Union的实际意义,就是列保持不变,累加行数

7 .join的实际意义,就是根据关联性,列和行都增大(根本:笛卡尔乘积)

8 .使用union all汇总一个表p,然后进行select xxx,zzz,yyy from p出数,如果使用了sum(),或者其它聚合函数,那么其他字段必须也是单行输出,因为整个出数的行数是以小优先。假设其中一列不符合这个原则,那么“1:多行”无法匹配,理应报错。

9 .当一个执行需要很长时间的SQL,并且带有leftjoin的各种语句嵌套,在定点测的时候,最好在关联条件ON后面增加AND Uid = 12345 的固定值。这样抽样,可以大大减少测试SQL脚本执行的时间。(也拆分查)

10 .使用join关联出现null字段,select a.zzz,b.xxx,cyyy A leftjoin B leftjoin C left join D,先找出是哪个表出的字段为空,假设:B选取的字段为空,那么排查B子查询的问题,若是单表则排查关联字段问题数据源问题。若是B子查询为自己的逻辑,则先排查代码逻辑问题

11 .DESC 降序 / ASC升序

12 .遇到在数据库表结构不明确的字段区别,比如ftotal_interest / fpreety_total_interest,可以先select * from …该表,查看下,发现这两数值只有单位的区别,则可以识别为前者为大单位存储,后者为小单位存储,但取值为同一个只。

13 .当苦思冥想,在逻辑上没有发现问题,各种逻辑拆分也没发现问题的情况下。可以怀疑是数据源的问题!抽样调查数据源表。

14 . count()函数和sum()函数没法同时使用,可以先把count()出来的,再包一层sum()

15 .关于测试,通过不同逻辑的脚本输出,到基本表去进行测试

16 .case when的用法案例:

    select   a.zzz   as zzz,
           	 a.yyy   as yyy,
    		case
    	         when substr(a.ffq_order_id, 1, 1) = 'O' then a.ffq_order_id
    	         when substr(a.ffq_order_id, 1, 1) = 'X' then substr(substr(a.ffq_order_id, 7), 1,
    	                                                             length(substr(a.ffq_order_id, 7)) - 4)
    	         when substr(a.ffq_order_id, 6, 1) = 'O' then substr(a.ffq_order_id, 6)
    	         WHEN substr(a.ffq_order_id, 1, 1) = 'A' then b.forder_id
    	         else a.ffq_order_id  end  as forder_id,
    from XXX_table

case when案例2:

    select 
    	sum (case when t.num >= 1 and t.num < 5  then 1 else 0 end) as num_1_5
    	sum (case when t.num >= 5  and t.num < 10  then 1 else 0 end) as num_5_10
    	sum (case when t.num >= 10 and t.num < 15  then 1 else 0 end) as num_10_15
    	sum (case when t.num >= 15 and t.num < 20  then 1 else 0 end) as num_15_20
    	sum (case when t.num >= 20   then 1 else 0 end) as num_15_20
    	
    from
    (
    	xxx
    )t

case when案例3:

case when fsource is null or fsource='' then 'lk' else fsource end as fsource

17 .where条件和可以放在 A leftjoin B leftjoin C leftjoin D 之后 where b.status = !80 。似乎可以节省资源开销

18 .对于left join关联查询,其实就是根据一个条件。匹配两张表,使其成为一张合体新表。然后where再筛选出需要的条件

19 .gruop by 的用法案例 (group by什么,就只能统计它之外的东西)

    SELECT o.Flicai_account_date
    SUM(o.Flicai_repay_capital+o.Finvest_interest+o.Foverdue_interest)/100
    FROM finance_db.t_finance_repay_order o 
    	LEFT JOIN finance_db.t_finance_loan l ON o.Forder_id = l.Ffq_order_id 
    WHERE l.Faccount_date >= '2018-04-24' 
	    AND o.Fstatus = 11 
	    AND o.Fguarantee_flag = 1
	    AND o.Flicai_account_date >= '2018-10-01'
	    AND o.Flicai_account_date <= '2018-12-31'
    group by o.Flicai_account_date 

20 .group by 一般要和聚合函数一起使用,比如 count() sum(),两个要点:①出现在select后面的字段:要么是是聚合函数中的,要么就是group by 中做包含的字段 ②要筛选结果:where + group by 或者group by + having

21 .sum() 函数,无法直接sum出来,被命名为 as"别名" 的列,方案有两种:①在外面包一层,在select sum(a+b+c+d) ② 不需要再包一层,直接把别名中的具体操作,从内部提出来,放到sum( (ab)+(cd)+…)等

22 .Fuid的 in (123,234,345) 会过滤掉出数不存在的列,所以当需要输出所有的行,方法①:需要先把括号里的Fuid创建一个表,并作为主表去inner join 副表,这样才能保证该Fuid的存在,方法②:使用一个函数吧括号内的转换成一张表

23 .要再一条sql中插入,批量插入数据。

    insert into table xxx as 
	    select 412 as id union all
	    select 1632 as id  union all
	    select  2310 as id  union all
	    select  6997 as id  union all
	    select  54409 as id 

24 .select * from 主表t1 leftjoin 从表t2 on t1.xxx = t2.xxx and 条件2 and 条件3 where xxx 。←这种写法可以使得主表的所有匹配项都显示出来,因为是先join再做on后面的条件。如果使用where就不可行,因为会先执行where的筛选再进行匹配,会让主表t1不全

25 .COALESCE( xxx,0 ) 如果xxx不为空则输出xxx , 如果为空null,则输出 0

26 .Hive创表是不允许表中有模糊命名的。必须给函数生成值一个准确的命名,如下最后一行 as xxx:

    create table if not exists XXX stored as orc as
    select 
    sum(a.Famount) 
    ,sum(a.Fcapital) 
    ,substr(a.newFbond_id,3,8) as xxx

27 .新增多列:SQL:ALTER TABLE tablename ADD COLUMN 列名 数据类型, ADD COLUMN 列名 数据类型;
HQL:alter table tablename add columns ( 列名 double,列名 double)

28 . select t1.* from (select explode(array(1,2,3))) t1 可以把数组转成表记录

29 .If的用法

    select count(t.Fuid) as numbers,
           if(t.Famount < 1000000, '小于1000000', if(t.Famount >= 1000000 and t.Famount < 5000000, '1000000-5000000',
                                         if(t.Famount >= 5000000 and t.Famount < 10000000, '5000000-10000000', '大于10000000'))) as g
    from (
          select s.Fuid as Fuid, sum(s.fbalance + s.fbuy_capital - s.fsell_capital - s.frepay_capital) as Famount
    
          from xx表 s
    
          where s.ftimeframe = '2019-04-02'
            and s.Fplan_status = 110
          group by s.Fuid) t
    group by g

30.模糊查询 通配符 4403% 表示开头是4403后面随意,%4403:左匹配。4403%:表示右匹配; _ 表示任意单个字符 %4403%完全匹配

select count(1) from jz_pure_snap.finance_db_t_finance_user  where Fcredit_id like '4403%'
非匹配:
select count(1) from jz_pure_snap.finance_db_t_finance_user  where Fcredit_id not like '4403%'

31 . inner join 和 exsit / not
inner Join 的脚本:

    SELECT COUNT(DISTINCT t.Fuid),NOW() FROM (
    	SELECT d0.Fuid,MIN(d0.Faccount_date) Fmin_date,MAX(d0.Faccount_date) Fmax_date
    	FROM jz_snap.finance_db_t_finance_detail d0 WHERE d0.Fcreate_time < '2018-01-01'  AND d0.Fdetail_type IN (80,100,161,162)
    	GROUP BY d0.Fuid
    	HAVING  COUNT(d0.Fdetail_id) >= 2
    )t
    inner join jz_snap.finance_db_t_finance_detail d ON t.Fuid = d.Fuid
    and d.Faccount_date >= '2017-01-01' AND d.Faccount_date < '2018-01-01' AND d.Fdetail_type IN (80,100,161,162)

exists 的脚本:

    SELECT COUNT(DISTINCT t.Fuid),NOW() FROM (
    	SELECT d0.Fuid,MIN(d0.Faccount_date) Fmin_date,MAX(d0.Faccount_date) Fmax_date
    	FROM jz_snap.finance_db_t_finance_detail d0 WHERE d0.Fcreate_time < '2019-01-01'  AND d0.Fdetail_type IN (80,100,161,162)
    	GROUP BY d0.Fuid
    	HAVING  COUNT(d0.Fdetail_id) >= 2
    )t
    where 
    1=1
    and EXISTS (
    SELECT 1 FROM jz_snap.finance_db_t_finance_detail d where t.Fuid = d.Fuid
    and d.Faccount_date >= '2018-01-01' AND d.Faccount_date < '2019-01-01' AND d.Fdetail_type IN (80,100,161,162)
    )

以上两个脚本等价!

32 .date_add(‘2019-04-19’,-7) 值为 2019-04-12

33 .Mysql添加列到某列后面 alter table table add ‘列名’ int(20) default ‘0’ after id;
–添加到id列后面
案例:

ALTER TABLE `finance_db.t_informatio_disclosure_summary` 
     ADD COLUMN `Flender_month` BIGINT NOT NULL DEFAULT '0' COMMENT '当月借款人' after fqlzb_zhzb_in_cnt;

34 .Update
语法:UPDATE table_name SET field1=new-value1, field2=new-value2 [WHERE Clause]

35 .根据身份证 fcredit_id, 截取前两位,统计用户的省份

  select
    	sum(case when SUBSTRING(fcredit_id,1,2)='11'  then 1 else 0 end) as beijing,
    	sum(case when SUBSTRING(fcredit_id,1,2)='12'  then 1 else 0 end) as tainjing,
    	sum(case when SUBSTRING(fcredit_id,1,2)='13'  then 1 else 0 end) as heibei,
    	sum(case when SUBSTRING(fcredit_id,1,2)='14'  then 1 else 0 end) as shanxi,
    	sum(case when SUBSTRING(fcredit_id,1,2)='15'  then 1 else 0 end) as neimenggu,
    	sum(case when SUBSTRING(fcredit_id,1,2)='21'  then 1 else 0 end) as liaoning,
    	sum(case when SUBSTRING(fcredit_id,1,2)='22'  then 1 else 0 end) as jiling,
    	sum(case when SUBSTRING(fcredit_id,1,2)='23'  then 1 else 0 end) as heilongjiang,
    	sum(case when SUBSTRING(fcredit_id,1,2)='31'  then 1 else 0 end) as shanghai,
    	sum(case when SUBSTRING(fcredit_id,1,2)='32'  then 1 else 0 end) as jiangsu,
    	sum(case when SUBSTRING(fcredit_id,1,2)='33'  then 1 else 0 end) as zhejiang,
    	sum(case when SUBSTRING(fcredit_id,1,2)='34'  then 1 else 0 end) as anhui,
    	sum(case when SUBSTRING(fcredit_id,1,2)='35'  then 1 else 0 end) as fujian,
    	sum(case when SUBSTRING(fcredit_id,1,2)='36'  then 1 else 0 end) as jiangxi,
    	sum(case when SUBSTRING(fcredit_id,1,2)='37'  then 1 else 0 end) as shangdong,
    	sum(case when SUBSTRING(fcredit_id,1,2)='41'  then 1 else 0 end) as henan,
    	sum(case when SUBSTRING(fcredit_id,1,2)='42'  then 1 else 0 end) as hubei,
    	sum(case when SUBSTRING(fcredit_id,1,2)='43'  then 1 else 0 end) as hunan,
    	sum(case when SUBSTRING(fcredit_id,1,2)='44'  then 1 else 0 end) as guangdong,
    	sum(case when SUBSTRING(fcredit_id,1,2)='45'  then 1 else 0 end) as guangxi,
    	sum(case when SUBSTRING(fcredit_id,1,2)='46'  then 1 else 0 end) as hainan,
    	sum(case when SUBSTRING(fcredit_id,1,2)='50'  then 1 else 0 end) as chongqin,
    	sum(case when SUBSTRING(fcredit_id,1,2)='51'  then 1 else 0 end) as sichuan,
    	sum(case when SUBSTRING(fcredit_id,1,2)='52'  then 1 else 0 end) as guizhou,
    	sum(case when SUBSTRING(fcredit_id,1,2)='53'  then 1 else 0 end) as yunnan,
    	sum(case when SUBSTRING(fcredit_id,1,2)='54'  then 1 else 0 end) as xizang,
    	sum(case when SUBSTRING(fcredit_id,1,2)='61'  then 1 else 0 end) as shanxi,
    	sum(case when SUBSTRING(fcredit_id,1,2)='62'  then 1 else 0 end) as gansu,
    	sum(case when SUBSTRING(fcredit_id,1,2)='63'  then 1 else 0 end) as qinghai,
    	sum(case when SUBSTRING(fcredit_id,1,2)='64'  then 1 else 0 end) as ningxia,
    	sum(case when SUBSTRING(fcredit_id,1,2)='65'  then 1 else 0 end) as xinjiang,
    	sum(case when SUBSTRING(fcredit_id,1,2)='71'  then 1 else 0 end) as taiwan,
    	sum(case when SUBSTRING(fcredit_id,1,2)='81'  then 1 else 0 end) as xianggang,
    	sum(case when SUBSTRING(fcredit_id,1,2)='91'  then 1 else 0 end) as aomen
    from jz_tmp.borrower_fcredit_20190426

36 .类似java,sum()函数自带遍历+只需要在其中新增case when xxx then xxx1 case when xxx then xxx2 … end)
案例:

--#8 (聚合状态110,400)
  select 
 sum(datediff(d.fmodify_time,p.fcreate_time) * d.famount)  /
  sum(case when p.fplan_status = 110 then d.famount when p.fplan_status = 400 then p.famount end)
  from 
 xxx表 p
  left join 
 xxxx表 d
 on p.fplan_id = d.fplan_id
  where p.fplan_status in (110,400)  --两种状态
  and p.fproduct_type = 1
  and p.ftimeframe = '2019-05-13' --观察日  

37 . datediff()用法

--值为31
select datediff('2016-07-30 08:28:59.0','2016-08-30 08:10:06.0')

38 . 显示创表语句

show create table xxx

39 .显示表字段

desc table xxx

40 . substr(F_p_date,1,10) 从1开始包括10 假设原值为 2019-05-22 13:13:13
结果为 2019-05-22

41 .使网页可复制:输入代码javascript:void($={});

42 .show create table xxx表; 可以查看hive的创表语句
desc xxx表;可以查看表的字段结构

43 .notepad++ 替换的正则表达式:①$ 每行末尾 ②^ 每行行首 ③–.*$ 替换–后的内容为’空’
案例如下:替换 mysql中 – 注释号及其后面所有内容为null
在这里插入图片描述

44 .创建分桶表

create table  students_tmp (id int, name string)  clustered by (id) into 2 buckets stored as orc

45 .bucket桶的理解:我桶的概念就是MapReduce的分区的概念。物理上每个桶就是目录里的一个文件,一个作业产生的桶(输出文件)数量和reduce任务个数相同。
而分区表的概念,则是新的概念。分区代表了数据的仓库,也就是文件夹目录。每个文件夹下面可以放不同的数据文件。通过文件夹可以查询里面存放的文件。但文件夹本身和数据的内容毫无关系。
桶则是按照数据内容的某个值进行分桶,把一个大文件散列称为一个个小文件。(摘自网上)

46 .SELECT (pmod(datediff(‘2019-07-14’, ‘2014-01-06’), 7) + 1) ;值为1~7对应周一到周日,判断p0是星期几

47 .遇到莫名的报错,可以先把错误的地方删除屏蔽,派出这个地方的问题,让真实报错现行

48 .SUM( xxx ) over (partition by yyy order by zzz) as vvv
统计sum() 根据 yyy 分组 根据 zzz 排序

49 .select count(1) from xxx表 t1 join zzz表 t2 on t1.findex=t2.findex where t1.floan_id<>t2.floan_id.
含义是两表关联,取得两个floan_id不相同的

50 .datediff( a.Fvip_end_time,‘2019-08-25’) > 0 日期是用 ‘后面的’-‘前面的’ > 0

51 .当多个存在多个leftjoin的时候,并提取Fuid,Date,sum(Fmount + Fasset),的后边一定要跟group by Fuid,Date 否则会报错

52 .A表 leftjoin B表,左边的作为主表,即要展示的字段基础,右边为配合左边匹配的表。例如需求,‘‘未来七天的预测’’,我们需要把,月月升,按月转出的日子作为副表B,而时间全表作为主表A,让副表去匹配主表。这样才能避免B表中不存在转出日为30号,和31号的尴尬局面。

53 .维度选择,比如预测历史的,创表行列构造,使用列,分别是
(Fdate 统计日期) (Ffocus_date 预测聚焦日期) (Fmonth_day 周周升瑞出日) (金额)
如此图,在选取表报的时候,就会非常方便和明确

54 .A表 leftjoin b表,如果发现匹配不上,可以将副表(即B表)的where条件屏蔽掉。如果还是匹配不了,那估计大概率是副表数据有问题了!

55 .join的用法,A表 join B表,可以取到他们的’‘交集’',当所取字段为双方均有的情况,随意取A表的Fuid或者B表的Fuid均可。

56 .
对宽表的一些理解
1.重要的是业务,要理解资金流向(信息流),从A表到B表,如何记录,记录在哪个字段,什么状态。并发执行还是顺序执行,最终汇总在哪个表。
2.宽表分成detial(过程表,即会记录每个明细)
宽表分成final表(终态表,只记录时刻终止)
detail表过程的汇总即为final表

57.sql底层构造函数报错,=SQL=
struct<balance date: string sum(Balance): double, sum(foff_balance): double, sum(fon. balance): double>

在这里插入图片描述出现原因: 此类报错是因为,底层构造函数不支持带()括号等 特殊字符作为列名 的操作
**解决方案:**把sum(balance) 改成合理的列名 如 sum(balance) as balance_sum

58 .IFNULL( A , B)
如果A不是NULL,返回A,否则它返回B

59 .空字符串的 ‘’ 和 null 的区别:
查询方式:
(1.‘’ :可以是用比较运算符进行筛选比如 > < =
(2.null :只能是用 is null 或者 is not null 的语法来进行查询

60 .
left join语法查询两个表未匹配到的项。
思路:取大表的字段uid,大表left join小表,然后on匹配字段uid,where小表uid为空

select 
    r.flender_uid 
from 大表 r
	Left join 小表 h  on r.uid = h.uid 
	where h.uid is null

61 .截取字段函数substr(f_p_date,0,10) --截取前十位

62 .ERROR 1062 (23000) at line 1: Duplicate entry ‘2019-12-31 00:00:00-0’ for key 'uniq_f_p_date
这个报错是因为**“Duplicate entry” -> 重复条目**,因为唯一键必须保证唯一性,
UNIQUE KEY uniq_f_p_date (f_p_date, Fdelete_status) USING BTREEf_p_date + Fdelete_status合体后必须唯一。否则就会报此错误

63 .hql建表语句查询:show create table tablename;
hql表结构查询:desc tablename;

64 .(64为转载 https://zhidao.baidu.com/question/628064534433470564.html)
inner join:理解为“有效连接”,两张表中都有的数据才会显示
left join:理解为“有左显示”,比如on a.field=b.field,则显示a表中存在的全部数据及a、b中都有的数据,a中有、b中没有的数据以null显示
right join:理解为“有右显示”,比如on a.field=b.field,则显示b表中存在的全部数据及a、b中都有的数据,b中有、a中没有的数据以null显示
full join:理解为“全连接”,两张表中所有数据都显示,实际就是inner +(left-inner)+(right-inner)

65 .合并两个表的方案:

1) A join B ,on 1=1 
2) A union B  (效率较高)  
3) insert into tableA select * from tableB --然后再select

66.时间间隔
select date_add(‘2020-02-10’,10) from …;
2014-09-20

67.对空值操作方法1:**NVL(E1, E2)**的功能为:如果E1为NULL,则函数返回E2,否则返回E1本身。

68 .对空值方法操作2:coalesce (E1,E2,E3,…) 的功能为:如果E1为空,则返回E1,否则,返回则E2,如果E1为空,则返回E3… 以此类推,如果所有的表达式都为空值,则返回NULL
而coalsece(E1,0),则可以使得返回E1值为空的情况,重新赋值0,完成去除null值的操作

68 .对空值方法操作3:ifnull(E1, 0) 如果是空返回0

69 .保留None值:(目的:为了符合某种特定的额条件,保留报表中的none的展示)方法:只需要把原来字段fconsumer,强行进行一个sum()操作,再做除法。即可

70 .MySql建表语句案例

create table finance_data.t_lcfx_fact_consume_active_users_v2_01
 (
	Findex BIGINT NOT NULL AUTO_INCREMENT COMMENT '主键ID' , 
	Fdate datetime NOT NULL DEFAULT '1990-01-01' COMMENT '业务日期',
	Fnew_consumer BIGINT NOT NULL DEFAULT '0' COMMENT '新增用户活跃', 
	Fagain_consumer BIGINT NOT NULL DEFAULT '0' COMMENT '新增且再次消费', 
	Findex_day INT NOT NULL DEFAULT '0' COMMENT '指标日期 (次日等)', 
	Fsource varchar(32) NOT NULL DEFAULT '0' COMMENT '用户来源渠道',
	Fetl_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'etl时间',
	F_p_date datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '采集时间',
	Fcreate_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
	Fmodify_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '修改时间',
	Fdelete_status tinyint(3) NOT NULL DEFAULT '0' COMMENT '删除状态 0未删除;1已删除',
	Fversion varchar(32) NOT NULL DEFAULT '1.0' COMMENT '版本号',
	PRIMARY KEY (findex),
	KEY idx_Fcreate_time (Fcreate_time) USING BTREE,  
	KEY idx_Fmodify_time (Fmodify_time) USING BTREE,  
	KEY idx_F_p_date  (F_p_date ) USING BTREE  
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='XX部|XX表|hxh|20200229';

Using BTREE为索引!

71 .复杂逻辑解析
部分Mysql的 (left/right) join 语句可以把,join之后的where条件放在select的列中进行判断出数,虽然看上去复杂(记得备注,免得以后忘记规则),但是有利于少写一些sql,会让sql行数大大减少

目标:t1和t2表均为Fuid明细,现在要求在t1表,且在t2表的Fuid。在**不使用where**的情况下,写出语句。
解析:
	t2.Fuid个数(包括null也算一条)   
	 ↑左边**减号**右边↓  
	如果交集 后有t2.Fuid有空的,才会导致多一条null的,所以要判断交集后t2.Fuid是否有空,有就-1没有就-0
结论:51 - 1 = 50
脚本:
select
 	count(distinct if(t2.fuid is null, 0, t2.fuid)) - if(sum(if(t2.fuid is null, 1, 0))>0, 1, 0) as fagain_consumer --新增且再次消费
 from
 	t1
 	left join 
 	t2

72 .distinct 效果是,会把select的筛选字段,包括非distinct字段,都判断一遍,要完全符合,才会被拉出来,相当于group by了

73 .t1.t2表内部带fsource且,需要出两表同时筛选相同渠道,进行匹配的Fuid,因为带渠道的fsource, 所以在left join之后,一定要()t1 join ()t2 on t1.fuid = t2.fuid and t1.fsource = t2.fsource否则关联变多了,数据会增大,导致报错

74 .强行拼接字段成为列

select * from 
(select ' ' as column)t1 --这列是为了避免a表为空设置的,否则数据会为0
left join a on 1=1
left join b
left join c

75 .Hive的建表\插入语句

创建规则:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
		  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
		  [COMMENT table_comment]
		  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
		  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
		  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
			 ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
			 [STORED AS DIRECTORIES]
		  [
		   [ROW FORMAT row_format] 
		   [STORED AS file_format]
			 | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
		  ]
		  [LOCATION hdfs_path]
		  [TBLPROPERTIES (property_name=property_value, ...)]

插入规则:

INSERT INTO|OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1

创建普通表:

create table if not exists jz_tmp.test_studentinfo (id int,name string,gender string, age int)

插入普通表:

insert into jz_tmp.test_studentinfo values(1,'hxh','A',20)

批量插入:

insert into jz_tmp.test_studentinfo values(1,'hxh','A',20),(2,'zzz','B',18),(3,'yyy','C',30)

创建分区表:

create table if not exists jz_tmp.test_student_01(
	id int
	,name string
	,gender string
	,age int) 
partitioned by (f_p_date string comment '') stored as orc 

插入分区表:

insert into jz_tmp.test_student_01 partition (f_p_date='2020-03-16') values(1,'hxh','S',18)`

76 .Mysql的更改表结构语句…

77 .mysql绝对值函数: abs(E1) 含义:取E1的绝对值。

78 .mysql截取年月

`select date_format(date,'%Y-%m') as '年月' from xxx表` 

79.hive截取年月

`date_format({p0},'yyyy-MM')`

80 .mysql截取字段

`substr(F_p_date,1,10) as fdate`

81 .hive/mysql取时间区间值

包头 包尾巴 fmodify_time between '2020-03-01' and '2020-03-11'
↑ fmodify_time => '2020-03-01 00:00:00' and fate <= '2020-03-11 00:00:00'
如果是fmodify_time =2020-03-11 033015’ 带时分秒又有偏差,
需要带上 + 1天 date_add('2020-03-11',INTERVAL 1 DAY),
或者 select date_add('2020-03-11',1)   
来取到2020-03-12 00:00:00
补充1:当筛选的 fmodify_time 值是 ‘yyyy-MM-dd hh:mm:ss’ 使用 fmodify_time between '2020-03-01' and '2020-03-11' 的时候,时间是 **包前 & 不包后(只到后者的00:00:00**  只取到 2020-03-11 00:00:00
补充2:当筛选的 fmodify_time 值是 ‘yyyy-MM-dd hh:mm:ss’ 使用 fmodify_time between '2020-03-01' and '2020-03-11 20:00:00' 的时候,时间是 **包前 & 包后(到后者的20:00:00**  
补充3:当筛选的 date(fmodify_time) 值是 ‘yyyy-MM-dd’ 使用 date(fmodify_time) between '2020-03-01' and '2020-03-11' 的时候,时间是 **包前 & 包后(完整一天)** 

81 .底层大数据表,初步etl,保留create_time到最大精确度,即保留时间: yyyy-MM-dd hh:mm:ss:ms 。后续报表再进行时间封装出数

NOT BETWEEN不包含两个边界的,表示不在此范围内???

 `--包头 包尾巴 (待测试)between '2020-03-01' and '2020-03-11 03:20:15'` 

82 .hive加减时间

select date_add('2020-03-12',2)   --加2天
select date_sub('2020-02-20',90)  --减90天

82 .mysql加减时间

date_add(OrderDate,INTERVAL 1 DAY)  --加一天
date_sub(OrderDate,INTERVAL 1 DAY)  --减一天

presto加减时间

select date_add(day, 1, TIMESTAMP2014-03-08 09:00:00);

83 .mysql、hive取时间(月初,月末等…)

上月末:last_day(DATE_SUB('2020-03-11',INTERVAL 1 MONTH))   --mysql
当月初:date_add(curdate(),interval -day(curdate())+1 day)  --mysql
当月初:trunc('2020-04-30','MM') --hive 
当月末:last_day(curdate()) --mysql hive(now()) 通用
昨天:date_sub(OrderDate,INTERVAL 1 DAY) --mysql
明天:date_add(OrderDate,INTERVAL 1 DAY) --mysql
本周一:select date_sub(next_day('2021-05-01','MO'),7) --hive
这周日: select date_sub(next_day('2021-05-01','MO'),1) --hive
本周一:date_sub('2021-05-18',cast(date_format('2021-05-18','u') as int)-1) --hive 3.0版本失效

84 .mysql金额格式:

`format(金额,2)   会生成 23,743,424.00 的金额格式`

84 .错误分析


		select
            if(fdate = '2020-03-11',Ffrist_hold_nums,0)
            ,sum(if(fdate = date_sub('2020-03-11', interval 1 day),Fhold_point_nums,0)) as Fyesterday_hold_point_nums  
from 	 
			ptbi_db.t_leka_point_user_active_detail where 1=1`

错误:这里如果去掉sum值会发生错误,数据为0。
原因:本身这段代码是需要group by的,但是我没有,所以脚本执行会自动帮忙group by 并取第一条,即 0
改正错误:1.加group by 2.保留sum()

85 .区间时间 中 取区间时间
需求如下:【大盘新增持有XX用户近7日/30日/60日/90日转化率 近30天变化曲线】
思路:相同的表自关联,on t2表的1-7天 数据,where t1表的30天 时间区间

--代码如下参考
--思路:相同的表自关联,on t2表的1-7天 数据,where t1表的30天 时间区间
select p1.Fdate as Fdate
     , p1.Ffirst_hold_and_consumer_in_7days
     , p1.Ffirst_hold_and_consumer_in_30days
     , p1.Ffirst_hold_and_consumer_in_60days
     , p1.Ffirst_hold_and_consumer_in_90days
     , p7.Ffrist_hold_nums_sum_seven
     , p30.Ffrist_hold_nums_sum_thirty
     , p60.Ffrist_hold_nums_sum_sixty
     , p90.Ffrist_hold_nums_sum_ninety
     , round(100* p1.Ffirst_hold_and_consumer_in_7days/p7.Ffrist_hold_nums_sum_seven,2) as Fratio_7
     , round(100* p1.Ffirst_hold_and_consumer_in_30days/p30.Ffrist_hold_nums_sum_thirty,2) as Fratio_30
     , round(100* p1.Ffirst_hold_and_consumer_in_60days/p60.Ffrist_hold_nums_sum_sixty,2) as Fratio_60
     , round(100* p1.Ffirst_hold_and_consumer_in_90days/p90.Ffrist_hold_nums_sum_ninety,2) as Fratio_90
from (

	select 
		Fdate
		,Ffirst_hold_and_consumer_in_7days
		,Ffirst_hold_and_consumer_in_30days 
		,Ffirst_hold_and_consumer_in_60days 
		,Ffirst_hold_and_consumer_in_90days  
	from ptbi_db.t_leka_point_user_active_detail
		  where Fdate between date_sub('2020-03-17', interval 30 day)  and date_add('2020-03-17', interval 1 day)
		 
	) p1
	  
	  
left join


    (
	 select t1.Fdate
           , sum(t2.Ffrist_hold_nums) as Ffrist_hold_nums_sum_seven
      from ptbi_db.t_leka_point_user_active_detail t1
               left join ptbi_db.t_leka_point_user_active_detail t2																				  
                         on t2.Fdate between date_sub(t1.Fdate, interval 7 day) and date_add(t1.Fdate, interval 1 day)                           
      where t1.Fdate between date_sub('2020-03-17', interval 30 day)  and date_add('2020-03-17', interval 1 day)                                
      group by t1.Fdate asc                                                                                                                      
	  
	) p7 on p1.Fdate = p7.Fdate
	  
left join

     (select t1.Fdate
           , sum(t2.Ffrist_hold_nums) as Ffrist_hold_nums_sum_thirty
      from ptbi_db.t_leka_point_user_active_detail t1
               left join ptbi_db.t_leka_point_user_active_detail t2
                         on t2.Fdate between date_sub(t1.Fdate, interval 30 day) and date_add(t1.Fdate, interval 1 day)
      where t1.Fdate between date_sub('2020-03-17', interval 30 day)  and date_add('2020-03-17', interval 1 day)
      group by t1.Fdate asc) p30 on p1.Fdate = p30.Fdate

left join

     (select t1.Fdate
           , sum(t2.Ffrist_hold_nums) as Ffrist_hold_nums_sum_sixty
      from ptbi_db.t_leka_point_user_active_detail t1
               left join ptbi_db.t_leka_point_user_active_detail t2
                         on t2.Fdate between date_sub(t1.Fdate, interval 60 day) and date_add(t1.Fdate, interval 1 day)
      where t1.Fdate between date_sub('2020-03-17', interval 30 day)  and date_add('2020-03-17', interval 1 day)
      group by t1.Fdate asc) p60 on p1.Fdate = p60.Fdate
	  
left join

     (select t1.Fdate
           , sum(t2.Ffrist_hold_nums) as Ffrist_hold_nums_sum_ninety
      from ptbi_db.t_leka_point_user_active_detail t1
	  
left join ptbi_db.t_leka_point_user_active_detail t2
                         on t2.Fdate between date_sub(t1.Fdate, interval 90 day) and date_add(t1.Fdate, interval 1 day)
      where t1.Fdate between date_sub('2020-03-17', interval 30 day)  and date_add('2020-03-17', interval 1 day)
      group by t1.Fdate asc) p90 on p1.Fdate = p90.Fdate
order by p1.Fdate

86 .
hive , 统计取出来的投入数,相同的个数

select Finput_sum,count(1) from (
    select Fuid,sum(Finput_star_num) as Finput_sum 
    from dp_snap.leka_star_act_user_db_t_star_act_partake_user 
    where Fact_id = 'SACT202003151626297900091124' 
    group by Fuid 
)t group by Finput_sum

87 .设计用户事实天表原则
1) 主键Fuid
2) 动作(发放/消耗) 分析:这两种类型,不会有新增的操作。所以设置再同一行,作为列!
----2.1) 发放动作:首次发放,最后发放…
--------2.1.1) 首次发放:发放时间,发放数量,发放完剩余,发放渠道…
------- 2.1.2) 最后发放:发放时间,发放数量,发放完剩余,发放渠道…
-----2.2) 消耗动作:首次消耗,最后消耗…
--------2.2.1) 首次消耗:消耗时间,消耗数量,消耗完剩余,消耗渠道…
--------2.2.2) 首次消耗:消耗时间,消耗数量,消耗完剩余,消耗渠道…
3) 渠道(lk,fql,wx…等) 分析:渠道类型,绝对会有新增。所以设置再同一列,分行取!
有了以上事实表,运营指标就会很容易出!
PS:例如,昨天新增持有人数(即首次发放在昨天的用户count(distinct Fuid)),昨天活跃用户数(即最后一次消耗再昨天的用户count(distinct Fuid))

88 . 如下sql包含知识点:
1)
参与人数的统计:count(distinct 条件语句…)里面是可以写if逻辑的,但是,count(distinct 条件语句… )如果没有distinct,里面写的逻辑判断!无效!,原因是,count()这个函数含义是,统计取出来的条数,无论什么数据,只要游标指针,往下扫描,有一行就会+1。这个操作:if(sum(if(a.ffirst_event_tag != 4, 1, 0))>0, 1, 0) 目的是为了减去不属于被减数逻辑,但是被统计出来的那一条数据,比如

fuidffirst_event_tag
14
14
14
23
23
count(distinct if(ffirst_event_tag = 4,fuid,-1))之后会变成
fuidffirst_event_tag
-14
-14
-14
13
13
在变成
fuid
-1
1
值为2,而真实取值为1
所以需要再减1,而是否减1,则看if(sum(if(a.ffirst_event_tag != 4, 1, 0))>0, 1, 0) 有没有存在ffirst_event_tag 非4的,如果有才减1,否则不需要减
(详见本文第71.

2)
group by 可以自行定义和取值的,详见代码。
3)
即使做了2)的分组,也同样是可以取到分组后,某类型的消耗次数的。因为同组两个fevent_type只有一个是消耗,即ffirst_event_tag = 4的
4)
累计消耗次数
正确写法:sum(if(ffirst_event_tag = 4,1,0)) as fcc_td
错误写法: count(if(ffirst_event_tag = 4,findex,0)) as fcc_td
错误原因分析: count()这个函数含义是,统计取出来的条数,无论什么数据,只要游标指针,往下扫描,有一行就会+1,任何条件都无效!(除非使用distinct)

	select 
			fdate
			,count(distinct if(ffirst_event_tag = 4,fuid,-1)) - if(sum(if(a.ffirst_event_tag != 4, 1, 0))>0, 1, 0) as fau_td  --'参与人数(消耗人数)'
			,sum(if(ffirst_event_tag = 4,1,0)) as fcc_td --累计消耗次数
			,sum(if(ffirst_event_tag = 4,fchange_point,0)) as fap_td --累计消耗金币数
			,sum(if(ffirst_event_tag = 3,fchange_point,0)) as fgp_td --累计发放金币数
			,case when fevent_type in (1014,2015) then 40 --'tlx' 
				  when fevent_type in (1013,2014) then 50 --'cj'
				  when fevent_type in (1017,2016) then 60 --'ccl'
				  when fevent_type in (1018,2018) then 70 --'bjc'
				  when fevent_type = 2012 then 10 --'db'    
				  else 'unknow' end as fact_type
			,fsource
			from
					(select findex  
						  ,fuid 
						  ,fevent_id  
						  ,fevent_type
						  ,fchange_point
						  ,fpost_point
						  ,fmemo
						  ,fcreate_time
						  ,{p0} as fdate --{p0}
						  ,case when fchange_point>0 then 3  --3为发放,4为消耗(规避以前1、2数据)
								when fchange_point<0 then 4 end as ffirst_event_tag
						  ,case when fsource is null or fsource='' then 'lk' 
								when fsource='fql' or fsource='wx_applet' then 'fql_all'
								else fsource end as fsource
					from dp_ods.leka_user_point_db_t_user_point_water  --流水表
					where (fuid < 3000000 or fuid > 5000000)
					and fmemo not like '%[sync][sync]%' --剔除错误流水
					and date(fcreate_time) <= {p0}
					and fevent_type in (2015,1014,2014,1013,2012,1017,2016,1018,2018) 
					)a
		group by 
			case  when fevent_type in (1014,2015) then 40 --'tlx' 
				  when fevent_type in (1013,2014) then 50 --'cj'
				  when fevent_type in (1017,2016) then 60 --'ccl'
				  when fevent_type in (1018,2018) then 70 --'bjc'
				  when fevent_type = 2012 then 10 --'db'  
				  else 'unknow' end
			,fsource
			,fdate	

88 . select * from (A union all B)t 内层的,AB逻辑不允许有order by操作,会报错。因为这里排序无意义,union all起来依旧乱序

select
		*
	from 
		(
			select
				*
			from
				t1表
			group by 
				...
			--order by  错
			--这里允许有order by 需要到外层写,因为这里排序无意义,union all起来依旧乱序
			
		union all
		
		 	select
				*
			from
				t2表
			group by 
				...
			--order by  错
			--这里允许有order by 需要到外层写,因为这里排序无意义,union all起来依旧乱序
		 )t
	order by 
		fdate desc --数字降序排序
		,fact_type --中文对应的unicode字典排序
		,fsource   --中文对应的unicode字典排序

89 .数据报表取数时,A表 left join B表,A为大表(三个渠道123),B为小表(一个渠道1)。当前端选择渠道23时,改报表展示无数据,如何操作呢?只需要在A left join B后,在where条件获取前端传递的渠道A.fsource in (1,2,3),且写死A.fsource = 1,可实现! 即当前端传递1、12、123 时有数据展示,而只传递2、3、23无数据展示。
参考代码

select
	*
from
	ptbi_db.t_leka_star_game_report_v2 r
left join	
	ptbi_db.t_leka_star_game_cash_cost c
on substr(r.fdate,1,7) = substr(c.fdate,1,7) and c.Fact_type = r.Fact_type and c.fsource = r.fsource
where 
	r.fsource in {fsource} 	       --※ 注意这里 !
	and r.fsource = 'all'          --※ 注意这里 !
	and r.Fact_type in {fact_type} 
group by 
	substr(r.fdate,1,7)
	,r.Fsource
	,r.Fact_type  
order by substr(r.fdate,1,7) desc  

89 .sql的select字段计算除数为空怎么处理,最外层包一层 ifnull(条件,计算语句,0) 即可

90 .月维度报表 表匹配问题分析
–1.) on r.fdate = c.fdate 开始写成了 substr(r.fdate,1,7) = substr(c.fdate,1,7) 导致笛卡尔积过大,如何排查错误呢?方法一:首先单独查两张表是否存在数据源的问题,如果原表数据没有问题,则查看对应的on匹配条件是否唯一,发现似乎是不唯一匹配,导致笛卡尔积过剩,导致数据过大。然后证明结论,不left join两张表单独出根据表ptbi_db.t_leka_star_game_cash_cost出数据,sum()一个正确的月消耗现金成本值。然后再left join两张表出一份,对比数据,发现真的过剩,差值倍数正好是17倍(刚好今天2020-04-17号),可以断定是匹配过剩,所以排查on条件,发现如果on substr(r.fdate,1,7) = substr(c.fdate,1,7),这里有问题,脑补截日期后取后笛卡尔积过程,主表2020-04-01日期截取出来’2020-04’,会匹配副表每一天截取出来的值,以此类推,刚好是17倍的差值。所以修改成on r.fdate = fdate,数据正确,修复完成。

select
			substr(r.fdate,1,7) as Fdate
			,2 as fdate_type --月维度
			,case when r.Fact_type = 0 then 'zhengti'
				  when r.Fact_type = 10 then 'duobao' 
				  when r.Fact_type = 40 then 'tuilexing' 
				  when r.Fact_type = 50 then 'choujiang' 
				  when r.Fact_type = 60 then 'caicaile' 
				  when r.Fact_type = 70 then 'baojianchui' 
				  else '未知游戏' end as Fact_type
			,case when r.Fsource = 'lk' then 'zz'
				  when r.Fsource = 'fql_all' then 'xx'
				  else '大盘' end as Fsource  
			,ifnull(sum(c.fcash_cost_d),0) as fcash_cost_d  						--月消耗现金成本			
			,( ifnull(abs(sum(r.fgp_d)),0) * 0.004 + ifnull(sum(c.fcash_cost_d),0) ) / ifnull(abs(sum(r.fap_d)),0) as fstar_cost_d  	--月消耗游戏币成本
			,'/' as fcash_cost_td   --累计消耗现金成本
			,'/' as fstar_cost_td	--累计消耗游戏币成本
		from
			ptbi_db.t_leka_star_game_report_v2 r
		left join	
			ptbi_db.t_leka_star_game_cash_cost c
		 on r.fdate = c.fdate and c.Fact_type = r.Fact_type and c.fsource = r.fsource
		where 
		  substr(r.fdate,1,7) between substr({fdate_start},1,7)  and substr({fdate_end},1,7) 
		  --and substr(c.fdate,1,7) between substr({fdate_start},1,7)   and substr({fdate_end},1,7)
		  and r.fsource in {fsource} 
		  and r.fsource = 'all'
		  and r.Fact_type in {fact_type} 
		group by 
		  substr(r.fdate,1,7)
		  ,r.Fsource
		  ,r.Fact_type  
		--order by substr(r.fdate,1,7) desc  
	)t
where 
	fdate_type = {fdate_type}
order by 
	fdate desc
	,Fact_type
	,fsource

90 .关于group by聚合函数sum() 的理解
第一步:group by之后形成↓

商品价格个数
5元代金券5元1
5元代金券5元1
5元代金券5元2
第二步:sum()之后形成↓ sum(Fgoods_quantity * fdistribute_price ) as zz
商品消耗权益成本价
5元代金券5*1=5元
5元代金券5*1=5元
5元代金券5*2=10元
sum(Fgoods_quantity * fdistribute_price ) = 5+5+10 = 20元

fdistribute_price * sum(Fgoods_quantity) as zz --消耗权益成本价
(fdistribute_price * sum(Fgoods_quantity) - sum(Forder_amount)) / sum(Forder_stars) --单个乐星成本价
需要把fdistribute_price 放进sum()里面,避免报错

select
		count(distinct fuid),--兑换用户数
		count(1) as fexchange_times, --兑换次数 (按订单维度来看,即订单数)
		fexchange_times / ?1 ,--成交转化率
		sum(Forder_stars) as cc, --消耗乐星数
		sum(Forder_amount) as xx, --消耗现金数
		sum(Fgoods_quantity * fdistribute_price ) as zz --消耗权益成本价 √
		(sum(Fgoods_quantity * fdistribute_price  ) - sum(Forder_amount)) / sum(Forder_stars) --单个乐星成本价 √
	from
		xx表
	where
	    date(fmodify_time) = '2020-04-26'
	group by
		fgoods_item_id 	

91 .having 的使用案例

SELECT 
    d.fplan_id
FROM jz_snap.finance_db_t_finance_plan_detail d 
  LEFT JOIN jz_snap.finance_db_t_finance_plan p  on d.Fplan_id = p.Fplan_id   
WHERE d.Fdetail_type IN (61, 71) --30:借出金额(Fexpense_amount) 50:回款本金(Fincome_amount) 61:债权转让本金回款(Fincome_amount) 71:购买债权转让本金(Fexpense_amount)
  AND substr(d.Fcreate_time,1,7) <= '2020-01'
  AND substr(p.Fcreate_time,1,7) <= '2020-03'
  AND p.Fratio = 650
group by d.fplan_id
having sum(d.Fexpense_amount - d.Fincome_amount)/100 < 0

92 .count的两种用法
统计用户数,活跃天数匹配不到,显示0,非null
left join 的副表的条件,需要使用到on后面的条件

select
	u.Fuid
	方法一 ,count(a.fuid) as factive_day
	方法二 ,sum(if(a.fuid is not null,1,0)) as factive_day
from
	dp_lcfx_mart.lcfx_fact_star_user_snap  u --历史有活跃过lk用户()
left join 
	dp_lcfx_mart.lcfx_fact_consume_active_users  a--活跃用户表
	on u.Fuid = a.Fuid and date(a.fdate) between date_sub('2020-06-14',30)  and '2020-06-14' and  a.fsource = 'lk'  
where
	u.fsource = 'lk'  
	and u.ffc_modify_time is not null  --历史活跃过
	and u.f_p_date = '2020-06-14' --统计日期
group by
	u.fuid

93 .inner join 和 left join的一个小特点:
1)主表A left join 副表B,注意B表的条件需要考虑两个层面,首先,如果要保证主表A的数据完全匹配,那么副表B的条件最好写到on后面,因为,如果副表B的条件条件写在where后面,会导致匹配出来的A全表的部分数据被过滤掉,最终导致数据不全。
2)A inner join B,则不存在1)所示的问题,条件写在on后面或者where结果是一样的,只是过程和效率不同,主要看B表的大小,如果B表过大,则把B条件写在on 后面,然条件先执行,在进行笛卡尔积,减少性能消耗。

94 .count() + if() 和 sum() + if() 做统计

select count( if( 条件, fuid, null))
select count( distinct if( 条件, fuid, null)) 
select sum( if( 条件, 1, 0 ) )

95 . 作为维度基表,可以防止空值无法,group by 出来等级

select 
	'' as findex
	,{p0} as fdate
	,if(b.num is not null,b.num,0) as Fchange_user_num
	,a.fgrade as fgrade

from
	(
		--重点在这里!!
	select 
		explode(array('y1','y2','y3','y4','k1','k2','k3','k4','n1','n2','n3','n4')) AS fgrade
	)a
left join 
	(
		select
			g.fgrade as fgrade
			,count(g.fuid) as num
		from 
			dp_lcfx_mart.lcfx_fact_user_unconsume_grade g
		inner join
			dp_lcfx_mart.lcfx_fact_star_user_snap u
			on g.fuid = u.fuid 
		group by 
			g.fgrade
	)b on a.fgrade = b.fgrade

96 .自定义 order by 的排序字段

order by FIELD(t1.fgrade,'y1','y2','y3','y4','k1','k2','k3','k4','n1','n2','n3','n4') 

97 .当报表取数为区间值 ‘2020-06-01’ ~ ‘2020-06-07’ 时

1)sql中有sum(),则需要使用group by,否则会只输出一条
select
	    fdate as fdate
		,sum(if(ftime_type = 1, fhold_uv, 0)) as Fhu_d 
	from 
		table1
	where 
		Fsource = 'all'
		and fdate between {fdate_start}  and {fdate_end}
	group by fdate
2)sql中有没有sum(),则不需要使用group by,也会输出全部
select
	    fdate as fdate
		fhold_uv as Fhu_d 
	from 
		table1
	where 
		Fsource = 'all'
		and fdate between {fdate_start}  and {fdate_end}
	--group by fdate
3)当A,B,C,D表取值为多条时,外层的t表需要,使用group by。否则会重复输出
select fdate, t.* from ( A left join B left join C left join D )t group by fdate

98 .left join 出现出数据变大为问题,和解决方案
出现原因:同样的fuid 在t1表,对应t2表有多个,造成笛卡尔积变大,最终导致数据值变大
解决方案:把t2表和on后面的条件,单独括号起来(),在()写逻辑,把t2的Fuid去重后,再left join

--错误代码
select
	t1.fuid
	,sum(if(t2.fuid is not null,1,0)) as factive_day 
	,abs(sum(if(w.fchange_point is null,0,fchange_point))) as fconsume_point 
from
	t1
left join
	 t2
	on t1.fuid = t2.fuid 
	and date(t1.fdate) between date_sub('2020-07-07',29)  and '2020-07-07' 
	and  a.fsource = 'lk'   
left join
	t3
	on t1.fuid = t2.fuid
	and w.fchange_point < 0  --消耗乐星
	and w.fevent_type != 2005 --剔除积分过期
	and w.fsource = 'lk'

99 .用户分层的两种方案:
1)归一法 + 权重
在这里插入图片描述
2)直角坐标系法
例如1)三个指标,做一个立体的x,y,z坐标系,中间点为中位数或者avg均值。分成的八个方块做分层。
在这里插入图片描述指标设计如下图所示:
在这里插入图片描述100 .中位数函数
分位数(Quantile),亦称分位点,是指将一个随机变量的概率分布范围分为几个等份的数值点,常用的有中位数(即二分位数)、四分位数、百分位数等。

select 
	percentile_approx(字段名称, 0.5) as fconsume_point_median
from 
	table1

PS:注意此函数是聚合函数,不可以在外层在嵌套聚合商数

101.Mysql的数据日增量限制
mysql表,每天输入100w条数据顶不住,需要把明细数据都存储在Hive中,最后统计结果在存上mysql。

102.从json字段中捞取值

select count(distinct Fuid)
      from tableA
      where dt >= '2020-05-15' and dt >= '2020-06-30'
      and Fevent_id ='B3610B59-9AB8-451C-B86A-582FB3ACD797'
      --重点↓
     and get_json_object(fextend_info,'$.source')='SG2005141550290028311552'
     and get_json_object(fextend_info,'$.origin')='30010'
{
    "tag_info":[
        {
            "time_stamp":"",
            "value":"",
            "key":"_OSC"
        },
        {
            "time_stamp":"",
            "value":"",
            "key":"_ISC"
        }
    ],
    "origin":"30010",
    "channel":"",
    "source":"SG2005141550290028311552",
    "pagePath":"https://m.leka.club/v2/star/goods/SG2005141550290028311552",
    "terminal_id":"LEKA"
}

案例2:嵌套json

select 
		count(if(get_json_object(get_json_object(fextend_info,'$.platform_extend_info'),'$.title') = '餐饮外卖',1,null)) as fpv1
		,count(distinct if(get_json_object(get_json_object(fextend_info,'$.platform_extend_info'),'$.title') = '餐饮外卖',fuid,null)) as fvu1
	from
		dp_lcfx_mart.t_user_behavior_business_log_lecard
	where 
		fevent_pos = 'FFENQILE_APP.QUANYITAB.WDLQYTAB.WDLYFGRM'
	and fdate > '2020-12-01'
{
    "risk_extend_info":{

    },
    "device_system":"iOS",
    "device_system_version":"13.7",
    "ecr_extend_info":{

    },
    "system":{
        "app_version":"H5",
        "os":"H5",
        "new_version":"H5",
        "machine_code":""
    },
    "device_root":"-1",
    "device_model":"iPhone",
    "net_type":"",
    "screen_size":"414*736",
    "platform":"H5",
    "platform_extend_info":{
        "eventId":"E16067173977239785",
        "search":"?vip_tag=FQL.GUANG_GAO.PISIPMRK.HRWRBZTZ&amp;agent_source=2000002",
        "tag_list":[

        ],
        "tag_info":[
            {
                "time_stamp":"",
                "value":"",
                "key":"_OSC"
            },
            {
                "time_stamp":"",
                "value":"",
                "key":"_ISC"
            }
        ],
        "agent_source":"2000002",
        "vip_tag":"FQL.GUANG_GAO.PISIPMRK.HRWRBZTZ",
        "pagePath":"https://m.leka.club/sub_card/activity/seckill_tabs.html",
        "title":"餐饮外卖",
        "terminal_id":"H5",
        "hash":""
    }
}

103.开窗函数用法

1)row_number()

将一个分好组里面的一个分区的所有数据进行排序;

select userid,username,dept,score,
row_number() over(partition by username order by score)
from ccc;

2)rank()

对于相同的分区里面的数据显示一样的排名

select userid,username,dept,score,
rank() over(partition by username order by score)
from ccc;

3)dense_rank()

相当于在分区里面有排名一样的,接下来的排序接着排

select userid,username,dept,score,
dense_rank() over(partition by username order by score)
from ccc;

4)percent_rank()

select userid,username,dept,score,
percent_rank() over(partition by username order by score)
from ccc;(得出的数值比列)
计算方式:当前(行号-1)除以(总行数-1

104.类型强转函数

CONVERT( INT, dimension ) 

105.判断为空

select * from tableA where  fplan_id <>  '' --or fplan_id is not null 一起使用会失效,原因不明

106.concat() 连接函数

select concat('a123','b456','c789') from tableA
--结果
a123b456c789

107.instr() 查找函数
语法:

instr(sourceString,destString,start,appearPosition)
instr(
		’源字符串’ , --即列名
		‘目标字符串’ , --要查的内容
		’开始位置’ --默认为1可不填,
		’第几次出现’ --默认为1可不填
)

108.自定义参数 @row的使用

109.mysql报表,月表定义yyyy-MM
1)方案一:定义Fdate字段,数据类型为date,etl跑数锁定{p0}为当月初,每天更新数据,让后端查询后截取前七位
2)方案二:定义Fdate字段,数据类型为int或String(注意String需要的空间更多,优先考虑int),那2020年10月举例,存储 ‘202010’

110.hive,中取json字段value

json示例:

{
	"ru": "",
	"nt": "wifi",
	"sest": "",
	"tag_info": "[{\"key\":\"_OSC\",\"value\":\"\",\"time_stamp\":\"\"},{\"key\":\"_ISC\",\"value\":\"LHK.LHK.00001\",\"time_stamp\":1604209128721},{\"key\":\"_CSC\",\"value\":\"\",\"time_stamp\":\"\"}]",
	"dm": "pm.m.fenqile.com",
	"poi": "",
	"ht": "PM_M.REPAY.PAYMENT_POPUP.SHOW_ACTIVITY_COUPON",
	"sestid": "",
	"type": "1"
}
select
    fdate
	,count(distinct if(get_json_object(fextend_info,'$.ht') = 'PM_M.REPAY.PAYMENT_POPUP.SHOW_ACTIVITY_COUPON' ,fuid ,null)) as Fsee_uv  --广告曝光数
	,count(distinct if(get_json_object(fextend_info,'$.ht') = 'PM_M.REPAY.PAYMENT_POPUP.CLICK_COUPON' ,fuid ,null)) as Fclick_uv --广告点击数
	
from 
	dp_click.t_user_behavior_business_log
where
    fevent_pos = 'F6E03F85-B999-4CEF-B402-114523C3D92C'
    and fdate >= '2020-10-01' and fdate <= '2020-11-19'
group by fdate

111.hive 时间戳 转成 yy-MM-dd

FROM_UNIXTIME(ftime/1000,'yyyy-MM-dd') as fdate

112. 字段转换成小写 lower(字段名) 和 rlike的使用

select 
	fuid   --UID
	,fcreate_date --开卡日期
from 
	dp_operation_mart.lkfx_privilege_card_valid_info
where
	fcreate_date >= '2020-09-15' and fcreate_date <= '2020-11-22'
	--and  fad_tag in ('yijiayou_vip','yijiayou_2')
	and lower(fad_tag) rlike 'yijiayou'

113. hive中,当明细表A和明细表B进行left join匹配时,特别是需要对A表的fdate筛选下,进行对B表fdate时间区间的筛选时,select的字段判断必须以,b.fdate作为判断左侧表达式,a.fdate作为右侧表达式(确保主表的字段在右边,作为一个基础,带动左边的表达式值)

案例:

select 
	a.fdate
	,count(distinct a.fuid) as fopen_card
	,count(distinct if(b.fdate = date_add(a.fdate,1),a.fuid,null)) as fone_day --正确
	,count(distinct if(b.fdate >= date_add(a.fdate,1) and b.fdate <= date_add(a.fdate,7),a.fuid,null)) as fseven_day	--正确
	--错误示范: ,count(distinct if(a.fdate = date_add(b.fdate,-1),a.fuid,null)) as fone_day
	--错误示范:  ,count(distinct if(a.fdate >= date_add(b.fdate,-1) and a.fdate <= date_add(b.fdate,-7),a.fuid,null)) as fseven_day	
from
	(
	--开卡fuid明细
	select                         
		fuid
		,date(fcreate_time) as fdate
	from  
		dp_operation_mart.lkfx_privilege_card_valid_info
	where                 
		date(fcreate_time)  >= '2020-11-01' 
		and date(fcreate_time) <= '2020-11-22'	
		and forder_type !=30
		and fvip_type in ('CLASSIC_VIP','PLUS_VIP') 
		and (fuid < 3000000 OR fuid > 5000000)
		and ftitle like '%XX%'
	group by 
		fuid,date(fcreate_time)
	)a
	
left join
	(
	--登录明细	
	select 
		fdate
		,fuid 
	from 
		dp_click.t_user_behavior_business_log
		where fdate >= '2020-11-01'
			and fdate <= '2020-11-22'
			and fapp_id in ('com.XX.club','com.XX.phone','com.XX.phones')
			and (fuid < 3000000 or fuid > 5000000)
			and FROM_UNIXTIME(ftime/1000,'yyyy-MM-dd')  >= '2020-11-01'
			and FROM_UNIXTIME(ftime/1000,'yyyy-MM-dd')  <= '2020-11-22'
		group by fuid,fdate
	)b
on a.fuid = b.fuid 
group by a.fdate 
order by a.fdate asc

114.解决union all使用order by 报错 (只可以在union all完成后,最后面进行整体order by)
PS:contact(字段1,拼接值) --拼接函数

select
	Fdate
	,fvalue_01
from A 
--这里不能写 order by fdate
--只可以在union all完成后,最后面进行整体order by

union all

select
	concat(substring(fdate,6,2),'月累计') as fdate --日期
	,fvalue_01
from B 
order by 
	fdate

115.join 相关(http://www.runoob.com/mysql/mysql-join.html)

left  join(左联接) 返回包括左表中的所有记录和右表中联结字段相等的记录 
right join(右联接) 返回包括右表中的所有记录和左表中联结字段相等的记录
inner join(等值连接) 只返回两个表中联结字段相等的行

在这里插入图片描述
A inner join B 取交集。
A left join B 取 A 全部,B 没有对应的值为 null。
A right join B 取 B 全部 A 没有对应的值为 null。
A full outer join B 取并集,彼此没有对应的值为 null。

116.hive 多列去重
只需要distinct 列名,列名,列名 即可(注意不可使用多个distinct关键字!)

select 
	distinct fuid,fbrand,fpaopao,fcustomized
	--四个字段均会去重!!
from
    dp_fk_tmp.leka_fuid_xioaxihao_detial_tmp
where 
    fuid != 0
order by fuid asc

117.一些商品运营指标
1、注册用户数:截止统计之日,某app当前累计注册的用户总数量;
2、总有效期内用户数/保有用户:截止统计之时,某会员加VIP会员有效期内去重的用户总数;
3、某会员有效期内用户数:截止统计之时,纯某会员(不含升级VIP的会员)有效期内用户总数;
4、VIP会员有效期内用户数:截止统计之时,VIP会员有效期内用户总数;
5、访问用户数:统计时间内,权益tab访问及某app整体访问用户总数,去重;
6、app访问用户数:统计时间内,某app整体访问的用户数量。
7、分x乐(权益tab)访问某app的用户:统计时间内,在指定事件埋点中上报了唯一机器码后,在某app内有首次访问的记录即算;
8、权益tab访问某app的用户:统计时间内,从权益tab引流到某app首次登录的用户数;
9、总开卡人数:统计之时,普卡会员加VIP会员开卡用户去重的总数;
10、某会员开卡数:统计之时,某会员开卡的用户总数;
11、VIP会员开通用户数:截止统计之时,VIP会员开卡的用户总数;
12、某会员升级人数:统计之日,某会员升级到VIP会员的人数;
13、升级转化率:升级人数/某会员有效期内用户数;
14、权益活跃人数:统计周期内,使用开卡礼权益与非开卡礼权益的用户数。包含定制权益使用(以成功的订单为准)、直充类权益(成功的订单,含秒杀)、卡密类权益(查看卡密,含秒杀)、H5类(点击商详底部按钮即认为使用权益);
15、 权益使用活跃率=权益活跃人数/保有用户;
16、总收入:统计时间内,开卡、进销存权益购买(含秒杀)、定制权益购买(拼多多/加油/自营电商/kfc/商超惠)产生的收入总和;
17、开卡收入:统计时间内,用数开通卡片产生的收入总和,含某会员及VIP会员;
18、总权益收入:统计时间内,进销存权益购买(含秒杀)、定制权益购买(拼多多/加油/自营电商/kfc/商超惠)产生的收入总和;
19、进销存权益收入:统计时间内,进销存权益(有权益ID的)产生的收入总和;,包含秒杀;
20、定制权益收入:统计时间内,定制权益(拼多多/KFC/优惠加油/采蛋/商超)产生的收入总和;
21、购买人数:统计周期内,购买权益的用户数,包含卡内权益、定制权益(拼多多/KFC/优惠加油/采蛋/商超),秒杀活动;
22、购买率=购买人数/保有用户
23、新用户:历史未交易用户,在统计时间内完成第一笔权益购买(仅1笔)的用户数;
24、新用户占比=新用户/购买人数;
25、毛利率=(售价-采购价)/售价;
26、分x乐开卡:在分x乐渠道成功开通某会员的用户,包含乐星购;
27、权益购买:用户付费购买权益;
28、采蛋购买:用户付费购买采蛋商品;
29、乐星发放量:统计日内,大盘总共发出的乐星数量;
30、乐星发放人数:统计日内,大盘总共给多少人发放了乐星;
31、乐星消耗量:统计日内,大盘总共消耗的乐星数量;
32、7日新持转化率:最近7天(包含统计当日),新增持有乐星人数里有多少人已经消耗过乐星;
33、30日新持转化率:最近30天(包含统计当日),新增持有乐星人数里有多少人已经消耗过乐星 ;
34、新增持有人数:历史未获得过乐星,统计时新获得乐星的用户数;
35、乐星消耗人数:统计时间内,成功使用了乐星的用户人数;
36、保有乐星人数:统计时间内,拥有乐星的人数;
37、乐星消耗-某app:在某app中进行了消耗的操作用户人数及消耗的乐星数量;
38、乐星消耗-分x乐:在分x乐中进行了消耗的操作用户人数及消耗的乐星数量;
39、日均访问用户:当月权益TBA+某app平均每天的访问用户数;
40、app日均访问用户:当月某app平均每天的访问用户数;
41、使用率=使用张数/下发张数
42、单均金额=实付订单金额/使用张数
42、补贴率=优惠金额/(优惠金额+实付订单金额)

118.Mysql新增列,修改列

--添加列 
--关键字:add column
alter table ptbi_db.t_leka_localization_merchant_data_report add column fcity_name_set varchar(30) not null DEFAULT '0' COMMENT '售卖城市';

--修改列名 
--关键字:change 记得把后续的not null DEFAULT '0' COMMENT '售卖城市'带上,否则会被覆盖清空
alter table ptbi_db.t_leka_localization_merchant_data_report change column fcity_name_set Fcity_name_set varchar(30) not null DEFAULT '0' COMMENT '售卖城市';

119.hive 拼接字段 or 行转列
concat_ws(‘&’,collect_set(fcity_name)) &为分隔符,set去重
concat_ws(‘&’,collect_array(fcity_name)) &为分隔符,array不去重

--城市匹配
            select 
                    t.fcommodity_id
                    --,group_concat('.',fcity_name) as concatcol1
                    --,fql_regex_repeat(fcity_name, "\\d+",".") 
					-- ,concat_ws('.',sort_array(collect_set(fcity_name) over(distribute by fcity_name))) as fcity_name_set 
                    ,concat_ws('&',collect_set(fcity_name) ) as fcity_name_set
            from(
                    select 
                        fcommodity_id as fcommodity_id,
                        case when fcity_code = 440100  then '深圳市' 
                                 when fcity_code = 440300 then '广州市' 
                                 else '全国' end as fcity_name 
                    from 
                            dp_snap.xxx_xxx --table name
                    group by 
							fcommodity_id,
                            case when fcity_code = 440100  then '深圳市' when fcity_code = 440300 then '广州市' else '全国' end

120.使用开窗函数,解决首次购买,二次购买,三次购买。且TOP10的商品销售量,逻辑,一步到位

--商品购买TOP10 
select
    rn
	,fcommodity_id as fcommodity_id --商品ID
	,productName as fproductName  --商品名称
	,count(forder_id) as fbuy_cnt --购买数量
from
	(
	select 
			fuid
			,fcreate_time 
			,forder_id
			,fcommodity_id
			,famount
			,productName
			,row_number() over(partition by fuid order by fcreate_time asc) as rn	
	from
		(
			--新人秒杀
			select                         
				fuid
			   ,fcreate_time 
			   ,forder_id --订单数
			   ,get_json_object(fextend,'$.amount')/100 as famount 
			   ,fcommodity_id   --商品ID
			   ,get_json_object(fextend,'$.productName')  as productName --商品名称 PS:美团外卖五元代金券
			from                 
				dp_ods.leka_events_db_t_event_order
			where                 
				fevent_id in (
				'E16036960351415771'
				,'E16086189308876539'
				,'E16086188757613042'
				,'E16086188158717114'
				,'E16086189755334675'
				,'E16087116444886382'			
				) --新人专区
				and date(fcreate_time) = '2020-12-20'
		)t
	)a  
where rn in (1,2,3)   --这个代:首次购买,二次购买,三次购买
group by rn,fcommodity_id,productName order by count(forder_id) desc 

121.使用left join代替,not in
取红色阴影面积:

select 
	tableA.fuid
from
	tableA
left join 
	tableB on tableA.fuid = tableB.fuid
where tableB.fuid is null

使用not in

select 
	tableA.fuid
from 
	tableA where fuid not in (select fuid from tableB)

两者相比,left join效率会更高,因为not in 的tableA是查一条,跟not in查一条匹配一起,会形成笛卡尔积。

122.使用正则表达式,取格式错误的json字符串
regexp_extract(f.Frule_params,'"feeRate":(.*?),',1)

错误
{"clearingRuleId":
"[\\"fixedRate\\",\\"afterPayment\\"]", --这里是错误的格式(因为sql=>fdw做了转义)
"effectTime":1603296000000,
"feeRate":1.00,
"invalidTime":1632153599000,
"merchantId":"MC2009192051030127401984",
"merchantName":"张店区瑾嘉韵便利店",
"settlementPrice":9.90,"spuId":"SPU2009211103340127401984",
"spuName":"爱家超市店内商品代金券"}

123.使用正则表达式,匹配需要字段

fpage_url:pages/goodsDetail/index?storeId=&ruleId=YHR2021010618042499274647284894&contractId=&yhGoodsId=YHG20210205192845525363119&fromOpenYueHui=1
代码:
select regexp_extract(fpage_url,'ruleId=([A-Z]+[0-9]+)' --含义是取ruleId=的所有英文在取所有数字 ,+的意思是取到尽
结果:YHR2021010618042499274647284894

123.使用正则表达式,匹配需要的字段 !重要!

json1:
{"marketing":{},"task":"dadiYuehui","outerShopId":"540","subMerch":"dadi","ucode":"40134","scene":"540_dadi_40134_dadiYuehui"}
json2:
{"marketing":{},"task":"","outerShopId":"44222301","subMerch":"dadi","ucode":"36893","scene":"44222301_dadi_36893_"}
需求:同时满足json1和json2,需要取dadi_和_之间的  40134
代码:
select 
	regexp_extract(c.Fbiz_code_info,'(dadi_)(.*?)(_)',2)  --结果40134  √ 
	,regexp_extract(c.Fbiz_code_info,'(dadi_)(.*?)(_)',1)  --结果dadi_  即头部
	,regexp_extract(c.Fbiz_code_info,'(dadi_)(.*?)(_)',0)  --结果dadi_40134_  即全部
	,regexp_extract(c.Fbiz_code_info,'(dadi_)(.*?)(_)',3)  --结果dadi_  即尾部
from 
	table1

124.经纬度,范围内3km

select 
	fuid as fuid,
	round(6378.138*2*asin(sqrt(pow(sin( (22.5460535462*pi()/180-c.Flatitude*pi()/180)/2),2)+cos(22.5460535462*pi()/180)*cos(c.Flatitude*pi()/180)* pow(sin( (114.025973657*pi()/180-c.Flongitude*pi()/180)/2),2)))*1000) as flabel_value 
from dp_lcfx_mart.lcfx_fact_app_login_info_data c  
where 
	dt = date_sub(now(),1)
	and round(6378.138*2*asin(sqrt(pow(sin( (22.5460535462*pi()/180-c.Flatitude*pi()/180)/2),2)+cos(22.5460535462*pi()/180)*cos(c.Flatitude*pi()/180)* pow(sin( (114.025973657*pi()/180-c.Flongitude*pi()/180)/2),2)))*1000)<3000

125.mysql自定义变量实现,hive的rank()函数

分数row_number()排序结果rank()排序结果dense_rank()排序结果
10111
10211
20332

这里完成的是hive rank()函数相同的结果

select ttt.frankOrder, ttt.Ftrainee_name, ttt.Fstar_value_num_all
from 
(
select 
    if(@lastStarValueNum <> tt0.Fstar_value_num_all, @rankOrder := @rankOrder + 1 + @equalsTimes, @rankOrder) as frankOrder,
    tt0.Ftrainee_name as Ftrainee_name,
    tt0.Fstar_value_num_all as Fstar_value_num_all,
    if(@lastStarValueNum = tt0.Fstar_value_num_all, @equalsTimes := @equalsTimes + 1, @equalsTimes := 0) as equalsTimes,
    @lastStarValueNum := tt0.Fstar_value_num_all as lastStarValueNum
    
from
    (select 
    	t1.Ftrainee_name as Ftrainee_name,
    	coalesce(t2.Fstar_value_num_all,0) as Fstar_value_num_all
    from 
        (
    	select 1 as Ftrainee_id, "璐宝" as Ftrainee_name union all
    	select 2 as Ftrainee_id, "安安" as Ftrainee_name union all
    	select 3 as Ftrainee_id, "阿羊" as Ftrainee_name union all
    	select 4 as Ftrainee_id, "陈冰" as Ftrainee_name union all
    	select 5 as Ftrainee_id, "贝怡" as Ftrainee_name union all
    	select 6 as Ftrainee_id, "贝拉" as Ftrainee_name union all
    	select 7 as Ftrainee_id, "灵儿" as Ftrainee_name union all
    	select 8 as Ftrainee_id, "张美琪" as Ftrainee_name union all
    	select 9 as Ftrainee_id, "王敏慈" as Ftrainee_name union all
    	select 10 as Ftrainee_id, "周周" as Ftrainee_name union all
    	select 11 as Ftrainee_id, "北鼻" as Ftrainee_name union all
    	select 12 as Ftrainee_id, "章于秋晶" as Ftrainee_name union all
    	select 13 as Ftrainee_id, "白冰" as Ftrainee_name union all
    	select 14 as Ftrainee_id, "蓝格" as Ftrainee_name union all
    	select 15 as Ftrainee_id, "徐佳颖" as Ftrainee_name union all
    	select 16 as Ftrainee_id, "何姗姗" as Ftrainee_name union all
    	select 17 as Ftrainee_id, "贾博雅" as Ftrainee_name union all
    	select 18 as Ftrainee_id, "韩大力" as Ftrainee_name union all
    	select 19 as Ftrainee_id, "江智慧" as Ftrainee_name union all
    	select 20 as Ftrainee_id, "辛迪" as Ftrainee_name union all
    	select 21 as Ftrainee_id, "小Q" as Ftrainee_name union all
    	select 22 as Ftrainee_id, "杨洁" as Ftrainee_name union all
    	select 23 as Ftrainee_id, "六六" as Ftrainee_name union all
    	select 24 as Ftrainee_id, "王宇露" as Ftrainee_name union all
    	select 25 as Ftrainee_id, "狒狒" as Ftrainee_name union all
    	select 26 as Ftrainee_id, "牟子" as Ftrainee_name union all
    	select 27 as Ftrainee_id, "JUJU" as Ftrainee_name union all
    	select 28 as Ftrainee_id, "二胖" as Ftrainee_name union all
    	select 29 as Ftrainee_id, "王思凡" as Ftrainee_name union all
    	select 30 as Ftrainee_id, "卢熙" as Ftrainee_name union all
    	select 31 as Ftrainee_id, "张曦月" as Ftrainee_name union all
    	select 32 as Ftrainee_id, "蛋宝宝" as Ftrainee_name
       ) t1 
    left join 
        (select 
          Ftrainee_id as Ftrainee_id,
          sum(Fput_star_num_all)*10 as Fstar_value_num_all
        from 
          ptbi_db.t_leka_star_act_huya_trainee_date_statistics 
        where fdate >= {fdate1}
         and fdate < {fdate2}
         and fdelete_status = 0
        group by 
          Ftrainee_id 
        ) t2 on t1.Ftrainee_id = t2.Ftrainee_id
    order by t2.Fstar_value_num_all desc, t1.Ftrainee_id asc
    ) tt0, 
    (SELECT @rankOrder := 0,  @lastStarValueNum := 0, @equalsTimes := 0) tt1
    ) ttt

126. 日周月纬度,累计纬度综合开发代码参考

select
    fdate
    ,Fpv
    ,Fvalue_01 
    ,Fvalue_02 
    ,Fvalue_03 
    ,Fvalue_04 
    ,Fvalue_05 
    ,Fvalue_06 
    ,Fvalue_07 
    ,Fvalue_08 
    ,Fvalue_09 
    
from
(
select 
    substr(Fdate,1,10) as fdate
    ,Fvalue_01 
    ,Fvalue_02 
    ,Fvalue_03 
    ,Fvalue_04 
    ,Fvalue_05 
    ,Fvalue_06 
    ,Fvalue_07 
    ,Fvalue_08 
    ,Fvalue_09 
    ,Fpv
    ,fdate_type
from 
    ptbi_db.t_leka_discount_data_report
where
    substr(Fdate,1,10) between {fdate_1} 
      and {fdate_2} 
      and fdate_type = 1 
      
union all

--历史维度

select 
     substr(Fdate,1,10) as fdate
    ,Fvalue_01 
    ,Fvalue_02 
    ,Fvalue_03 
    ,Fvalue_04 
    ,Fvalue_05 
    ,Fvalue_06 
    ,Fvalue_07 
    ,Fvalue_08 
    ,Fvalue_09 
    ,Fpv
    ,Fdate_type
from 
    ptbi_db.t_leka_discount_data_report
where
     substr(Fdate,1,10) between {fdate_1} 
      and {fdate_2} 
      and fdate_type = 4
      
union all
    -- 月纬度

select 
      date_format(a.Fdate, '%Y-%m') as fdate
     ,Fvalue_01 
    ,Fvalue_02 
    ,Fvalue_03 
    ,Fvalue_04 
    ,Fvalue_05 
    ,Fvalue_06 
    ,Fvalue_07 
    ,Fvalue_08 
    ,Fvalue_09 
    ,Fpv
    ,Fdate_type
    from 
      ptbi_db.t_leka_discount_data_report a
    join (
        select 
          max(fdate) as fdate, 
          date_format(Fdate, '%Y-%m') as fmonth 
        from 
          ptbi_db.t_leka_discount_data_report 
        WHERE 
          substr(Fdate,1,10) between {fdate_1} and {fdate_2} 
            and fdate_type = 3
      
      ) b on a.fdate = b.fdate  
    where
     substr(a.Fdate,1,10) between {fdate_1} and {fdate_2} 
      		and fdate_type = 3

union all --周纬度


 
    --周纬度

select 
      concat(
        substr(
        subdate(
          a.fdate, 
          if(
            date_format(a.fdate, '%w')= 0, 
            7, 
            date_format(a.fdate, '%w')
          )-1
        ),1,10), 
        '~', 
        substr(
        subdate(
          a.fdate, 
          if(
            date_format(a.fdate, '%w')= 0, 
            7, 
            date_format(a.fdate, '%w')
          )-7
        ),1,10)
      ) as fdate
     ,Fvalue_01 
    ,Fvalue_02 
    ,Fvalue_03 
    ,Fvalue_04 
    ,Fvalue_05 
    ,Fvalue_06 
    ,Fvalue_07 
    ,Fvalue_08 
    ,Fvalue_09 
    ,Fpv
    ,Fdate_type
    from 
      ptbi_db.t_leka_discount_data_report a 
      join  --PS:这里的目的是,取出每周最大的那一天,去匹配完成一周!
      (
        select 
          max(fdate) as fdate, 
          concat(
        substr(
        subdate(
          a.fdate, 
          if(
            date_format(a.fdate, '%w')= 0, 
            7, 
            date_format(a.fdate, '%w')
          )-1
        ),1,10), 
        '~', 
        substr(
        subdate(
          a.fdate, 
          if(
            date_format(a.fdate, '%w')= 0, 
            7, 
            date_format(a.fdate, '%w')
          )-7
        ),1,10)
      )  as fweek 
        from 
          ptbi_db.t_leka_discount_data_report a 
        WHERE 
          substr(Fdate,1,10) between {fdate_1} and {fdate_2} 
          and fdate_type = 2
        group by 
        concat(
        substr(
        subdate(
          fdate, 
          if(
            date_format(fdate, '%w')= 0, 
            7, 
            date_format(fdate, '%w')
          )-1
        ),1,10), 
        '~', 
        substr(
        subdate(
          fdate, 
          if(
            date_format(fdate, '%w')= 0, 
            7, 
            date_format(fdate, '%w')
          )-7
        ),1,10)
      ) 
      ) b on substr(a.Fdate,1,10) = substr(b.Fdate,1,10)
    where 
      substr(a.Fdate,1,10) between {fdate_1} and {fdate_2} 
      and fdate_type = 2
      
      
      
      
)t
WHERE 
  fdate_type = {fdate_type} 
ORDER BY 
  fdate desc

127.CONCAT()函数,拼接,分隔
CONCAT()函数用于将多个字符串连接成一个字符串
例如,用户id+年龄
1.SELECT CONCAT(id, ‘,’,age) from table1
2.CONCAT_WS(separator,str1,str2,…) //separator为分割符号
3.GROUP_CONCAT函数返回一个字符串结果,该结果由分组中的值连接组合而成

128.大数据知识普及

1B (byte 字节)1KB(Kilobyte 千字节) = 2^10 B = 1024 B; 1MB(Megabyte 兆字节) = 2^10 KB = 1024 KB = 2^20 B; 1GB(Gigabyte 吉字节) = 2^10 MB = 1024 MB = 2^30 B; 1TB(Trillionbyte 太字节) = 2^10 GB = 1024 GB = 2^40

1PB(Petabyte,千万亿字节,拍字节)=1024TB= 2^50 B;
1EB(Exabyte,百亿亿字节,艾字节)=1024PB= 2^60 B;
1ZB(Zettabyte,十万亿亿字节,泽字节)= 1024EB= 2^70

1YB(Yottabyte,一亿亿亿字节,尧字节)= 1024ZB= 2^80 B;
1BB(Brontobyte,一千亿亿亿字节)= 1024YB= 2^90 B;
1NB(NonaByte,一百万亿亿亿字节) = 1024BB = 2^100 B;
1DB(DoggaByte,十亿亿亿亿字节) = 1024 NB = 2^110 B。

129. hive创建array列

'列名' array<struct<additional_key:string,additional_values:string>> COMMENT 'xxx',
[{additional_key=CONTROL_STRATEGY, additional_values=0}, {additional_key=MIX_FLAG, additional_values=0}, {additional_key=HAS_OWN_WRAPPER_FLAG, additional_values=0}, {additional_key=ORIGINAL_NUMBER, additional_values=9f7152f80ef00db2d0e38f4e310c174d_1}, {additional_key=ORDER_SYS_SOURCE, additional_values=BSP}, {additional_key=ORDER_TYPE, additional_values=91}, {additional_key=WAYBILL_NO_TYPE, additional_values=WAYBILL_01_042}, {additional_key=SEASON_TAG, additional_values=[""]}, {additional_key=CIRCLE_FIX_PRICE_INFO, additional_values=512|3134|2021-09-05 07:30:37}, {additional_key=FIRST_EMP_SCHE_AOI_AREA_CODE, additional_values=010GB123}, {additional_key=ORDER_CHANNEL_CODE, additional_values=wphjttx}, {additional_key=IS_SENSITIVE, additional_values=0}, {additional_key=INPUTER_EMP_CODE, additional_values=inc-sgs}, {additional_key=CUT_OFF_TIME, additional_values=2021-09-07 18:00}, {additional_key=WI_SEND_CHECK_TYPE, additional_values=}, {additional_key=IS_TO_WAREHOUSE, additional_values=}, {additional_key=IS_COMFIRM_AIR_EMBARGO, additional_values=0}, {additional_key=SGS_UPLOAD_TYPE_CODE, additional_values=PICKUP}, {additional_key=BOX_SECURITY_CODE, additional_values=}, {additional_key=CONSIGNOR_ADDR_AOIAREA_CODE, additional_values=010GB123}, {additional_key=WI_SEND_CHECK_RESULT, additional_values=}, {additional_key=SIGN_BACK_REMARK, additional_values=}, {additional_key=CALL_FLAG, additional_values=1}, {additional_key=PICKUP_TYPE, additional_values=6}, {additional_key=PICKUP_LNGLAT_AOIAREA_CODE, additional_values=010GB123}, {additional_key=PICKUP_LNGLAT_AOI_CODE, additional_values=010GB000071}, {additional_key=PICKUP_EMP_SCHE_AOIAREA_CODE, additional_values=010GB123}, {additional_key=IS_NC_FLAG, additional_values=0}, {additional_key=RONG_TYPE, additional_values=}, {additional_key=IS_BILLING, additional_values=Y}]

130. hive表可以同时并行写入,但注意不可有重复的分区inc_day,否则出数据故障

131.hive 日期转化 yyyyMMdd 转成 yyyy-mm-dd

select from_unixtime(unix_timestamp('20211031','yyyymmdd'),'yyyy-mm-dd')
select concat(substr('20211031',1,4),'-',substr('20211031',5,2),'-',substr('20211031',7,2))

hive 日期转化 yyyy-mm-dd 转成 yyyyMMdd

select from_unixtime(unix_timestamp('2021-10-31','yyyy-mm-dd'),'yyyymmdd')
select concat(substr('2021-10-31',1,4),substr('2021-10-31',6,2),substr('2021-10-31',9,2))
select  regexp_replace('2021-10-31','-','') as days

132.截取后3位

substr(arrivedepotbatch,-3,3) --注意:不是python倒着取得

133.sql分组累加

SQL 数据分组累加sum() over (partition by ... order by ...)

134.得到当前时间,的小时hour

select hour(from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss'))

135.底表建设策略

底表建设策略:
1.背景:预测底盘数据
2.过程整体解释:(当宽表的字段很多,数据量很大的时候,为了适应分层和维表常态变动的情况,形成的策略)
2.1 对宽表的200个字段,进行甄选出,30个左右常用的(且同时要取到后续需要映射的关联字段)
2.2 对甄选出来的字段,做度量的聚合,这里不要关联任何的表(原因:数据量很大,提升效率,且这个可以做底盘基表,提升拓展性和复用性,可以让第三步适用性更强,避免再跑一次200字段的底表)
2.3 对轻聚合的表,做各种维度映射! 这里没有group by,效率也不错,给轻聚合表打上更多的字段属性(产品板块,所属城市,业务区,网点,大区等)
2.4 最后通过重聚合表,sum成大网\业务区\城市\网点的具体数据

136.cascade
hive分区表新增字段,t用cascade,这样可以让表下面各个分区都能增加新的字段。因为分区表的各个分区在hive的元数据管理中是分开的。如果不这样操作,新字段插入的数据都是NULL

  • 4
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值