Spark Basics

Oracle to Spark SQL

需要转换之处

  1. decode 转为case when condition then result_1 else result_2 end 写法

decode是一个字符串函数:
decode(binary bin, string charset) STRING 用charset的编码方式对bin进行解码。

  1. 子查询 必须指定一个别名
    下面语句必须指定一个t1 的别名
select * from (select distinct branch_code,branch_name from am_cfg_code_organization) t1

避免多层嵌套子查询,尽量拆分成单独的 CTE (with tmp1 as (…), tmp2 as (…)

  1. 字符串型字段 nvl(col_1, ‘abc’)
    改为 case when coalesce(col_1,’’) = ‘’ then ‘abc’ else col_1 end
    因为在oracle中空字符串和Null,都表现为Null,而在hadoop中这是2个不同的东西,故先
    把NUll coalesce成 空字符串, 然后判断等于空时,给默认值 abc
    ps: 数值型字段不存在这个问题

  2. col_1 is not null --> coalesce(col_1,'' ) <> ''

  3. listagg 或 || 或 wm_concat 等字段拼接函数转换
    在这里插入图片描述

  4. 表关联使用 join , left join, full join ,拒绝直接通过where条件实现

  5. 将字符串变量内容插入表时,需要替处理其中包含的单引号

ORA_DESC2 = ORA_DESC.toString().replace("'", "#");
sqlContext.sql(s"insert into table am_dp_in_log select $AL_OP_SEQ,'$P_I_DATE','$AL_SYS_NO','$AL_POPULATION_TIMESTAMP','$PROC_NAME','$START_TIME','$END_TIME','$ORA_CODE2','$ORA_DESC2','$OP_STEP2','$OP_DESC2'") 
  1. 使用left [anti] semi join 替换 [not] exists

  2. 日期类型处理
    经常使用 date_format, unix_timestamp(string date)
    Oracle 中的to_date, to_char, date’’ 全部需要转换

函数返回值描述注意
to_date(string timestamp)STRING返回时间中的年月日,例如:to_date(“1970-01-01 00:00:00”) = “1970-01-01”
year(string date)INT返回指定日期中的年份,范围在1000到9999
quarter(date/timestamp/string)INT返回该date/timestamp所在的季度,如quarter(‘2015-04-01’)=2。
month(string date)INT返回指定时间的月份,范围为1至12月。
day(string date) dayofmonth(date)INT返回指定时间的日期。
hour(string date)INT返回指定时间的小时,范围为0到23。
minute(string date)INT返回指定时间的分钟,范围为0到59
second(string date)INT返回指定时间的秒,范围为0到59
weekofyear(string date)INT返回指定日期所在一年中的星期号,范围为0到53
datediff(string enddate, string startdate)INT两个时间参数的日期之差
date_add(string startdate, int days)STRING给定时间,在此基础上加上指定的时间段
date_sub(string startdate, int days)STRING给定时间,在此基础上减去指定的时间段
current_dateDATE返回当前日期,如2016-07-04
current_timestampTIMESTAMP返回当前时间,如2016-07-04 11:18:11.685
add_months(string start_date, int num_months)STRING返回start_date在num_months个月之后的date
last_day(string date)STRING返回date所在月份的最后一天,格式为YYYY-MM-DD,如2015-08-31
next_day(string start_date, string day_of_week)STRING返回start_date之后最接近day_of_week的日期,格式为YYYY-MM-DD,day_of_week为一周内的星期表示(如Monday、FRIDAY
trunc(string date, string format)STRING将date按照特定的格式进行清零操作,支持格式为MONTH/MON/MM, YEAR/YYYY/YY,如trunc(‘2015-03-17’, ‘MM’) = 2015-03-01
months_between(date1, date2)DOUBLE返回date1与date2之间的月份差
date_format(date/timestamp/string ts, string fmt)STRING返回date/timestamp/string的格式化输出,格式支持JAVA的SimpleDateFormat格式,如date_format(‘2015-04-08’, ‘y’) = ‘2015’。“JAVA的SimpleDateFormat格式: yyyy-MM-dd HH:mm:ss;Oracle 日期to_char,全部转换成这个”
unix_timestamp(string date)STRING指定日期参数调用UNIX_TIMESTAMP(),它返回“1970-01-01 00:00:00”到指定日期的秒数。“时间相减:time_field_A - time_field_B --》 转换成 (unix_timestamp(time_field_A)-unix_timestamp(time_field_B))/(3600 * 24) --> 精确到小数”

select pmod(floor((unix_timestamp(‘2022-03-06 00:00:00’)-unix_timestamp(‘1970-01-05 00:00:00’))/(3600*24)),7)+1;
–> 周一至周六:1-6,周日:7

select pmod(datediff(‘2022-02-27’, ‘2012-01-01’), 7)
–> 周一至周六:1-6,周日:0
10.
避免使用distinct , 尽量使用group by 去重
distinct 只会生成一个reducer,效率低

  1. instr:只接受2个参数
    需要计算从某个位置开始的第一个匹配字符串位置时,需要使用locate
函数返回值描述注意
instr(string str, string substr)INT返回substr在str中最早出现的下标。当参数中出现NULL时,返回NULL,但str中不存在substr时返回0,注意下标从1开始。只能从1开始,应用场景有限,可以使用locate
locate(string substr, string str[, int pos])INT返回在下标pos之后,substr在str中出现的最小下标
  1. insert 后的select 语句中的字段需要与插入的目标表中的名称、类型、顺序保持一致,否则数据插入不会报错,但是Hive查询解析文件时,会出现问题,数据为NULL

  2. 使用except 代替 minus,except 需要指定对比的字段名称,不可以使用*

select operbiz,branch,src_sys,sequenceno,al_table,pk_name,pk_value,al_day from al_cvt_deleted_list_0920  
except
select operbiz,branch,src_sys,sequenceno,al_table,pk_name,pk_value,al_day from al_cvt_deleted_list 
  1. 排序默认空值在最后位置
//下面2个等同
ROW_NUMBER() OVER(PARTITION BY TOPACTUALID ORDER BY SOLVEDTIME DESC) --大数据
ROW_NUMBER() OVER(PARTITION BY TOPACTUALID ORDER BY SOLVEDTIME DESC NULLS LAST)  --Oracle
  1. 正则使用
函数返回值描述注意
regexp_extract(string subject, string pattern, int index)STRING通过下标返回正则表达式指定的部分。regexp_extract(′foothebar′, ′foo(.*?)(bar)′, 2) returns ′bar.′
regexp_replace(string A, string B, string C)STRING字符串A中的B字符被C字符替代
A RLIKE BBOOLEANJAVA的LIKE操作,若A或其子字符串与B相匹配,则返回TRUE,否则返回FALSE。若A或B为NULL,则返回NULL

正则表达式语法

((?!{“ckResult”).)* :表示非字符串{"ckResult"开头的字符

spark.sql("""with tmp1 as (
select '{"checkNo":"123345666","checkTime":"2031.12.28","companyName":"上海公司","finalSummary":"阳性结果和异常情况\n【1】 您本次体检所检项目未见异常,建议您定期健康体检。\n","gender":"","kindList":[{"checkItemList":[{"ckResult":"157.5","itemName":"身高","unit":"cm"},{"ckResult":"53.0","itemName":"体重","unit":"Kg"},{"ckResult":"21.4","itemName":"体重指数","normalVal":"18.5--23.99"},{"ckResult":"105","itemName":"收缩压","normalVal":"90--139","unit":"mmHg"},{"ckResult":"71","itemName":"舒张压","normalVal":"60--89","unit":"mmHg"}],"doctorName":"刘洋","kindName":"一般项目检查","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"无","itemName":"病史"},{"ckResult":"无特殊","itemName":"家族史"},{"ckResult":"68","itemName":"心率(次/分)","unit":"次/分"},{"ckResult":"齐","itemName":"心律"},{"ckResult":"正常","itemName":"心音"},{"ckResult":"双侧呼吸音未闻及异常","itemName":"肺部听诊"},{"ckResult":"肝脏肋下未触及","itemName":"肝脏触诊"},{"ckResult":"脾脏肋下未触及","itemName":"脾脏触诊"},{"ckResult":"双肾区无叩痛","itemName":"肾脏叩诊"},{"ckResult":"无","itemName":"内科其它"}],"doctorName":"周芷萍","kindName":"内科","summary":"未见明显异常女外科\n"},{"checkItemList":[{"ckResult":"5.6","itemName":"白细胞计数","normalVal":"3.5--9.5","shortCode":"WBC","unit":"10^9/L"},{"ckResult":"4.35","itemName":"红细胞计数","normalVal":"3.8--5.1","shortCode":"RBC","unit":"10^12/L"},{"ckResult":"136.0","itemName":"血红蛋白","normalVal":"115--150","shortCode":"Hb","unit":"g/L"},{"ckResult":"0.41","itemName":"红细胞压积","normalVal":"0.35--0.45","shortCode":"HCT","unit":"L/L"},{"ckResult":"93.6","itemName":"平均红细胞体积","normalVal":"82--100","shortCode":"MCV","unit":"fL"},{"ckResult":"31.3","itemName":"平均红细胞血红蛋白含量","normalVal":"27--34","shortCode":"MCH","unit":"pg"},{"ckResult":"334.0","itemName":"平均红细胞血红蛋白浓度","normalVal":"316--354","shortCode":"MCHC","unit":"g/L"},{"ckResult":"12.0","itemName":"RDW-CV","normalVal":"10.9--15.4"},{"ckResult":"287.0","itemName":"血小板计数","normalVal":"125--350","shortCode":"PLT","unit":"10^9/L"},{"ckResult":"9.7","itemName":"平均血小板体积","normalVal":"7.6--13.6","shortCode":"MPV","unit":"fL"},{"ckResult":"11.6","itemName":"血小板分布宽度","normalVal":"9.0--17","shortCode":"PDW"},{"ckResult":"29.1","itemName":"淋巴细胞百分比","normalVal":"20--50%","shortCode":"LYMPH%"},{"ckResult":"5.3","itemName":"中间细胞百分比","normalVal":"3--10%","shortCode":"MON%"},{"ckResult":"65.6","itemName":"中性粒细胞百分比","normalVal":"40--75%","shortCode":"NEUT%"},{"ckResult":"1.6","itemName":"淋巴细胞绝对值","normalVal":"1.1--3.2","shortCode":"LYMPH","unit":"10^9/L"},{"ckResult":"0.3","itemName":"中间细胞绝对值","normalVal":"0.12--1.0","unit":"10^9/L"},{"ckResult":"3.7","itemName":"中性粒细胞绝对值","normalVal":"1.8--6.3","shortCode":"NEUT","unit":"10^9/L"},{"ckResult":"41.0","itemName":"红细胞分布宽度-标准差","normalVal":"39--53.9","shortCode":"RDW-SD","unit":"g/L"},{"ckResult":"0.235","itemName":"血小板压积","normalVal":"0.108--0.282%","shortCode":"PCT"}],"doctorName":"韩秀杰","kindName":"血常规","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"1.020","itemName":"尿比重","normalVal":"1.010--1.025","shortCode":"SG"},{"ckResult":"6.0","itemName":"尿酸碱度","normalVal":"5.4--8.4","shortCode":"PH"},{"ckResult":"阴性","itemName":"尿白细胞","normalVal":"阴性","shortCode":"LEU","unit":"Cell/uL"},{"ckResult":"阴性","itemName":"尿亚硝酸盐","normalVal":"阴性","shortCode":"NIT"},{"ckResult":"阴性","itemName":"尿蛋白质","normalVal":"阴性","shortCode":"PRO","unit":"g/L"},{"ckResult":"阴性","itemName":"尿糖","normalVal":"阴性","shortCode":"GLU","unit":"mmol/L"},{"ckResult":"阴性","itemName":"尿酮体","normalVal":"阴性","shortCode":"KET","unit":"mmol/L"},{"ckResult":"阴性","itemName":"尿胆原","normalVal":"阴性","shortCode":"URO","unit":"umol/L"},{"ckResult":"阴性","itemName":"尿胆红素","normalVal":"阴性","shortCode":"BIL","unit":"umol/L"},{"ckResult":"阴性","itemName":"尿隐血","normalVal":"阴性","shortCode":"BLD","unit":"Cell/uL"}],"doctorName":"韩秀杰","kindName":"尿常规","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"9","itemName":"丙氨酸氨基转移酶","normalVal":"0.00--40","shortCode":"ALT","unit":"U/L"},{"ckResult":"16","itemName":"天门冬氨酸氨基转移酶","normalVal":"0--40","shortCode":"AST","unit":"U/L"},{"ckResult":"4.28","itemName":"尿素","normalVal":"1.7--8.3","shortCode":"UREA","unit":"mmol/L"},{"ckResult":"4.72","itemName":"空腹血葡萄糖","normalVal":"3.90--6.10","shortCode":"FBG","unit":"mmol/L"},{"ckResult":"5.13","itemName":"总胆固醇","normalVal":"3.10--5.69","shortCode":"TC","unit":"mmol/L"},{"ckResult":"0.59","itemName":"甘油三酯","normalVal":"0.4--1.53","shortCode":"TG","unit":"mmol/L"},{"ckResult":"7.69","itemName":"甲胎蛋白定量","normalVal":"0--20","shortCode":"AFP","unit":"ng/ml"},{"ckResult":"1.10","itemName":"癌胚抗原定量","normalVal":"0--5","shortCode":"CEA","unit":"ng/ml"}],"doctorName":"韩秀杰、赵洁","kindName":"实验室检查","summary":"未见明显异常"},{"checkItemList":[{"ckResult":"未见明显异常","itemName":"肝"},{"ckResult":"未见明显异常","itemName":"胆"},{"ckResult":"未见明显异常","itemName":"胰"},{"ckResult":"未见明显异常","itemName":"脾"},{"ckResult":"未见明显异常","itemName":"双肾"}],"doctorName":"蒋海燕","kindName":"超声检查室","summary":"未见明显异常心电图\n"}],"medicalBranchOrg":"爱康国宾","medicalOrgName":"爱康国宾","name":"徐晓莉"}
'
 text_info
)
select regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"(中性粒细胞数|中性粒细胞值|中性粒细胞绝对值|中性粒细胞计数|嗜中性粒细胞绝对值)([^"]*)"([^}]*)}',1)   na --中性粒细胞绝对值
      ,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"红细胞计数"([^}]*)}',1)  rbc --红细胞计数 
      ,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"血红蛋白"([^}]*)}',1)  hb --血红蛋白
      ,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"血小板计数"([^}]*)}',1)  plat --血小板 
      ,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"尿隐血"([^}]*)}',1)   ubc --红细胞镜检(尿隐血)  
      ,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"尿蛋白质"([^}]*)}',1)   tup --尿蛋白   
      ,regexp_extract(b.text_info,'"ckResult":"([^"]*)"([^}]*)"itemName":"(尿液管型|尿管型)([^"]*)"([^}]*)}',1)   bc --尿管型  
  
      ,regexp_extract(b.text_info,'"kindName":"宫颈刮片([^"]*)"([^}]*)"summary":"([^"]*)"([^}]*)}',3)   css --宫颈刮片
      ,regexp_extract(b.text_info,'"kindName":"内科([^"]*)"([^}]*)"summary":"([^"]*)"([^}]*)}',3)  iea --内科检查异常+B39:E48
      ,regexp_extract(b.text_info,'"kindName":"外科([^"]*)"([^}]*)"summary":"([^"]*)"([^}]*)}',3) sea --外科检查异常    
  from tmp1 b
""").show()
  1. lateral view explode
  • 单行数据变多行
spark.sql("""
with tmp1 as (
select 1 id,78 score_english, 88 score_math, 97 score_chinese
union all 
select 2 id,48 score_english, 83 score_math, 67 score_chinese
union all
select 3 id,58 score_english, 89 score_math, 85 score_chinese
)
select id,adid,ad from tmp1
lateral view explode(map('english',score_english,'math',score_math,'chinese',score_chinese)) adtable as adid,ad
""").show
  • 列转行 实现 unpivot
spark.sql("""
with tmp1 as (
select 1 id
)
select id,t1 from tmp1
lateral view explode(array(1,2,3)) t as t1
""").show
select
    diagnose_code
    ,diagnose
    ,accident_ind
    ,disease_ind
    ,risk_type
    ,adid
    ,ad
from icd10_standard_disease_info_mapping
-- 列转行
lateral view explode (map(
                         'D001',D001,'D002',D002,'D003',D003,'D004',D004,'D005',D005,'D006',D006,'D007',D007,'D008',D008,'D009',D009,'D010',D010
                        ,'D011',D011,'D012',D012,'D013',D013,'D014',D014,'D015',D015,'D016',D016,'D017',D017,'D018',D018,'D019',D019,'D020',D020
                        ,'D021',D021,'D022',D022,'D023',D023,'D024',D024,'D025',D025,'D026',D026,'D027',D027,'D028',D028,'D029',D029,'D030',D030
                        ,'D031',D031,'D032',D032,'D033',D033,'D034',D034,'D035',D035,'D036',D036,'D037',D037,'D038',D038,'D039',D039,'D040',D040
                        ,'D041',D041,'D042',D042,'D043',D043,'D044',D044,'D045',D045,'D046',D046,'D047',D047,'D048',D048,'D049',D049,'D050',D050
                        ,'D051',D051,'D052',D052,'D053',D053,'D054',D054,'D055',D055,'D056',D056,'D057',D057,'D058',D058,'D059',D059,'D060',D060
                        ,'D061',D061,'D062',D062,'D063',D063,'D064',D064,'D065',D065,'D066',D066,'D067',D067,'D068',D068,'D069',D069,'D070',D070
                        ,'D071',D071,'D072',D072,'D073',D073,'D074',D074,'D075',D075,'D076',D076,'D077',D077,'D078',D078,'D079',D079,'D080',D080
                        ,'D081',D081,'D082',D082,'D083',D083,'D084',D084,'D085',D085,'D086',D086,'D087',D087,'D088',D088,'D089',D089,'D090',D090
                        ,'D091',D091,'D092',D092,'D093',D093,'D094',D094,'D095',D095,'D096',D096,'D097',D097,'D098',D098,'D099',D099,'D100',D100
                        ,'M001',M001,'M002',M002,'M003',M003,'M004',M004,'M005',M005,'M006',M006,'M007',M007,'M008',M008,'M009',M009,'M010',M010
                        ,'M011',M011,'M012',M012,'M013',M013,'M014',M014,'M015',M015,'M016',M016,'M017',M017,'M018',M018,'M019',M019,'M020',M020
                        ,'M021',M021,'M022',M022,'M023',M023,'M024',M024)) adTable AS adid,ad
where pt = '${bizdate}000000'
    and ad=1
  1. 查看Parquet 表时提示找不到Parquet文件
    验证:1.insert…select 语句中是否有重名列;
    2.write parquet 前打印schema与目标表结构比对
    3.目标表的路径需要删除重新建立
    hadoop dfs -rm -r /hdfs/table
    hadoop dfs -mkdir /hdfs/table

  2. 设置自增序列
    row_number() over():什么参数都不加则生成一个从1开始的自增字段

insert into table User_Attribute select row_number() over() as id,customid from tbl_custom;

若下次操作需要在上次自增值的基础上继续自增,实现序列的效果,可以先计算出原有的number: max(id),join on (1=1)关联上去

insert into table User_Attribute 
select row_number() over() + id_max as id,customid 
 from tbl_custom a 
 join (select max(id) id_max from User_Attribute) b on (1 = 1);
  1. lag (col ignore nulls ) —> 替换为 last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values

属于Spark built-in function, hive中不支持,Spark SQL 可以
last 会考虑当前行,使用时结合rows between 限定范围

 with tmp1 as (
 select 'hjl' name,'a' level,1 id 
 union all 
 select 'hjl' name,'' level,2 id 
 union all 
 select 'hjl' name,null level,3 id 
 union all 
 select 'hjl' name,'b' level,4 id 
union all 
 select 'hjl' name,'c' level,5 id 
 union all 
 select 'hjl' name,null level,6 id 
 union all 
 select 'hjl' name,'d' level,7 id 
  union all 
 select 'hjl' name,'' level,8 id  
 )
select a.*
      ,last(case when level = '' then null else level end,true) over(partition by name order by id rows between unbounded preceding and 1 preceding) last_level
  from tmp1 a 
;

在这里插入图片描述

Spark 配置

spark-submit --queue $1 --num-executors $2 --executor-cores $3 --executor-memory $4  --driver-memory 15G --driver-cores 24 --master yarn --deploy-mode $5  --class $6 $7 $8 $9

num-executors * executor-memory 过大的话,会直接无法提交程序

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值