我们先看看生产上的两张表:
1.用户cpu信息表:
create table test.test_user as
select '10001' as uid,'AMD A8-5600K APU with Radeon(tm) HD Graphics _AMD64 Family 21 Model 16 Stepping 1_AuthenticAMD' as cpu
union all
select '10002' as uid,'Intel(R) Core(TM) i3-7100U CPU @ 2.40GHz_Intel64 Family 6 Model 142 Stepping 9_GenuineIntel' as cpu
union all
select '10003' as uid,'Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz_Intel64 Family 6 Model 58 Stepping 9_GenuineIntel' as cpu
union all
select '10004' as uid,'AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G _AMD64 Family 21 Model 48 Stepping 1_AuthenticAMD' as cpu;
2.cpu评分配置表:
create table test.test_cpu_config as
select 'A8-5600K' as cpu,2 as cpu_grade
union all
select 'i3-7100' as cpu,2 as cpu_grade
union all
select 'E3-1230' as cpu,3 as cpu_grade
union all
select 'A10-7800' as cpu,2 as cpu_grade;
注:这两张表都是从生产环境中截取的部分有代表性的数据
问题:
现有一张用户cpu信息表,一张cpu评分配置表,需要通过cpu评分表将用户的cpu打上评分的标签
分析:
用户cpu信息:
AMD A8-5600K APU with Radeon(tm) HD Graphics _AMD64 Family 21 Model 16 Stepping 1_AuthenticAMD
配置cpu信息:
A8-5600K
- 用户表cpu信息比较复杂且没有规律,里面包含部分配置表cpu信息
- 若通过截取cpu信息得到配置表相似的cpu信息比较困难
- 可以通过模糊匹配方式关联去解决,但模糊匹配方式关联是否可行存在疑问
测试代码:
关联条件:
on t1.cpu like concat('%',t2.cpu,'%')
spark-sql> select
> t1.uid,
> t1.cpu,
> t2.cpu,
> t2.cpu_grade
> from
> (select
> uid,
> cpu
> from test.test_user) t1
>
> left join
> (select
> cpu,
> cpu_grade
> from test.test_cpu_config) t2
> on t1.cpu like concat('%',t2.cpu,'%');
10001 AMD A8-5600K APU with Radeon(tm) HD Graphics _AMD64 Family 21 Model 16 Stepping 1_AuthenticAMD A8-5600K 2
10002 Intel(R) Core(TM) i3-7100U CPU @ 2.40GHz_Intel64 Family 6 Model 142 Stepping 9_GenuineIntel i3-7100 2
10003 Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz_Intel64 Family 6 Model 58 Stepping 9_GenuineIntel E3-1230 3
10004 AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G _AMD64 Family 21 Model 48 Stepping 1_AuthenticAMD A10-7800 2
Time taken: 1.795 seconds, Fetched 4 row(s)
注:查看结果,用户cpu信息全部打上了评分的标签,方案可行
总结: 上述问题在实际操作中可能存在很多的坑
例如:A8-5600K
可以模糊匹配到A8-5600
和A8-5600K
,所以我们在关联的时候可以根据实际数据进行调整,例如:数据...AMD A8-5600K APU...
中A8-5600K
两边有空格,我们模糊匹配条件可以改成这样on t1.cpu like concat('%',t2.cpu,'','%')