将HDFS中的数据导入到Hive中
工作场景
由于公司里的日志数据有20-30个字段,并且根据事件类型不同,生成的日志类别也有所差别
方案设计
将日志通过不同的来源划分成几种,比如分成了WEB_EVENT,APP_EVENT,WXAPP_EVENT等几种数据来源,每种数据来源的结构保证相同,
例如:
{"account":"","appId":"cn.xxx","appVersion":"2.0","carrier":"小米移动","deviceId":"ZvRWCBGAuSaK","deviceType":"REDMI-6","eventId":"share","ip":"218.23.97.57","latitude":36.17641631906538,"longitude":120.39343589187808,"netType":"WIFI","osName":"android","osVersion":"7.5","properties":{"pageId":"165","productId":"419","shareMethod":"qq空间","title":"lWF eRR jFJ","url":"uvM/JyH"},"releaseChannel":"木蚂蚁安卓应用市场","resolution":"1024*768","sessionId":"bCjgwViU9vd","timeStamp":1598861486317}
{"account":"","appId":"cn.xxx","appVersion":"4.0","carrier":"中国联通","deviceId":"nNge28DXXwNC","deviceType":"LEPHONE-6","eventId":"share","ip":"107.249.206.150","latitude":34.253410875603905,"longitude":119.15852793581637,"netType":"3G","osName":"android","osVersion":"8.5","properties":{"pageId":"513","productId":"238","shareMethod":"微信朋友圈","title":"roY rVW Pur","url":"yMi/gpb"},"releaseChannel":"手机乐园","resolution":"2048*1024","sessionId":"X1gk6w5NHdB","timeStamp":1598861486697}
{"account":"","appId":"cn.xxx","appVersion":"2.0","carrier":"小米移动","deviceId":"szqwdecx78RA","deviceType":"MI-7","eventId":"adClick","ip":"27.214.63.168","latitude":30.343773977338717,"longitude":114.29445473137073,"netType":"3G","osName":"android","osVersion":"7.2","properties":{"adCampain":"18","adId":"16","adLocation":"9","pageId":"479"},"releaseChannel":"奇珀市场","resolution":"1024*768","sessionId":"sqO1197eqlu","timeStamp":1598861486815}
这种JSON类型的数据,主要是在入仓的时候使用了GitHub开源插件进行解析,将JSON解析成字段类型,最后,能够减少很多数据处理工作
效果(脱敏)
+--------------------------+--------------------------+---------------------------+-----------------------------+-----------------------------------------+
| event_wxapp_log.account | event_wxapp_log.carrier | event_wxapp_log.deviceid | event_wxapp_log.devicetype | event_wxapp_log.evenxapp_log.longitude |
+--------------------------+--------------------------+---------------------------+-----------------------------+-----------------------------------------+
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | IPHONE-6 | share .53805466900734 |
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | MI-6 | adShow .86291585848379 |
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | IPHONE-6 | thumbup .53805466900734 |
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | MATE-X | submitOrder .43712824596625 |
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | MEIZU-ML7 | submitOrder .03285465597456 |
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | IPHONE-6 | adClick .53805466900734 |
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | MI-6 | share .86291585848379 |
| xxxxxxxxxxxxxxxx | 中国xx | ooooooooooo | MEIZU-ML7 | submitOrder .03285465597456 |
| xxxxxxxxxxxxxxxx | 腾讯xxx | ooooooooooo | IPHONE-9 | login .55698566076829 |
| | 中国xx | ooooooooooo | MATE-X | adClick .43712824596625 |
+--------------------------+--------------------------+---------------------------+-----------------------------+-----------------------------------------+
字段中包含数组类型,方便后续的DWD层的处理
待处理
针对用户的唯一标识做行为数据关联,目前面临的问题是对于用户标识不同的来源标识不同,
APP使用的是mac+IMEI+系统号+APP码
WEB端使用的是CookieID
微信小程序使用的是OPENID
保证日志和用户的强关联是需要解决的