数据推断
学习目标
- 知道数据推断的使用场景
- 使用SQL完成关键指标推断逻辑
1 业务背景:缺失关键指标
-
疫情期间,一家印度外卖餐厅想通过数据分析,数据挖掘提升销量,但是在历史数据中缺少了很重要的一个维度,用餐人数
- 每单客人的就餐人数这一信息对于我们优化服务流程,设计套餐都有很大的指导意义
- 从现在开始改造出餐系统,添加就餐人数字段,但是涉及工作量较大(前后端开发,数据库调整),并且改造完成后需要几个月之后积累了一定的数据量可以进行进一步分析
-
从现有的数据中可以推理出每单的就餐人数,虽然会存在一定误差,但可以利用全部的历史数据,快速上线
-
下图是外卖软件提供的选项,这里餐具要多少份就是商家依据餐量提供
2 业务数据介绍
- 有两家餐厅的业务数据
- 订单表:
Order_Number
订单编号Order_Date
订单日期Item_Name
商品名称Quantity
商品数量Product_Price
商品价格Total_products
订单商品数量restaurant_no
连锁店编号
Order_Number | Order_Date | Item_Name | Quantity | Product_Price | Total_products | restaurant_no |
---|---|---|---|---|---|---|
16118 | 03/08/2019 20:25 | Plain Papadum | 2 | 0.8 | 6 | 1 |
16118 | 03/08/2019 20:25 | King Prawn Balti | 1 | 12.95 | 6 | 1 |
16118 | 03/08/2019 20:25 | Garlic Naan | 1 | 2.95 | 6 | 1 |
16118 | 03/08/2019 20:25 | Mushroom Rice | 1 | 3.95 | 6 | 1 |
16118 | 03/08/2019 20:25 | Paneer Tikka Masala | 1 | 8.95 | 6 | 1 |
16118 | 03/08/2019 20:25 | Mango Chutney | 1 | 0.5 | 6 | 1 |
16117 | 03/08/2019 20:17 | Plain Naan | 1 | 2.6 | 7 | 1 |
16117 | 03/08/2019 20:17 | Mushroom Rice | 1 | 3.95 | 7 | 1 |
16117 | 03/08/2019 20:17 | Tandoori Chicken (1/4) | 1 | 4.95 | 7 | 1 |
16117 | 03/08/2019 20:17 | Vindaloo - Lamb | 1 | 7.95 | 7 | 1 |
16117 | 03/08/2019 20:17 | Chapati | 1 | 1.95 | 7 | 1 |
16117 | 03/08/2019 20:17 | Lamb Tikka | 1 | 4.95 | 7 | 1 |
16117 | 03/08/2019 20:17 | Saag Paneer | 1 | 5.95 | 7 | 1 |
16116 | 03/08/2019 20:09 | Aloo Chaat | 1 | 4.95 | 5 | 1 |
- 菜单表:
- 商品名称
- 商品价格
item_name | product_price | restaurant_id |
---|---|---|
Achar Chicken | 8.95 | 2 |
Achar Lamb | 8.95 | 2 |
Aloo Brinjal | 5.95 | 2 |
Aloo Chaat | 4.95 | 1 |
Aloo Chaat | 4.95 | 2 |
Aloo Dupiaza | 5.95 | 2 |
Aloo Gobi | 5.95 | 1 |
Aloo Gobi | 5.95 | 2 |
Aloo Methi | 5.95 | 1 |
Aloo Mithy | 5.95 | 2 |
Aloo Peas | 5.95 | 2 |
3 基本思路
-
如何根据上述数据确定就餐人数?
- 我们可以将菜单数据划分为几类,根据不同类别食物的点单量来推断就餐人数
- 主食:Naan(囊), Rice(米饭)
- 饮料:Cola(可乐)COBRA(啤酒)
- 小吃:samosa(咖喱角)Seekh Kebab (烤串)
-
酱料:pickle(腌菜) Chutney
-
主菜: LAMB TIKKA 烤羊肉 Prawn Karahi Masala 咖喱虾
-
确定推断规则:
- 1份主食对应1人 (1:1)
- 1份主菜对应1人 (1:1)
- 1份饮料对应1人(1:1)
- 2份小食对应1人 (2:1)
- 酱料,佐料不做计算(n:0)
-
规则验证:推断就餐人数
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2lahNEnJ-1631666622971)(img\indian_food1.png)]
- 主食一份:naan(1:1) , 饮料一份:酸奶(1:1), 菜酱料 4
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WM0svMfK-1631666622973)(img\indian-food2.png)]
-
主食:两份米饭+一份馕饼=3人份
-
饮料:一份酸奶=1人份
-
主菜: 3个 = 3人份
-
小食:1份脆饼+6份菜=0.5*7=3.5人份
-
佐料&配菜:2份酱料=n*0=0人份
-
按照四类中取最大值的原则,我们可得这顿饭的就餐人数为
Max(3人份,3人份,3.5人份,0人份)=3.5人,由于人数不能出现小数,向下取整 = 3人
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GQRayo7J-1631666622974)(img\bear.png)]
- 上面一餐算几人?
- 主食:没有
- 饮料:40瓶啤酒
- 小食:……
- 酱料:n*0
- 按照上述原则 上面一餐推断出来的人数为40人——如果专门以喝酒为目的的聚餐,上述规则推断出来的出错概率很高
- 这种极端的例子可以验证在数据中的占比,如果占比不大则无太大影响
4 代码实现
-
使用SQL批量处理数据实现数据推断逻辑
-
SQL作业流程
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QKiQqh8d-1631666622976)(img\sql2.png)]
-
原始数据中缺少菜品类别标签,需要先添加菜品类别标签
-
使用
LOCATE
函数 进行字符串匹配,使用CASE WHEN
语句判断不同条件,创建新字段food_category
LOCATE 函数介绍:
LOCATE(substr,str)
返回 substr 在 str中的位置- substr, 要查询的子串
- str 字段名字 或者 字符串
DROP TABLE IF EXISTS food_category_table; CREATE TABLE food_category_table AS SELECT Item_Name AS item_name, `Product_Price` AS price, Restaurant_id AS restaurant_id, CASE WHEN locate( 'Dahi', Item_Name )> 0 THEN '酸奶' WHEN locate( 'wine', Item_Name )> 0 OR locate( 'COBRA', Item_Name )> 0 THEN '酒' WHEN locate( 'water', Item_Name )> 0 OR locate( 'Coke', Item_Name )> 0 OR locate( 'Lemonade', Item_Name )> 0 THEN '饮料' WHEN locate( 'Rice', Item_Name )> 0 THEN '米饭' WHEN locate( 'Chapati', Item_Name )> 0 OR locate( 'Paratha', Item_Name )> 0 OR locate( 'Naan', Item_Name )> 0 OR locate( 'roti', Item_Name )> 0 OR locate( 'Papadum', Item_Name )> 0 THEN '饼' WHEN locate ( 'Main', Item_Name )> 0 THEN '主菜' WHEN locate ( 'Chaat', Item_Name )> 0 OR locate( 'Muttar', Item_Name )> 0 THEN '小吃' WHEN locate( 'Chicken', Item_Name )> 0 THEN '鸡肉类' WHEN locate( 'Lamb', Item_Name )> 0 THEN '羊肉类' WHEN locate( 'Fish', Item_Name )> 0 THEN '鱼肉类' WHEN locate( 'Prawn', Item_Name )> 0 OR locate( 'Jinga', Item_Name )> 0 THEN '虾类' WHEN locate( 'Pakora', Item_Name )> 0 THEN '炸素丸子' WHEN LOCATE( 'Saag', Item_Name )> 0 THEN '绿叶菜胡胡' WHEN locate( 'Paneer', Item_Name )> 0 THEN '芝士菜' WHEN locate( 'Pickle', Item_Name )> 0 OR locate( 'Chutney', Item_Name )> 0 THEN '腌菜' WHEN locate( 'Aloo', Item_Name )> 0 THEN '土豆类' WHEN locate( 'Salad', Item_Name )> 0 THEN '沙拉' WHEN locate( 'Tikka', Item_Name )> 0 THEN '烤串' WHEN locate( 'Chana', Item_Name )> 0 THEN '豆类' WHEN locate( 'Dall', Item_Name )> 0 OR locate( 'Hari Mirch', Item_Name )> 0 THEN '素菜' WHEN locate( 'Puree', Item_Name )> 0 THEN '胡胡' WHEN locate( 'Raitha', Item_Name )> 0 OR locate( 'Raita', Item_Name )> 0 THEN '酸奶沙拉' WHEN locate( 'French Fries', Item_Name )> 0 THEN '炸薯条' WHEN locate( 'Samosa', Item_Name )> 0 THEN '咖喱角' WHEN locate( 'Kehab', Item_Name )> 0 OR locate( 'Kebab', Item_Name )> 0 THEN '烤串(小食)' WHEN locate( 'Bhajee', Item_Name )> 0 OR locate( 'Bhaji', Item_Name )> 0 THEN '油炸蔬菜团' WHEN locate( 'Mushroom', Item_Name )> 0 OR locate( 'Vegetable', Item_Name )> 0 THEN '蔬菜' WHEN locate( 'Starter', Item_Name )> 0 THEN '开胃小吃' WHEN locate( 'Sauce', Item_Name )> 0 THEN '酱' ELSE '咖喱菜' END AS food_category FROM restaurant_products_price;
-
-
首先看一下产品明细表,知道我们数据库内现有的产品划分,找到与我们推理时用到的分类之间的差异。
SELECT count( DISTINCT item_name ) AS item_num, food_category FROM food_category_table GROUP BY food_category ORDER BY item_num DESC;
-
在现有的产品明细表里重新定义一列字段,作为我们就餐人数判断的产品分类
drop table if exists food_type; create table food_type as SELECT item_name, price, restaurant_id, food_category, CASE WHEN food_category IN ( '鸡肉类', '羊肉类', '虾类', '咖喱菜', '鱼肉类', '主菜', '芝士菜' ) THEN '主菜' WHEN food_category IN ( '饼', '米饭' ) THEN '主食' WHEN food_category IN ( '饮料', '酒', '酸奶' ) THEN '饮料' WHEN food_category IN ( '酱', '腌菜' ) THEN '佐料' ELSE '小食' END AS food_type FROM food_category_table;
-
再来看一下重新定义过后产品的分类情况
SELECT count( DISTINCT item_name ) AS item_num, food_type, food_category FROM food_type GROUP BY food_type, food_category ORDER BY food_type, food_category, item_num DESC;
-
将交易明细表与我们刚定义的产品明细表进行关联,把产品分类和价格加入明细当中
SELECT a.*, b.food_type, b.price FROM restaurant_orders a JOIN food_type b ON a.`Item_Name` = b.item_name AND a.Restaurant_Id = b.restaurant_id;
-
目前一行记录是一笔订单的一种产品的售卖情况,如果一笔订单有若干样产品会分成若干行,我们希望把一笔订单的详情,从多行统一到一行中,同时用我们事先定义的系数计算。
select a.`Order_Number`,a.`Order_Date`,a.restaurant_id,round(sum(a.Quantity*b.price),2) as total_amount, sum(case when food_type='主食' then a.Quantity*1 else 0 end) as staple_food_count, sum(case when food_type='主菜' then a.Quantity*1 else 0 end) as main_course_count, sum(case when food_type='饮料' then a.Quantity*1 else 0 end) as drink_count, sum(case when food_type='小食' then a.Quantity*0.5 else 0 end) as snack_count from restaurant_orders a join food_type b on a.`Item_Name`=b.item_name and a.Restaurant_Id=b.restaurant_id group by a.`Order_Number`,a.`Order_Date`,a.Restaurant_Id;
-
比较主食,主菜,饮料,小食中的最大者
select c.*,GREATEST(c.staple_food_count,c.main_course_count,c.drink_count,c.snack_count) as max_count from (select a.`Order_Number`,a.`Order_Date`,a.restaurant_id,round(sum(a.Quantity*b.price),2) as total_amount, sum(case when food_type='主食' then a.Quantity*1 else 0 end) as staple_food_count, sum(case when food_type='主菜' then a.Quantity*1 else 0 end) as main_course_count, sum(case when food_type='饮料' then a.Quantity*1 else 0 end) as drink_count, sum(case when food_type='小食' then a.Quantity*0.5 else 0 end) as snack_count from restaurant_orders a join food_type b on a.`Item_Name`=b.item_name and a.Restaurant_Id=b.restaurant_id group by a.`Order_Number`,a.`Order_Date`,a.Restaurant_Id) c;
-
最后,增加向下取整的逻辑,并且确保最小就餐人数为1
select c.*, GREATEST(floor(GREATEST(c.staple_food_count,c.main_course_count,c.drink_count,c.snack_count)),1) as customer_count from (select a.`Order_Number`,a.`Order_Date`,a.restaurant_id,round(sum(a.Quantity*b.price),2) as total_amount, sum(case when food_type='主食' then a.Quantity*1 else 0 end) as staple_food_count, sum(case when food_type='主菜' then a.Quantity*1 else 0 end) as main_course_count, sum(case when food_type='饮料' then a.Quantity*1 else 0 end) as drink_count, sum(case when food_type='小食' then a.Quantity*0.5 else 0 end) as snack_count from restaurant_orders a join food_type b on a.`Item_Name`=b.item_name and a.Restaurant_Id=b.restaurant_id group by a.`Order_Number`,a.`Order_Date`,a.Restaurant_Id) c;
-
新建一张表格,将两家店每一笔交易的时间,金额,就餐人数,主食,饮料,小食数存储进去
DROP TABLE IF EXISTS restaurants_orders_customer_count; CREATE TABLE restaurants_orders_customer_count AS SELECT c.*, GREATEST( floor( GREATEST( c.staple_food_count, c.main_course_count, c.drink_count, c.snack_count )), 1 ) AS customer_count FROM (SELECT a.`Order_Number`, a.`Order_Date`, a.restaurant_id, round( sum( a.Quantity * b.price ), 2 ) AS total_amount, sum( CASE WHEN food_type = '主食' THEN a.Quantity * 1 ELSE 0 END ) AS staple_food_count, sum( CASE WHEN food_type = '主菜' THEN a.Quantity * 1 ELSE 0 END ) AS main_course_count, sum( CASE WHEN food_type = '饮料' THEN a.Quantity * 1 ELSE 0 END ) AS drink_count, sum( CASE WHEN food_type = '小食' THEN a.Quantity * 0.5 ELSE 0 END ) AS snack_count FROM restaurant_orders a JOIN food_type b ON a.`Item_Name` = b.item_name AND a.Restaurant_Id = b.restaurant_id GROUP BY a.`Order_Number`, a.`Order_Date`, a.Restaurant_Id ) c;
-
检查一下之前担心的饮料过多导致的推测异常占比
SELECT count( CASE WHEN drink_count >= 5 THEN `Order_Number` ELSE NULL END ) AS outlier_count, count(*) AS total_count, round( count( CASE WHEN drink_count >= 5 THEN `Order_Number` ELSE NULL END )/ count(*), 5 ) AS outlier_rate FROM restaurants_orders_customer_count;
outlier_count total_count outlier_rate 13 33055 0.0004 -
有了就餐人数之后,我们还可以进行进一步分析
SELECT restaurant_id, avg( customer_count ) AS avg_cc, avg( total_amount ) AS ta, avg( total_amount / customer_count ) AS avg_scc, avg( staple_food_count / customer_count ) AS avg_staple, avg( main_course_count / customer_count ) AS avg_main, avg( drink_count / customer_count ) AS avg_drink, avg( snack_count / customer_count ) AS avg_snake FROM restaurants_orders_customer_count group by restaurant_id;
5 结果解读
- 有了就餐人数之后,我们能够增添许多新KPI,例如人均花费,人均产品数等
- 根据新增的就餐人数,我们可以学习每天各个时段的用餐场景。例如什么时候最有可能一人食,什么时候最有可能2人约饭,什么时候最有可能多人聚餐等。
- 对于就餐人数比较固定的用户,我们是否可以考虑推出一些定制化的套餐或者促销产品,促进每个餐期的营业表现?
6 小结
- 现实生活中,我们会遇到一些问题,问题的答案不能直接获取,但可以推理得到,例如:
- 煎饼摊的老板没仔细算过每天能做多少个煎饼,但通过用了多少面粉可以推算出来
- 突然有人问你今天是几号,你一时不记得但转念想到明天预约了植树节的亲子活动。
- 数据推断的价值:
- 数据本身就是资产,而获取新数据的成本不断增加,数据推断的价值愈发突出
- 数据推断是基于已经存在的数据资产,发掘出来的“新”数据,相当于是已较低的成本增加了数据资产