SQL数据分析
介绍
数据分析场景
- 小数据 excel
- 100w-千万级别 sql取数和计算 powerbi tableau画图
- 也可以用pandas处理
- 非常大 hivesql+hadoop分布式存储 spark分布式计算 推荐系统课程中
sql常见操作回顾
-
聚合函数
- avg min max count sum
- distinct 去重
-
groupby having筛选组
-
对于上述的复杂语句,执行顺序:
From->where->group by->having->select->order by- From(只执行一次)
- where
- 对原数据每一行进行筛选
- where中如果有多个条件,自右向左执行,我们为了追求性能,把能筛掉更多数据的条件放右边
- group by:将where筛选后的数据按照字段分成若干组 还没有对每一组进行计算
- having:根据组的条件筛选组
- select: 完成对每一组的计算 count sum avg
- orderby
-
子查询— 首先执行子查询 然后执行外查询
-
Where子查询 子查询放在where子句中 不支持limit
- 标量子查询—子查询返回一个值 > < =
- 多值子查询 --子查询返回多个值 结合 any和all >any >all in ()
-
select子查询–看成是表达式 每一行都会被执行一次
- 计算函数 count max
- 常量 123 “123”
- 子查询—子查询的结果就是值 就是表达式
-
from子查询 子查询放在from子句中 这个时候相当于是将子查询的结果看成是一张表 支持limit
-
关联子查询 --子查询需要用到外查询当前行的值
-
内查询先执行,执行完销毁内查询表之后再执行外查询
-
内查询可以使用外查询的字段,反之不行
-
-
EXISTS和NOT EXISTS
- EXISTS,内查询找到结果,外查询当前行保留,反之不保留
- NOT EXISTS:正好相反 ,内查询没有返回结果,外查询当前行保留,否则不保留
-
-
连接查询
- 内连接、外连接、左链接、右连接、 笛卡尔积—没有链接字段、不等连接
-
合并 union, union all
- union 去重 union all不去重
- 前后查询字段数要一致
- union前面的语句不能排序
先合并,后排序 不是先排序再合并
聚合函数使用
1.计算价格的平均值
select avg(Price) as avg_price from house_price
2.计算数据的总量
select count(*) from house_price #计算的是所有数据,包含空的行
select count(price) from house_price #计算的是price列中非空的行数
3.统计Bedrooms中不同的取值
select DISTINCT Bedrooms from house_price
4.统计Price中不同的取值的个数
select count(DISTINCT Price) from house_price
分组
1.找出house_price中Price不同取值的每个取值的个数
select Price,count(*) from house_price group by Bedrooms
注:MySQL 版本>5.6的问题
--[Err] 1055 - Expression #1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column 'information_schema.PROFILING.SEQ' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
--- 问题原因 sql_mode=only_full_group_by
-- 解决方案
SET sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));
2. 找出house_price中不同取值(例如:取值为2,则为两室,为3则为3室)的Bedrooms的每个取值的房屋总价格
select Bedrooms,sum(price) from house_price group by Bedrooms;
3.取别名后,可以以别名进行分组
select Bedrooms as br,sum(price) from house_price group by br;
4.找出house_price中不同取值的Bedrooms的房屋中每个取值对应的行数大于20个的房屋及其对应的总价格
select Bedrooms,sum(price) from house_price group by Bedrooms having count(*)>20;
5.找出house_price中不同取值的Bedrooms的房屋中每个取值对应的行数大于20个的房屋及其对应的总价格,并将数据按照总价格的升序排序
select Bedrooms,sum(price) from house_price group by Bedrooms having count(*)>20 order by sum(Price);
6.复杂SQL语句的执行顺序
select Bedrooms,count(*) from house_price where Bedrooms=2 group by Bedrooms having count(*)>20 order by Bedrooms;
执行顺序:
From->where->group by->having->select->order by
子查询
1.找出比平均价格高的物品
select Price from house_price where Price>(select AVG(Price) from house_price);
2.找出一些房屋信息和房屋的均价
select Price,(select AVG(Price) from house_price) as avg_price from house_price;
select Price,123 as avg_price from house_price;
3. 以Neighborhood和Bedrooms分组,统计分组后均价大于整体价格均价的Neighborhood、Bedrooms和分组后的均价数据
select Neighborhood,Bedrooms,AVG(Price) from
house_price group by Neighborhood,Bedrooms
having AVG(Price)>(select AVG(Price)
from house_price);
4.找出所有房屋中Bedrooms数量相同的价格大于以Neighborhood为分组的房屋均价的房屋
select Neighborhood,Bedrooms from house_price as a where Price>ALL(select AVG(Price) from house_price as b where a.Bedrooms=b.Bedrooms group by Neighborhood);
#EXISTS和NOT EXISTS,EXISTS关键字值关心子查询中返回数据的真假而不是具体的内容
select * from house_price as a where EXISTS (select * from info as b where b.Bedrooms=a.Bedrooms);
多表联结查询
1. 内连接
合并两张表,在两张表中分别获取部分数据
select shop_id,shop_name,sp.product_id,quantity,product_name,product_type,sale_price from product as p inner join shopproduct as sp on p.product_id=sp.product_id
# 起别名更稳妥
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p
inner join
shopproduct as sp
on p.product_id=sp.product_id
2. 左外连接:保留左边的表中全部信息,右边表中的对应数据为空则不保留。
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p
left outer join
shopproduct as sp
on p.product_id=sp.product_id
3.右外连接:保留右边的表中全部信息,左边边表中的对应数据为空则不保留
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p
right outer join
shopproduct as sp
on p.product_id=sp.product_id
4.交叉连接
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p cross join shopproduct as sp
交叉连接(CROSS JOIN):也称迪卡尔积
概念:不带WHERE条件子句,它将会返回被连接的两个表的笛卡尔积,返回结果的行数等于两个表行数的乘积;如果带where,返回或显示的是匹配的行数。
组合查询
- 多次查询
select product_id,product_name,product_type from product where product_id<3
union
select product_id,product_name,product_type from product where product_type="厨房用具"
- 简写为:
select product_id,product_name,product_type from product where product_type="厨房用具" or product_id<3
- 使用UNION_ALL的效果:
select product_id,product_name,product_type from product where product_id<5
union all
select product_id,product_name,product_type from product where product_type="厨房用具"
order by关键字只能在最后使用,而不能再union或union all之前使用
select product_id,product_name from product1 where product_id<5
union all
select product_id,product_name,product_type from product1 where product_type="厨房用具"
order by product_id
- 使用union和使用union all效果不一致,union all不会将和并后的数据去重
- union或union all前后查询出的字段数量必须一致
- order by关键字只能在最后使用,而不能再union或union all之前使用
CASE表达式
例子
1.简单case表达式写法
select
case Bedrooms
when 2 then 'small'
when 3 then 'middle'
when 4 then 'large'
else 'largest'
end as BedroomsDesc
from house_price
2.搜索case表达式写法
select
case
when Bedrooms=2 then 'small'
when Bedrooms=3 then 'middle'
when Bedrooms=4 then 'large'
else 'largest'
end as BedroomsDesc
from house_price
例子:
1.求出每个公司的男性的总数
select company,sum(num)
from company_profile
where gender="M"
group by company;
2.求出每个公司的女性的总数
select company,sum(num)
from company_profile
where gender="F"
group by company;
3.求出每个公司男性和女性的总数,需要将上述两个表进行合并
select M.company,M.num_M,F.num_F from
(select company,sum(num) as num_M
from company_profile
where gender="M"
group by company) as M -- 筛选出女性列表命名M
inner join
(select company,sum(num) as num_F
from company_profile
where gender="F"
group by company) as F -- 筛选出男性列表命名F
on M.company=F.company -- 连接筛选出男女的连接条件
4. 使用case表达式
select company,
sum(case when gender="M" then num
else 0
end) as num_M
,sum(case when gender="F" then num
else 0
end) as num_F
from company_profile
group by company;
if函数
- if(表达式,真,假)
- 配合使用
- sum(if()) 返回0
- count(if()) 返回null
- avg(if()) 返回null
利用SQL对淘宝用户行为进行数据分析
数据导入与清洗
数据导入与选择子集
列名重命名:对列名进行重命名
删除重复值
SELECT * FROM userbehavior GROUP BY User_ID,Item_ID,Category_ID,Behavior,Timestamp
HAVING COUNT(*)>1;
去重代码
DELETE FROM 表名 WHERE (字段1,字段2,字段3)
IN
(SELECT 字段1,字段2,字段3 FROM (SELECT 字段1,字段2,字段3 FROM 表名 GROUP BY 字段1,字段2,字段3
HAVING COUNT(*)>1) s1)
AND
id NOT IN (SELECT id FROM (SELECT id FROM 表名 GROUP BY 字段1,字段2,字段3
HAVING COUNT(*)>1) s2);
缺失值处理-对所有列进行计数查询,查看是否有缺失值
SELECT count(User_ID),count(Item_ID),count(Category_ID),count(Behavior),count(Timestamp)
FROM userbehavior;
一致化处理Timestamp列无法直接进行分析,需要对其列转化为三列,分别为时间,日期,小时
-- 添加新列Date_time,根据Timestamp返回日期时间
ALTER TABLE userbehavior add COLUMN Date_time TIMESTAMP(0) NULL;
UPDATE userbehavior
set Date_time = FROM_UNIXTIME(`Timestamp`);
-- 添加新列Date,根据Timestamp列返回日期
ALTER TABLE userbehavior ADD COLUMN Date char(10) NULL;
UPDATE userbehavior
SET Date = FROM_UNIXTIME(`Timestamp`,'%y-%m-%d');
-- 添加新列time,根据Timestamp列返回时间
ALTER TABLE userbehavior ADD COLUMN Time char(10) null;
update userbehavior
set Time = SUBSTRING(Date_time FROM 12 FOR 2);
异常值处理
-- 检查是否有不在2017-11-25至2017-12-03这个时间段内的时间值
SELECT MAX(Date_time),MIN(Date_time) FROM userbehavior;
-- 把不在2017-11-25至2017-12-03这个时间段内的异常时间值删除
DELETE FROM userbehavior where Date >'2017-12-03' or Date <'2017-11-25';
-- 再次检查是否存在不在2017-11-25至2017-12-03这个时间段内的异常值
SELECT MAX(Date_time),MIN(Date_time) FROM userbehavior;
导入的数据集大小概览
SELECT count(DISTINCT User_ID) as 用户数,
count(DISTINCT Item_ID) as 商品数量,
count(DISTINCT Category_ID) as 商品类目数量,
count(Behavior) as 行为数量
FROM userbehavior;
用户行为分析
- 总体UV、PV、人均浏览次数、成交量
- 日均UV、PV、人均浏览次数、成交量—看趋势
- 复购率和跳出率
数据总体情况
总体UV、PV、人均浏览次数、成交量
select COUNT(DISTINCT User_ID) AS '独立访客数',
sum( case when Behavior='pv' then 1 else 0 END) as '点击数',
sum( case when Behavior='pv' then 1 else 0 END)/COUNT(DISTINCT User_ID) as '人均浏览次数',
sum(case when Behavior ='buy' then 1 else 0 END)as '成交量'
from userbehavior;
日均UV、PV、人均浏览次数、成交量
select Date,COUNT(DISTINCT User_ID) AS '日访客数',
sum( case when Behavior='pv' then 1 else 0 END) as '日点击数',
sum( case when Behavior='pv' then 1 else 0 END)/COUNT(DISTINCT User_ID) as '日人均浏览次数',
sum(case when Behavior ='buy' then 1 else 0 END)as '日成交量'
from userbehavior
GROUP BY Date
ORDER BY Date;
用户整体行为数据
create view 用户行为数据 AS
select User_ID,COUNT(Behavior) as '用户行为数',
sum(case when Behavior ='pv' then 1 else 0 end ) as '点击',
sum(case when Behavior ='fav' then 1 else 0 end) as '收藏',
sum(case when Behavior ='cart' then 1 else 0 end) as '加购',
sum(case when Behavior ='buy' then 1 else 0 end) as '购买'
from userbehavior
GROUP BY User_ID
ORDER BY 用户行为数 DESC;
用户的复购率和跳出率
-- 复购用户、复购率
SELECT
sum(case when 购买>1 then 1 else 0 end ) as 复购次数,
sum(case when 购买>0 then 1 else 0 end ) as 购买次数,
CONCAT(ROUND(sum(case when 购买>1 then 1 else 0 end)*100/ sum(case when 购买>0 then 1 else 0 end),2),'%') as 复购率
from 用户行为数据;
-- 跳出率=只浏览了一个页面的访客数/总访客数
select COUNT(*) from (
SELECT User_ID from userbehavior
GROUP BY User_ID HAVING COUNT(Behavior)=1)as a;
用户行为转化情况分析
用户行为转化情况分析
- 行为转化 pv/pv
- 用户转化 uv/uv
- 一定要注意行为路径的分析
流量行为转化漏斗分析
select behavior,count(*) as 数量 from UserBehavior
group by behavior order by 数量 desc
用户转化漏斗分析
-- 用户留存分析(漏斗转化)
select sum(case when 点击>0 then 1 else 0 end)as 点击用户数,
sum(case when 收藏>0 then 1 else 0 end) as 收藏用户数,
sum(case when 加购>0 then 1 else 0 end) as 加购用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据;
-- 点击——购买的留存分析
select
sum(case when 点击>0 then 1 else 0 end) as 点击用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 收藏=0 and 加购=0;
-- 点击——加购——购买的留存分析
select
sum(case when 加购>0 then 1 else 0 end) as 加购用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 收藏=0 and 加购<>0;
-- 点击——收藏——购买的留存分析
select
sum(case when 收藏>0 then 1 else 0 end) as 收藏用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 加购=0 and 收藏<>0;
-- 点击——收藏——加购——购买的留存分析
select
sum(case when 加购>0 then 1 else 0 end) as 加购用户数,
sum(case when 收藏>0 then 1 else 0 end) as 收藏用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 加购<>0 and 收藏<>0;
用户行为习惯分析
2017年11月25日至12月3日间的用户行为习惯分布
create view 用户行为习惯分布(天) AS
select Date,COUNT(Behavior) as '用户行为数',
sum(case when Behavior ='pv' then 1 else 0 end ) as '点击',
sum(case when Behavior ='fav' then 1 else 0 end) as '收藏',
sum(case when Behavior ='cart' then 1 else 0 end) as '加购',
sum(case when Behavior ='buy' then 1 else 0 end) as '购买'
from userbehavior
GROUP BY Date
ORDER BY Date;
一天中的用户行为习惯分布
-- 用户行为习惯分布(时)
create view 用户行为习惯分布(时) AS
select Time,COUNT(Behavior) as '用户行为数',
sum(case when Behavior ='pv' then 1 else 0 end ) as '点击',
sum(case when Behavior ='fav' then 1 else 0 end) as '收藏',
sum(case when Behavior ='cart' then 1 else 0 end) as '加购',
sum(case when Behavior ='buy' then 1 else 0 end) as '购买'
from userbehavior
GROUP BY Time
ORDER BY Time;
用户类目偏好分析
商品销售情况分析
-- 商品销售情况
create view 商品销售情况 AS
select a.购买次数,COUNT(a.Item_ID) as 商品数
FROM
(select Item_ID,COUNT(User_ID) as 购买次数
from userbehavior
where Behavior='buy'
GROUP BY Item_ID) as a
GROUP BY 购买次数
ORDER BY 购买次数 DESC;
-- 商品销售整体情况
-- 总商品数
select COUNT(DISTINCT Item_ID)
from userbehavior;
-- 购买的商品数
select COUNT(DISTINCT Item_ID)
from userbehavior
where Behavior='buy';
商品品类浏览top20
-- 商品品类点击数据
select Category_ID,
sum(case when Behavior ='buy' then 1 else 0 end) as '点击'
from userbehavior
GROUP BY Category_ID
ORDER BY 点击 DESC
limit 20;