数据分析-sql基本操作

最新推荐文章于 2024-07-02 20:06:37 发布

White Root

最新推荐文章于 2024-07-02 20:06:37 发布

阅读量586

点赞数

分类专栏：数据分析文章标签：数据库数据分析 sql

本文链接：https://blog.csdn.net/root1228544694/article/details/106897812

版权

数据分析专栏收录该内容

5 篇文章 1 订阅

订阅专栏

SQL数据分析

介绍
sql常见操作回顾
利用SQL对淘宝用户行为进行数据分析
- 数据导入与清洗
用户行为分析

介绍

数据分析场景

小数据 excel
100w-千万级别 sql取数和计算 powerbi tableau画图
- 也可以用pandas处理
非常大 hivesql+hadoop分布式存储 spark分布式计算推荐系统课程中

sql常见操作回顾

聚合函数
- avg min max count sum
- distinct 去重
groupby having筛选组
对于上述的复杂语句，执行顺序：
From->where->group by->having->select->order by
- From(只执行一次)
- where
  - 对原数据每一行进行筛选
  - where中如果有多个条件,自右向左执行,我们为了追求性能,把能筛掉更多数据的条件放右边
- group by:将where筛选后的数据按照字段分成若干组还没有对每一组进行计算
- having:根据组的条件筛选组
- select: 完成对每一组的计算 count sum avg
- orderby
子查询— 首先执行子查询然后执行外查询
- Where子查询子查询放在where子句中不支持limit
  - 标量子查询—子查询返回一个值 > < =
  - 多值子查询 --子查询返回多个值结合 any和all >any >all in ()
- select子查询–看成是表达式每一行都会被执行一次
  - 计算函数 count max
  - 常量 123 “123”
  - 子查询—子查询的结果就是值就是表达式
- from子查询子查询放在from子句中这个时候相当于是将子查询的结果看成是一张表支持limit
- 关联子查询 --子查询需要用到外查询当前行的值
  - 内查询先执行,执行完销毁内查询表之后再执行外查询
  - 内查询可以使用外查询的字段,反之不行
- EXISTS和NOT EXISTS
  - EXISTS,内查询找到结果,外查询当前行保留,反之不保留
  - NOT EXISTS:正好相反 ,内查询没有返回结果,外查询当前行保留,否则不保留
连接查询
- 内连接、外连接、左链接、右连接、笛卡尔积—没有链接字段、不等连接
合并 union, union all
- union 去重 union all不去重
- 前后查询字段数要一致
- union前面的语句不能排序
  先合并,后排序不是先排序再合并

聚合函数使用

1.计算价格的平均值
select avg(Price) as avg_price from house_price

2.计算数据的总量
select count(*) from house_price  #计算的是所有数据，包含空的行
select count(price) from house_price #计算的是price列中非空的行数

3.统计Bedrooms中不同的取值
select DISTINCT Bedrooms from house_price

4.统计Price中不同的取值的个数
select count(DISTINCT Price) from house_price

分组

1.找出house_price中Price不同取值的每个取值的个数
select Price,count(*) from house_price group by Bedrooms

注：MySQL 版本>5.6的问题
--[Err] 1055 - Expression #1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column 'information_schema.PROFILING.SEQ' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
--- 问题原因 sql_mode=only_full_group_by
-- 解决方案
SET sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));

2. 找出house_price中不同取值(例如：取值为2，则为两室，为3则为3室)的Bedrooms的每个取值的房屋总价格
select Bedrooms,sum(price) from house_price group by Bedrooms;

3.取别名后，可以以别名进行分组
select Bedrooms as br,sum(price) from house_price group by br;

4.找出house_price中不同取值的Bedrooms的房屋中每个取值对应的行数大于20个的房屋及其对应的总价格
select Bedrooms,sum(price) from house_price group by Bedrooms having count(*)>20;

5.找出house_price中不同取值的Bedrooms的房屋中每个取值对应的行数大于20个的房屋及其对应的总价格，并将数据按照总价格的升序排序
select Bedrooms,sum(price) from house_price group by Bedrooms having count(*)>20 order by sum(Price);

6.复杂SQL语句的执行顺序
select Bedrooms,count(*) from house_price where Bedrooms=2 group by Bedrooms having count(*)>20 order by Bedrooms;

执行顺序：
From->where->group by->having->select->order by

子查询

1.找出比平均价格高的物品
select Price from house_price where Price>(select AVG(Price) from house_price);

2.找出一些房屋信息和房屋的均价
select Price,(select AVG(Price) from house_price) as avg_price from house_price;
select Price,123 as avg_price from house_price;

3. 以Neighborhood和Bedrooms分组，统计分组后均价大于整体价格均价的Neighborhood、Bedrooms和分组后的均价数据
select Neighborhood,Bedrooms,AVG(Price) from 
house_price group by Neighborhood,Bedrooms 
having AVG(Price)>(select AVG(Price)
                   from house_price);
                   
4.找出所有房屋中Bedrooms数量相同的价格大于以Neighborhood为分组的房屋均价的房屋
select Neighborhood,Bedrooms from house_price as a where Price>ALL(select AVG(Price) from house_price as b where a.Bedrooms=b.Bedrooms group by Neighborhood);

#EXISTS和NOT EXISTS，EXISTS关键字值关心子查询中返回数据的真假而不是具体的内容
select * from house_price as a where EXISTS (select * from info as b where b.Bedrooms=a.Bedrooms);

多表联结查询

1. 内连接
合并两张表，在两张表中分别获取部分数据
select shop_id,shop_name,sp.product_id,quantity,product_name,product_type,sale_price from product as p inner join shopproduct as sp on p.product_id=sp.product_id

# 起别名更稳妥
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p 
inner join 
shopproduct as sp
on p.product_id=sp.product_id

2. 左外连接：保留左边的表中全部信息，右边表中的对应数据为空则不保留。
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p 
left outer join 
shopproduct as sp
on p.product_id=sp.product_id

3.右外连接：保留右边的表中全部信息，左边边表中的对应数据为空则不保留
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p 
right outer join 
shopproduct as sp
on p.product_id=sp.product_id

4.交叉连接
select sp.shop_id,sp.shop_name,sp.product_id,sp.quantity,p.product_name,p.product_type,p.sale_price
from product as p cross join shopproduct as sp

交叉连接（CROSS JOIN）：也称迪卡尔积
概念：不带WHERE条件子句，它将会返回被连接的两个表的笛卡尔积，返回结果的行数等于两个表行数的乘积；如果带where，返回或显示的是匹配的行数。

组合查询

- 多次查询
select product_id,product_name,product_type from product where product_id<3
union
select product_id,product_name,product_type from product where product_type="厨房用具"

- 简写为：
select product_id,product_name,product_type from product where product_type="厨房用具" or product_id<3

- 使用UNION_ALL的效果：
select product_id,product_name,product_type from product where product_id<5
union all
select product_id,product_name,product_type from product where product_type="厨房用具"

order by关键字只能在最后使用，而不能再union或union all之前使用
select product_id,product_name from product1 where product_id<5
union all
select product_id,product_name,product_type from product1 where product_type="厨房用具"
order by product_id

使用union和使用union all效果不一致，union all不会将和并后的数据去重
union或union all前后查询出的字段数量必须一致
order by关键字只能在最后使用，而不能再union或union all之前使用

CASE表达式

在这里插入图片描述
例子

1.简单case表达式写法
select 
case Bedrooms
    when 2 then 'small'
    when 3 then 'middle'
    when 4 then 'large'
    else 'largest'
end as BedroomsDesc
from house_price

2.搜索case表达式写法
select 
case 
    when Bedrooms=2 then 'small'
    when Bedrooms=3 then 'middle'
    when Bedrooms=4 then 'large'
    else 'largest'
end as BedroomsDesc
from house_price

例子：

1.求出每个公司的男性的总数
select company,sum(num)
from company_profile
where gender="M"
group by company;

2.求出每个公司的女性的总数
select company,sum(num)
from company_profile
where gender="F"
group by company;

3.求出每个公司男性和女性的总数，需要将上述两个表进行合并
select M.company,M.num_M,F.num_F from
(select company,sum(num) as num_M
from company_profile
where gender="M"
group by company) as M   -- 筛选出女性列表命名M
inner join
(select company,sum(num) as num_F
from company_profile
where gender="F"
group by company) as F  -- 筛选出男性列表命名F
on M.company=F.company   -- 连接筛选出男女的连接条件

4. 使用case表达式

select company,
            sum(case when gender="M" then num
                                else 0
                    end) as num_M
            ,sum(case when gender="F" then num
                                else 0
                    end) as num_F
        from company_profile
        group by company;

在这里插入图片描述
if函数

if(表达式,真,假)
配合使用
- sum(if()) 返回0
- count(if()) 返回null
- avg(if()) 返回null

利用SQL对淘宝用户行为进行数据分析

在这里插入图片描述

数据导入与清洗

数据导入与选择子集
列名重命名：对列名进行重命名

删除重复值
SELECT * FROM userbehavior GROUP BY User_ID,Item_ID,Category_ID,Behavior,Timestamp 
HAVING COUNT(*)>1;

去重代码
DELETE FROM 表名 WHERE (字段1,字段2,字段3)
IN 
(SELECT 字段1,字段2,字段3 FROM (SELECT 字段1,字段2,字段3 FROM 表名 GROUP BY 字段1,字段2,字段3
 HAVING COUNT(*)>1) s1) 
AND
id NOT IN (SELECT id FROM (SELECT id FROM 表名 GROUP BY 字段1,字段2,字段3 
HAVING COUNT(*)>1) s2);

缺失值处理-对所有列进行计数查询，查看是否有缺失值
SELECT count(User_ID),count(Item_ID),count(Category_ID),count(Behavior),count(Timestamp) 
FROM userbehavior;

一致化处理Timestamp列无法直接进行分析，需要对其列转化为三列，分别为时间，日期，小时


-- 添加新列Date_time，根据Timestamp返回日期时间
ALTER TABLE userbehavior add COLUMN Date_time TIMESTAMP(0)  NULL;
UPDATE userbehavior
set Date_time = FROM_UNIXTIME(`Timestamp`);

-- 添加新列Date，根据Timestamp列返回日期
ALTER TABLE userbehavior ADD COLUMN Date char(10) NULL;
UPDATE userbehavior
SET Date = FROM_UNIXTIME(`Timestamp`,'%y-%m-%d');

-- 添加新列time，根据Timestamp列返回时间
ALTER TABLE userbehavior ADD COLUMN Time char(10) null;
update userbehavior
set Time = SUBSTRING(Date_time FROM 12 FOR 2);

异常值处理

-- 检查是否有不在2017-11-25至2017-12-03这个时间段内的时间值
SELECT MAX(Date_time),MIN(Date_time) FROM userbehavior;
-- 把不在2017-11-25至2017-12-03这个时间段内的异常时间值删除
DELETE FROM userbehavior where Date >'2017-12-03' or Date <'2017-11-25';

-- 再次检查是否存在不在2017-11-25至2017-12-03这个时间段内的异常值
SELECT MAX(Date_time),MIN(Date_time) FROM userbehavior;

导入的数据集大小概览

SELECT count(DISTINCT User_ID) as 用户数,
count(DISTINCT Item_ID) as 商品数量,
count(DISTINCT Category_ID) as 商品类目数量,
count(Behavior) as 行为数量
FROM userbehavior;

用户行为分析

总体UV、PV、人均浏览次数、成交量
日均UV、PV、人均浏览次数、成交量—看趋势
复购率和跳出率

数据总体情况

总体UV、PV、人均浏览次数、成交量
select COUNT(DISTINCT User_ID) AS '独立访客数',
sum( case when Behavior='pv' then 1 else 0 END) as '点击数',
sum( case when Behavior='pv' then 1 else 0 END)/COUNT(DISTINCT User_ID) as '人均浏览次数',
sum(case when Behavior ='buy' then 1 else 0 END)as '成交量'
from userbehavior;

日均UV、PV、人均浏览次数、成交量
select Date,COUNT(DISTINCT User_ID) AS '日访客数',
sum( case when Behavior='pv' then 1 else 0 END) as '日点击数',
sum( case when Behavior='pv' then 1 else 0 END)/COUNT(DISTINCT User_ID) as '日人均浏览次数',
sum(case when Behavior ='buy' then 1 else 0 END)as '日成交量'
from userbehavior
GROUP BY Date
ORDER BY Date;

用户整体行为数据
create view 用户行为数据 AS
select User_ID,COUNT(Behavior) as '用户行为数',
sum(case when Behavior ='pv' then 1 else 0 end ) as '点击',
sum(case when Behavior ='fav' then 1 else 0 end) as '收藏',
sum(case when Behavior ='cart' then 1 else 0 end) as '加购',
sum(case when Behavior ='buy' then 1 else 0 end) as '购买'
from userbehavior
GROUP BY User_ID
ORDER BY 用户行为数 DESC;

用户的复购率和跳出率
-- 复购用户、复购率
SELECT
sum(case when 购买>1 then 1 else 0 end ) as 复购次数,
sum(case when 购买>0 then 1 else 0 end ) as 购买次数,
CONCAT(ROUND(sum(case when 购买>1 then 1 else 0 end)*100/ sum(case when 购买>0 then 1 else 0 end),2),'%') as 复购率
from 用户行为数据;


-- 跳出率=只浏览了一个页面的访客数/总访客数
select COUNT(*) from (
SELECT User_ID from userbehavior 
GROUP BY User_ID HAVING COUNT(Behavior)=1)as a;

用户行为转化情况分析

行为转化 pv/pv
用户转化 uv/uv
- 一定要注意行为路径的分析

流量行为转化漏斗分析
select behavior,count(*) as 数量 from UserBehavior 
group by behavior order by 数量 desc

用户转化漏斗分析
-- 用户留存分析（漏斗转化）
select sum(case when 点击>0 then 1 else 0 end)as 点击用户数,
sum(case when 收藏>0 then 1 else 0 end) as 收藏用户数,
sum(case when 加购>0 then 1 else 0 end) as 加购用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据;

-- 点击——购买的留存分析
select 
sum(case when 点击>0 then 1 else 0 end) as 点击用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 收藏=0 and 加购=0;
-- 点击——加购——购买的留存分析
select 
sum(case when 加购>0 then 1 else 0 end) as 加购用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 收藏=0 and 加购<>0;
-- 点击——收藏——购买的留存分析
select 
sum(case when 收藏>0 then 1 else 0 end) as 收藏用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 加购=0 and 收藏<>0;
-- 点击——收藏——加购——购买的留存分析
select 
sum(case when 加购>0 then 1 else 0 end) as 加购用户数,
sum(case when 收藏>0 then 1 else 0 end) as 收藏用户数,
sum(case when 购买>0 then 1 else 0 end) as 购买用户数
from 用户行为数据
where 加购<>0 and 收藏<>0;

用户行为习惯分析

2017年11月25日至12月3日间的用户行为习惯分布
  create view  用户行为习惯分布（天） AS
  select Date,COUNT(Behavior) as '用户行为数',
  sum(case when Behavior ='pv' then 1 else 0 end ) as '点击',
  sum(case when Behavior ='fav' then 1 else 0 end) as '收藏',
  sum(case when Behavior ='cart' then 1 else 0 end) as '加购',
  sum(case when Behavior ='buy' then 1 else 0 end) as '购买'
  from userbehavior
  GROUP BY Date
  ORDER BY Date;

一天中的用户行为习惯分布
  -- 用户行为习惯分布（时）
  create view  用户行为习惯分布（时） AS
  select Time,COUNT(Behavior) as '用户行为数',
  sum(case when Behavior ='pv' then 1 else 0 end ) as '点击',
  sum(case when Behavior ='fav' then 1 else 0 end) as '收藏',
  sum(case when Behavior ='cart' then 1 else 0 end) as '加购',
  sum(case when Behavior ='buy' then 1 else 0 end) as '购买'
  from userbehavior
  GROUP BY Time
  ORDER BY Time;

用户类目偏好分析

商品销售情况分析
-- 商品销售情况
create view  商品销售情况 AS
select a.购买次数,COUNT(a.Item_ID) as 商品数
FROM
(select  Item_ID,COUNT(User_ID) as 购买次数
from userbehavior 
where Behavior='buy'
GROUP BY Item_ID) as a
GROUP BY 购买次数
ORDER BY 购买次数 DESC;

-- 商品销售整体情况
-- 总商品数
select COUNT(DISTINCT Item_ID)
from userbehavior;

-- 购买的商品数
select  COUNT(DISTINCT Item_ID)
from userbehavior 
where Behavior='buy';

商品品类浏览top20
-- 商品品类点击数据
select Category_ID,
sum(case when Behavior ='buy' then 1 else 0 end) as '点击'
from userbehavior
GROUP BY Category_ID
ORDER BY 点击 DESC
limit 20;