目录
一、软件安装问题和解决
工具:my sql, sequel pro入门使用的问题和解决方法
课程来源:《秦路七周数据分析》、《必知必会》…
解决方法:百度,感谢前人种树
1. csv无法导入
问题原因:mac和excel兼容问题
解决方法:用number打开csv,另存为utf 8
2. sequel pro无法链接数据库
step 1: 终端
mysql -u root -p
step 2:
load data local infile '位置' into table 数据库名字
fields terminated by ',';
step 3:应该就可以了,但是一直报错,应该是环境未配置好,但是至今没有解决
3. 加载sql
问题原因:版本兼容问题
解决方法:下载测试版squel pro
二、基础查询
1. 排序检索
语句:select, order by, limit
SELECT * FROM `test data 01`.dataanalyst
ORDER BY companyId DESC
LIMIT 10;
2. 过滤数据
语句:where
- 简单过滤:
SELECT * FROM `test data 01`.dataanalyst
WHERE city = "成都";
SELECT * FROM `test data 01`.dataanalyst
WHERE city IN ("成都", "杭州"); #杭州或者成都
SELECT * FROM `test data 01`.dataanalyst
WHERE companyId > 20000; #比较运算:>, <, >=, !=, not in
- 操作符:and, or
SELECT * FROM `test data 01`.dataanalyst
WHERE city IN ("成都", "杭州")
AND education = "本科"
SELECT * FROM `test data 01`.dataanalyst
WHERE city IN ("成都", "杭州")
OR city = "上海";
(1). and:具有优先级
(2). 示例:在上海的本科生,或者在上海的工作经验1-3年的
SELECT * FROM `test data 01`.dataanalyst
WHERE city IN ("上海")
AND (education = "本科"
OR workyear="1-3年")
- 模糊查找
SELECT * FROM `test data 01`.dataanalyst
WHERE secondType LIKE "%后端%"; # "%后端", "后段%"
3. 分组数据
语句:group by
(1). 示例:按城市分组,每个城市的职位数量
SELECT city, COUNT(positionId) AS "position num" FROM `test data 01`.dataanalyst
GROUP BY city
(2). 示例:按城市分组,每个城市招聘公司的数量(去重:distinct)
SELECT city, COUNT(distinct companyId) AS "position num" FROM `test data 01`.dataanalyst #count(*):按照默认字段计数
GROUP BY city;
(3). 多条件分组:每个城市不同学历招聘的岗位数
SELECT city, education, COUNT(*) FROM `test data 01`.dataanalyst
GROUP BY city, education;
- 分组过滤
语句:having
(1). 示例:招聘职位数量大于等于100的城市
SELECT city, COUNT(*) FROM `test data 01`.dataanalyst
GROUP BY city
HAVING COUNT(positionId) >= 100
(2). 示例:招聘电子商务岗位的,数量大于等于50的城市
思路:1st. 选出包含电子商务的岗位; 2nd. 按城市进行分组,然后过滤
SELECT city, COUNT(*) FROM `test data 01`.dataanalyst
WHERE industryField LIKE "%电子商务%"
GROUP BY city
HAVING COUNT(positionId) >= 50
(3). 用if解决(示例):招聘电子商务岗位的,数量大于等于50的城市
思路:1st. 用if判断,如果名称包含“电子商务”,就返回1,不然返回null; 2nd. count计数不计null
SELECT city, COUNT(*) FROM `test data 01`.dataanalyst
GROUP BY city
HAVING COUNT(IF(industryfield LIKE "%电子商务%",1,null)) >= 50
(4). 示例:不同城市中,电子商务岗位占该城市所有职位的比重
思路:1st. 先计算城市所有的职位数量; 2nd. 电子商务的职位数量; 3rd. 按照城市分组
SELECT
city,
#每个城市的岗位数
COUNT(*) AS total,
#每个城市电子商务的岗位数
COUNT(IF (industryfield LIKE "%电子商务%", industryfield, NULL)) AS emarket,
COUNT(IF (industryfield LIKE "%电子商务%", industryfield, NULL))/COUNT(*) #select设置别名,但不能直接用于计算等
FROM `test data 01`.dataanalyst
#按城市分组
GROUP BY city
#只要电子商务招聘岗位数大于等于10的城市
HAVING emarket >=10
ORDER BY emarket
4. 截取数据
语句:left, locate
- left:
left (salary, 1) #从salary中,从左边开始,截取1个
- locate:
locate ('k', salary, 4) #在salary中,从第4个位置开始,'k'的位置是第几个
(1). 示例:8k-12k,薪资的下限是多少
思路:salary中,从左开始,截取到第一个k所在的位置
SELECT LEFT(salary,LOCATE('k', salary)-1), salary FROM `test data 01`.dataanalyst;
(2). 示例:8k-12k,薪资的上限是多少
方法一:1st. salary中,从右开始,截取到 ‘-’ 所在的位置; 2nd. 要去掉取出的上限的’k’
SELECT
LEFT(salary,LOCATE("k", salary)-1),
locate("-", salary),
RIGHT(salary,length(salary)-locate("-",salary)), #8k-12k的总长度,减去'-'所在的位置
LEFT(RIGHT(salary,length(salary)-locate("-",salary)),LOCATE('k',salary)-1), #截取出的上限,再从左取到'k'之前
salary FROM `test data 01`.dataanalyst;
- substr:
substr (salary, 2,3) #substr(字符串,从哪个位置开始,截取长度),好处是可以从中间截取
示例:8k-12k,薪资的上限是多少
方法二:
SELECT
SUBSTR(salary,LOCATE('-',salary)+1,length(salary)-locate("-",salary)-1) AS top,
salary FROM `test data 01`.dataanalyst;
三、子查询
注释:select子句可以作为一个临时的库,在其中直接查找
SELECT (bottom+top)/2 FROM(
SELECT
LEFT(salary,LOCATE("k", salary)-1) AS bottom,
SUBSTR(salary,LOCATE('-',salary)+1,length(salary)-locate("-",salary)-1) AS top,
salary
FROM `test data 01`.dataanalyst
)AS t #别名一般用t
1. 条件分组
语句:case when……then……else……end
SELECT
CASE
WHEN (bottom+top)/2 <= 10 THEN '0-10'
WHEN (bottom+top)/2 <= 20 THEN '10-20'
WHEN (bottom+top)/2 <= 30 THEN '20-30'
ELSE '30+'
END,
salary
FROM (
SELECT
LEFT(salary,LOCATE("k", salary)-1) AS bottom,
SUBSTR(salary,LOCATE('-',salary)+1,length(salary)-locate("-",salary)-1) AS top,
salary
FROM `test data 01`.dataanalyst
)AS t
2. 过滤功能
语句:where……in
示例:城市职位数量大于100的城市,职位详情
SELECT * FROM `test data 01`.dataanalyst
WHERE city IN(
SELECT city FROM `test data 01`.dataanalyst
GROUP BY city HAVING COUNT(positionid) >=100
) #where和select中的字段要一致,都是city
四、表的连结
语句:join
示例:查询“唯医网”的招聘岗位,其中公司名称在company表中,职位在dataanalyst中,共同字段是companyid
(1). 方法一:用where子查询
SELECT * FROM `test data 01`.dataanalyst
WHERE companyId = (
SELECT companyId FROM `test data 01`.company
WHERE companyshortname = "唯医网") #选出来的信息只包含datananlyst的部分
(2). 方法二:用join……on
SELECT * FROM `test data 01`.dataanalyst AS d
JOIN `test data 01`.company AS c
ON d.companyId = c.companyid
where companyshortname = '唯医网'
- left join:
语句:left join……on
作用:左链接,包含join左边的全部,没有的部分用null
应用:可以进行排除法统计
示例:人数在150-500人公司的占比
SELECT COUNT(t.companyid)/COUNT(1)
FROM `test data 01`.company AS d
LEFT JOIN (SELECT * FROM `test data 01`.company
WHERE companysize= '150-500人') AS t
ON t.companyId = d.companyid
五、时间函数
语句:date
(1)date():年月日
(2)date_format(列名,’%Y-%m’):把列的时间格式改为Y-m
SELECT paidTime, date(paidTime), date_format(paidTime,'%Y-%m')
FROM `test data 01`.orderInfo
(3)date_sub(列名, interval 1 day/week/month) :减去时间
(4)date_add(列名, interval 1 day):增加时间
(5)datediff:时间差值
示例:不同月份成交的订单笔数
SELECT COUNT(*), date_format(paidTime,'%Y-%m') as monthly
FROM `test data 01`.orderInfo
GROUP BY monthly
六、例题
例题1:统计不同月份的下单人数
SELECT COUNT(distinct userId), #去重
date_format(paidTime,'%Y-%m') as monthly #也可以用month()
FROM `test data 01`.orderInfo
WHERE isPaid = '已支付' #有未支付或者空值的脏数据
GROUP BY monthly
例题2:统计用户三月份的回购率和复购率
#复购率:在三月份购买多次的人,占三月份购买人数的比率
SELECT COUNT( if ( ct >1, 1, null) )/COUNT(ct)
FROM (
SELECT userId,count(userid) as ct
FROM `test data 01`.orderInfo
WHERE month(paidTime)= 3
GROUP BY userId
) as t
#回购率:三月份购买且后一个月又购买的人,占三月份购买人数的比率
SELECT t1.m, COUNT(t2.m)/COUNT(t1.m) FROM(
SELECT userId, date_format(paidtime,'%Y-%m-01') as m FROM `test data 01`.orderInfo
WHERE isPaid='已支付'
GROUP BY userId, date_format(paidtime,'%Y-%m-01') )as t1
LEFT JOIN (
SELECT userId, date_format(paidtime,'%Y-%m-01') as m FROM `test data 01`.orderInfo
WHERE isPaid='已支付'
GROUP BY userId, date_format(paidtime,'%Y-%m-01') )as t2
ON t1.userId=t2.userId and t1.m = date_sub(t2.m, interval 1 month)
GROUP BY t1.m
例题3:统计男女用户的消费频次是否有差异
思路:先求出不同id的消费次数,再求平均数
SELECT sex, AVG(ct) FROM (
SELECT o.userId, sex, COUNT(1) as ct FROM (
SELECT * FROM `test data 01`.userinfo
WHERE sex !='') u
JOIN `test data 01`.orderinfo o
ON u.userId=o.userId
GROUP BY userId, sex) t2
GROUP BY sex
例题4:统计多次消费的用户,第一次消费和最后一次消费的间隔是多少
SELECT userId, max(paidtime), min(paidtime),
datediff(max(paidtime), min(paidtime)), #天数差值
max(paidtime)-min(paidtime) #秒数差值
FROM `test data 01`.orderInfo
where ispaid="已支付"
group by userId having count(1) >1
例题5:统计不同年龄段,用户的消费频次是否有差异
select age, avg(ct) from(
SELECT u.userid,
ceil((date_format(now(),'%Y')-year(birth) )/10 )as age, #/10是指年龄分10段,ceil向上取整,=1就是0-10岁
#或者 year(now())-year(birth),
count(u.userid) as ct
FROM `test data 01`.userinfo u
JOIN orderinfo o
ON u.userid = o.userid
group by userid) t #报错1248,每个派生表都要加一个别名
group by age
例题6:统计消费的二八法则,消费的top 20%用户,贡献了多少额度
SELECT count(userid), sum(total) FROM (
SELECT userid, sum(price) as total FROM `test data 01`.orderInfo
where ispaid = '已支付'
group by userid
order by total desc
limit 1 ) t#count(userid)*0.2,前20%的消费者