mysql 并表去重_MySQL Notes

最新推荐文章于 2025-02-22 15:31:19 发布

原创最新推荐文章于 2025-02-22 15:31:19 发布 · 370 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#mysql 并表去重

本文探讨了SQL在数据提取、分析、对比、日期处理、替换、扩表、排名、比值等方面的应用。介绍了如何利用max、min、greatest、least、distinct、order by、limit、offset、join、group by、having、if、case when等功能解决各种复杂问题，如找出第二高薪资、员工薪资高于经理、连续日期的最高访问量、列转行、行转列、删除重复数据等。此外，还讲解了如何定义函数来计算取消率，并讨论了SQL在删除重复数据时的策略。

数据提取与分析

问题1 前几

【涉及知识点】

最值：单列最大max(column)，单列最小min(column)；多列最大greatest(column1,column2)，多列最小least(column1,column2)，注意greatest和least里不能有值为null否则会返回null

剔重：distinct

排序：按某column列从小到大order by column，从大到小order by column desc

选取：limit m offset n 取第n+1至n+1+(m-1)=n+m名

没有结果时返回null：嵌套一层，内层没有结果什么也不会返回，外层在内层没有结果的情况下会返回null，也可以外层用ifnull更直观1

17Write a SQL query to get the second highest salary from the Employee table.

+----+--------+

| Id | Salary |

+----+--------+

| 1 | 100 |

| 2 | 200 |

| 3 | 300 |

+----+--------+

For example, given the above Employee table, the query should return 200 as the second highest salary.

If there is no second highest salary, then the query should return null.

+---------------------+

| SecondHighestSalary |

+---------------------+

| 200 |

+---------------------+

【解析】

两种思路：一种是从高到低排序，然后取第二个，这种是如果要取的不是第二，而是第五或者第八到十都会比较适用的；另一种是先求最高的，然后求小于最高的那个，这种方法只在求第二的时候好使。

第二种思路的核心函数是max，求次高的时候可以用

3select max(salary) as 'SecondHighestSalary'

from employee

where salary < (select max(salary) from employee)

或者1

3select max(salary) as 'SecondHighestSalary'

from employee

where salary not in (select max(salary) from employee)

备注:

1)其中前者运行时间是473ms，后者是229ms，后者比前者要快一些，事实上，大部分时候用where in都会更快。

2)虽然有函数，但没有group by，所以用where而不是having。

3)不用group by是因为是算的所有salary的top2，而不是按照某标准比如部门分组后计算每一组的salary的top2。

第一种思路的核心语句是order by desc和limit offset，无取值时返回null可以嵌套一层select也可以用ifnull，分别如下：1

8select

(

select distinct salary

from employee

order by salary desc

limit 1 offset 1

)

as 'SecondHighestSalary'

或者1

8select ifnull(

(select distinct salary

from employee

order by salary desc

limit 1 offset 1)

,null

)

as 'SecondHighestSalary'

备注：

1)其中前者运行时间是430ms，后者运行时间是181ms，后者更快。

2)嵌套的外层是select () as XXX, 而不是select xxx as XXX from () as t，因为如果是后者的话，可以看到它的结构和内层一样，在没有值的时候返回的还是[]而不是[null]。1

34The Employee table holds all employees. Every employee has an Id, a salary,

and there is also a column for the department Id.

+----+-------+--------+--------------+

+----+-------+--------+--------------+

| 1 | Joe | 70000 | 1 |

| 2 | Jim | 90000 | 1 |

| 3 | Henry | 80000 | 2 |

| 4 | Sam | 60000 | 2 |

| 5 | Max | 90000 | 1 |

+----+-------+--------+--------------+

The Department table holds all departments of the company.

+----+----------+

| Id | Name |

+----+----------+

| 1 | IT |

| 2 | Sales |

+----+----------+

Write a SQL query to find employees who have the highest salary in each of the departments.

For the above tables, your SQL query should return the following rows (order of rows does not matter).

+------------+----------+--------+

| Department | Employee | Salary |

+------------+----------+--------+

| IT | Max | 90000 |

| IT | Jim | 90000 |

| Sales | Henry | 80000 |

+------------+----------+--------+

Explanation:

Max and Jim both have the highest salary in the IT department

and Henry has the highest salary in the Sales department.

【解析】

由于需要把所有并列部门第一的信息全部罗列出来，因此不能简单地group by部门以后倒序排列然后limit 1，而是需要嵌套一层，选出所有数值等于前者的，具体如下：1

8select d.name as 'Department',

e.name as 'Employee',

e.Salary

from employee e

join department d

on d.id=e.departmentid

join (select max(salary) as mxsalary, departmentid from employee group by departmentid) mx

on mx.mxsalary=e.salary and mx.departmentid=e.departmentid1

40The Employee table holds all employees. Every employee has an Id,

and there is also a column for the department Id.

+----+-------+--------+--------------+

+----+-------+--------+--------------+

| 1 | Joe | 85000 | 1 |

| 2 | Henry | 80000 | 2 |

| 3 | Sam | 60000 | 2 |

| 4 | Max | 90000 | 1 |

| 5 | Janet | 69000 | 1 |

| 6 | Randy | 85000 | 1 |

| 7 | Will | 70000 | 1 |

+----+-------+--------+--------------+

The Department table holds all departments of the company.

+----+----------+

| Id | Name |

+----+----------+

| 1 | IT |

| 2 | Sales |

+----+----------+

Write a SQL query to find employees who earn the top three salaries in each of the department.

For the above tables, your SQL query should return the following rows (order of rows does not matter).

+------------+----------+--------+

| Department | Employee | Salary |

+------------+----------+--------+

| IT | Max | 90000 |

| IT | Randy | 85000 |

| IT | Joe | 85000 |

| IT | Will | 70000 |

| Sales | Henry | 80000 |

| Sales | Sam | 60000 |

+------------+----------+--------+

Explanation:

In IT department, Max earns the highest salary, both Randy and Joe earn the second highest salary,

and Will earns the third highest salary. There are only two employees in the Sales department,

Henry earns the highest salary while Sam earns the second highest salary.

【解析】

由于需要把满足条件的都列出来，而其中有并列的，因此需要把employee自己并表对比。核心思路是，每类比自己小的最多只有3个。具体如下：1

11select d.name as 'Department',

e1.name as 'Employee',

e1.Salary

from employee e1

join department d

on e1.departmentid=d.id

join employee e2

on e1.departmentid=e2.departmentid

where e1.salary<=e2.salary

group by e1.id

having count(distinct(e2.salary))<=3

注意:

group by的是id，而不是departmentid，这是因为e1和e2已经按照同departmentid来比较了，group by主要是为了后面的having count()，而不是为了区分部门。后面可以用e1.id或者e1.name，但不能用e2的id或name，也不能用departmentid或者salary。

问题2 对比

【涉及知识点】

核心思路：先拆表再并表，把表按照一个参数一个表拆为多个表，再对比或用join连接各表进行数值对比

并表：select xxx from table1 join table2 join table3 on table1.column1=table2.column2 and table1.column3=table3.column4

拆表：select xxx from table as t1, table as t2 where1

19The Employee table holds all employees including their managers. Every employee has an Id,

and there is also a column for the manager Id.

+----+-------+--------+-----------+

+----+-------+--------+-----------+

| 1 | Joe | 70000 | 3 |

| 2 | Henry | 80000 | 4 |

| 3 | Sam | 60000 | NULL |

| 4 | Max | 90000 | NULL |

+----+-------+--------+-----------+

Given the Employee table, write a SQL query that finds out employees who earn more than their managers.

For the above table, Joe is the only employee who earns more than his manager.

+----------+

| Employee |

+----------+

| Joe |

+----------+

【解析】

两种思路，一种是两个表直接对比，一种是并表对比，后者比前者用得更多，因为这个例题是只有两个参数，如果参数多了，并表更直观。

第一种思路直接对比，核心语句是where，具体如下：1

5select e1.name as 'Employee'

from employee as e1,

employee as e2

where e1.managerid = e2.id

and e1.salary > e2.salary

这种方法的运算时间是881ms，但不太常用，更常用的还是第二种思路join。

第二种思路先拆表再并表，核心语句是join，具体如下：1

5select e1.name as 'Employee'

from employee as e1

left join employee as e2

on e1.managerid=e2.id

where e1.salary > e2.salary

这种方法的运算时间是792ms，比较常用，数据分析中这种先拆表再并表是一种基本思路。1

31X city built a new stadium, each day many people visit it and the stats are saved as these columns:

id, visit_date, people

Please write a query to display the records which have 3 or more consecutive rows

and the amount of people more than 100(inclusive).

For example, the table stadium:

+------+------------+-----------+

| id | visit_date | people |

+------+------------+-----------+

| 1 | 2017-01-01 | 10 |

| 2 | 2017-01-02 | 109 |

| 3 | 2017-01-03 | 150 |

| 4 | 2017-01-04 | 99 |

| 5 | 2017-01-05 | 145 |

| 6 | 2017-01-06 | 1455 |

| 7 | 2017-01-07 | 199 |

| 8 | 2017-01-08 | 188 |

+------+------------+-----------+

For the sample data above, the output is:

+------+------------+-----------+

| id | visit_date | people |

+------+------------+-----------+

| 5 | 2017-01-05 | 145 |

| 6 | 2017-01-06 | 1455 |

| 7 | 2017-01-07 | 199 |

| 8 | 2017-01-08 | 188 |

+------+------------+-----------+

Note:

Each day only have one row record, and the dates are increasing with id increasing.

【解析】

拆成3个表来对比即可，注意符合条件的id可能在连续3个100以上访问记录的第1/2/3个中的任何一个位置，需用or来连接条件，并用()来表示优先级，具体如下：1

9select distinct s1.*

from stadium s1, stadium s2, stadium s3

where s1.people>=100 and s2.people>=100 and s3.people>=100

and ((s1.id+1=s2.id and s2.id+1=s3.id)

(s2.id+1=s1.id and s1.id+1=s3.id)

(s3.id+1=s2.id and s2.id+1=s1.id))

order by id

注意：由于很多id同时满足or连接的不同条件，因此distinct和order by是必须的。

问题3 查重

【涉及知识点】

分类：group by

计数：count()

先拆再比1

17Write a SQL query to find all duplicate emails in a table named Person.

+----+---------+

| Id | Email |

+----+---------+

For example, your query should return the following for the above table:

+---------+

| Email |

+---------+

Note: All emails are in lowercase.

【解析】

两种思路，一种思路是转化为对比问题，用上面问题2对比中所用的先拆再比的办法，另一种思路是每个email分个类，选出分类计数大于1的。

第一种思路，拆成两个表，求那些id不同但email值相同的，具体如下：1

5select distinct p1.Email

from person p1,

person p2

where p1.id <> p2.id

and p1.email = p2.email

备注：

1)这种方法的运算速度是513ms

2)注意要用distinct，这样如果重复次数多于2次的时候，只会返回一个值。

另一种思路，按email进行分类，求类别下的id数大于1的，具体如下：1

4select Email

from person

group by email

having count(email) > 1

备注：

1)这种方法的运算速度是436ms

2)注意因为有分类group by，求每个类别的计数条件要用having，而不是求所有的计数所用的where。

3)计数这种方式比起前一种拆表对比的方式而言，适用性更强，在有多个参数有要求时更方便直观。1

30There is a table courses with columns: student and class

Please list out all classes which have more than or equal to 5 students.

For example, the table:

+---------+------------+

| student | class |

+---------+------------+

| A | Math |

| B | English |

| C | Math |

| D | Biology |

| E | Math |

| F | Computer |

| G | Math |

| H | Math |

| I | Math |

+---------+------------+

Should output:

+---------+

| class |

+---------+

| Math |

+---------+

Note:

The students should not be counted duplicate in each course.

【解析】

这样计数超过2的情况用对比就不好使了，用group by再count则非常方便。具体如下：1

4select class

from courses

group by class

having count(distinct student)>=5

备注：

1)同上例，因为有group by，所以不能用where，而应该用having来求每组的数字

2)考虑到表中有可能有不同口径重复写入的情况，因此需要distinct剔除重复的记录，这是一个分析中应该养成的习惯，因为如果没有重复记录，加一个distinct也不会影响结果。

问题4 日期

【涉及知识点】

返回当前日期和时间：

now()会返回当前YYYY-MM-DD HH:MM:SS，

curdate()会返回当前YYYY-MM-DD，

curtime()会返回当前HH:MM:SS

截取时间中的某个值：

date(column)会返回column中的YYYY-MM-DD，

extract(year/quarter/month/week/day/hour/minute/second/microsecond/hour_minute/day_minute/year_month from column)会返回column中的YYYY/Q/MM/WW/DD/HH/MM/SS/MS/HHMM/DDHHMM/YYYYMM 注，Q是从1开始计，WW是在第一个周日前计为0，第一个周日开始到第一个周六计为1，以此类推

增减求日期间隔：

date_add(date, interval 1 day/week/month/quarter/year/hour/minute/second)返回date之后1天/周/月/季/年/小时/分钟/秒钟后的那个日期，

date_sub(date, interval 2 day/...)返回date之前1天/…后的那个日期，

datediff(date1,date2)返回两个日期中间的间隔天数。1

19Given a Weather table, write a SQL query to find all dates' Ids

with higher temperature compared to its previous (yesterday's) dates.

+---------+------------------+------------------+

| Id(INT) | RecordDate(DATE) | Temperature(INT) |

+---------+------------------+------------------+

| 1 | 2015-01-01 | 10 |

| 2 | 2015-01-02 | 25 |

| 3 | 2015-01-03 | 20 |

| 4 | 2015-01-04 | 30 |

+---------+------------------+------------------+

For example, return the following Ids for the above Weather table:

+----+

| Id |

+----+

| 2 |

| 4 |

+----+

【解析】

主体是一个非常简单的先拆表再比较或并表，容易错的点在于：由于记录不一定是连续的，所以不能用id而应该用recorddate来表达previous dates的概念，用到date_add或datediff函数，具体如下：

date_add+拆表比较(运算时间1025ms)1

5select w2.Id

from weather w1,

weather w2

where date_add(w1.recorddate, interval 1 day)=w2.recorddate

and w1.temperature

datediff+拆表比较(运算时间641ms)1

5select w2.Id

from weather w1,

weather w2

where datediff(w1.recorddate, w2.recorddate)=-1

and w1.temperature

date_add+拆表并表(运算时间679ms)1

5select w2.Id

from weather w1

right join weather w2

on date_add(w1.recorddate,interval 1 day)=w2.recorddate

where w1.temperature

datediff+拆表并表(运算时间791ms)1

5select w2.Id

from weather w1

right join weather w2

on datediff(w1.recorddate, w2.recorddate)=-1

where w1.temperature

从运算时间看，其中如果采用并表join，用date_add更快一点；如果采用比较，用datediff要快很多。

考虑到大部分情况下用join会比直接比较更适合多参数拆分为多表时，因此在date_add和datediff都能用的情况下可考虑优先选用date_add。

问题5 数字

【涉及知识点】

一些常用符号：

不等于可以用<>也可以用!=；

余数可以用mod(被除数,除数)=余数也可以用被除数%除数=余数；

截取数字：

round(3.1415, 3)=3.142，

truncate(3.1415, 3)=3.141，

floor(3.1415)=3，

ceil(3.1415)=41

23X city opened a new cinema, many people would like to go to this cinema.

The cinema also gives out a poster indicating the movies’ ratings and descriptions.

Please write a SQL query to output movies with an odd numbered ID and a description

that is not 'boring'. Order the result by rating.

For example, table cinema:

+---------+-----------+--------------+-----------+

+---------+-----------+--------------+-----------+

| 1 | War | great 3D | 8.9 |

| 2 | Science | fiction | 8.5 |

| 3 | irish | boring | 6.2 |

| 4 | Ice song | Fantacy | 8.6 |

| 5 | House card| Interesting| 9.1 |

+---------+-----------+--------------+-----------+

For the example above, the output should be:

+---------+-----------+--------------+-----------+

+---------+-----------+--------------+-----------+

| 5 | House card| Interesting| 9.1 |

| 1 | War | great 3D | 8.9 |

+---------+-----------+--------------+-----------+

【解析】这道题非常简单，仅需知道余数的表达方式即可，余数用%则只有mysql支持，用mod()则其它如oracle等也支持，这道题的description里的boring是精确的，可以直接用=，但考虑到大部分时候描述和评价会是一小段话而非一个词，因此用了字符串模糊匹配的not like '%xxx%'1

5select *

from cinema

where id%2=1

and description not like '%boring%'

order by rating desc

问题6 字符

【涉及知识点】

字符串匹配：

精确匹配用=，模糊匹配用like或not like，

模糊匹配不确定前后有多少字符的时候用'%word%'，模糊匹配确定前后有多少个字符的时候用'w_rd'，

模糊匹配多个可以包含的字符时用'where city like [ANL]%'，即返回所有以A或N或L字母打头的城市名，

模糊匹配多个需要排除的字符时用'where city like [!ANL]'，即返回所有不以A或N或L字母打头的城市名。

字符串长度：

len(column)，返回这列每个值的字符串长度

字符串定位：

locate(abc, column)，返回这列每个值中，字符abc出现的位置，如无返回0

截取字符串：

mid(column, x, n)，返回这列每个值中，从第x个字符开始一直截取n个字符的这一串子字符串

【例题】

问题5的例题262、问题11的例题620解析中有涉及到字符串处理，不赘述。

问题7 替换

【涉及知识点】

替换问题的核心其实是多条件赋值，如果只有较少条件的时候可以用if()，如果条件多时需要用case when

if的基本用法：if(条件, 符合条件时的赋值1, 不符合条件时的赋值2)

case when的基本用法：case when 条件1 then 赋值1 when 条件2 then 赋值2 else 赋值3 end1

23Given a table salary, such as the one below, that has m=male and f=female values.

Swap all f and m values (i.e., change all f values to m and vice versa) with a single update statement

and no intermediate temp table.

Note that you must write a single update statement, DO NOT write any select statement for this problem.

Example:

| id | name | sex | salary |

|----|------|-----|--------|

| 1 | A | m | 2500 |

| 2 | B | f | 1500 |

| 3 | C | m | 5500 |

| 4 | D | f | 500 |

After running your update statement, the above salary table should have the following rows:

| id | name | sex | salary |

|----|------|-----|--------|

| 1 | A | f | 2500 |

| 2 | B | m | 1500 |

| 3 | C | f | 5500 |

| 4 | D | m | 500 |

【解析】

解题思路有两种，一种是用if，因为这里只有两个赋值，一种是用更具普遍性的case when。

需要额外提到的是，这是一个更新数据库而非通常的提取数据用于分析的场景，需要用到update而非日常的select，更新数据update的基本用法是：update 表格名 set 需修改参数列名 = 修改内容

第一种思路，case when，具体如下：1

4update salary

set sex = case sex when 'f' then 'm'

else 'f'

end

需要注意的是，修改的列和条件的列都是sex，如果不是同一列的话比如修改的是name，用于判断的条件是id，那么会写作update salary set name= case id when 1 then 'aaa' when 2 then 'bbb' else 'ccc' end

另，如果不写作case sex when，而是直接用case when，把sex放到后面去也是可以的，具体写作：1

5update salary

set sex = case when sex='f'

then 'm'

else 'f'

end

第二种思路，if，因为这里sex只有两个取值，所以用if是足够的，具体如下：1

2update salary

set sex = if(sex='f','m','f')

需要注意的是，if后面的条件必须完整的写为sex=xxx，而不能因为前面已经有sex=了就省掉，因为这里的条件和要赋值的列都是sex，但可能存在不是同一个列的情况，比如按照上面的栗子会写作update salary set name = if(id=1,'aaa','bbb')

这里用if的运算时间是518ms，用case when的运算时间是337ms，尽管if的代码要短一些，但case when的运算速度更快，考虑到case when的适用场景也更多，因此在替换与赋值问题中尽量优先选择使用case when。1

32Mary is a teacher in a middle school and she has a table seat storing students' names

and their corresponding seat ids.

The column id is continuous increment.

Mary wants to change seats for the adjacent students.

Can you write a SQL query to output the result for Mary?

+---------+---------+

| id | student |

+---------+---------+

| 1 | Abbot |

| 2 | Doris |

| 3 | Emerson |

| 4 | Green |

| 5 | Jeames |

+---------+---------+

For the sample input, the output is:

+---------+---------+

| id | student |

+---------+---------+

| 1 | Doris |

| 2 | Abbot |

| 3 | Green |

| 4 | Emerson |

| 5 | Jeames |

+---------+---------+

Note:

If the number of students is odd, there is no need to change the last one's seat.

【解析】

由于涉及到3个情景：奇数且不为最大值，奇数且为最大值，偶数。因此无法用if，需用case when，具体如下：1

10select

(case when id%2=0 then id-1

when id%2=1 and id<>maxid then id+1

else id

end

) as id,

student

from seat,

(select max(id) as maxid from seat) mx

order by id

注意：

不能在case when的其中一个枝干里直接用case when id%2=1 and id not in (select max(id) from seat) mx then id+1，而须按照上面的写法把mx放到from之后，否则会报错。

问题8 扩表

【涉及知识点】

扩表是把记录不连续的表格扩大成包含一系列参数的大表，通常包含两类操作：列转行或行转列、对无记录的条目进行赋值

列转行通常有三种做法：if；case when；pivot

行转列通常做法为：union/union all

if和case when的用法在问题7替换中已经讲过了，不赘述。

pivot的基本用法：1

3select column2value1,column2value2,column2value3

from (select * from originaltable) as base

pivot(max/min/sum(column1) for column2 in (column2value1,column2value2,column2value3)) as pvt

union /union all基本用法：1

3select column from table

union (all)

select column from table

需注意的是，1)如果用union会自动剃重，union all则不会 2)被连接各表的colomn项目顺序格式必须完全一样 3)union除了用作行转列外，也可用作纵向并表(join为横向并表)1

40Table: Department

+---------------+---------+

| Column Name | Type |

+---------------+---------+

| id | int |

| revenue | int |

| month | varchar |

+---------------+---------+

(id, month) is the primary key of this table.

The table has information about the revenue of each department per month.

The month has values in ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"].

Write an SQL query to reformat the table such that there is a department id column

and a revenue column for each month.

The query result format is in the following example:

Department table:

+------+---------+-------+

| id | revenue | month |

+------+---------+-------+

| 1 | 8000 | Jan |

| 2 | 9000 | Jan |

| 3 | 10000 | Feb |

| 1 | 7000 | Feb |

| 1 | 6000 | Mar |

+------+---------+-------+

Result table:

+------+-------------+-------------+-------------+-----+-------------+

+------+-------------+-------------+-------------+-----+-------------+

| 1 | 8000 | 7000 | 6000 | ... | null |

| 2 | 9000 | null | null | ... | null |

| 3 | null | 10000 | null | ... | null |

+------+-------------+-------------+-------------+-----+-------------+

Note that the result table has 13 columns (1 for the department id + 12 for the months).

【解析】

这是一个典型的扩表问题，需要的工作包含列转行和对没有记录的取值赋值为null，下面分别用case when；if；pivot来实现。

第一种方法，case when，具体如下：1

15select distinct id,

sum(CASE when month='jan' then revenue else null END) as 'Jan_Revenue',

sum(CASE when month='feb' then revenue else null END) as 'Feb_Revenue',

sum(CASE when month='mar' then revenue else null END) as 'Mar_Revenue',

sum(CASE when month='apr' then revenue else null END) as 'Apr_Revenue',

sum(CASE when month='may' then revenue else null END) as 'May_Revenue',

sum(CASE when month='jun' then revenue else null END) as 'Jun_Revenue',

sum(CASE when month='jul' then revenue else null END) as 'Jul_Revenue',

sum(CASE when month='aug' then revenue else null END) as 'Aug_Revenue',

sum(CASE when month='sep' then revenue else null END) as 'Sep_Revenue',

sum(CASE when month='oct' then revenue else null END) as 'Oct_Revenue',

sum(CASE when month='nov' then revenue else null END) as 'Nov_Revenue',

sum(CASE when month='dec' then revenue else null END) as 'Dec_Revenue'

from department

group by id

备注：

1)需要注意的是用到了sum()和group by id，因为如果不用sum的话，就会出现每条记录对应的别的记录其实是有数值的会在这条记录中显示为null，因此需要用sum()或者max()来在多个取值中选择有数值的而非null，这道题目因为每个月每个部门只有一条记录，所以用max或者sum都可以，但如果是有多条记录的话，用sum()会更合适

2) 注意是group by id而不是group by month，因为已经把每个月给单独写出来了，不需要通过group by month来为不同月份分组计算，使用group by的目的是为了sum，因此需用id这种不会因group而汇总的参数。

3)distinct只是写作习惯，不写也可以，但实测发现写distinct的运算时间为690m，不写distinct会更慢运算时间要1116ms

第二种方法，if()函数，具体如下：1

15select id,

sum(if(month='jan', revenue, null)) as 'Jan_Revenue',

sum(if(month='feb', revenue, null)) as 'Feb_Revenue',

sum(if(month='mar', revenue, null)) as 'Mar_Revenue',

sum(if(month='apr', revenue, null)) as 'Apr_Revenue',

sum(if(month='may', revenue, null)) as 'May_Revenue',

sum(if(month='jun', revenue, null)) as 'Jun_Revenue',

sum(if(month='jul', revenue, null)) as 'Jul_Revenue',

sum(if(month='aug', revenue, null)) as 'Aug_Revenue',

sum(if(month='sep', revenue, null)) as 'Sep_Revenue',

sum(if(month='oct', revenue, null)) as 'Oct_Revenue',

sum(if(month='nov', revenue, null)) as 'Nov_Revenue',

sum(if(month='dec', revenue, null)) as 'Dec_Revenue'

from department

group by id

备注：

1)仍是用到了sum()和group by

2)这种方法运算时间为734ms，很有趣的是如果加distinct，运算时间为774ms，相差不大

第三种方法，pivot，具体如下：1

15select id,

jan as Jan_revenue,

feb as Feb_revenue,

mar as Mar_revenue,

apr as Apr_revenue,

may as May_revenue,

jun as Jun_revenue,

jul as Jul_revenue,

aug as Aug_revenue,

sep as Sep_revenue,

oct as Oct_revenue,

nov as Nov_revenue,

dec as Dec_revenue

from (select id,revenue,month from department) as base

pivot (sum(revenue) for month in(jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec)) as pvt

备注：

1)大部分版本的mysql不支持pivot，以上是在ms sql server中的写法

2)运算时间1315ms

3)考虑兼容性，列转行尽量使用case when/if而非pivot，行转列尽量使用union all而非unpivot

问题9 定义函数

【涉及知识点】

有时需要返回的内容并非一个固定参数，而是会需要根据情况调整输入参数的，这时需要定义函数。

定义函数的基本格式为：1

7CREATE FUNCTION 函数名(参数n 参数格式int/varchar) RETURNS 返回值格式int/varchar

BEGIN

DECLARE 参数名m 参数格式int; #与定义函数名和参数格式时一样中间用空格隔开，注意结尾需:

SET m=n+1; #set后面是declare的参数与函数参数n的关系，注意需;结尾

RETURN ( select... ); #中间是正常的查询分析语句，把平时固定的某数值替换为函数参数n即可，

#注意需打括号，括号后有;，注意是return而不是前面定义函数名时用的returns

END; #注意有begin就要有end，end后面有没有;都可以1

17Write a SQL query to get the nth highest salary from the Employee table.

+----+--------+

| Id | Salary |

+----+--------+

| 1 | 100 |

| 2 | 200 |

| 3 | 300 |

+----+--------+

For example, given the above Employee table, the nth highest salary where n = 2 is 200.

If there is no nth highest salary, then the query should return null.

+------------------------+

| getNthHighestSalary(2) |

+------------------------+

| 200 |

+------------------------+

【解析】

本身是一个非常简单的前几问题，但由于需要定义一个函数来方便输入第几的数字，因此需要把简单的前几查询略作修改变成函数定义。具体如下：1

16CREATE FUNCTION getNthHighestSalary(N int) RETURNS int

BEGIN

DECLARE M int;

SET M=N-1;

RETURN

(

select distinct salary

from employee

order by salary desc

limit 1 offset M

);

END;

格式中需要注意的点包括：

1)函数名的名字就是需要返回的列名，下面写函数时就无需as了

2)函数名后面括号内的参数名和参数格式中间的是空格，而不是逗号

3)定义函数名时用的returns有个s，后面接的是返回值的格式

4)create function returns这句后面无需使用符号

5)begin后如果没什么要设定的可以直接接return

6)如果在查询语句中不能直接使用参数而需要参数的计算后变形，那么需要在begin后先declare另一个辅助参数，然后set辅助参数与函数参数之间的关系，注意declare句后和set句后均有;分号

7)return后面的如果是整个查询语句的话，需要把查询语句括号起来，注意括号后有;分号

8)最后别忘了有begin就会有end，而end后面有没有;都可以的

问题10 排名

【涉及知识点】

需要注意的是，这里所说的排名与排序不同，重点是需要为每行返回一个代表其名次的数字，这涉及到的核心功能仍是自定义函数，但不同于create function，这里的自定义函数是在查询语句中间实现的，也并不需要面向不同的输入参数。

通过自定义函数创建一列新的行序数字的基本格式是：

select table.*, @rownum:[email protected]+1 from table, (select @rownum:=0) as r1

27Write a SQL query to rank scores. If there is a tie between two scores, both should have the same ranking.

Note that after a tie, the next ranking number should be the next consecutive integer value.

In other words, there should be no "holes" between ranks.

+----+-------+

| Id | Score |

+----+-------+

| 1 | 3.50 |

| 2 | 3.65 |

| 3 | 4.00 |

| 4 | 3.85 |

| 5 | 4.00 |

| 6 | 3.65 |

+----+-------+

For example, given the above Scores table, your query should generate the following report (order

by highest score):

+-------+------+

| Score | Rank |

+-------+------+

| 4.00 | 1 |

| 3.85 | 2 |

| 3.65 | 3 |

| 3.50 | 4 |

+-------+------+

【解析】

可以想到输出结果是由两个表并成的，一个表是每个分数对应的排名，这需要将分数剔重以后排序再通过自定义函数得出每行的行数数值，第二个表是包含相同分数的原始表格，需要按照分数从高到低排列。具体如下：1

9select s1.Score, s3.Rank

from scores s1

left join

(

select s2.score, @rownum:[email protected]+1 as Rank

from (select distinct score from scores order by score desc) s2, (select @rownum:=0) r

) s3

on s1.score=s3.score

order by rank

需注意的是，对获取排名值的表需要嵌套一次，因为需要剔除重复的和排序，而如果不嵌套而是把distinct，order by，自定义rownum函数全放在同一层的话会报错。

问题11 比值

【涉及知识点】

拆表再并表

用case when将符合条件的赋值1，不符合条件的赋值0，使用sum计算出所有符合条件的，除以count全部。1

48The Trips table holds all taxi trips. Each trip has a unique Id, while Client_Id and Driver_Id

are both foreign keys to the Users_Id at the Users table. Status is an ENUM type of (‘completed’,

‘cancelled_by_driver’, ‘cancelled_by_client’).

+----+-----------+-----------+---------+--------------------+----------+

+----+-----------+-----------+---------+--------------------+----------+

| 1 | 1 | 10 | 1 | completed |2013-10-01|

| 2 | 2 | 11 | 1 | cancelled_by_driver|2013-10-01|

| 3 | 3 | 12 | 6 | completed |2013-10-01|

| 4 | 4 | 13 | 6 | cancelled_by_client|2013-10-01|

| 5 | 1 | 10 | 1 | completed |2013-10-02|

| 6 | 2 | 11 | 6 | completed |2013-10-02|

| 7 | 3 | 12 | 6 | completed |2013-10-02|

| 8 | 2 | 12 | 12 | completed |2013-10-03|

| 9 | 3 | 10 | 12 | completed |2013-10-03|

| 10 | 4 | 13 | 12 | cancelled_by_driver|2013-10-03|

+----+-----------+-----------+---------+--------------------+----------+

The Users table holds all users. Each user has an unique Users_Id, and Role is an ENUM

type of (‘client’, ‘driver’, ‘partner’).

+----------+--------+--------+

| Users_Id | Banned | Role |

+----------+--------+--------+

| 1 | No | client |

| 2 | Yes | client |

| 3 | No | client |

| 4 | No | client |

| 10 | No | driver |

| 11 | No | driver |

| 12 | No | driver |

| 13 | No | driver |

+----------+--------+--------+

Write a SQL query to find the cancellation rate of requests made by unbanned users (both client and

driver must be unbanned) between Oct 1, 2013 and Oct 3, 2013.

The cancellation rate is computed by dividing the number of canceled (by client or driver) requests

made by unbanned users by the total number of requests made by unbanned users.

For the above tables, your SQL query should return the following rows with the cancellation rate

being rounded to two decimal places.

+------------+-------------------+

| Day | Cancellation Rate |

+------------+-------------------+

| 2013-10-01 | 0.33 |

| 2013-10-02 | 0.00 |

| 2013-10-03 | 0.50 |

+------------+-------------------+

【解析】

先并表，在并表过程中通过where把被禁用户和不符合日期的记录给直接剔除掉。

然后在求取消率时，分母可以直接用count()，分子如果用count就不能再设条件，因此用sum()，把符合条件带cancelled的条目赋值1其它赋值0，注意根据要求是精确到小数点后两位，因此需要用round(num,2)。1

10select distinct t.request_at as 'Day',

round(sum(case when t.status like '%cancelled%' then 1 else 0 end)/count(t.id), 2) as 'Cancellation Rate'

from trips t

join users u1

join users u2

on t.client_id=u1.users_id and t.driver_id=u2.users_id

where u1.banned='no'and u2.banned='no'

and t.request_at>='2013-10-01' and t.request_at<='2013-10-03'

group by request_at

order by request_at

注意：

1)like前面没有is，后面的模糊字符串需要用''框起来。

2)三个表可以一起连，直接join join on，而无需join on join on

3)日期可以直接用大于小于这样的符号，但需注意，日期也需用''框起来，否则会报错。

数据库维护相关

问题1 删重

【涉及知识点】

删除某行的格式：delete * from table where，和select的格式是一样的，只是换成delete，且因为是删整行，所以一般是*或者table.*而不像select需要写明选出来的是行里的哪些列参数。

删重还是剔重：其实一般更常用的不是删除重复的，而是选出不重复的，即是用distinct，但不排除有时有用delete维护数据库而不是在分析时剔除重复数据但不修改数据库的需求。1

22Write a SQL query to delete all duplicate email entries in a table named Person,

keeping only unique emails based on its smallest Id.

+----+------------------+

| Id | Email |

+----+------------------+

Id is the primary key column for this table.

For example, after running your query, the above Person table should have the following rows:

+----+------------------+

| Id | Email |

+----+------------------+

Note:

Your output is the whole Person table after executing your sql. Use delete statement.

【解析】

和查重的思路很相似，也是要么先拆表再转化成比较问题，要么计算每个值有多少个。

第一种方法，拆表再对比，把查重时的select改为删除时的delete就行了，注意因为要求保留较小的一个id，因此比较的时候不是不等于而是大于。1

5delete p1.*

from person p1,

person p2

where p1.id > p2.id

and p1.email = p2.email

备注：

1)这种方法的运算时间是1110ms

2)这里的delete后面有*，是因为删除表格的依据是整表，而不是某个指标。

第二种方法，把表格按每个email来group by，然后这里我们不需要计数，只需要求出id最小的需要保留的是哪些，然后删除那些不属于这些id的就可以了。1

7delete

from person

where id not in (

select id from

(select min(id) as 'id',email

from person

group by email) minid

备注：

1)这种方法的运算时间是1846ms

2)这里的delete后面没有*，而是直接接from，因为后面是where所限定的一个指标id

3)id not in的括号内必须嵌套一次，而不能直接一次，否则会报错，原因不明。

问题2 更新数据

【涉及知识点】

更新数据的基本格式：

update table set column-need-to-be-update=xxx where column-condition=mmm

【例题】前面分析互换问题时在例题627中已经讲过，不赘述。