本文是我的刷题记录,如果小伙伴有更好的方法欢迎评论区留言!
目录
难度:简单
1、组合两个表(175)
表1:Person
+-------------+---------+ | 列名 | 类型 | +-------------+---------+ | PersonId | int | | FirstName | varchar | | LastName | varchar | +-------------+---------+
personId为该表的主键
表2:Address
+-------------+---------+ | 列名 | 类型 | +-------------+---------+ | AddressId | int | | PersonId | int | | City | varchar | | State | varchar | +-------------+---------+
addressId是该表的主键
目标:报告 Person
表中每个人的姓、名、城市和州。如果 personId
的地址不在 Address
表中,则报告为 null
。
Pandas解法:
import pandas as pd
def combine_two_tables(person: pd.DataFrame, address: pd.DataFrame) -> pd.DataFrame:
return pd.merge(person,address,on='personId',how='left')[['firstName','lastName','city','state']]
MySQL解法:
select p.firstName, p.lastName, a.city, a.state from Person p
left join Address a
on a.PersonID = p.PersonID
2、超过经理收入的员工(181)
表1:Employee
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | name | varchar | | salary | int | | managerId | int | +-------------+---------+ id 是该表的主键
目标:找出收入比经理高的员工
Pandas:
import pandas as pd
def find_employees(employee: pd.DataFrame) -> pd.DataFrame:
data = pd.merge(employee,employee,left_on='id',right_on='managerId',how='right',suffixes=('','_y')).drop(['id','name','id_y','managerId','managerId_y'],axis=1)
data = data[data['salary_y'] > data['salary']][['name_y']]
data.columns = ['Employee']
return data
MySQL:
select e1.name as Employee
from Employee e1, Employee e2
where e1.managerId = e2.id and e1.salary > e2.salary
3、查找重复的电子邮箱(182)
表:Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id 是该表的主键
目标:报告所有重复的电子邮件,可以保证电子邮件字段不为 NULL。
Pandas解法:
import pandas as pd
def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
return person[['email']][person.duplicated(subset=['email'])].drop_duplicates()
MySQL解法:
select email
from person
group by email
having count(email) > 1
4、从不订购的客户(183)
表1:Customers
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | name | varchar | +-------------+---------+ id 是该表的主键。
表2:Orders
+-------------+------+ | Column Name | Type | +-------------+------+ | id | int | | customerId | int | +-------------+------+ id 是该表的主键。
Pandas:
def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(customers,orders,left_on='id',right_on='customerId',how='left')
return df[df['customerId'].isna()][['name']].rename(columns={'name':'Customers'})
MySQL:
select c.name as Customers
from Customers c
left join Orders o
on c.id = o.customerId
where o.id is null
5、删除重复的电子邮箱(196)
表:Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id 是该表的主键列
目标:删除所有重复的电子邮件,只保留一个具有最小id的唯一电子邮件。
Pandas:
def delete_duplicate_emails(person: pd.DataFrame) -> None:
person.sort_values(by='id',inplace=True)
person.drop_duplicates(['email'],keep='first',inplace=True)
return person
MySQL:
delete p1 from Person p1, Person p2
where p1.email = p2.email and p1.id > p2.id
6、上升的温度(197)
表:Weather
+---------------+---------+ | Column Name | Type | +---------------+---------+ | id | int | | recordDate | date | | temperature | int | +---------------+---------+ id 是该表具有唯一值的列。
目标:找出与之前(昨天的)日期相比温度更高的所有日期的id
Pandas:
def rising_temperature(weather: pd.DataFrame) -> pd.DataFrame:
weather.sort_values(by='recordDate',inplace=True)
df = weather[((weather['recordDate']- weather['recordDate'].shift(1)) == "1days") & (weather['temperature'] > weather['temperature'].shift(1))]
return df[['id']]
7、游戏玩法分析I(511)
表:Activities
+--------------+---------+ | Column Name | Type | +--------------+---------+ | player_id | int | | device_id | int | | event_date | date | | games_played | int | +--------------+---------+ 在 SQL 中,表的主键是 (player_id, event_date)
目标:查询每位玩家第一次登陆平台的日期
Pandas:
def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:
return activity.groupby(by='player_id')['event_date'].min().reset_index().rename(columns={'event_date':'first_login'})[['player_id','first_login']]
MySQL:
select player_id, min(event_date) as first_login
from Activity
group by player_id
8、游戏玩法分析II(512)
表: Activity
+--------------+---------+ | Column Name | Type | +--------------+---------+ | player_id | int | | device_id | int | | event_date | date | | games_played | int | +--------------+---------+ (player_id, event_date) 是这个表的两个主键(具有唯一值的列的组合)
目标:描述每一个玩家首次登陆的设备名称
Pandas:
def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:
data = activity.groupby('player_id')[['event_date']].min().reset_index()
return pd.merge(activity,data,on=['player_id','event_date'],how='inner')[['player_id','device_id']]
9、员工奖金(577)
表1:Employee
+-------------+---------+ | Column Name | Type | +-------------+---------+ | empId | int | | name | varchar | | supervisor | int | | salary | int | +-------------+---------+ empId 是该表中具有唯一值的列
表2:Bonus
+-------------+------+ | Column Name | Type | +-------------+------+ | empId | int | | bonus | int | +-------------+------+ empId是该表具有唯一值的列
目标:报告每个奖金少于1000
的员工的姓名和奖金数额
Pandas:
def employee_bonus(employee: pd.DataFrame, bonus: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(employee,bonus,on='empId',how='left')
return df[(df['bonus'] < 1000) | (df['bonus'].isnull())][['name','bonus']]
MySQL:
select name, bonus from Employee e
left join Bonus b
on e.empId = b.empId
where bonus < 1000 or bonus is null
10、寻找用户推荐人(584)
表:Customer
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | name | varchar | | referee_id | int | +-------------+---------+ id 是该表的主键列
目标:找出那些没有被id=2的客户推荐的客户姓名
Pandas:
def find_customer_referee(customer: pd.DataFrame) -> pd.DataFrame:
return customer[(customer['referee_id'] != 2) | (customer['referee_id'].isna())][['name']]
MySQL:
select name
from customer
where referee_id !=2 or referee_id is null
11、订单最多的客户(586)
表:Orders
+-----------------+----------+ | Column Name | Type | +-----------------+----------+ | order_number | int | | customer_number | int | +-----------------+----------+ Order_number是该表的主键。
目标:查找下了最多订单的客户的customer_number
Pandas:
def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
df = orders.groupby(by='customer_number').count().reset_index()
return df[df['order_number'] == df['order_number'].max()][['customer_number']]
MySQL:
select customer_number from orders
group by customer_number
order by
COUNT(customer_number) DESC
LIMIT 1
进阶目标:如果有多位顾客订单数并列最多,找出他们的customer_number
Pandas解法同上。
12、大的国家(595)
表:World
+-------------+---------+ | Column Name | Type | +-------------+---------+ | name | varchar | | continent | varchar | | area | int | | population | int | | gdp | bigint | +-------------+---------+ name 是该表的主键
目标:找出大国的国家名称、人口和面积。大国的定义:面积至少为300万平方公里、或者人口至少为2500万。
Pandas:
def big_countries(world: pd.DataFrame) -> pd.DataFrame:
return world[(world['area'] >= 3000000) | (world['population'] >= 25000000)][['name','population','area']]
MySQL:
select name, population, area
from World
where area >= 3000000 or population >=25000000
13、超过5名学生的课(596)
表:Courses
+-------------+---------+ | Column Name | Type | +-------------+---------+ | student | varchar | | class | varchar | +-------------+---------+ 在 SQL 中,(student, class)是该表的主键列
目标:查询至少有5个学生的所有班级
Pandas:
def find_classes(courses: pd.DataFrame) -> pd.DataFrame:
df = courses.groupby(by='class').count().reset_index()
return df.loc[df['student'] >= 5, ['class']]
MySQL:
select class
from Courses
group by class
having count(distinct student) >= 5
14、好友申请(597)
表1:FriendRequest
+----------------+---------+ | Column Name | Type | +----------------+---------+ | sender_id | int | | send_to_id | int | | request_date | date | +----------------+---------+ 该表可能包含重复项(换句话说,在SQL中,该表没有主键)。
表2:RequestAccepted
+----------------+---------+ | Column Name | Type | +----------------+---------+ | requester_id | int | | accepter_id | int | | accept_date | date | +----------------+---------+ 该表可能包含重复项
目标:求出好友申请的通过率,用 2 位小数表示。通过率由接受好友申请的数目除以申请总数。
Pandas:
15、销售员(607)
表1:SalesPerson
+-----------------+---------+ | Column Name | Type | +-----------------+---------+ | sales_id | int | | name | varchar | | salary | int | | commission_rate | int | | hire_date | date | +-----------------+---------+ sales_id 是该表的主键列
表2:Company
+-------------+---------+ | Column Name | Type | +-------------+---------+ | com_id | int | | name | varchar | | city | varchar | +-------------+---------+ com_id 是该表的主键列
表3:Orders
+-------------+------+ | Column Name | Type | +-------------+------+ | order_id | int | | order_date | date | | com_id | int | | sales_id | int | | amount | int | +-------------+------+ order_id 是该表的主键列
目标:找出没有任何与名为RED公司相关的订单的所有销售人员的名单
Pandas:
def sales_person(sales_person: pd.DataFrame, company: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(orders, company, on='com_id')
red_orders = df[df['name'] == 'RED']
invalid_ids = red_orders.sales_id.unique()
valid_sales_person = sales_person[~sales_person['sales_id'].isin(invalid_ids)]
return valid_sales_person[['name']]
16、判断三角形(610)
表:Triangle
+-------------+------+ | Column Name | Type | +-------------+------+ | x | int | | y | int | | z | int | +-------------+------+ 在 SQL 中,(x, y, z)是该表的主键列
目标:报告它们是否可以形成一个三角形
Pandas:
def triangle_judgement(triangle: pd.DataFrame) -> pd.DataFrame:
triangle['triangle'] = triangle.apply(lambda x: 'Yes' if max(x) < sum(x) - max(x) else 'No', axis=1)
return triangle
MySQL:
select *, if(x+y>z and x+z>y and y+z>x,'Yes','No') as triangle from Triangle
17、直线上最近的距离(613)
表: Point
+-------------+------+ | Column Name | Type | +-------------+------+ | x | int | +-------------+------+ 在SQL中,x是该表的主键列。
目标:找到表中任意两点之间的最短距离
Pandas:
18、只出现一次的最大数字(619)
表:MyNumbers
+-------------+------+ | Column Name | Type | +-------------+------+ | num | int | +-------------+------+ 该表可能包含重复项(换句话说,在SQL中,该表没有主键)
目标:找出最大的单一数字,如果不存在单一数字则返回null
Pandas:
def biggest_single_number(my_numbers: pd.DataFrame) -> pd.DataFrame:
try:
df = my_numbers.sort_values('num',ascending=False).drop_duplicates(subset='num',keep=False)
return pd.DataFrame({'num': [df.iat[0,0]]})
except:
return pd.DataFrame({'num': [None]})
18、有趣的电影(620)
表:cinema
+----------------+----------+ | Column Name | Type | +----------------+----------+ | id | int | | movie | varchar | | description | varchar | | rating | float | +----------------+----------+ id 是该表的主键
目标:找出所有影片描述为非boring的并且id为奇数的影片。
Pandas:
def not_boring_movies(cinema: pd.DataFrame) -> pd.DataFrame:
return cinema.query('description not in "boring" & id%2 == 1').sort_values(by='rating',ascending=False)
19、变更性别(627)
表:Salary
+-------------+----------+ | Column Name | Type | +-------------+----------+ | id | int | | name | varchar | | sex | ENUM | | salary | int | +-------------+----------+ id 是这个表的主键
目标:交换所有的'f'和'm',仅使用单个update语句,且不产生中间临时表。
Pandas:
def swap_salary(salary: pd.DataFrame) -> pd.DataFrame:
salary['sex'] = salary['sex'].map({'f':'m','m':'f'})
return salary
20、合作过至少三次的演员和导演(1050)
表:ActorDirector
+-------------+---------+ | Column Name | Type | +-------------+---------+ | actor_id | int | | director_id | int | | timestamp | int | +-------------+---------+ timestamp 是这张表的主键
目标:找出合作过至少三次的演员和导演
Pandas:
def actors_and_directors(actor_director: pd.DataFrame) -> pd.DataFrame:
df = actor_director.groupby(by=['actor_id','director_id']).count().reset_index()
return df[df['timestamp'] >= 3][['actor_id','director_id']]
MySQL:
select actor_id, director_id
from ActorDirector
group by actor_id, director_id
having count(timestamp) >= 3
21、产品销售分析I(1068)
表1:Sales
+-------------+-------+ | Column Name | Type | +-------------+-------+ | sale_id | int | | product_id | int | | year | int | | quantity | int | | price | int | +-------------+-------+ (sale_id, year) 是销售表 Sales 的主键
表2:Product
+--------------+---------+ | Column Name | Type | +--------------+---------+ | product_id | int | | product_name | varchar | +--------------+---------+ product_id 是表的主键
目标:获取Sales表中所有sale_id中对应的product_name以及该产品的所有year和price
Pandas:
def sales_analysis(sales: pd.DataFrame, product: pd.DataFrame) -> pd.DataFrame:
data = pd.merge(sales,product,on='product_id',how='left')[['product_name','year','price']]
return data.sort_values(by='product_name')
MySQL:
select product_name, year, price from Sales s
left join Product p
on s.product_id = p.product_id
order by product_name asc
22、项目员工I(1075)
表1: Project
+-------------+---------+ | Column Name | Type | +-------------+---------+ | project_id | int | | employee_id | int | +-------------+---------+ 主键为project_id
表2: Employee
+------------------+---------+ | Column Name | Type | +------------------+---------+ | employee_id | int | | name | varchar | | experience_years | int | +------------------+---------+ 主键是 employee_id。
目标:查询每一个项目中员工的平均工作年限,保留至小数点后两位。
Pandas:
def project_employees_i(project: pd.DataFrame, employee: pd.DataFrame) -> pd.DataFrame:
return pd.merge(project,employee,on='employee_id',how='right').groupby(by='project_id')[['experience_years']].mean().round(2).reset_index().rename(columns={'experience_years':'average_years'})
23、销售分析III(1084)
表1:Product
+--------------+---------+ | Column Name | Type | +--------------+---------+ | product_id | int | | product_name | varchar | | unit_price | int | +--------------+---------+ product_id 是该表的主键(具有唯一值的列)
表2:Sales
+-------------+---------+ | Column Name | Type | +-------------+---------+ | seller_id | int | | product_id | int | | buyer_id | int | | sale_date | date | | quantity | int | | price | int | +------ ------+---------+ 这个表可能有重复的行
Pandas:
def sales_analysis(product: pd.DataFrame, sales: pd.DataFrame) -> pd.DataFrame:
inValid = sales[(sales['sale_date'] < '2019-01-01') | (sales['sale_date'] > '2019-03-31')]['product_id']
Valid = sales[~sales['product_id'].isin(inValid)]['product_id']
return product[product['product_id'].isin(Valid)][['product_id','product_name']]
MySQL:
select p.product_id, p.product_name
from Product p
right join Sales s
on p.product_id = s.product_id
group by product_id
having count(sale_date between '2019-01-01' and '2019-04-01' or null) = count(*)
24、查询近30天活跃用户数(1141)
表:Activity
+---------------+---------+ | Column Name | Type | +---------------+---------+ | user_id | int | | session_id | int | | activity_date | date | | activity_type | enum | +---------------+---------+ 该表没有包含重复数据
目标:统计截至2019-08-27(包括2019-08-27),近30天的每日活跃用户数(当天只要有一条记录即为活跃用户)
Pandas:
def user_activity(activity: pd.DataFrame) -> pd.DataFrame:
d = pd.to_datetime(['2019-07-27']*activity.shape[0]) - activity['activity_date']
df = activity[(d < '30days') & (d >= '0days')]
df = df.groupby(by='activity_date')[['user_id']].nunique().reset_index().rename(columns={'activity_date':'day','user_id':'active_users'})
return df
MySQL:
select activity_date as day, count(distinct user_id) as active_users
from Activity
where activity_date between subdate('2019-07-27',29) and '2019-07-27'
group by activity_date
25、文章浏览I(1148)
表: Views
+---------------+---------+ | Column Name | Type | +---------------+---------+ | article_id | int | | author_id | int | | viewer_id | int | | view_date | date | +---------------+---------+ 此表可能会存在重复行。(换句话说,在 SQL 中这个表没有主键)
目标:查询出所有浏览过自己文章的作者,结果以id升序排列
MySQL:
select distinct author_id as id
from Views
where author_id = viewer_id
order by author_id;
26、重新格式化部门表(1179)
表:Department
+---------------+---------+ | Column Name | Type | +---------------+---------+ | id | int | | revenue | int | | month | varchar | +---------------+---------+ 在 SQL 中,(id, month) 是表的联合主键
目标:重新格式化表格,使得每个月都有一个部门id列和一个收入列
Pandas:
def reformat_table(department: pd.DataFrame) -> pd.DataFrame:
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df = department.melt(id_vars=['id','month']).pivot(values='value',columns='month',index='id').reindex(columns=months)
month = ['{}_Revenue'.format(x) for x in months]
df.columns = month
return df.reset_index()
27、查询结果的质量和占比(1211)
表:Queries
+-------------+---------+ | Column Name | Type | +-------------+---------+ | query_name | varchar | | result | varchar | | position | int | | rating | int | +-------------+---------+ 此表可能有重复的行
目标:找出每次的query_name、quality和poor_query_percentage,其中quality定义为各查询结果的评分与其位置之间比率的平均值、poor_query_percentage为评分小于3的查询结果占全部查询结果的百分比。quality和poor_query_percentage四舍五入保留两位小数。
Pandas:
import pandas as pd
from decimal import Decimal, ROUND_HALF_UP
def up_round(x):
return Decimal(x).quantize(Decimal('.00'),rounding=ROUND_HALF_UP)
def queries_stats(queries: pd.DataFrame) -> pd.DataFrame:
queries['quality'] = queries['rating']/queries['position']
df = queries.groupby(by='query_name').agg(
quality=('quality',lambda x: up_round(x.mean())),
poor_query_percentage=('rating',lambda x: up_round((x<3).sum()/len(x)*100))
).reset_index()
return df
28、平均售价(1251)
表1:Prices
+---------------+---------+ | Column Name | Type | +---------------+---------+ | product_id | int | | start_date | date | | end_date | date | | price | int | +---------------+---------+ (product_id, start_date, end_date)是表的主键
表2:UnitsSold
+---------------+---------+ | Column Name | Type | +---------------+---------+ | product_id | int | | purchase_date | date | | units | int | +---------------+---------+ 该表可能包含重复数据
目标:查找每种产品的平均售价,average_price应四舍五入至保留两位小数。
MySQL:(不能用where过滤)
select p.product_id, ifnull(round(sum(price * units)/sum(units),2),0) as average_price
from Prices p
left join UnitsSold u
on p.product_id = u.product_id
and (purchase_date between start_date and end_date)
group by product_id
29、学生们参加各科测试的次数(1280)
表1:Students
+---------------+---------+ | Column Name | Type | +---------------+---------+ | student_id | int | | student_name | varchar | +---------------+---------+ 在 SQL 中,主键为 student_id
表2:Subjects
+--------------+---------+ | Column Name | Type | +--------------+---------+ | subject_name | varchar | +--------------+---------+ 在 SQL 中,主键为 subject_name
表3:Examinations
+--------------+---------+ | Column Name | Type | +--------------+---------+ | student_id | int | | subject_name | varchar | +--------------+---------+ 这个表可能包含重复数据
目标:查询出每个学生参加每一门科目测试的次数,结果按student_id和subject_name排序
Pandas:
def students_and_examinations(students: pd.DataFrame, subjects: pd.DataFrame, examinations: pd.DataFrame) -> pd.DataFrame:
grouped = examinations.groupby(['student_id', 'subject_name']).size().reset_index(name='attended_exams')
all_id_subjects = pd.merge(students, subjects, how='cross')
id_subjects_count = pd.merge(all_id_subjects, grouped, on=['student_id', 'subject_name'], how='left')
id_subjects_count['attended_exams'] = id_subjects_count['attended_exams'].fillna(0).astype(int)
id_subjects_count.sort_values(['student_id', 'subject_name'], inplace=True)
return id_subjects_count[['student_id', 'student_name', 'subject_name', 'attended_exams']]
30、使用唯一标识码替换员工ID(1381)
表1:Employees
+---------------+---------+ | Column Name | Type | +---------------+---------+ | id | int | | name | varchar | +---------------+---------+ 在 SQL 中,id 是这张表的主键
表2:EmployeeUNI
+---------------+---------+ | Column Name | Type | +---------------+---------+ | id | int | | unique_id | int | +---------------+---------+ 在 SQL 中,(id, unique_id) 是这张表的主键
目标:展示每位用户的唯一标识码,如果某位员工没有唯一标识码,使用null填充
Pandas:
def replace_employee_id(employees: pd.DataFrame, employee_uni: pd.DataFrame) -> pd.DataFrame:
return pd.merge(employees,employee_uni,on='id',how='left')[['unique_id','name']]
MySQL:
select u.unique_id, e.name
from Employees e
left join EmployeeUNI u
on e.id = u.id
31、列出指定时间段内所有的下单产品(1327)
表1:Products
+------------------+---------+ | Column Name | Type | +------------------+---------+ | product_id | int | | product_name | varchar | | product_category | varchar | +------------------+---------+ product_id 是该表主键
表2:Orders
+---------------+---------+ | Column Name | Type | +---------------+---------+ | product_id | int | | order_date | date | | unit | int | +---------------+---------+ 该表可能包含重复行
目标:获取在2020年2月下单的数量不少于100的产品的名字和数目
Pandas:
def list_products(products: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
orders = orders[orders['order_date'].dt.strftime('%Y-%m') =='2020-02']
df = pd.merge(products,orders,on='product_id',how='left')
data = df.groupby(by='product_name')[['unit']].sum().reset_index()
return data[data['unit'] >= 100][['product_name','unit']]
MySQL:
select p.product_name, sum(o.unit) as unit
from Products p
left join Orders o
on p.product_id = o.product_id
where o.order_date like '2020-02%'
group by p.product_id
having sum(unit) >= 100
32、排名靠前的旅行者(1407)
表1:Users
+---------------+---------+ | Column Name | Type | +---------------+---------+ | id | int | | name | varchar | +---------------+---------+ id 是该表中具有唯一值的列
表2:Rides
+---------------+---------+ | Column Name | Type | +---------------+---------+ | id | int | | user_id | int | | distance | int | +---------------+---------+ id 是该表中具有唯一值的列
目标:报告每个用户的旅行距离,以travelled_distance降序排列,如果有两个或者更多的用户旅行了相同的距离, 那么再以name升序排列
Pandas:
def top_travellers(users: pd.DataFrame, rides: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(users,rides,left_on='id',right_on='user_id',how='left',suffixes=('','_y')).groupby(by=['id','name']).sum().reset_index().rename(columns={'distance':'travelled_distance'}).sort_values(by=['travelled_distance','name'],ascending=[False,True])
df.fillna(0,inplace=True)
return df[['name','travelled_distance']]
33、按日期分组销售产品(1484)
表: Activities
+-------------+---------+ | 列名 | 类型 | +-------------+---------+ | sell_date | date | | product | varchar | +-------------+---------+ 该表没有主键
目标:找出每个日期、销售的不同产品的数量及其名称,每个日期的销售产品名称应按词典序排列。
Pandas:
def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
group = activities.groupby(by='sell_date')
stats = group.agg(
num_sold=('product','nunique'),
products=('product',lambda x: ','.join(sorted(set(x))))
).reset_index()
stats.sort_values(by='sell_date',inplace=True)
return stats
34、查找拥有有效用户的邮箱(1517)
表: Users
+---------------+---------+ | Column Name | Type | +---------------+---------+ | user_id | int | | name | varchar | | mail | varchar | +---------------+---------+ user_id 是该表的主键
目标:查找具有有效电子邮件的用户。一个有效的电子邮件具有前缀名称和域,其中: 前缀名称是一个字符串,可以包含字母(大写或小写)、数字、下划线 '_'
、点 '.'
和/或破折号 '-'
。前缀名称必须以字母开头;域为 '@leetcode.com'
。以任何顺序返回结果表
Pandas:
def valid_emails(users: pd.DataFrame) -> pd.DataFrame:
r = '^[a-zA-Z][a-zA-Z0-9_.-]*@leetcode\.com$'
valid_users = users[users['mail'].str.match(r)]
return valid_users
35、患某种疾病的患者(1527)
表:Patients
+--------------+---------+ | Column Name | Type | +--------------+---------+ | patient_id | int | | patient_name | varchar | | conditions | varchar | +--------------+---------+ patient_id (患者 ID)是该表的主键
目标:查询患有 I 类糖尿病的患者 ID (patient_id)、患者姓名(patient_name)以及其患有的所有疾病代码(conditions)。I 类糖尿病的代码总是包含前缀DIAB1
Pandas:
def find_patients(patients: pd.DataFrame) -> pd.DataFrame:
return patients[patients['conditions'].apply(judge)]
def judge(conditions):
for i in conditions.split():
if i.startswith('DIAB1'):
return True
return False
MySQL:
select patient_id, patient_name, conditions
from Patients
where conditions rlike '^DIAB1|.*\\sDIAB1'
36、进店却未进行过交易的顾客(1581)
表1:Visits
+-------------+---------+ | Column Name | Type | +-------------+---------+ | visit_id | int | | customer_id | int | +-------------+---------+ visit_id 是该表中具有唯一值的列
表2:Transactions
+----------------+---------+ | Column Name | Type | +----------------+---------+ | transaction_id | int | | visit_id | int | | amount | int | +----------------+---------+ transaction_id 是该表中具有唯一值的列
目标:有一些顾客可能光顾了购物中心但没有进行交易。请你编写一个解决方案,来查找这些顾客的 ID ,以及他们只光顾不交易的次数
Pandas:
def find_customers(visits: pd.DataFrame, transactions: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(visits,transactions,on='visit_id',how='outer')
df.fillna(0,inplace=True)
df = df.groupby(by='customer_id')[['amount']].value_counts().to_frame(name='count_no_trans').reset_index()
return df[df['amount'] == 0][['customer_id','count_no_trans']]
37、银行账户概要(1587)
表1:Users
+--------------+---------+ | Column Name | Type | +--------------+---------+ | account | int | | name | varchar | +--------------+---------+ account 是该表的主键
表2:Transactions
+---------------+---------+ | Column Name | Type | +---------------+---------+ | trans_id | int | | account | int | | amount | int | | transacted_on | date | +---------------+---------+ trans_id 是该表主键
目标:报告余额高于 10000 的所有用户的名字和余额. 账户的余额等于包含该账户的所有交易的总和。
Pandas:
def account_summary(users: pd.DataFrame, transactions: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(users,transactions,on='account',how='right').groupby(by='name')[['amount']].sum().reset_index().rename(columns={'amount':'balance'})
return df[df['balance'] > 10000]
38、每台机器的进程平均时间(1661)
表: Activity
+----------------+---------+ | Column Name | Type | +----------------+---------+ | machine_id | int | | process_id | int | | activity_type | enum | | timestamp | float | +----------------+---------+ (machine_id, process_id, activity_type) 是当前表的主键(具有唯一值的列的组合)。machine_id 是一台机器的ID号。process_id 是运行在各机器上的进程ID号。activity_type 是枚举类型 ('start', 'end')。timestamp 是浮点类型,代表当前时间(以秒为单位)。'start' 代表该进程在这台机器上的开始运行时间戳 , 'end' 代表该进程在这台机器上的终止运行时间戳。同一台机器,同一个进程都有一对开始时间戳和结束时间戳,而且开始时间戳永远在结束时间戳前面。
目标:现在有一个工厂网站由几台机器运行,每台机器上运行着相同数量的进程。编写解决方案,计算每台机器各自完成一个进程任务的平均耗时。完成一个进程任务的时间指进程的'end' 时间戳
减去 'start' 时间戳
。平均耗时通过计算每台机器上所有进程任务的总耗费时间除以机器上的总进程数量获得。结果表必须包含machine_id(机器ID)
和对应的average time(平均耗时) 别名 processing_time
,且四舍五入保留3位小数。以任意顺序返回表。
Pandas:
39、修复表中的名字(1667)
表: Users
+----------------+---------+ | Column Name | Type | +----------------+---------+ | user_id | int | | name | varchar | +----------------+---------+ user_id 是该表的主键(具有唯一值的列)。 该表包含用户的 ID 和名字。名字仅由小写和大写字符组成。 目标:编写解决方案,修复名字,使得只有第一个字符是大写的,其余都是小写的。返回按user_id排序的结果表。
Pandas:
def fix_names(users: pd.DataFrame) -> pd.DataFrame:
users['name'] = users.apply(lambda x: x['name'][0].upper() + x['name'][1:].lower(),axis=1)
users = users.sort_values('user_id')
return users
MySQL:
select user_id, concat(upper(left(name,1)),lower(substring(name,2))) as name from Users
order by user_id
40、无效的推文(1683)
表:Tweets
+----------------+---------+ | Column Name | Type | +----------------+---------+ | tweet_id | int | | content | varchar | +----------------+---------+ 在 SQL 中,tweet_id 是这个表的主键。
目标:查询所有无效推文的编号(ID)。当推文中的内容字符数严格大于15时,该推文是无效的。
MySQL:
select tweet_id
from Tweets
where length(content) >15
41、每天的领导和合伙人(1693)
表:DailySales
+-------------+---------+ | Column Name | Type | +-------------+---------+ | date_id | date | | make_name | varchar | | lead_id | int | | partner_id | int | +-------------+---------+ 该表没有主键(具有唯一值的列)。它可能包含重复项。该表包含日期、产品的名称,以及售给的领导和合伙人的编号。名称只包含小写英文字母。
目标:对于每一个 date_id
和 make_name
,找出不同的 lead_id
以及不同的 partner_id
的数量。按任意顺序返回结果表。
Pandas:
def daily_leads_and_partners(daily_sales: pd.DataFrame) -> pd.DataFrame:
return daily_sales.groupby(['date_id','make_name'])[['lead_id','partner_id']].nunique().reset_index().rename(columns={'lead_id':'unique_leads','partner_id':'unique_partners'})
MySQL:
select date_id, make_name, count(distinct lead_id) as unique_leads, count(distinct partner_id) as unique_partners
from DailySales
group by date_id, make_name
42、求关注者的数量(1729)
表:Followers
+-------------+------+ | Column Name | Type | +-------------+------+ | user_id | int | | follower_id | int | +-------------+------+ (user_id, follower_id) 是这个表的主键
目标:编写解决方案,对于每一个用户,返回该用户的关注者数量。按user_id的顺序返回结果表。
Pandas:
def count_followers(followers: pd.DataFrame) -> pd.DataFrame:
return followers.groupby('user_id').size().reset_index(name='followers_count')
MySQL:
select user_id, count(follower_id) as followers_count from Followers
group by user_id
order by user_id
43、可回收且低脂的产品(1757)
表:Products
+-------------+---------+ | Column Name | Type | +-------------+---------+ | product_id | int | | low_fats | enum | | recyclable | enum | +-------------+---------+
product_id是主键。
目标:找出既是低脂又是可回收的产品编号。
Pandas:
def find_products(products: pd.DataFrame) -> pd.DataFrame:
return products[(products['low_fats'] == 'Y') & (products['recyclable'] == 'Y')][['product_id']]
MySQL:
select product_id
from Products
where low_fats = 'Y' and recyclable = 'Y'
44、员工的直属部门(1789)
表:Employee
+---------------+---------+ | Column Name | Type | +---------------+---------+ | employee_id | int | | department_id | int | | primary_flag | varchar | +---------------+---------+ 这张表的主键为 employee_id, department_id (具有唯一值的列的组合)。employee_id 是员工的ID,department_id 是部门的ID,表示员工与该部门有关系;primary_flag 是一个枚举类型,值分别为('Y', 'N'). 如果值为'Y',表示该部门是员工的直属部门。 如果值是'N',则否。一个员工可以属于多个部门。当一个员工加入超过一个部门的时候,他需要决定哪个部门是他的直属部门。请注意,当员工只加入一个部门的时候,那这个部门将默认为他的直属部门,虽然表记录的值为'N'
目标:请编写解决方案,查出员工所属的直属部门。返回结果 没有顺序要求 。
Pandas:
import pandas as pd
def find_primary_department(employee: pd.DataFrame) -> pd.DataFrame:
return employee[(employee['primary_flag'] == 'Y') | (~employee.duplicated('employee_id', keep=False))][['employee_id', 'department_id']]
45、每个产品在不同商店的价格(1795)
表:Products
+-------------+---------+ | Column Name | Type | +-------------+---------+ | product_id | int | | store1 | int | | store2 | int | | store3 | int | +-------------+---------+ 在 SQL 中,这张表的主键是 product_id(产品Id)。每行存储了这一产品在不同商店 store1, store2, store3 的价格。如果这一产品在商店里没有出售,则值将为 null。
目标:查询每个产品在不同商店的价格,使得输出的格式变为(product_id, store, price)
。如果这一产品在商店里没有出售,则不输出这一行。
Pandas:
def rearrange_products_table(products: pd.DataFrame) -> pd.DataFrame:
df = pd.melt(products,id_vars='product_id',var_name='store')
df.dropna(axis=0,inplace=True)
return df.rename(columns={'value':'price'})
46、计算特殊奖金(1873)
表:Employees
+-------------+---------+
| 列名 | 类型 |
+-------------+---------+
| employee_id | int |
| name | varchar |
| salary | int |
+-------------+---------+
employee_id 是这个表的主键(具有唯一值的列)。此表的每一行给出了雇员id ,名字和薪水。
目标:编写解决方案,计算每个雇员的奖金。如果一个雇员的 id 是 奇数 并且他的名字不是以 'M' 开头,那么他的奖金是他工资的 100% ,否则奖金为 0 。返回的结果按照 employee_id 排序。
Pandas:
import pandas as pd
def calculate_special_bonus(employees: pd.DataFrame) -> pd.DataFrame:
employees['bonus'] = employees['salary']
employees.loc[(employees['employee_id']%2 == 0) | (employees['name'].str.startswith('M')), 'bonus'] = 0
employees.sort_values('employee_id', inplace=True)
return employees[['employee_id', 'bonus']]
47、丢失信息的雇员(1965)
表1:Employees
+-------------+---------+ | Column Name | Type | +-------------+---------+ | employee_id | int | | name | varchar | +-------------+---------+ employee_id 是该表中具有唯一值的列
表2:Salaries
+-------------+---------+ | Column Name | Type | +-------------+---------+ | employee_id | int | | salary | int | +-------------+---------+ employee_id 是该表中具有唯一值的列
目标:找出所有丢失信息的雇员,当雇员的姓名丢失或者薪水丢失则视为丢失信息。返回这些雇员的id(employee_id),从小到大排序。
Pandas:
def find_employees(employees: pd.DataFrame, salaries: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(employees,salaries,on='employee_id',how='outer')
interset = list(set(employees['employee_id']).intersection(set(salaries['employee_id'])))
df['judge'] = df['employee_id'].apply(lambda x: 1 if x in interset else 0)
return df[df['judge'] == 0][['employee_id']].sort_values(by='employee_id')
48、上级经理已离职的公司员工(1978)
表:Employees
+-------------+----------+ | Column Name | Type | +-------------+----------+ | employee_id | int | | name | varchar | | manager_id | int | | salary | int | +-------------+----------+ 在 SQL 中,employee_id 是这个表的主键
目标:查找这些员工的id,他们的薪水严格少于30000并且他们的上级经理已离职。当一个经理离开公司时,他们的信息需要从员工表中删除掉,但是表中的员工的manager_id这一列还是设置的离职经理的id。返回的结果按照manager_id从小到大排序。
Pandas:
49、每位老师所教授的科目种类的数量(2356)
表: Teacher
+-------------+------+ | Column Name | Type | +-------------+------+ | teacher_id | int | | subject_id | int | | dept_id | int | +-------------+------+ 在 SQL 中,(subject_id, dept_id) 是该表的主键。 该表中的每一行都表示带有 teacher_id 的教师在系 dept_id 中教授科目 subject_id。
目标:查询每位老师在大学里教授的科目种类的数量。以任意顺序返回结果表。
Pandas:
def count_unique_subjects(teacher: pd.DataFrame) -> pd.DataFrame:
return teacher.groupby(['teacher_id'])[['subject_id']].nunique().reset_index().rename(columns={'subject_id':'cnt'})
MySQL:
select teacher_id, count(distinct subject_id) as cnt
from Teacher
group by teacher_id
难度:中等
1、第二高的薪水(176)
表:Employee
+-------------+------+ | Column Name | Type | +-------------+------+ | id | int | | salary | int | +-------------+------+ id 是该表的主键
Pandas:
def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:
df = employee.drop_duplicates(subset='salary')
if df.shape[0] >= 2:
df = df.sort_values(by='salary',ascending=False).head(2).tail(1)
return df[['salary']].rename(columns={'salary':'SecondHighestSalary'})
else:
return pd.DataFrame({'SecondHighestSalary':[None]})
MySQL:
select (
select distinct salary
from Employee
order by salary desc
limit 1 offset 1
) as SecondHighestSalary
2、第N高的薪水(177)
表:Employee
+-------------+------+ | Column Name | Type | +-------------+------+ | id | int | | salary | int | +-------------+------+ id 是该表的主键
目标:查询Employee表中第n高的工资,如果没有第n高的工资,查询结果应该为null
Pandas:
def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:
df = employee.drop_duplicates(subset='salary')
if df.shape[0] >= N:
df = df.sort_values(by='salary',ascending=False).head(N).tail(1)
return df[['salary']].rename(columns={'salary':'getNthHighestSalary({})'.format(N)})
else:
return pd.DataFrame({'getNthHighestSalary({})'.format(N):[None]})
3、分数排名(178)
表:Scores
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | score | decimal | +-------------+---------+ id 是该表的主键。
目标:查询并对分数进行排序。排名按以下规则计算:分数应按从高到低排列;如果两个分数相等,那么两个分数的排名应该相同。;在排名相同的分数后,排名数应该是下一个连续的整数。换句话说,排名之间不应该有空缺的数字。按 score
降序返回结果表。
Pandas:
def order_scores(scores: pd.DataFrame) -> pd.DataFrame:
scores['rank'] = scores['score'].rank(ascending=False,method='dense')
return scores.sort_values('score',ascending=False)[['score','rank']]
MySQL:
select score, dense_rank() over(order by score desc) as 'rank'
from Scores
4、连续出现的数字(180)
表:Logs
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | num | varchar | +-------------+---------+ id 是该表的主键。
目标:找出所有至少连续出现三次的数字
Pandas:
def consecutive_numbers(logs: pd.DataFrame) -> pd.DataFrame:
df = logs[(logs['num'] == logs['num'].shift(1)) & (logs['num'] == logs['num'].shift(2))].rename(columns={'num':'ConsecutiveNums'})
return df.drop_duplicates(subset='ConsecutiveNums')[['ConsecutiveNums']]
5、部门工资最高的员工(184)
表1:Employee
+--------------+---------+ | 列名 | 类型 | +--------------+---------+ | id | int | | name | varchar | | salary | int | | departmentId | int | +--------------+---------+ id是此表的主键。
表2:Department
+-------------+---------+ | 列名 | 类型 | +-------------+---------+ | id | int | | name | varchar | +-------------+---------+ id 是此表的主键列。
目标:查找出每个部门中薪资最高的员工
Pandas:
def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(employee,department,left_on='departmentId',right_on='id',how='left',suffixes=('','_y'))
max_salary = df.groupby(by=['name_y'])['salary'].transform('max')
df = df[df['salary'] == max_salary]
df.rename(columns={'name_y':'Department','name':'Employee','salary':'Salary'},inplace=True)
return df[['Department','Employee','Salary']]
5、游戏玩法分析(550)
表:Activity
+--------------+---------+ | Column Name | Type | +--------------+---------+ | player_id | int | | device_id | int | | event_date | date | | games_played | int | +--------------+---------+ (player_id,event_date)是此表的主键(具有唯一值的列的组合)。
目标:编写解决方案,报告在首次登录的第二天再次登录的玩家的比率,四舍五入到小数点后两位。换句话说,你需要计算从首次登录日期开始至少连续两天登录的玩家的数量,然后除以玩家总数。
Pandas:
def gameplay_analysis(activity: pd.DataFrame) -> pd.DataFrame:
data = activity.groupby(by='player_id')[['event_date']].min()
data['second_login'] = data['event_date'] + pd.DateOffset(days=1)
result = pd.merge(data.reset_index(),activity,left_on=['player_id','second_login'],right_on=['player_id','event_date'],how='right')
fraction = round(len(result[~result['second_login'].isna()]['player_id'].unique())/len(activity['player_id'].unique()),2)
return pd.DataFrame({'fraction':[fraction]})
6、2016年的投资(585)
表:Insurance
+-------------+-------+ | Column Name | Type | +-------------+-------+ | pid | int | | tiv_2015 | float | | tiv_2016 | float | | lat | float | | lon | float | +-------------+-------+ pid 是这张表的主键
目标:报告 2016 年 (tiv_2016
) 所有满足下述条件的投保人的投保金额之和:他在 2015 年的投保额 (tiv_2015
) 至少跟一个其他投保人在 2015 年的投保额相同;他所在的城市必须与其他投保人都不同(也就是说 (lat, lon
) 不能跟其他任何一个投保人完全相同)。tiv_2016
四舍五入保留两位小数。
Pandas:
7、电影评分(1341)
表1:Movies
+---------------+---------+ | Column Name | Type | +---------------+---------+ | movie_id | int | | title | varchar | +---------------+---------+ movie_id 是这个表的主键(具有唯一值的列)。title 是电影的名字。
表2:Users
+---------------+---------+ | Column Name | Type | +---------------+---------+ | user_id | int | | name | varchar | +---------------+---------+ user_id 是表的主键(具有唯一值的列)。
表3:MovieRating
+---------------+---------+ | Column Name | Type | +---------------+---------+ | movie_id | int | | user_id | int | | rating | int | | created_at | date | +---------------+---------+ (movie_id, user_id) 是这个表的主键(具有唯一值的列的组合)。这个表包含用户在其评论中对电影的评分 rating 。created_at 是用户的点评日期。
目标:查找评论电影数量最多的用户名。如果出现平局,返回字典序较小的用户名。查找在2020年2月平均评分最高的电影名称。如果出现平局,返回字典序较小的电影名称。字典序 ,即按字母在字典中出现顺序对字符串排序,字典序较小则意味着排序靠前。
Pandas:
def movie_rating(movies: pd.DataFrame, users: pd.DataFrame, movie_rating: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(movie_rating,users,on='user_id',how='left')
result1 = df.groupby('name')[['created_at']].apply(lambda x: x.count()).idxmax()[0]
df1 = pd.merge(movie_rating,movies,on='movie_id',how='left')
result2 = df1[(df1['created_at'] >= '2020-02-01') & (df['created_at'] < '2020-03-01')].groupby('title')[['rating']].mean().idxmax()[0]
return pd.DataFrame({'results':[result1,result2]})
难度:困难
1、部门工资前三高的员工(185)
表1:Employee
+--------------+---------+ | Column Name | Type | +--------------+---------+ | id | int | | name | varchar | | salary | int | | departmentId | int | +--------------+---------+ id 是该表的主键列
表2:Department
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | name | varchar | +-------------+---------+ id 是该表的主键列
目标:找出每个部门中收入前三的员工
Pandas:
def top_three_salaries(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
df = pd.merge(employee,department,left_on='departmentId',right_on='id',how='left',suffixes=('','_y')).sort_values(by=['departmentId','salary'],ascending=[True,False]).rename(columns={'name_y':'Department','name':'Employee','salary':'Salary'})
df['rank'] = df.groupby(by='departmentId').rank(ascending=False,method='dense')['Salary']
return df[df['rank'] <= 3][['Department','Employee','Salary']]
2、行程和用户(262)
表1:Trips
+-------------+----------+ | Column Name | Type | +-------------+----------+ | id | int | | client_id | int | | driver_id | int | | city_id | int | | status | enum | | request_at | date | +-------------+----------+ id 是这张表的主键
表2:Users
+-------------+----------+ | Column Name | Type | +-------------+----------+ | users_id | int | | banned | enum | | role | enum | +-------------+----------+ users_id 是这张表的主键
目标:找出2013-10-01至2013-10-03之间非禁止用户(乘客和司机都未禁止)的取消率,其中取消率=(被司机或乘客取消的非禁止用户生成的订单数量) / (非禁止用户生成的订单总数),保留两位小数
Pandas: