【Leetcode 30天Pandas挑战】学习记录下

cwtnice

已于 2023-08-31 17:30:16 修改

阅读量445

点赞数

分类专栏： # Pandas 文章标签： leetcode pandas 学习

于 2023-08-15 17:19:49 首次发布

本文链接：https://blog.csdn.net/cwtnice/article/details/132302788

版权

Pandas 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

题目列表：

数据统计:
数据分组（重要）
数据合并

数据统计:

2082. The Number of Rich Customers

原题链接：2082. The Number of Rich Customers
考察：去重、计数

Table: Store

+-------------+------+
| Column Name | Type |
+-------------+------+
| bill_id     | int  |
| customer_id | int  |
| amount      | int  |
+-------------+------+
bill_id is the primary key (column with unique values) for this table.
Each row contains information about the amount of one bill and the customer associated with it.

Write a solution to report the number of customers who had at least one bill with an amount strictly greater than 500 .

The result format is in the following example.

Example 1:

Input:

Store table:
+---------+-------------+--------+
| bill_id | customer_id | amount |
+---------+-------------+--------+
| 6       | 1           | 549    |
| 8       | 1           | 834    |
| 4       | 2           | 394    |
| 11      | 3           | 657    |
| 13      | 3           | 257    |
+---------+-------------+--------+

Output:

+------------+
| rich_count |
+------------+
| 2          |
+------------+

Explanation:
Customer 1 has two bills with amounts strictly greater than 500.
Customer 2 does not have any bills with an amount strictly greater than 500.
Customer 3 has one bill with an amount strictly greater than 500.

题目大意：
统计一下amount大于500的customer的数目，一个customer会有多张单据，需要去重后统计

pandas 思路：
nunique() 方法能直接返回去重后的个数

pandas 实现：

import pandas as pd

def count_rich_customers(store: pd.DataFrame) -> pd.DataFrame:
    store = store[store['amount'] > 500]
    return pd.DataFrame({'rich_count': [store['customer_id'].nunique()]})

MySQL 思路：
用一下 count 和 distinct

MySQL 实现：

SELECT
    COUNT(DISTINCT customer_id) AS rich_count
FROM Store
WHERE amount > 500

1173. Immediate Food Delivery I

原题链接：1173. Immediate Food Delivery I
考察：保留n位小数

Table: Delivery

+-----------------------------+---------+
| Column Name                 | Type    |
+-----------------------------+---------+
| delivery_id                 | int     |
| customer_id                 | int     |
| order_date                  | date    |
| customer_pref_delivery_date | date    |
+-----------------------------+---------+
delivery_id is the primary key (column with unique values) of this table.
The table holds information about food delivery to customers that make orders at some date and specify a preferred delivery date (on the same order date or after it).

If the customer’s preferred delivery date is the same as the order date, then the order is called immediate（以为及时的，立即的）; otherwise, it is called scheduled.

Write a solution to find the percentage of immediate orders in the table, rounded to 2 decimal places.

The result format is in the following example.

Example 1:

Input:

Delivery table:
+-------------+-------------+------------+-----------------------------+
| delivery_id | customer_id | order_date | customer_pref_delivery_date |
+-------------+-------------+------------+-----------------------------+
| 1           | 1           | 2019-08-01 | 2019-08-02                  |
| 2           | 5           | 2019-08-02 | 2019-08-02                  |
| 3           | 1           | 2019-08-11 | 2019-08-11                  |
| 4           | 3           | 2019-08-24 | 2019-08-26                  |
| 5           | 4           | 2019-08-21 | 2019-08-22                  |
| 6           | 2           | 2019-08-11 | 2019-08-13                  |
+-------------+-------------+------------+-----------------------------+

Output:

+----------------------+
| immediate_percentage |
+----------------------+
| 33.33                |
+----------------------+

Explanation: The orders with delivery id 2 and 3 are immediate while the others are scheduled.

题目大意：
计算一个两个日期相等的条目在整表中所占的比例，保留两位小数

pandas 思路：
保留两位小数用 round()

pandas 实现：

import pandas as pd

def food_delivery(delivery: pd.DataFrame) -> pd.DataFrame:
    tmp = delivery[delivery['order_date'] == delivery['customer_pref_delivery_date']]
    return pd.DataFrame({'immediate_percentage': [round((tmp.shape[0] / delivery.shape[0] * 100) , 2)]})

MySQL 思路：
用 avg() 计算占比，用round() 保留两位小数

MySQL 实现：

SELECT
	round( 100 * avg( order_date = customer_pref_delivery_date ), 2 ) AS immediate_percentage 
FROM
	delivery

1907. Count Salary Categories（好题）

原题链接：1907. Count Salary Categories
考察：组合查询结果

Table: Accounts

+-------------+------+
| Column Name | Type |
+-------------+------+
| account_id  | int  |
| income      | int  |
+-------------+------+
account_id is the primary key (column with unique values) for this table.
Each row contains information about the monthly income for one bank account.

Write a solution to calculate the number of bank accounts for each salary category. The salary categories are:

“Low Salary”: All the salaries strictly less than $20000.
“Average Salary”: All the salaries in the inclusive range [$20000, $50000].
“High Salary”: All the salaries strictly greater than $50000.
The result table must contain all three categories. If there are no accounts in a category, return 0.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Accounts table:
+------------+--------+
| account_id | income |
+------------+--------+
| 3          | 108939 |
| 2          | 12747  |
| 8          | 87709  |
| 6          | 91796  |
+------------+--------+

Output:

+----------------+----------------+
| category       | accounts_count |
+----------------+----------------+
| Low Salary     | 1              |
| Average Salary | 0              |
| High Salary    | 3              |
+----------------+----------------+

Explanation:
Low Salary: Account 2.
Average Salary: No accounts.
High Salary: Accounts 3, 6, and 8.

题目大意：
按照规则统计低收入、中等收入、高收入三类人的人数

pandas 思路：
统计数目比较容易，按行筛选后的行数即可。怎么组成题目要求返回的形式是本题的重点，用到 pd.DataFrame() 函数

pandas 实现：

import pandas as pd

def count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:
    low_salary = accounts[accounts['income'] < 20000].shape[0]
    average_salary = accounts[(accounts['income'] >= 20000)&(accounts['income'] <= 50000)].shape[0]
    high_salary = accounts[accounts['income'] > 50000].shape[0]

    ans = pd.DataFrame({
        'category': ['Low Salary', 'Average Salary', 'High Salary'],
        'accounts_count': [low_salary, average_salary, high_salary]
    })

    return ans

MySQL 思路：
case when 来按条件筛选，三个sql的结果 union 一下

MySQL 实现：

SELECT 
    'Low Salary' AS category,
    SUM(CASE WHEN income < 20000 THEN 1 ELSE 0 END) AS accounts_count
FROM 
    Accounts
    
UNION
SELECT  
    'Average Salary' category,
    SUM(CASE WHEN income >= 20000 AND income <= 50000 THEN 1 ELSE 0 END) 
    AS accounts_count
FROM 
    Accounts

UNION
SELECT 
    'High Salary' category,
    SUM(CASE WHEN income > 50000 THEN 1 ELSE 0 END) AS accounts_count
FROM 
    Accounts

数据分组（重要）

1741. Find Total Time Spent by Each Employee

原题链接：1741. Find Total Time Spent by Each Employee
考察：分组 + 取最大

Table: Employees

+-------------+------+
| Column Name | Type |
+-------------+------+
| emp_id      | int  |
| event_day   | date |
| in_time     | int  |
| out_time    | int  |
+-------------+------+
(emp_id, event_day, in_time) is the primary key (combinations of columns with unique values) of this table.
The table shows the employees' entries and exits in an office.
event_day is the day at which this event happened, in_time is the minute at which the employee entered the office, and out_time is the minute at which they left the office.
in_time and out_time are between 1 and 1440.
It is guaranteed that no two events on the same day intersect in time, and in_time < out_time.

Write a solution to calculate the total time in minutes spent by each employee on each day at the office. Note that within one day, an employee can enter and leave more than once. The time spent in the office for a single entry is out_time - in_time.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Employees table:
+--------+------------+---------+----------+
| emp_id | event_day  | in_time | out_time |
+--------+------------+---------+----------+
| 1      | 2020-11-28 | 4       | 32       |
| 1      | 2020-11-28 | 55      | 200      |
| 1      | 2020-12-03 | 1       | 42       |
| 2      | 2020-11-28 | 3       | 33       |
| 2      | 2020-12-09 | 47      | 74       |
+--------+------------+---------+----------+

Output:

+------------+--------+------------+
| day        | emp_id | total_time |
+------------+--------+------------+
| 2020-11-28 | 1      | 173        |
| 2020-11-28 | 2      | 30         |
| 2020-12-03 | 1      | 41         |
| 2020-12-09 | 2      | 27         |
+------------+--------+------------+

Explanation:
Employee 1 has three events: two on day 2020-11-28 with a total of (32 - 4) + (200 - 55) = 173, and one on day 2020-12-03 with a total of (42 - 1) = 41.
Employee 2 has two events: one on day 2020-11-28 with a total of (33 - 3) = 30, and one on day 2020-12-09 with a total of (74 - 47) = 27.

题目大意：
给出每个员工的进场和离场表，要求统计每个员工在场的总时长

pandas 思路：
一个时间段的值作差就可得到，每个员工的总时长可以通过 groupby 之后 sum 一下得到

pandas 实现：

import pandas as pd

def total_time(employees: pd.DataFrame) -> pd.DataFrame:
    employees['total_time'] = employees['out_time'] - employees['in_time']

    ans = employees.groupby(by=['event_day', 'emp_id'], as_index=False).agg('sum')
    ans.rename(columns={'event_day':'day'}, inplace=True)
    ans = ans[['day', 'emp_id', 'total_time']]

    return ans

MySQL 思路：
groupby 一下，查询 sum 就可以了

MySQL 实现：

select event_day as day, emp_id, sum(out_time - in_time) as total_time
from Employees
group by event_day, emp_id

511. Game Play Analysis I

原题链接：511. Game Play Analysis I
考察：分组 + 取最小、mysql的窗口函数

Table: Activity

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| player_id    | int     |
| device_id    | int     |
| event_date   | date    |
| games_played | int     |
+--------------+---------+
(player_id, event_date) is the primary key (combination of columns with unique values) of this table.
This table shows the activity of players of some games.
Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.

Write a solution to find the first login date for each player.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Activity table:
+-----------+-----------+------------+--------------+
| player_id | device_id | event_date | games_played |
+-----------+-----------+------------+--------------+
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-05-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |
+-----------+-----------+------------+--------------+

Output:

+-----------+-------------+
| player_id | first_login |
+-----------+-------------+
| 1         | 2016-03-01  |
| 2         | 2017-06-25  |
| 3         | 2016-03-02  |
+-----------+-------------+

题目大意：
输出员工的最早打卡时间

pandas 思路1：
排序+去重保留第一条就可以实现题目要求

pandas 实现1：

import pandas as pd

def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:
    activity.sort_values(by='event_date', inplace=True) # 按照登录时间排序
    activity.drop_duplicates(subset='player_id', keep='first', inplace=True) # 去重
    activity.rename(columns={'event_date':'first_login'}, inplace=True)

    return activity[['player_id', 'first_login']]

pandas 思路2：
对每个 player_id 进行分组，随后用 min 得到每组里面的最小时间

pandas 实现2：

import pandas as pd

def game_analysis(activity: pd.DataFrame) -> pd.DataFrame:
    df = activity.groupby('player_id')['event_date'].min().reset_index()

    return df.rename(columns = {'event_date':'first_login'})

MySQL 思路1：
也是采用分组+min的方法

MySQL 实现1：

SELECT
  A.player_id,
  MIN(A.event_date) AS first_login
FROM
  Activity A
GROUP BY
  A.player_id

MySQL 思路2：复杂，但突出了许多替代解决方案
采用窗口函数

MySQL 实现2：

SELECT DISTINCT
  A.player_id,
  LAST_VALUE(A.event_date) OVER (
    PARTITION BY
      A.player_id
    ORDER BY
      A.event_date DESC RANGE BETWEEN UNBOUNDED PRECEDING -- 这里是窗口函数框架规范
      AND UNBOUNDED FOLLOWING
  ) AS first_login
FROM
  Activity A;

2356. Number of Unique Subjects Taught by Each Teacher

原题链接：2356. Number of Unique Subjects Taught by Each Teacher
考察：分组 + 去重计数

Table: Teacher

+-------------+------+
| Column Name | Type |
+-------------+------+
| teacher_id  | int  |
| subject_id  | int  |
| dept_id     | int  |
+-------------+------+
(subject_id, dept_id) is the primary key (combinations of columns with unique values) of this table.
Each row in this table indicates that the teacher with teacher_id teaches the subject subject_id in the department dept_id.

Write a solution to calculate the number of unique subjects each teacher teaches in the university.

Return the result table in any order.

The result format is shown in the following example.

Example 1:

Input:

Teacher table:
+------------+------------+---------+
| teacher_id | subject_id | dept_id |
+------------+------------+---------+
| 1          | 2          | 3       |
| 1          | 2          | 4       |
| 1          | 3          | 3       |
| 2          | 1          | 1       |
| 2          | 2          | 1       |
| 2          | 3          | 1       |
| 2          | 4          | 1       |
+------------+------------+---------+

Output:

+------------+-----+
| teacher_id | cnt |
+------------+-----+
| 1          | 2   |
| 2          | 4   |
+------------+-----+

Explanation:
Teacher 1:

They teach subject 2 in departments 3 and 4.
They teach subject 3 in department 3.

Teacher 2:

They teach subject 1 in department 1.
They teach subject 2 in department 1.
They teach subject 3 in department 1.
They teach subject 4 in department 1.

题目大意：
统计一下每个老师所交的课程的数目，同一个课程不同教室只算一门

pandas 思路：
用教师id进行分组，nunique() 用于统计去重后的科目数

pandas 实现：

import pandas as pd

def count_unique_subjects(teacher: pd.DataFrame) -> pd.DataFrame:
    ans = teacher.groupby(by='teacher_id')['subject_id'].nunique().reset_index()
    ans.rename(columns={'subject_id':'cnt'}, inplace=True)

    return ans

MySQL 思路：
group by ，然后用 count() 计数

MySQL 实现：

select 
	teacher_id,
	count(distinct subject_id) as cnt
from 
	Teacher
group by 
	teacher_id

596. Classes More Than 5 Students

原题链接：596. Classes More Than 5 Students
考察：分组 + 计数、having

Table: Courses

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| student     | varchar |
| class       | varchar |
+-------------+---------+
(student, class) is the primary key (combination of columns with unique values) for this table.
Each row of this table indicates the name of a student and the class in which they are enrolled.

Write a solution to find all the classes that have at least five students.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Courses table:
+---------+----------+
| student | class    |
+---------+----------+
| A       | Math     |
| B       | English  |
| C       | Math     |
| D       | Biology  |
| E       | Math     |
| F       | Computer |
| G       | Math     |
| H       | Math     |
| I       | Math     |
+---------+----------+

Output:

+---------+
| class   |
+---------+
| Math    |
+---------+

Explanation:

Math has 6 students, so we include it.
English has 1 student, so we do not include it.
Biology has 1 student, so we do not include it.
Computer has 1 student, so we do not include it.

题目大意：
返回学生数大于等于5的课程名

pandas 思路：
使用 groupby() 按班级分组， size() 计算每个班级的出现次数（也就是学生数），然后进行行筛选即可

pandas 实现：

import pandas as pd

def find_classes(courses: pd.DataFrame) -> pd.DataFrame:
  df = courses.groupby('class').size().reset_index(name='count') # size()计算每个值出现的次数
  df = df[df['count'] >= 5]

  return df[['class']]

MySQL 思路：
where关键字无法和聚合函数一起使用， having 子句可以筛选分组后的各组数据

MySQL 实现：

select 
	class
from 
	courses
group by 
	class
having 
	count(class) >= 5

586. Customer Placing the Largest Number of Orders

原题链接：586. Customer Placing the Largest Number of Orders
考察：分组 + 计数

Table: Orders

+-----------------+----------+
| Column Name     | Type     |
+-----------------+----------+
| order_number    | int      |
| customer_number | int      |
+-----------------+----------+
order_number is the primary key (column with unique values) for this table.
This table contains information about the order ID and the customer ID.

Write a solution to find the customer_number for the customer who has placed the largest number of orders.

The test cases are generated so that exactly one customer will have placed more orders than any other customer.

The result format is in the following example.

Example 1:

Input:

Orders table:
+--------------+-----------------+
| order_number | customer_number |
+--------------+-----------------+
| 1            | 1               |
| 2            | 2               |
| 3            | 3               |
| 4            | 3               |
+--------------+-----------------+

Output:

+-----------------+
| customer_number |
+-----------------+
| 3               |
+-----------------+

Explanation:
The customer with number 3 has two orders, which is greater than either customer 1 or 2 because each of them only has one order.
So the result is customer_number 3.

Follow up:
What if more than one customer has the largest number of orders, can you find all the customer_number in this case?

题目大意：
返回点单数最多的顾客的编号

pandas 思路：
通过 groupby() 进行分组，通过 size() 来统计出现次数，对出现次数进行降序排列，最后输出第一行即可

pandas 实现：

import pandas as pd

def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
	# 如果 orders 为空，提前返回一个空的 DataFrame
    if orders.empty:
        return pd.DataFrame({'customer_number': []})

    df = orders.groupby('customer_number').size().reset_index(name='count') # 统计次数
    df.sort_values(by='count', ascending=False, inplace=True) # 根据次数降序排列

    return df[['customer_number']].head(1)

MySQL 思路：
一样也是 group by 聚合，然后 order by 排序，用 limit 1 来返回第一条

MySQL 实现：

select 
	customer_number
from 
	orders
group by
	 customer_number
order by
	 count(*) desc
limit 1

1484. Group Sold Products By The Date（好题）

原题链接：1484. Group Sold Products By The Date
考察：groupby

Table Activities :

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| sell_date   | date    |
| product     | varchar |
+-------------+---------+
There is no primary key (column with unique values) for this table. It may contain duplicates.
Each row of this table contains the product name and the date it was sold in a market.

Write a solution to find for each date the number of different products sold and their names.

The sold products names for each date should be sorted lexicographically（字典序）.

Return the result table ordered by sell_date .

The result format is in the following example.

Example 1:

Input:

Activities table:
+------------+------------+
| sell_date  | product     |
+------------+------------+
| 2020-05-30 | Headphone  |
| 2020-06-01 | Pencil     |
| 2020-06-02 | Mask       |
| 2020-05-30 | Basketball |
| 2020-06-01 | Bible      |
| 2020-06-02 | Mask       |
| 2020-05-30 | T-Shirt    |
+------------+------------+

Output:

+------------+----------+------------------------------+
| sell_date  | num_sold | products                     |
+------------+----------+------------------------------+
| 2020-05-30 | 3        | Basketball,Headphone,T-shirt |
| 2020-06-01 | 2        | Bible,Pencil                 |
| 2020-06-02 | 1        | Mask                         |
+------------+----------+------------------------------+

Explanation:
For 2020-05-30, Sold items were (Headphone, Basketball, T-shirt), we sort them lexicographically and separate them by a comma.
For 2020-06-01, Sold items were (Pencil, Bible), we sort them lexicographically and separate them by a comma.
For 2020-06-02, the Sold item is (Mask), we just return it.

题目大意：
按照日期将多行合并为一行，并统计数目

pandas 思路：
使用 groupby() 进行分组，然后 agg() 进行聚合操作，重点在于将多行合并到一行，这个时不时会遇到，我觉得一定要掌握！！！

pandas 实现：

import pandas as pd

def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
    groups = activities.groupby('sell_date') # groupby的结果是一个DataFrameGroupBy对象

    ans = groups.agg(
        num_sold = ('product', 'nunique'),
        products = ('product', lambda x : ','.join(sorted(set(x)))) # 去重后按字典序排列, 用逗号分隔组成字符串
    ).reset_index()

    ans.sort_values('sell_date', inplace=True) # 按照日期排序

    return ans

MySQL 思路：
重点也是多行合并到一行

MySQL 实现：

select
    sell_date,
    count(distinct product) as num_sold,
    group_concat(distinct product order by product separator ',') as products -- 重点
from
    activities
group by
    sell_date
order by
    sell_date asc

1693. Daily Leads and Partners

原题链接：1693. Daily Leads and Partners

Table: DailySales

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| date_id     | date    |
| make_name   | varchar |
| lead_id     | int     |
| partner_id  | int     |
+-------------+---------+
There is no primary key (column with unique values) for this table. It may contain duplicates.
This table contains the date and the name of the product sold and the IDs of the lead and partner it was sold to.
The name consists of only lowercase English letters.

For each date_id and make_name, find the number of distinct lead_id’s and distinct partner_id’s.

Return the result table in any order .

The result format is in the following example.

Example 1:

Input:

DailySales table:
+-----------+-----------+---------+------------+
| date_id   | make_name | lead_id | partner_id |
+-----------+-----------+---------+------------+
| 2020-12-8 | toyota    | 0       | 1          |
| 2020-12-8 | toyota    | 1       | 0          |
| 2020-12-8 | toyota    | 1       | 2          |
| 2020-12-7 | toyota    | 0       | 2          |
| 2020-12-7 | toyota    | 0       | 1          |
| 2020-12-8 | honda     | 1       | 2          |
| 2020-12-8 | honda     | 2       | 1          |
| 2020-12-7 | honda     | 0       | 1          |
| 2020-12-7 | honda     | 1       | 2          |
| 2020-12-7 | honda     | 2       | 1          |
+-----------+-----------+---------+------------+

Output:

+-----------+-----------+--------------+-----------------+
| date_id   | make_name | unique_leads | unique_partners |
+-----------+-----------+--------------+-----------------+
| 2020-12-8 | toyota    | 2            | 3               |
| 2020-12-7 | toyota    | 1            | 2               |
| 2020-12-8 | honda     | 2            | 2               |
| 2020-12-7 | honda     | 3            | 2               |
+-----------+-----------+--------------+-----------------+

Explanation:
For 2020-12-8, toyota gets leads = [0, 1] and partners = [0, 1, 2] while honda gets leads = [1, 2] and partners = [1, 2].
For 2020-12-7, toyota gets leads = [0] and partners = [1, 2] while honda gets leads = [0, 1, 2] and partners = [1, 2].

题目大意：
按照 date_id 和 make_name ，统计一下去重后的 lead_id 和 partner_id 的数目

pandas 实现：

import pandas as pd

def daily_leads_and_partners(daily_sales: pd.DataFrame) -> pd.DataFrame:
    tmp = daily_sales.groupby(['date_id', 'make_name'])
    ans = tmp.agg(
        unique_leads = ('lead_id', 'nunique'),
        unique_partners =  ('partner_id', 'nunique')
    ).reset_index()

    return ans

MySQL 实现：

select
    date_id,
    make_name,
    count(distinct lead_id) as unique_leads,
    count(distinct partner_id) as unique_partners
from
    DailySales
group by 
    date_id, make_name
order by
    date_id, make_name

数据合并

1050. Actors and Directors Who Cooperated At Least Three Times

原题链接：1050. Actors and Directors Who Cooperated At Least Three Times
考察：groupby

Table: ActorDirector

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| actor_id    | int     |
| director_id | int     |
| timestamp   | int     |
+-------------+---------+
timestamp is the primary key (column with unique values) for this table.

Write a solution to find all the pairs (actor_id, director_id) where the actor has cooperated with the director at least three times.

Return the result table in any order .

The result format is in the following example.

Example 1:

Input:

ActorDirector table:
+-------------+-------------+-------------+
| actor_id    | director_id | timestamp   |
+-------------+-------------+-------------+
| 1           | 1           | 0           |
| 1           | 1           | 1           |
| 1           | 1           | 2           |
| 1           | 2           | 3           |
| 1           | 2           | 4           |
| 2           | 1           | 5           |
| 2           | 1           | 6           |
+-------------+-------------+-------------+

Output:

+-------------+-------------+
| actor_id    | director_id |
+-------------+-------------+
| 1           | 1           |
+-------------+-------------+

Explanation:
The only pair is (1, 1) where they cooperated exactly 3 times.

题目大意：
统计合作次数大于等于3次的演员和导演的id

pandas思路：
timestamp 没有用不要管它，按照两项 groupby() ， size() 统计次数，最后按行选择输出即可

pandas实现：

import pandas as pd

def actors_and_directors(actor_director: pd.DataFrame) -> pd.DataFrame:
    tmp = actor_director.groupby(by=['actor_id', 'director_id']).size().reset_index(name='count')
    tmp = tmp[tmp['count'] >= 3]

    return tmp[['actor_id', 'director_id']]

MySQL思路：
groupby 的情况下进行筛选，用 having

MySQL实现：

select 
    actor_id,
    director_id
from
    ActorDirector
group by 
    actor_id, 
    director_id
having
    count(*) >= 3

1378. Replace Employee ID With The Unique Identifier

原题链接：1378. Replace Employee ID With The Unique Identifier
知识点：左连接

Table: Employees

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| name          | varchar |
+---------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains the id and the name of an employee in a company.

Table: EmployeeUNI

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| unique_id     | int     |
+---------------+---------+
(id, unique_id) is the primary key (combination of columns with unique values) for this table.
Each row of this table contains the id and the corresponding unique id of an employee in the company.

Write a solution to show the unique ID of each user, If a user does not have a unique ID replace just show null.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Employees table:
+----+----------+
| id | name     |
+----+----------+
| 1  | Alice    |
| 7  | Bob      |
| 11 | Meir     |
| 90 | Winston  |
| 3  | Jonathan |
+----+----------+

EmployeeUNI table:
+----+-----------+
| id | unique_id |
+----+-----------+
| 3  | 1         |
| 11 | 2         |
| 90 | 3         |
+----+-----------+

Output:

+-----------+----------+
| unique_id | name     |
+-----------+----------+
| null      | Alice    |
| null      | Bob      |
| 2         | Meir     |
| 3         | Winston  |
| 1         | Jonathan |
+-----------+----------+

Explanation:
Alice and Bob do not have a unique ID, We will show null instead.
The unique ID of Meir is 2.
The unique ID of Winston is 3.
The unique ID of Jonathan is 1.

题目大意：
有两个表，需要返回两个表匹配之后的结果

pandas 思路：
merge，没啥好说的

pandas 实现：

import pandas as pd

def replace_employee_id(employees: pd.DataFrame, employee_uni: pd.DataFrame) -> pd.DataFrame:
    tmp = pd.merge(employees, employee_uni, how='left', on='id')

    return tmp[['unique_id', 'name']]

MySQL 思路：
left join，没啥好说的

MySQL 实现：

select
    b.unique_id, a.name
from 
    Employees a
left join
    EmployeeUNI b
on
    a.id = b.id

1280. Students and Examinations（好题）

原题链接：1280. Students and Examinations
考察：交差链接

Table: Students

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| student_id    | int     |
| student_name  | varchar |
+---------------+---------+
student_id is the primary key (column with unique values) for this table.
Each row of this table contains the ID and the name of one student in the school.

Table: Subjects

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| subject_name | varchar |
+--------------+---------+
subject_name is the primary key (column with unique values) for this table.
Each row of this table contains the name of one subject in the school.

Table: Examinations

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| student_id   | int     |
| subject_name | varchar |
+--------------+---------+
There is no primary key (column with unique values) for this table. It may contain duplicates.
Each student from the Students table takes every course from the Subjects table.
Each row of this table indicates that a student with ID student_id attended the exam of subject_name.

Write a solution to find the number of times each student attended each exam.

Return the result table ordered by student_id and subject_name.

The result format is in the following example.

Example 1:

Input:

Students table:
+------------+--------------+
| student_id | student_name |
+------------+--------------+
| 1          | Alice        |
| 2          | Bob          |
| 13         | John         |
| 6          | Alex         |
+------------+--------------+

Subjects table:
+--------------+
| subject_name |
+--------------+
| Math         |
| Physics      |
| Programming  |
+--------------+

Examinations table:
+------------+--------------+
| student_id | subject_name |
+------------+--------------+
| 1          | Math         |
| 1          | Physics      |
| 1          | Programming  |
| 2          | Programming  |
| 1          | Physics      |
| 1          | Math         |
| 13         | Math         |
| 13         | Programming  |
| 13         | Physics      |
| 2          | Math         |
| 1          | Math         |
+------------+--------------+

Output:

+------------+--------------+--------------+----------------+
| student_id | student_name | subject_name | attended_exams |
+------------+--------------+--------------+----------------+
| 1          | Alice        | Math         | 3              |
| 1          | Alice        | Physics      | 2              |
| 1          | Alice        | Programming  | 1              |
| 2          | Bob          | Math         | 1              |
| 2          | Bob          | Physics      | 0              |
| 2          | Bob          | Programming  | 1              |
| 6          | Alex         | Math         | 0              |
| 6          | Alex         | Physics      | 0              |
| 6          | Alex         | Programming  | 0              |
| 13         | John         | Math         | 1              |
| 13         | John         | Physics      | 1              |
| 13         | John         | Programming  | 1              |
+------------+--------------+--------------+----------------+

Explanation:
The result table should contain all students and all subjects.
Alice attended the Math exam 3 times, the Physics exam 2 times, and the Programming exam 1 time.
Bob attended the Math exam 1 time, the Programming exam 1 time, and did not attend the Physics exam.
Alex did not attend any exams.
John attended the Math exam 1 time, the Physics exam 1 time, and the Programming exam 1 time.

题目大意：
需要将三个表合并为一个表，注意不存在的组合也要在表里，数目为0

pandas 思路：
merge不难，重点在于，获取所有的组合，要用到交差连接

pandas 实现：

import pandas as pd

def students_and_examinations(students: pd.DataFrame, subjects: pd.DataFrame, examinations: pd.DataFrame) -> pd.DataFrame:
    # 按照student_id和subject_name分组 对参与考试次数进行计数
    tmp = examinations.groupby(by=['student_id', 'subject_name']).size().reset_index(name='attended_exams')

     # 交差连接 获取所有组合
    base = pd.merge(students, subjects, how='cross')

    # 左连接以保留所有组合
    ans = pd.merge(base, tmp, on=['student_id', 'subject_name'], how='left')

    # 将NAN填为0
    ans['attended_exams'] = ans['attended_exams'].fillna(0).astype(int)

    # 根据student_id Subject_name 升序排序
    ans.sort_values(by=['student_id', 'subject_name'], inplace=True)

    return ans[['student_id', 'student_name', 'subject_name', 'attended_exams']]

MySQL思路：
当一个问题比较复杂的时候，将它拆成若干个子问题：

统计所有的 student_id 和 subject_name 的所有可能组合
统计每个学生参加的考试次数

对于1，通过 cross join 交差连接得到所有的组合
对于2，要用到 ifnull 将为空的项设置为0
最后再将两个子查询左连接即可

MySQL实现：

SELECT
	a.student_id,
	a.student_name,
	b.subject_name,
	ifnull( c.attended_exams, 0 ) AS attended_exams 
FROM
	Students a
	CROSS JOIN Subjects b
	LEFT JOIN ( SELECT student_id, subject_name, COUNT( * ) AS attended_exams FROM Examinations GROUP BY student_id, subject_name ) c ON a.student_id = c.student_id 
	AND b.subject_name = c.subject_name 
ORDER BY
	a.student_id,
	b.subject_name

570. Managers with at Least 5 Direct Reports

原题链接：570. Managers with at Least 5 Direct Reports
考察：内连接

Table: Employee

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| name        | varchar |
| department  | varchar |
| managerId   | int     |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table indicates the name of an employee, their department, and the id of their manager.
If managerId is null, then the employee does not have a manager.
No employee will be the manager of themself.

Write a solution to find managers with at least five direct reports(下属).

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

Employee table:
+-----+-------+------------+-----------+
| id  | name  | department | managerId |
+-----+-------+------------+-----------+
| 101 | John  | A          | None      |
| 102 | Dan   | A          | 101       |
| 103 | James | A          | 101       |
| 104 | Amy   | A          | 101       |
| 105 | Anne  | A          | 101       |
| 106 | Ron   | B          | 101       |
+-----+-------+------------+-----------+

Output:

+------+
| name |
+------+
| John |
+------+

题目大意：
找到至少有五个直接下属的经理

pandas 思路：
统计 managerId 列下每个经理的ID的出现次数，并筛选大于等于5的行，随后和原表进行内连接（左右两边都有才纳入结果）

pandas 实现：

import pandas as pd

def find_managers(employee: pd.DataFrame) -> pd.DataFrame:
    tmp = employee.groupby('managerId').size().reset_index(name='count') # 统计次数
    tmp = tmp[tmp['count'] >= 5]
    
    ans = pd.merge(tmp[['managerId']], employee[['id', 'name']], how='inner', left_on='managerId', right_on='id') # 内连接合并
    
    return ans[['name']]

MySQL 思路1：
首先用子查询A得到至少有五个下属的managerId，随后再查询限制条件 where in (A)

MySQL 实现1：

SELECT
	name 
FROM
	employee 
WHERE
	id IN ( SELECT managerId FROM employee GROUP BY managerId HAVING COUNT( * ) >= 5 );

607. Sales Person

原题链接：607. Sales Person
考察：去重、排除

Table: SalesPerson

+-----------------+---------+
| Column Name     | Type    |
+-----------------+---------+
| sales_id        | int     |
| name            | varchar |
| salary          | int     |
| commission_rate | int     |
| hire_date       | date    |
+-----------------+---------+
sales_id is the primary key (column with unique values) for this table.
Each row of this table indicates the name and the ID of a salesperson alongside their salary, commission rate, and hire date.

Table: Company

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| com_id      | int     |
| name        | varchar |
| city        | varchar |
+-------------+---------+
com_id is the primary key (column with unique values) for this table.
Each row of this table indicates the name and the ID of a company and the city in which the company is located.

Table: Orders

+-------------+------+
| Column Name | Type |
+-------------+------+
| order_id    | int  |
| order_date  | date |
| com_id      | int  |
| sales_id    | int  |
| amount      | int  |
+-------------+------+
order_id is the primary key (column with unique values) for this table.
com_id is a foreign key (reference column) to com_id from the Company table.
sales_id is a foreign key (reference column) to sales_id from the SalesPerson table.
Each row of this table contains information about one order. This includes the ID of the company, the ID of the salesperson, the date of the order, and the amount paid.

Write a solution to find the names of all the salespersons who did not have any orders related to the company with the name “RED”.

Return the result table in any order.

The result format is in the following example.

Example 1:

Input:

SalesPerson table:
+----------+------+--------+-----------------+------------+
| sales_id | name | salary | commission_rate | hire_date  |
+----------+------+--------+-----------------+------------+
| 1        | John | 100000 | 6               | 4/1/2006   |
| 2        | Amy  | 12000  | 5               | 5/1/2010   |
| 3        | Mark | 65000  | 12              | 12/25/2008 |
| 4        | Pam  | 25000  | 25              | 1/1/2005   |
| 5        | Alex | 5000   | 10              | 2/3/2007   |
+----------+------+--------+-----------------+------------+

Company table:
+--------+--------+----------+
| com_id | name   | city     |
+--------+--------+----------+
| 1      | RED    | Boston   |
| 2      | ORANGE | New York |
| 3      | YELLOW | Boston   |
| 4      | GREEN  | Austin   |
+--------+--------+----------+

Orders table:
+----------+------------+--------+----------+--------+
| order_id | order_date | com_id | sales_id | amount |
+----------+------------+--------+----------+--------+
| 1        | 1/1/2014   | 3      | 4        | 10000  |
| 2        | 2/1/2014   | 4      | 5        | 5000   |
| 3        | 3/1/2014   | 1      | 1        | 50000  |
| 4        | 4/1/2014   | 1      | 4        | 25000  |
+----------+------------+--------+----------+--------+

Output:

+------+
| name |
+------+
| Amy  |
| Mark |
| Alex |
+------+

Explanation:
According to orders 3 and 4 in the Orders table, it is easy to tell that only salesperson John and Pam have sales to company RED, so we report all the other names in the table salesperson.

题目大意：
返回所有的，没有和RED公司有过交易的销售的名字

pandas 思路：
首先找到所有和RED有过订单的销售id，然后用排除法，返回剩下的数据集中的销售名称

pandas 实现：

import pandas as pd

def sales_person(sales_person: pd.DataFrame, company: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    df = pd.merge(orders, company, on='com_id') # 合并订单表和公司表
    red_orders = df[df['name'] == 'RED'] # RED的订单
    invalid_ids = red_orders.sales_id.unique() # RED的订单中的销售id

    ans = sales_person[~sales_person['sales_id'].isin(invalid_ids)]  # 用排除法

    return ans[['name']]

MySQL 思路：
一个子查询得到和RED有关的销售id，然后也走not in的路子

MySQL 实现：

SELECT
	name 
FROM
	SalesPerson 
WHERE
	sales_id NOT IN ( SELECT sales_id FROM orders a LEFT JOIN company b ON a.com_id = b.com_id WHERE b.name = 'RED' )