30 天 Pandas 挑战

ciky2011

已于 2024-03-06 14:38:22 修改

阅读量256

点赞数

分类专栏： Python # Pandas 文章标签： python pandas

于 2023-09-09 12:19:24 首次发布

本文链接：https://blog.csdn.net/ciky2011/article/details/132775935

版权

Python 同时被 2 个专栏收录

27 篇文章 0 订阅

订阅专栏

Pandas

8 篇文章 0 订阅

订阅专栏

Day01:大的国家 ->df[condition]

import pandas as pd

def big_countries(world: pd.DataFrame) -> pd.DataFrame:
    condition = (world.area >= 3000000) | (world['population'] >= 25000000)
    return world[condition][['name','population','area']]

Day02: 可回收且低脂的产品->rename/~condition

import pandas as pd

def find_customers(customers: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    # orders.customerId
    return customers[~customers.id.isin(orders.customerId)][['name']].rename(columns={"name":"Customers"})

Day03: 文章浏览 I->drop_duplicates()/sort_values

import pandas as pd

def article_views(views: pd.DataFrame) -> pd.DataFrame:
    df =views[(views.author_id==views.viewer_id)][['author_id']].drop_duplicates()
    return df.sort_values(by='author_id', ascending=True).rename(columns={'author_id': 'id'})

Day04: 无效的推文->condition/pd.Series().str.len()>15

import pandas as pd

def invalid_tweets(tweets: pd.DataFrame) -> pd.DataFrame:
    condition = tweets.content.str.len() > 15
    return tweets[condition][['tweet_id']]

Day05: 计算特殊奖金->df.apply(func,axis)向量化，函数里根据条件赋值

import pandas as pd

def calculate_special_bonus(employees: pd.DataFrame) -> pd.DataFrame:
    employees['bonus'] = employees.apply(
        lambda x: x['salary'] if x['employee_id'] % 2 and not x['name'].startswith('M') else 0,
        axis=1)
    return employees[['employee_id','bonus']].sort_values(by='employee_id', ascending=True)

Day06: 修复表中的名字-> apply/Series.str/lower()/title()/capitalize()

方法一： apply(func,axis=1) .str.strip().title().lower()

import pandas as pd
def title_name(x):
    names = x['name'].split(' ')
    titled_name = names[0].lower().title()
    for name in names[1:]:
        titled_name = titled_name + ' ' + name.lower()
    return titled_name.strip()

def fix_names(users: pd.DataFrame) -> pd.DataFrame:
    users['name']=users.apply(title_name, axis=1)
    return users.sort_values('user_id')

方法二：users['name']=users['name'].str.capitalize()

import pandas as pd

def fix_names(users: pd.DataFrame) -> pd.DataFrame:
    users['name'] = users['name'].str.capitalize()
    return users.sort_values('user_id')

Day07: 查找拥有有效邮箱的用户->condition/str.match正则表达式

import pandas as pd

def valid_emails(users: pd.DataFrame) -> pd.DataFrame:
    return users[users["mail"].str.match(r"^[a-zA-Z][a-zA-Z0-9_.-]*\@leetcode\.com$")]

Day08: 患某种疾病的患者->str.match/str.contains

'''输入'''
| patient_id | patient_name | conditions   |
| ---------- | ------------ | ------------ |
| 1          | Daniel       | YFEV COUGH   |
| 2          | Alice        |              |
| 3          | Bob          | DIAB100 MYOP |
| 4          | George       | ACNE DIAB100 |
| 5          | Alain        | DIAB201      |

str.match

'''代码'''
import pandas as pd

def find_patients(patients: pd.DataFrame) -> pd.DataFrame:
    return patients[patients.conditions.str.match(r"\bDIAB1\w*\b")]

'''输出'''
| patient_id | patient_name | conditions   |
| ---------- | ------------ | ------------ |
| 3          | Bob          | DIAB100 MYOP |

str.contains

'''代码'''
import pandas as pd

def find_patients(patients: pd.DataFrame) -> pd.DataFrame:
    return patients[patients.conditions.str.contains(r"\bDIAB1\w*\b")]
  
'''输出'''
| patient_id | patient_name | conditions   |
| ---------- | ------------ | ------------ |
| 3          | Bob          | DIAB100 MYOP |
| 4          | George       | ACNE DIAB100 |

match、fullmatch和contains之间的区别在于严格性：
- fullmatch测试整个字符串是否与正则表达式匹配；
- match是否存在从字符串的第一个字符开始的正则表达式的匹配；
- contain在字符串中的任何位置是否存在正则表达式的匹配。
- 这三种匹配模式的re包中的相应函数为re。完全匹配，重新。匹配，再重新。搜索。

Day09: 第N高的薪水->rename/iloc[]/sort_values列表

rename(columns = {原列名:新列名}) # 字典
iloc[] #主要为数组索引[]而不是函数()
df.sort_values(by=['列1','列2'], ascending=[False,True]) # by参数为列表，一个也要用列表，ascending参数一个时可为单个值，多个时为列表。
通过iloc为空时报index越界异常，通过手动利用字典构造DataFrame({key:[value]}) # 注意value要用列表[],单个时也用列表[]。

import pandas as pd

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:
    salary_df = employee[['salary']].drop_duplicates().rename(columns ={'salary':f'getNthHighestSalary({N})'})
    try:
        salary_df = salary_df.sort_values(by=[f'getNthHighestSalary({N})'], ascending=False).iloc[[N-1]]
    except Exception as ex:
        salary_df = pd.DataFrame({f'getNthHighestSalary({N})':[None]})
    return salary_df

Day10: 第二高的薪水->drop_duplicates(subset=['col1','col2])

输入

Employee =
| id | salary |
| -- | ------ |
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

代码

import pandas as pd


def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:
    try:
        employee = employee.drop_duplicates(subset=['salary']).sort_values(by=['salary'], ascending=False).iloc[[1]]
    except Exception as _:
        employee = pd.DataFrame({'SecondHighestSalary': [None]})
    else:
        employee = employee[['salary']].rename(columns={'salary': 'SecondHighestSalary'})
    return employee

| SecondHighestSalary |
| ------------------- |
| 200                 |

知识点
- drop_duplicates()函数如果需要根据某几列删掉重复值时使用subset=['col1','col2]
- 只有一列时iloc[1]返回Series，如果需要返回DataFrame，多加一个[]如iloc[[1]]
- 通过字典构造DataFrame({key:[value]}) 时注意value要用列表[],单value也用列表[]。

Day11: 部门工资最高的员工->merge/重名列/groupby/transform/apply

输入

Employee =
| id | name  | salary | departmentId |
| -- | ----- | ------ | ------------ |
| 1  | Joe   | 70000  | 1            |
| 2  | Jim   | 90000  | 1            |
| 3  | Henry | 80000  | 2            |
| 4  | Sam   | 60000  | 2            |
| 5  | Max   | 90000  | 1            |

Department =
| id | name  |
| -- | ----- |
| 1  | IT    |
| 2  | Sales |

预期结果

| Department | Employee | Salary |
| ---------- | -------- | ------ |
| IT         | Jim      | 90000  |
| Sales      | Henry    | 80000  |
| IT         | Max      | 90000  |

代码

import pandas as pd


def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    df = pd.merge(employee, department, left_on='departmentId', right_on='id')
    df.rename(columns={'name_x': 'Employee', 'salary': 'Salary', 'name_y': 'Department'}, inplace=True)
    # Pick up the max_salary series
    max_salary = df.groupby('Department')['Salary'].transform('max')

    # Use condition df['Salary']=max_salary to filter expected df
    df = df[df['Salary'] == max_salary]
    # return the df with expected columns.
    return df[['Department', 'Employee', 'Salary']]

知识点：
- merge函数，注意有pd.merge和employee.merge
- apply和transform区别
  - 重要区别：
    - apply返回与组数相同的行数，
    - transform返回与原数据相同行数的。
  - 使用场景和性能特性：
    - transform:这个函数应用于每个分组后，会将结果广播到原始数据的大小，这意味着它会保持与输入相同的DataFrame大小。这种特性通常使transform更为高效，因为它会尝试使用更高效的内部机制来执行向量化操作。然而，由于结果会被广播到原始数据的大小，因此transform应用的函数应返回标量值或与输入组相同大小的数组。
    - apply：这个函数更为通用，适用于更复杂的操作，包括改变DataFrame的大小。它可以用于执行更多种类的操作，例如返回DataFrame、Series或标量。然而，这种通用性通常意味着它在性能上不如transform高效，尤其是在需要广播结果到原始数据大小的场景。
  - 功能和灵活性：
    - apply：由于其通用性，apply可以配合自定义的函数使用，包括简单的求和函数以及复杂的特征间的差值函数等。但需要注意的是，apply不能直接使用agg()方法或transform()中的python内置函数，如sum、max、min、count等。
    - transform：虽然transform不能直接与自定义的特征交互函数配合使用，但它专注于对每一列（即每一元素）进行计算。这意味着在使用transform方法时，需要在groupby()之后明确指定要操作的列。
- 在pandas中，当合并多个DataFrame时，如果重名的列存在，则会自动创建一个新的列名，以避免重复。
  例如，假设我们有两个DataFrame，其中一个包含列名为"A"和"B"，另一个也包含列名为"A"和"B"，如果我们使用merge函数将它们合并在一起，则会自动生成一个新的列名，例如"A_x"和"B_x"来表示第一个DataFrame中的列，"A_y"和"B_y"来表示第二个DataFrame中的列。
  以下是一个示例代码：
- ```
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 6], 'B': [7, 8]})
merged_df = pd.merge(df1, df2, on='A')
print(merged_df)
```
- 输出结果：
- ```
   A  B_x  B_y
0  1    3    7
```
  可以看到，在合并后的DataFrame中，"A"和"B"列都存在，但每个列都有两个不同的值，同时自动生成了新的列名来避免重复。
  
  除了使用_x和_y作为后缀外，pandas还会使用其他字符来区分不同的DataFrame。例如，如果第一个DataFrame的列名是"A"，第二个DataFrame的列名也是"A"，第三个DataFrame的列名也是"A"，则合并后的DataFrame会自动生成列名"A_0"、"A_1"和"A_2"来区分不同的数据来源。
  
  如果仍然不够用，可以考虑手动指定合并后的列名，例如使用pd.merge(df1, df2, on='column_name', suffixes=('suffix1', 'suffix2'))参数来指定后缀，这样可以将第一个DataFrame的列名后添加"suffix1"，第二个DataFrame的列名后添加"suffix2"，以此类推。
- ```
groupby后可以通过[]来对某一列的数据进行操作，df.groupby('Department')['Salary'].transform('max')，否则对所有列进行操作。
```

Day12: 分数排名 ->rank()函数，df['新列名']=新列值(Series)来增加新列

输入

Scores =
| id | score |
| -- | ----- |
| 1  | 3.5   |
| 2  | 3.65  |
| 3  | 4     |
| 4  | 3.85  |
| 5  | 4     |
| 6  | 3.65  |

预期结果

| score | rank |
| ----- | ---- |
| 4     | 1    |
| 4     | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.5   | 4    |

代码

import pandas as pd


def order_scores(scores: pd.DataFrame) -> pd.DataFrame:
    scores['rank'] = scores['score'].rank(method='dense', ascending=False)
    return scores[['score', 'rank']].sort_values(by='score', ascending=False)

知识点
- rank函数来帮助计算沿轴的数值数据排名，注意为降序，score从高到低，rank从小到大
- 最后按照score降序返回，因为rank里的sort只用于根据score标记rank值，socre列还是原顺序如下，所以需要按最终需按照score降序得到预期结果。
```
   id  score  rank
0   1    3.5   4.0
1   2   3.65   3.0
2   3    4.0   1.0
3   4   3.85   2.0
4   5    4.0   1.0
5   6   3.65   3.0
```

Day13: 删除重复的电子邮箱 ->sort_values() drop_duplicates.

输入

Person =
| id | email            |
| -- | ---------------- |
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |

预期结果

| id | email            |
| -- | ---------------- |
| 1  | john@example.com |
| 2  | bob@example.com  |

代码

import pandas as pd

# Modify Person in place
def delete_duplicate_emails(person: pd.DataFrame) -> None:
    person.sort_values(by=['id'], inplace=True, ascending=True)
    person.drop_duplicates(subset=['email'], inplace=True)

知识点
- sort_values 参数by为list id
- drop_duplicates 参数subset为list email

Day14: 每个产品在不同商店的价格 ->loc[行条件, 列条件] concat

输入

Products =
| product_id | store1 | store2 | store3 |
| ---------- | ------ | ------ | ------ |
| 0          | 95     | 100    | 105    |
| 1          | 70     | null   | 80     |

预计结果

| product_id | store  | price |
| ---------- | ------ | ----- |
| 0          | store1 | 95    |
| 1          | store1 | 70    |
| 0          | store2 | 100   |
| 0          | store3 | 105   |
| 1          | store3 | 80    |

代码

import pandas as pd


def rearrange_products_table(products: pd.DataFrame) -> pd.DataFrame:
    # Step1: Pick up "product_id" and "store1" from DataFrame products
    store1_df: pd.DataFrame = products.loc[products['store1'].notnull(), ['product_id', 'store1']]

    # Step2: Rename "store1" to "price"
    store1_df.rename(columns={'store1': 'price'}, inplace=True)

    # Step3: Add store column with value store1
    store1_df['store'] = 'store1'

    # Step4: Adjust the column order to meet output formart requirement
    store1_df = store1_df[['product_id', 'store', 'price']]

    # Repeat Step1-4 for store2 and store3
    store2_df: pd.DataFrame = products.loc[products['store2'].notnull(), ['product_id', 'store2']]
    store2_df.rename(columns={'store2': 'price'}, inplace=True)
    store2_df['store'] = 'store2'
    store2_df = store2_df[['product_id', 'store', 'price']]

    store3_df: pd.DataFrame = products.loc[products['store3'].notnull(), ['product_id', 'store3']]
    store3_df.rename(columns={'store3': 'price'}, inplace=True)
    store3_df['store'] = 'store3'
    store3_df = store3_df[['product_id', 'store', 'price']]

    # Step5: concat the 3 DataFrame follow axis=0
    return pd.concat([store1_df, store2_df, store3_df], axis=0)

Day15: 按分类统计薪水->conditions过滤生成pd.Series，pd.Series.sum()统计个数构造list、dict、pd.DataFrame

输入

Accounts =
| account_id | income |
| ---------- | ------ |
| 3          | 108939 |
| 2          | 12747  |
| 8          | 87709  |
| 6          | 91796  |

预期结果

| category       | accounts_count |
| -------------- | -------------- |
| High Salary    | 3              |
| Low Salary     | 1              |
| Average Salary | 0              |

代码

import pandas as pd


def count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:
    low_salary_filter: pd.Series = accounts['income'] < 20000
    low_salary_count = low_salary_filter.sum()

    average_salary_filter: pd.Series = (accounts['income'] >= 20000) & (accounts['income'] <= 50000)
    average_salary_count = average_salary_filter.sum()

    high_salary_filter: pd.Series = accounts['income'] > 50000
    high_salary_count = high_salary_filter.sum()

    salary_levels_dict = {
        'category': ['Low Salary', 'Average Salary', 'High Salary'],
        'accounts_count': [low_salary_count, average_salary_count, high_salary_count]
    }
    salary_levels = pd.DataFrame(data=salary_levels_dict)

    return salary_levels

知识点
- 通过将 income 列中的每个值与 20000 进行比较来创建一个 Boolean Series ，工资小于 20000 的为 True，其它的为 False。
- ```
print(low_salary_filter)
0    False
1     True
2    False
3    False
Name: income, dtype: boolea
```
- 接下来，我们可以使用 sum() 方法统计 True 值的个数，sum() 将 True 视为 1，将 False 视为 0。因此，count 表示该系列中 True 的数量，它对应于低工资的账号数量。
  
  作者：力扣官方题解
  链接：https://leetcode.cn/problems/count-salary-categories/
  来源：力扣（LeetCode）
  著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

Day16：1741. 查找每个员工花费的总时间

输入

Employees table:
+--------+------------+---------+----------+
| emp_id | event_day  | in_time | out_time |
+--------+------------+---------+----------+
| 1      | 2020-11-28 | 4       | 32       |
| 1      | 2020-11-28 | 55      | 200      |
| 1      | 2020-12-03 | 1       | 42       |
| 2      | 2020-11-28 | 3       | 33       |
| 2      | 2020-12-09 | 47      | 74       |
+--------+------------+---------+----------+

预期结果

+------------+--------+------------+
| day        | emp_id | total_time |
+------------+--------+------------+
| 2020-11-28 | 1      | 173        |
| 2020-11-28 | 2      | 30         |
| 2020-12-03 | 1      | 41         |
| 2020-12-09 | 2      | 27         |
+------------+--------+------------+

代码1:无reset_index()

import pandas as pd


def total_time(employees: pd.DataFrame) -> pd.DataFrame:
    employees['inter_time'] = employees['out_time'] - employees['in_time']
    employees = employees.groupby(['emp_id', 'event_day'])['inter_time'].sum()
    print(employees)
    print(type(employees))

输出

emp_id  event_day 
1       2020-11-28    173
        2020-12-03     41
2       2020-11-28     30
        2020-12-09     27
Name: inter_time, dtype: Int64
<class 'pandas.core.series.Series'>

代码2:有reset_index()

import pandas as pd


def total_time(employees: pd.DataFrame) -> pd.DataFrame:
    employees['inter_time'] = employees['out_time'] - employees['in_time']
    employees = employees.groupby(['emp_id', 'event_day'])['inter_time'].sum().reset_index()
    print(employees)
    print(type(employees))

输出

   emp_id  event_day  inter_time
0       1 2020-11-28         173
1       1 2020-12-03          41
2       2 2020-11-28          30
3       2 2020-12-09          27
<class 'pandas.core.frame.DataFrame'>

代码

import pandas as pd


def total_time(employees: pd.DataFrame) -> pd.DataFrame:
    employees['inter_time'] = employees['out_time'] - employees['in_time']
    employees: pd.DataFrame = employees.groupby(['emp_id', 'event_day'])['inter_time'].sum().reset_index()
    employees.rename(columns={'event_day': 'day', 'inter_time': 'total_time'}, inplace=True)
    return employees[['day', 'emp_id', 'total_time']]

知识点
- reset_index()作用：reset_index() 是 pandas 库中的一个方法，它的主要作用是对索引进行重置。当你使用 groupby 方法对数据进行聚合后，你会得到一个 Series，这个 Series 的索引是组合的列名（在你的例子中是 'emp_id' 和 'event_day'，多重索引）。reset_index() 方法将这个索引重置为默认的整数索引，并且如果需要的话，它还可以创建一个新的列来保存原来的索引。在你的例子中，reset_index() 将根据 'emp_id' 和 'event_day' 的值为新的 DataFrame 创建索引，而原本的数值则作为新的 DataFrame 的一列，这就是为什么 reset_index() 能够将 Series 转化为 DataFrame 的原因。因此，reset_index() 方法是一个非常实用的工具，可以帮助你在进行数据聚合操作后，将结果从 Series 格式转化为更易于分析和使用的 DataFrame 格式
- rename函数同时给多列rename时记得colunms=不能省略，参数名传递，不能用位置传递。

Day17: 2356. 每位教师所教授的科目种类的数量

输入：

Teacher =
| teacher_id | subject_id | dept_id |
| ---------- | ---------- | ------- |
| 1          | 2          | 3       |
| 1          | 2          | 4       |
| 1          | 3          | 3       |
| 2          | 1          | 1       |
| 2          | 2          | 1       |
| 2          | 3          | 1       |
| 2          | 4          | 1       |

预期结果

| teacher_id | cnt |
| ---------- | --- |
| 1          | 2   |
| 2          | 4   |

个人解题代码

import pandas as pd


def count_unique_subjects(teacher: pd.DataFrame) -> pd.DataFrame:
    df = teacher[['teacher_id', 'subject_id']].drop_duplicates().groupby(
        by=['teacher_id']).count().reset_index().rename({'subject_id': 'cnt'}, axis=1)
    return df

官方解题代码

import pandas as pd

def count_unique_subjects(teacher: pd.DataFrame) -> pd.DataFrame:
    df = teacher.groupby(["teacher_id"])["subject_id"].nunique().reset_index()
    df = df.rename({'subject_id': "cnt"}, axis=1)
    return d

知识点
- groupby() 函数返回的是一个groupby对象，该对象具有count()、unique()、max()等一些列方法。
- groupby()函数会把by=['teacher_id']的列自动作为新DataFrame的index，因此需要使用reset_index()方法，重新创建index同时把原index ['teacher_id']作为一个新的列。
- ```
nunique()函数Return DataFrame with counts of unique elements in each position.
```
- 使用rename()函数时注意要给axis=1，因为是要给列重命名而不是行。

Day18:596. 超过5名学生的课

输入

Courses table:
+---------+----------+
| student | class    |
+---------+----------+
| A       | Math     |
| B       | English  |
| C       | Math     |
| D       | Biology  |
| E       | Math     |
| F       | Computer |
| G       | Math     |
| H       | Math     |
| I       | Math     |
+---------+----------+

输出

+---------+ 
| class   | 
+---------+ 
| Math    | 
+---------+

个人解题代码

import pandas as pd

def find_classes(courses: pd.DataFrame) -> pd.DataFrame:
    df = courses.groupby(by=['class']).count().reset_index()
    return df[df['student'] > 4][['class']]

官方解题代码

import pandas as pd

def find_classes(courses: pd.DataFrame) -> pd.DataFrame:
    df = courses.groupby('class').size().reset_index(name='count')

    df = df[df['count'] >= 5]

    return df[['class']]

作者：力扣官方题解
链接：https://leetcode.cn/problems/classes-more-than-5-students/solutions/2366294/chao-guo-5ming-xue-sheng-de-ke-by-leetco-l4es/
来源：力扣（LeetCode）
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

知识点
- groupby().count()和groupby().size()区别:
  - groupby().count(): 这个函数会计算每个分组中非空值的数量。也就是说，它会考虑NaN值，并且不会将其计入总数。
  - groupby().size(): 这个函数计算每个分组的行数，无论这些行的值是什么（包括NaN）。它直接返回每个组的元素数量，而不是基于列的非空值计数。
- 返回时注意df['class']为Serials，df[['class']]为DataFrame

Day19:586. 订单最多的客户

输入

输入: 
Orders 表:
+--------------+-----------------+
| order_number | customer_number |
+--------------+-----------------+
| 1            | 1               |
| 2            | 2               |
| 3            | 3               |
| 4            | 3               |
+--------------+-----------------+

输出

+-----------------+
| customer_number |
+-----------------+
| 3               |
+-----------------+

个人解题代码

import pandas as pd


def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
    return orders.groupby(by=['customer_number']).nunique().reset_index().sort_values(by='order_number', ascending=False)[['customer_number']][0:1]

知识点
- nunique()返回分组后每组里的unique元素个数n。
- reset_index() transfer Serials to DataFrame.
- sort_values() use column 'order_number' to sort.
- Max value in the top, so use slice [0] to get the max one.
官方题解代码

import pandas as pd

def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
    # 如果 orders 为空，返回一个空的 DataFrame。
    if orders.empty:
        return pd.DataFrame({'customer_number': []})

    df = orders.groupby('customer_number').size().reset_index(name='count')
    df.sort_values(by='count', ascending = False, inplace=True)
    return df[['customer_number']][0:1]

作者：力扣官方题解
链接：https://leetcode.cn/problems/customer-placing-the-largest-number-of-orders/solutions/2366301/ding-dan-zui-duo-de-ke-hu-by-leetcode-so-bywe/
来源：力扣（LeetCode）
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

知识点
- reset_index(name='count') 用于给结果列指定一个新名称 count，这一步确保结果 DataFrame df 有两列：customer_number 和 count
- sort_values(by='count')利用count进行排序。
- df[['customer_number']][0:1]返回customer_number列的第一行。

Day20: 1484. 按日期分组销售产品

输入

+------------+-------------+
| sell_date  | product     |
+------------+-------------+
| 2020-05-30 | Headphone   |
| 2020-06-01 | Pencil      |
| 2020-06-02 | Mask        |
| 2020-05-30 | Basketball  |
| 2020-06-01 | Bible       |
| 2020-06-02 | Mask        |
| 2020-05-30 | T-Shirt     |
+------------+-------------+

期待输出

+------------+----------+------------------------------+
| sell_date  | num_sold | products                     |
+------------+----------+------------------------------+
| 2020-05-30 | 3        | Basketball,Headphone,T-shirt |
| 2020-06-01 | 2        | Bible,Pencil                 |
| 2020-06-02 | 1        | Mask                         |
+------------+----------+------------------------------+

代码

import pandas as pd


def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
    df = activities.groupby(['sell_date']).agg(num_sold=('product', 'nunique'),
                                               products=('product', lambda x: ','.join(sorted(set(x))))).reset_index()

    return df

知识点
- agg() 对这个 DataFrameGroupBy 对象中的每个组进行聚合操作，其中我们使用命名聚合指定了两个聚合任务：
- num_sold=('product', 'nunique')：这将在输出 DataFrame 中创建一个新列 num_sold，表示在每个销售日期售出的唯一产品的数量。‘nunique’ 函数对每个组中 product 列中的不同元素进行计数。
- products=('product', lambda x: ','.join(sorted(set(x))))：这一行有点复杂，我们被要求对每个组中的所有唯一名称进行排序和联接。然而，没有定义的函数可以处理此任务，但幸运的是，我们可以将其替换为自定义函数 lambda x: ','.join(sorted(set(x)))。其中，x 表示表示每个组中的列 product 的Series。我们将其转换为一个集合，以删除重复项，对唯一的产品名称进行排序，然后将它们连接成带有逗号的单个字符串。

Day21: 1693. 每天的领导和合伙人

输入

+-----------+-----------+---------+------------+
| date_id   | make_name | lead_id | partner_id |
+-----------+-----------+---------+------------+
| 2020-12-8 | toyota    | 0       | 1          |
| 2020-12-8 | toyota    | 1       | 0          |
| 2020-12-8 | toyota    | 1       | 2          |
| 2020-12-7 | toyota    | 0       | 2          |
| 2020-12-7 | toyota    | 0       | 1          |
| 2020-12-8 | honda     | 1       | 2          |
| 2020-12-8 | honda     | 2       | 1          |
| 2020-12-7 | honda     | 0       | 1          |
| 2020-12-7 | honda     | 1       | 2          |
| 2020-12-7 | honda     | 2       | 1          |
+-----------+-----------+---------+------------+

期待输出

+-----------+-----------+--------------+-----------------+
| date_id   | make_name | unique_leads | unique_partners |
+-----------+-----------+--------------+-----------------+
| 2020-12-8 | toyota    | 2            | 3               |
| 2020-12-7 | toyota    | 1            | 2               |
| 2020-12-8 | honda     | 2            | 2               |
| 2020-12-7 | honda     | 3            | 2               |
+-----------+-----------+--------------+-----------------+

个人解题代码

import pandas as pd

def daily_leads_and_partners(daily_sales: pd.DataFrame) -> pd.DataFrame:
    df = daily_sales.groupby(['date_id', 'make_name']).nunique().reset_index().rename({
        "lead_id": "unique_leads", "partner_id": "unique_partners"}, axis=1)
    return df

官方解题代码

import pandas as pd

def daily_leads_and_partners(daily_sales: pd.DataFrame) -> pd.DataFrame:
    # 方法：Group by 并聚合
    # 让我们利用 .groupby() 方法，使用 'data_id' 和 'make_name'
    # 作为分组标准并且使用 'nunique' 方法聚合 'lead_id' 和 'partner_id'
    # 这会返回一组中不同的元素
    df = daily_sales.groupby(['date_id', 'make_name']).agg({
        'lead_id': 'nunique',
        'partner_id': 'nunique'
    }).reset_index()
    
    # 重命名结果 DataFrame 并且 重命名列
    df = df.rename(columns={
        'lead_id': 'unique_leads',
        'partner_id': 'unique_partners'
    })

    # 返回 DataFrame
    return df

作者：力扣官方题解
链接：https://leetcode.cn/problems/daily-leads-and-partners/solutions/2366306/mei-tian-de-ling-dao-he-he-huo-ren-by-le-jet2/
来源：力扣（LeetCode）
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

知识点
- lead_id 和 partner_id 列将使用 nunique 方法聚合,可简写'nunique' 也可写全pd.Series.nunique。

Day21:合作过至少三次的演员和导演

输入

+-------------+-------------+-------------+
| actor_id    | director_id | timestamp   |
+-------------+-------------+-------------+
| 1           | 1           | 0           |
| 1           | 1           | 1           |
| 1           | 1           | 2           |
| 1           | 2           | 3           |
| 1           | 2           | 4           |
| 2           | 1           | 5           |
| 2           | 1           | 6           |
+-------------+-------------+-------------+

期待输出

+-------------+-------------+
| actor_id    | director_id |
+-------------+-------------+
| 1           | 1           |
+-------------+-------------+

代码

import pandas as pd


def actors_and_directors(actor_director: pd.DataFrame) -> pd.DataFrame:
    df = actor_director.groupby(['actor_id', 'director_id'])['timestamp'].size().reset_index()
    return df[df['timestamp'] >= 3][['actor_id', 'director_id']]

知识点
- reset_index(name='counts')时也可修改列名为'counts'
- 相应的后面过滤条件也要用新的列名cnts['counts'] >= 3

Day22: 1378. 使用唯一标识码替换员工ID

输入

Employees 表:
+----+----------+
| id | name     |
+----+----------+
| 1  | Alice    |
| 7  | Bob      |
| 11 | Meir     |
| 90 | Winston  |
| 3  | Jonathan |
+----+----------+
EmployeeUNI 表:
+----+-----------+
| id | unique_id |
+----+-----------+
| 3  | 1         |
| 11 | 2         |
| 90 | 3         |
+----+-----------+

期待输出

+-----------+----------+
| unique_id | name     |
+-----------+----------+
| null      | Alice    |
| null      | Bob      |
| 2         | Meir     |
| 3         | Winston  |
| 1         | Jonathan |
+-----------+----------+

代码

import pandas as pd

def replace_employee_id(employees: pd.DataFrame, employee_uni: pd.DataFrame) -> pd.DataFrame:
    name_ui = pd.merge(left=employees, right=employee_uni, on='id', how='left')
    return name_ui[['unique_id', 'name']]

知识点
- pd.merge()函数，以id列为依据连接左右两个表，以左表为准(how='left')，即左侧有右侧没有的右侧值填充null，右侧有左侧无的不保留右侧值；然后取unique_id和name两列。

Day23: 1280. 学生们参加各科测试的次数

输入

Students table:
+------------+--------------+
| student_id | student_name |
+------------+--------------+
| 1          | Alice        |
| 2          | Bob          |
| 13         | John         |
| 6          | Alex         |
+------------+--------------+
Subjects table:
+--------------+
| subject_name |
+--------------+
| Math         |
| Physics      |
| Programming  |
+--------------+
Examinations table:
+------------+--------------+
| student_id | subject_name |
+------------+--------------+
| 1          | Math         |
| 1          | Physics      |
| 1          | Programming  |
| 2          | Programming  |
| 1          | Physics      |
| 1          | Math         |
| 13         | Math         |
| 13         | Programming  |
| 13         | Physics      |
| 2          | Math         |
| 1          | Math         |
+------------+--------------+

输出

+------------+--------------+--------------+----------------+
| student_id | student_name | subject_name | attended_exams |
+------------+--------------+--------------+----------------+
| 1          | Alice        | Math         | 3              |
| 1          | Alice        | Physics      | 2              |
| 1          | Alice        | Programming  | 1              |
| 2          | Bob          | Math         | 1              |
| 2          | Bob          | Physics      | 0              |
| 2          | Bob          | Programming  | 1              |
| 6          | Alex         | Math         | 0              |
| 6          | Alex         | Physics      | 0              |
| 6          | Alex         | Programming  | 0              |
| 13         | John         | Math         | 1              |
| 13         | John         | Physics      | 1              |
| 13         | John         | Programming  | 1              |
+------------+--------------+--------------+----------------+

代码

import pandas as pd

def students_and_examinations(students: pd.DataFrame, subjects: pd.DataFrame, examinations: pd.DataFrame) -> pd.DataFrame:
    # 按 id 和科目分组，并计算考试次数。
    grouped = examinations.groupby(['student_id', 'subject_name']).size().reset_index(name='attended_exams')

    # 获取 (id, subject) 的所有组合
    all_id_subjects = pd.merge(students, subjects, how='cross')

    # 左连接以保留所有组合。
    id_subjects_count = pd.merge(all_id_subjects, grouped, on=['student_id', 'subject_name'], how='left')
    
    # 数据清理
    id_subjects_count['attended_exams'] = id_subjects_count['attended_exams'].fillna(0).astype(int)
    
    # 根据'student_id'，'Subject_name'以升序对 DataFrame 进行排序。
    id_subjects_count.sort_values(['student_id', 'subject_name'], inplace=True)

    return id_subjects_count[['student_id', 'student_name', 'subject_name', 'attended_exams']]

作者：力扣官方题解
链接：https://leetcode.cn/problems/students-and-examinations/solutions/2366340/students-and-examinations-by-leetcode-so-3oup/
来源：力扣（LeetCode）
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

知识点
- astype(int)将整列的数据类型转换为整数

Day24: 570. 至少有5名直接下属的经理

输入

Employee 表:
+-----+-------+------------+-----------+
| id  | name  | department | managerId |
+-----+-------+------------+-----------+
| 101 | John  | A          | Null      |
| 102 | Dan   | A          | 101       |
| 103 | James | A          | 101       |
| 104 | Amy   | A          | 101       |
| 105 | Anne  | A          | 101       |
| 106 | Ron   | B          | 101       |
+-----+-------+------------+-----------+

输出

+------+
| name |
+------+
| John |
+------+

个人解题代码

import pandas as pd


def find_managers(employee: pd.DataFrame) -> pd.DataFrame:
    df1 = employee.groupby(['managerId']).size().reset_index(name='num')
    df1 = df1[df1['num'] >= 5]
    return employee[['name']][employee['id'].isin(df1['managerId'])]

官方解题代码


import pandas as pd

def find_managers(employee: pd.DataFrame) -> pd.DataFrame:
    # 统计每个经理的直接下属数量
    subordinate_count = employee.groupby('managerId')['id'].count()
    print(subordinate_count)
    '''
    managerId
    101    5
    Name: id, dtype: int64
    '''
    print(type(subordinate_count))
    '''
    <class 'pandas.core.series.Series'>
    '''
    
    # 找出直接下属数量大于等于5的经理
    managers_with_5_subordinates = subordinate_count[subordinate_count >= 5].index
    print(managers_with_5_subordinates)
    '''
    Index([101], dtype='Int64', name='managerId')
    '''
    
    # 使用这些经理的id来获取经理姓名
    result = employee[employee['id'].isin(managers_with_5_subordinates)]['name']
    
    # 将 Series 转换为 DataFrame
    result_df = result.to_frame(name='name')
    
    return result_df

知识点
- groupby(): 这是一个在 pandas 中常用的操作，它允许你按照某个列的值对 DataFrame 进行分组。在这个函数中，我们使用了 employee.groupby('managerId')['id'].count()，这会统计每个经理的直接下属数量。其中对id列的值替换为count()算出来的数，count()不同于size(),size()函数是会在DataFrame右侧增加一列。
- 使用条件表达式可以从 DataFrame 中筛选出满足特定条件的行。在这个函数中，我们使用了 employee[employee['id'].isin(managers_with_5_subordinates)]['name'] 来获取经理姓名，其中 employee['id'].isin(managers_with_5_subordinates) 会返回一个布尔序列，用于筛选出 id 在指定范围内的行。其中该布尔序列主要有两个关键内容Index和True/False值，Index用于和被过滤的DataFrame匹配连接，True/False用于决定是否保留该Index对应的行。

ciky2011

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
30 天 Pandas 挑战

例如，假设我们有两个DataFrame，其中一个包含列名为"A"和"B"，另一个也包含列名为"A"和"B"，如果我们使用merge函数将它们合并在一起，则会自动生成一个新的列名，例如"A_x"和"B_x"来表示第一个DataFrame中的列，"A_y"和"B_y"来表示第二个DataFrame中的列。因此，count 表示该系列中 True 的数量，它对应于低工资的账号数量。可以看到，在合并后的DataFrame中，"A"和"B"列都存在，但每个列都有两个不同的值，同时自动生成了新的列名来避免重复。
复制链接

扫一扫