23/8/28刷题记录

花花橙子

已于 2023-09-04 15:40:02 修改

阅读量95

点赞数

分类专栏：算法刷题记录文章标签： pandas

于 2023-09-01 00:09:05 首次发布

本文链接：https://blog.csdn.net/qq_43783669/article/details/132537136

版权

算法刷题记录专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1. pandas 数据分组

1484. 按日期分组销售产品【简单】

groupby：这是一个用于按照指定的列对数据进行分组。
- agg：对于数据进行聚合操作。

def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
    # 分组
    groups = activities.groupby('sell_date')
    # 对产品进行去重和计数，重置索引
    stats = groups.agg(
        num_sold = ('product','nunique'),
        products = ('product', lambda x: ','.join(sorted(set(x))))
    ).reset_index()
    # 按照日期进行排序
    stats.sort_values('sell_date',inplace=True)
    return stats

1693. 每天的领导和合伙人【简单】

def daily_leads_and_partners(daily_sales: pd.DataFrame) -> pd.DataFrame:
    #分组
    groups = daily_sales.groupby(['date_id','make_name'])
    df = groups.agg(
        unique_leads = ('lead_id','nunique'),
        unique_partners = ('partner_id','nunique')
    ).reset_index()
    return df

2. pandas数据操作

1795. 每个产品在不同商店的价格【简单】

这题的重点是行列转换。

方法一：合并表格

.loc：接受两个参数：行索引和列索引。

def rearrange_products_table(products: pd.DataFrame) -> pd.DataFrame:
    # .loc 的作用是从 products DataFrame 中选择满足特定条件的行，并且只选择指定的两列 product_id 和 store1。
    a = products.loc[products['store1'].notna(),['product_id','store1']]
    # 创建一个新列，并将原来的store1列命名为price
    a['store'] = 'store1'
    a.rename(columns = {'store1':'price'}, inplace = True)
    # 重新排列顺序
    a = a[['product_id','store','price']]

    # 重复以上过程
    b = products.loc[products['store2'].notna(),['product_id','store2']]
    b['store'] = 'store2'
    b.rename(columns = {'store2':'price'},inplace = True)
    b = b[['product_id','store','price']]

    c = products.loc[products['store3'].notna(),['product_id','store3']]
    c['store'] = 'store3'
    c.rename(columns = {'store3':'price'},inplace=True)
    c = c[['product_id','store','price']]
    # 把三个不同的表连接起来！！！而不是在一个表操作，我没考虑到
    ans = pd.concat([a,b,c])
    return ans

方法二：透视表

.melt()将df从宽格式转换为long format，并可以选择保留字符。可以指定要堆叠的列及其对应的名称，更轻松处理单个操作中的大量的列。在宽格式数据中，每一行代表一个观察单元，每一列代表一个特性或变量。而在长格式数据中，每一行代表一个观察单元的某个变量的取值
frame：要进行重塑的数据框（DataFrame）。
id_vars：要保留为标识符（identifier）的列名。
value_vars：要转换为值列的列名。如果不指定，则使用除了 id_vars 以外的所有列。
var_name：新生成的值列（value column）的列名。
value_name：新生成的值列的列名。

def rearrange_products_table(products: pd.DataFrame) -> pd.DataFrame:
    df = products.melt(
        id_vars = 'product_id',
        value_vars = ['store1','store2','store3'],
        var_name = 'store',
        value_name = 'price'
    )
    df = df.dropna(axis=0)
    return df

177. 第N高的薪水【中等】
引入了一个整数，不会写函数
流程是：筛选-降重-排序-筛选

def nth_highest_salary(employee: pd.DataFrame, N: int) -> pd.DataFrame:
    # 筛选出只有salary的列，并去重
    df = employee[['salary']].drop_duplicates()
    # 判断返回
    if len(df) < N:
        return pd.DataFrame({'getNthHighestSalary(2)':[None]}) # 注意这个表达
    # 截取指定的值返回
    return df.sort_values('salary', ascending = False).head(N).tail(1)

176. 第二高的薪水【中等】

df.sort_values(by, axis=0, ascending=True, inplace=False, ignore_index=False)
如果你想要修改并更新原始DataFrame，可以使用 employee = employee.drop_duplicates([“salary”])。如果你只是想在某些列上进行去重，并保留原始DataFrame不变，可以使用 df = employee[[‘salary’]].drop_duplicates()。

def second_highest_salary(employee: pd.DataFrame) -> pd.DataFrame:
    # 删除重复的salary
    df = employee[['salary']].drop_duplicates()
    if df['salary'].nunique() < 2:
        return pd.DataFrame({'SecondHighestSalary':[np.NaN]})
    # 对salary排序
    df = df.sort_values('salary',ascending=False)
    # # 删除id列
    # df.drop(columns='id',  inplace=True)
    # 修改列名
    df.rename(columns = {'salary':'SecondHighestSalary'},inplace = True)
    # 返回结果
    return df.head(2).tail(1)

184. 部门工资最高的员工【简单】

transform 函数会接收一个函数（或函数名）作为参数，并将该函数应用于每个分组中的数据。然后，它会返回一个与原始数据长度相同的新数据结构，其中包含了转换后的结果。
与 transform 类似的函数还有 apply，但两者之间有一些区别。transform 返回的结果会保持与原始数据的结构和长度一致，而 apply 则可以返回任意长度和结构的数据。
需要注意的是，transform 只能用于在分组操作中对分组进行处理，而不能进行整个 DataFrame 的全局操作。
pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)。left_on 和 right_on：分别指定左侧和右侧 DataFrame 中要用于合并的列名。

def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    # 连接表格横向的！！！
    df = employee.merge(department,left_on='departmentId',right_on='id',how = 'left')
    # 合并后需要重新命名
    df.rename({'name_x':'Employee','name_y':'Department','salary':'Salary'},axis = 1,inplace=True)
    # 确认筛选条件
    max_salary = df.groupby('Department')['Salary'].transform('max')
    # 筛选
    df = df[df['Salary'] == max_salary]
    return df[['Department', 'Employee', 'Salary']]

178. 分数排名【简单】

rank（）的用法
min：为相同值的元素分配最低的排名，下一个排名将跳过相同值的数量。
max：为相同值的元素分配最高的排名，下一个排名将跳过相同值的数量。
first：为相同值的元素分配从左往右的排名，具有相同值的元素将按照发现的顺序排序。
average：为相同值的元素分配平均排名，下一个排名将跳过相同值的数量。
min, max, first, average 也可以在字符串形式下直接使用。

def order_scores(scores: pd.DataFrame) -> pd.DataFrame:
    # 按照降序对 'score' 列进行密集排名，有单独的函数
    scores['rank'] = scores['score'].rank(method='dense', ascending=False) 
    # 按照score排序
    return scores[['score','rank']].sort_values('score',ascending= False)

3. pandas 数据统计

2082. 富有客户的数量【简单】

筛选和求唯一值的运用，set（）也可以

def count_rich_customers(store: pd.DataFrame) -> pd.DataFrame:
    # 设置筛选条件
    g_amount = store['amount'] > 500
    
    df = pd.DataFrame(
        {
            'rich_count' : [store[g_amount]['customer_id'].nunique()]
        }
    )
    return df

1173. 即时食物配送 I【简单】
注意sum（）求值

def food_delivery(delivery: pd.DataFrame) -> pd.DataFrame:
    # len求行数
    delivery_num = len(delivery['delivery_id'].dropna())
    # sum()求为True的数量
    immediate_num = (delivery['order_date'] == delivery['customer_pref_delivery_date']).sum()
    percentage = round(100 *immediate_num/delivery_num ,2)
    df = pd.DataFrame (
        {
            'immediate_percentage': [percentage]
        }
    )
    return df

1907. 按分类统计薪水【中等】

def count_salary_categories(accounts: pd.DataFrame) -> pd.DataFrame:
    salary_low = accounts[accounts['income'] < 20000].count()[0]
    salary_avg = accounts[accounts['income'].between(20000,50000, inclusive = 'both')].count()[0]
    salary_high = accounts[accounts['income'] > 50000].count()[0]
    df = pd.DataFrame(
        {'category':['High Salary','Low Salary','Average Salary'],
         'accounts_count': [salary_high,salary_low,salary_avg]
        }
    )
    return df

5. pandas 数据合并

1050. 合作过至少三次的演员和导演【简单】

def actors_and_directors(actor_director: pd.DataFrame) -> pd.DataFrame:
    # 因为是按照“对”，所以分组
    df = actor_director.groupby(['actor_id','director_id'])
    # if actor_director['timestamp'] >= 3: 这不是记录次数
    df = df.agg(
        cnt = ('timestamp','nunique')
    ).reset_index()
    return df[df['cnt'] >= 3][['actor_id','director_id']]

1378. 使用唯一标识码替换员工ID【简单】
注意.fillna()

def replace_employee_id(employees: pd.DataFrame, employee_uni: pd.DataFrame) -> pd.DataFrame:
    # 连接
    df = employees.merge(employee_uni, left_on ='id',right_on = 'id',how = 'left')
    df['unique_id'].fillna(np.NaN)
    return df[['unique_id','name']]

1280. 学生们参加各科测试的次数【简单】

错误的

def students_and_examinations(students: pd.DataFrame, subjects: pd.DataFrame, examinations: pd.DataFrame) -> pd.DataFrame:
    # 分组并计算参加考试的次数，新增一列
    examinations = examinations.groupby(['student_id'])
    print(examinations)
    examinations['attended_exams'] = examinations['subject_name'].unique()
    # 合并
    df = examinations.merge(students, on = 'student_id')
    return df[['student_id','student_name','subject_name','attended_exams']]

import pandas as pd

def students_and_examinations(students: pd.DataFrame, subjects: pd.DataFrame, examinations: pd.DataFrame) -> pd.DataFrame:
    # 按 id 和科目分组，并计算考试次数。
    grouped = examinations.groupby(['student_id', 'subject_name']).size().reset_index(name='attended_exams')

    # 获取 (id, subject) 的所有组合
    all_id_subjects = pd.merge(students, subjects, how='cross')

    # 左连接以保留所有组合。
    id_subjects_count = pd.merge(all_id_subjects, grouped, on=['student_id', 'subject_name'], how='left')
    
    # 数据清理
    id_subjects_count['attended_exams'] = id_subjects_count['attended_exams'].fillna(0).astype(int)
    
    # 根据'student_id'，'Subject_name'以升序对 DataFrame 进行排序。
    id_subjects_count.sort_values(['student_id', 'subject_name'], inplace=True)

    return id_subjects_count[['student_id', 'student_name', 'subject_name', 'attended_exams']]

作者：力扣官方题解
链接：https://leetcode.cn/problems/students-and-examinations/solutions/2366340/students-and-examinations-by-leetcode-so-3oup/
来源：力扣（LeetCode）
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

570. 至少有5名直接下属的经理【中等】

简单来说，内连接返回的结果只包含满足连接条件的行，而左连接和右连接则会保留左表或右表的所有行，并加上与另一个表匹配的行。如果两个表中的某些行无法匹配，则用 NULL 值填充缺失的列。

注意设置新列的方法

def find_managers(employee: pd.DataFrame) -> pd.DataFrame:
    # 分组，并设置新的列
    df = employee.groupby('managerId').size().reset_index(name = 'cnt')
    # 筛选
    df = df[df['cnt'] >= 5]
    # 连接表格，注意左右连接和内连接
    df = df.merge(employee, left_on = 'managerId', right_on = 'id', how = 'inner')
    # 返回结果,不能只用df['name']
    return df[['name']]

607. 销售员【简单】
注意外连接的用法

def sales_person(sales_person: pd.DataFrame, company: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    # 筛选有red的城市
    red_com = company['name'] =='RED'
    id_with_red = company[red_com]
    # 合并company和order
    order_info = pd.merge(orders, id_with_red, on ='com_id',how='right')
    # 合并 order_info和sales_person
    sales_info = pd.merge(sales_person,order_info,on = 'sales_id',how='outer',indicator= True)
    # 仅保留来源于左表或右表的元素
    merged_sales_info = sales_info[sales_info['_merge'].isin(['left_only'])]
    # 更改名字
    merged_sales_info=merged_sales_info.rename(
        {'name_x': 'name',
        'name_y': 'com_name'},axis = 1
    )
    return merged_sales_info[['name']]