pandas versus excel 学习总结2

多表联合

  • how=‘left’ 不论右边存在与否左边的都保存下来
  • fillna(0)
  • astype(int)
  • 使用join时要设置索引,根据索引联合
import pandas as pd
students = pd.read_excel("Student_Score.xlsx",sheet_name='Students')
scores = pd.read_excel("Student_Score.xlsx",sheet_name='Scores')
table = students.merge(scores,how='left',on='ID').fillna(0)
#table = students.merge(scores,how='left',left_on='ID',right_on='ID').fillna(0)
table.Score = table.Score.astype(int)
print(table)


#students = pd.read_excel("Student_Score.xlsx",sheet_name='Students',index_col='ID')
#scores = pd.read_excel("Student_Score.xlsx",sheet_name='Scores',index_col='ID')
#table = students.join(scores, how='left').fillna(0)
#table.Score = table.Score.astype(int)
#print(table)

数据校验,轴

  • 最好不要设置index,这样每列都可以被校验
  • assert 断定一下某个条件是否成立,不符合抛出异常
  • axis=1是从左到右一行行的校验
import pandas as pd


def score_valication(row):
    try:
        assert 0 <= row.Score <= 100
    except:
        print(f'#{row.ID}\tstudent {row.Name} has an invalid score {row.Score}')


students = pd.read_excel('Students.xlsx')
# print(students)
students.apply(score_valication, axis=1)

列分割

  • expend=False 会把分割好的当成一列,True把分割好的分成多列
import pandas as pd

employees = pd.read_excel('Employees.xlsx', index_col='ID')
df = employees['Full Name'].str.split(expand=True) 
employees['First Name'] = df[0]
employees['Last Name'] = df[1]
print(employees)

求和、平均

import pandas as pd

students = pd.read_excel('Students.xlsx',index_col='ID')
print(students.index)
row_sum = students[['Test_1', 'Test_2', 'Test_3']].sum(axis=1)
row_mean = students[['Test_1', 'Test_2', 'Test_3']].mean(axis=1)

students['Total'] = row_sum
students['Average'] = row_mean

col_mean = students[['Test_1', 'Test_2', 'Test_3', 'Total', 'Average']].mean()
col_mean['Name'] = 'Summary'
# print(col_mean)
students = students.append(col_mean,ignore_index=True)
print(students)

定位、去重

  • duplicated定位重复,重复显示True
  • drop_duplicated 去重, keep='last’保留后面的
  • iloc:根据index定位到数据
import pandas as pd

students = pd.read_excel('Students_Duplicates.xlsx')
dupe = students.duplicated(subset='Name')
dupe = dupe[dupe == True]  # dupe = dupe[dupe]
print(students.iloc[dupe.index])
print("=========")
students.drop_duplicates(subset='Name', inplace=True, keep='last')
print(students)

旋转

import pandas as pd
- pd.options.display.max_columns 最大展示列数
pd.options.display.max_columns = 999
videos = pd.read_excel('Videos.xlsx', index_col='Month')
# table = videos.transpose()
table = videos.T
print(table)

透视表,分组,聚合

  • DatetimeIndex(orders.Date).year 取出年份
import pandas as pd
from datetime import date

orders = pd.read_excel('Orders.xlsx', dtype={'Date': date})
orders['Year'] = pd.DatetimeIndex(orders.Date).year
groups = orders.groupby(['Category', 'Year'])
s = groups['Total'].sum()
c = groups['ID'].count()
pt1 = pd.DataFrame({'Sum': s, 'Count': c})
pt2 = orders.pivot_table(index='Category', columns='Year', values='Total', aggfunc=np.sum)

print(pt1)
print(pt2)

线性回归、数据预测

import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
# import numpy as  np

sales = pd.read_excel('Sales.xlsx', dtype={'Month': str, 'Revenue': float})
print(sales.index)

slope, intercept, r_value, p_value, std_err = linregress(sales.index, sales.Revenue)
exp = sales.index * slope + intercept

plt.scatter(sales.index, sales.Revenue)
plt.plot(sales.index, exp, color='red')
plt.xticks(sales.index, sales.Month, rotation=90)
plt.show()

条件格式化

import pandas as pd

def low_score_red(s):
    color = 'red' if s<60 else 'green'
    return f'color:{color}'

students = pd.read_excel('Students.xlsx')
students = students.style.applymap(low_score_red,subset=['Test_1', 'Test_2', 'Test_3'])
print(students)
students.to_excel('out.xlsx')
#========================

# import pandas as pd
#
# def highest_score_green2(col):
#     return ['background-color:lime' if v==col.max() else 'background-color:red' for v in col]
#
# students = pd.read_excel('c:/Temp/Students.xlsx')
# students.style.apply(highest_score_green2, subset=['Test_1', 'Test_2', 'Test_3'])
#
# #========================
#
# import pandas as pd
# import seaborn as sns
#
# color_map = sns.light_palette('green', as_cmap=True)
#
# students = pd.read_excel('c:/Temp/Students.xlsx')
# students.style.background_gradient(cmap=color_map, subset=['Test_1','Test_2','Test_3'])
#
# #========================
#
# import pandas as pd
#
# students = pd.read_excel('c:/Temp/Students.xlsx')
# students.style.bar(color='orange', subset=['Test_1','Test_2','Test_3'])
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值