pandas操作Excel-数据部分

16-31部分 

  B站链接【Python自动化办公--Pandas玩转Excel(全30集)-哔哩哔哩】

16 多表联立
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_excel('E:/PandasVersusExcel-Code/016/student_score.xlsx', sheet_name='Students', index_col='ID')
ds = pd.read_excel('E:/PandasVersusExcel-Code/016/student_score.xlsx', sheet_name='Scores', index_col='ID')

# table = df.merge(ds, how='left', on='ID').fillna(0)
# table = df.merge(ds, how='left', left_on=df.index, right_on=ds.index).fillna(0)
table = df.join(ds, how='left').fillna(0)
table.Score = table.Score.astype(int)
print(table)

如果指定表的索引为id,id列就从普通列消失了,merge就必须指定join的联立参数

join默认拿index进行联立,但是已经没有left_on和right_on了

fillna(0),如果为空,用零填充

table.Score.astype(int),指定score列的类型为数值型

17 数据校验

数据校验时不要设置index,这样所有的数据都能以普通列的形式存入dataframe里,都可以被校验到 

import pandas as pd
import matplotlib.pyplot as plt

def score_validation(row):
    try:
        assert 0<=row.Score<=100
    except:
        print(f"#{row.ID}student {row.Name} has an invaild score {row.Score}")

def score_validation2(row):
    if not 0<=row.Score<=100:
        print(f"#{row.ID}student {row.Name} has an invaild score {row.Score}")

df = pd.read_excel('E:/PandasVersusExcel-Code/017/students.xlsx')
df.apply(score_validation2,axis=1)

dataframe有两个轴,axis=0表示从上到下的方向,=1表示从左到右的方向

apply函数用法暂时不明白

18 数据分列 
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_excel('E:/PandasVersusExcel-Code/018/employees.xlsx')
fn = df['Full Name'].str.split(' ', expand=True, n=1)
df['first name'] = fn[0]
df['last name'] = fn[1].str.upper()
print(df)

 通过Series.str.split()方法分割读取的dataframe,调用参数expand展开,生成两个Series存到fn里,n表示保留的Series个数,默认为-1全部保留

也可以通过Series.str.upper()转换大写

 19 求和求平均统计导引

temp = df[['Test_1', 'Test_2', 'Test_3']],用列表做索引值,可以截取原dataframe形成新的dataframe

pandas里求平均用mean()

向dataframe里追加series,用_append()才行,append()报错'DataFrame' object has no attribute 'append'

ignore_index=True忽视索引,不然报错Can only append a Series if ignore_index=True or if the Series has a name

import pandas as pd

students = pd.read_excel('./Students.xlsx', index_col='ID')
temp = students[['Test_1', 'Test_2', 'Test_3']]
row_sum = temp.sum(axis=1)
row_mean = temp.mean(axis=1)
students['Total'] = row_sum
students['Average'] = row_mean
col_mean = students[['Test_1', 'Test_2', 'Test_3', 'Total', 'Average']].mean()
print(col_mean)
col_mean['Name'] = 'Summary'  # 进行命名
students = students._append(col_mean, ignore_index=True)
print(students)

 20 定位 消除重复数据
import pandas as pd

df = pd.read_excel('E:/PandasVersusExcel-Code/020/students_duplicates.xlsx', index_col='ID')
df.drop_duplicates(subset='Name', inplace=True, keep='last')
print(df)

subset 选择含重复值的列名 keep默认值first,表示保留前面的重复值

import pandas as pd

df = pd.read_excel('E:/PandasVersusExcel-Code/020/students_duplicates.xlsx')
dupe = df.duplicated(subset='Name')
print(dupe.any())
dupe = dupe[dupe]
print(df.iloc[dupe.index])

dupe是一个序列,值为bool类型,与前面的值没有重复就为false,否则为true

dupe.any()查询是否有重复值

dupe = dupe[dupe] 筛选dupe值为true的数据

df.iloc[x]输出索引为x的一行数据

21 旋转数据表 行列转换 

import pandas as pd

pd.options.display.max_columns = 20
df = pd.read_excel('E:/PandasVersusExcel-Code/021/videos.xlsx', index_col='Month')
df = df.transpose()
print(df)

pd.options.display.max_columns = 20 设置显示的最大列数

df.transpose() 进行行列转换,注意要设置index

22 读取CSV,tsv,TXT中数据 
import pandas as pd

read_csv = pd.read_csv('E:/PandasVersusExcel-Code/022/students.csv', index_col='ID')
print(read_csv)
read_txt = pd.read_csv('E:/PandasVersusExcel-Code/022/students.txt', sep='|', index_col='ID')
print(read_txt)
read_tsv = pd.read_csv('E:/PandasVersusExcel-Code/022/students.tsv', sep='\t', index_col='ID')
print(read_tsv)

都是用的pd.read_csv()方法,读取TXT要设置分隔符sep=‘|’,读取tsv要设置分隔符sep='\t'

23 透视表 分组聚合 
import pandas as pd
import numpy as np

pd.options.display.max_columns=20
df = pd.read_excel('E:/PandasVersusExcel-Code/023/orders.xlsx')
df['Year'] = pd.DatetimeIndex(df['Date']).year
# pt1 = df.pivot_table(df.Total, df.Category, df.Year, np.sum)
pt1 = df.pivot_table('Total', 'Category', 'Year', np.sum)
# 关键字参数
pt1 = df.pivot_table(values='Total', index='Category', columns='Year', aggfunc=np.sum)

print(pt1)
print(df.Date.dtype)

 加一列年份作为columns

使用透视表pivot_table制表,可以通过位置参数或关键字参数传参。传参的时候用的'Total',而不是df['Total'],不知道为什么

df['Date'].dtype 查看某一列的数据类型

import pandas as pd
import numpy as np

# pd.options.display.max_columns=20
df = pd.read_excel('E:/PandasVersusExcel-Code/023/orders.xlsx')
df['Year'] = pd.DatetimeIndex(df['Date']).year
groups = df.groupby(['Category', 'Year'])
s = groups['Total'].sum()
c = groups['ID'].count()
pt2 = pd.DataFrame({'sum':s, 'count':c})
print(pt2)
print(groups.head(1))

 groupby函数进行分组,参数传的类型依旧是列表,不知道为什么

用sum和count方法进行统计,形式为groups['Total'].sum(),不知道为什么

创建一个新的dataframe进行聚合,传入参数为字典,不知道为什么

打印groups的第一行,会打印出16行,分年份和类型,不知道为什么

24 线性回归 数据预测 
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

df = pd.read_excel('E:/PandasVersusExcel-Code/024/sales.xlsx', dtype={'Month':str})
print(df)
# plt.bar(df.Month, df.Revenue)
# plt.title('Sales')
# plt.xticks(rotation=90)
# plt.tight_layout()
# plt.show()
# slope:斜率 intercept:截距
slope, intercept, r, p, str_err = linregress(df.index, df.Revenue)
exp = df.index * slope + intercept
print("2023年12月销售额:" + str(83 * slope + intercept))
plt.scatter(df.index, df.Revenue)
plt.plot(df.index, exp, color='green')
plt.title(f"y={slope}*x+{intercept}")
plt.xticks(df.index, df.Month, rotation=90)
plt.tight_layout()
plt.show()

 dtype={'Month':str} 把日期转换成字符型,因为2018.08更像数值型

调用scipy库,中的线性回归函数,参数返回给slope,intercept

25 条件格式上
26 条件格式下 
27 行操作
import pandas as pd

df1 = pd.read_excel('E:/PandasVersusExcel-Code/027/students.xlsx', sheet_name='Page_001')
df2 = pd.read_excel('E:/PandasVersusExcel-Code/027/students.xlsx', sheet_name='Page_002')
# 行追加内容
df = df1._append(df2).reset_index(drop=True)

# 追加新内容
stu_ser = pd.Series({'ID':41, 'Name':'Bob', 'Score':99})
df = df._append(stu_ser, ignore_index=True)

# 行修改内容
# df.at[39,'name'] = 'Ana' # 错误示范,column大小写敏感,会生成新的列
df.at[39,'Name'] = 'Ana'
df.at[39,'Score'] = 150
# 直接替换Series
stu39 = pd.Series({'ID':39, 'Name':'Kate', 'Score':120})
df.iloc[38] = stu39

print(df)

 reset_index(drop=True) 删除原来的索引,重设索引

现在追加用的_append()

ignore_index=True 忽视索引,不然报错Can only append a Series if ignore_index=True or if the Series has a name

# 插入索引为30学生数据
stu30 = pd.Series({'ID':110, 'Name':'Lily', 'Score':110})
part1 = df[:30]
part2 = df[30:]
df = part1._append(stu30, ignore_index=True)._append(part2).reset_index(drop=True)

 

# 删除内容
df.drop(index=[0,1,2], inplace=True)
df.drop(index=range(3,5), inplace=True)
df.drop(index=df[5:10].index, inplace=True)

 切片删除,会切当前序列的5到10,而不是切索引的5到10

for i in range(5,15):
    df['Name'].at[i] = ''
missing = df.loc[df['Name'] == '']
df.drop(index=missing.index, inplace=True)
df = df.reset_index(drop=True)

通过筛选获得所需数据 

28 列操作
import pandas as pd

df1 = pd.read_excel('E:/PandasVersusExcel-Code/027/students.xlsx', sheet_name='Page_001')
df2 = pd.read_excel('E:/PandasVersusExcel-Code/027/students.xlsx', sheet_name='Page_002')

df = pd.concat([df1, df2], axis=1).reset_index(drop=True)
print(df)

 pd.concat()连接两个表,效果等同于_append

axis=1 轴为1表示横向连接

# 追加列
df['age'] = 18
df['age'] = np.repeat(18, len(df))  # 同样的效果
df['age'] = np.arange(0, len(df))
# 删除列
df.drop(columns=['Score', 'age'], inplace=True)

df['ID'] = df['ID'].astype(float)
for i in range(5,15):
    df['ID'].at[i] = np.nan
# df.dropna(inplace=True)
print(df)

 29 链接数据库

我链接的是oracle数据库,具体教程可看我博文主页 oracle教程

import pandas as pd
import oracledb

conn = oracledb.connect(
    host='172.27.21.30',
    user='wateruser',
    password='123456',
    service_name='orclpdb1'
)

query = 'select * from t_owners order by "ID"'
df = pd.read_sql_query(query, conn)
print(df)

30 复杂计算列

import pandas as pd
import numpy as np


def get_circumcirale_area(l, h):
    r = np.sqrt(l ** 2 + h ** 2) / 2
    return r ** 2 * np.pi


def wrapper(row):
    return get_circumcirale_area(row['Length'], row['Height'])


df = pd.read_excel('E:/PandasVersusExcel-Code/030/rectangles.xlsx', index_col='ID')
# df['ca'] = df.apply(wrapper, axis=1)
df['ca'] = df.apply(lambda row:get_circumcirale_area(row['Length'], row['Height']), axis=1)
print(df)

df.apply怎么用的,没搞懂,不是应该给wrapper一个参数吗 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值