1. 创建 DataFrame
可以通过 list、csv、Series、empty DataFrame
等创建
DataFrame 语法结构
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
1.1 通过 list
创建
import pandas as pd
technologies = [ ["Spark",20000, "30days"],
["Pandas",25000, "40days"],
]
df = pd.DataFrame(technologies)
df
不指定 index、columns
,会默认生成递增序列号 0/1/2...
,下面是指定 index、columns
,可以对已存在的 df
指定:
# 自定义列名、索引
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df = pd.DataFrame(technologies, index=row_label, columns=column_names)
df
df.dtypes
可以查看字段数据类型:
df.dtypes
Courses object
Fee int64
Duration object
dtype: object
# 指定自定义数据类型
types={'Courses': str,'Fee':float,'Duration':str}
df = df.astype(types)
df.dtypes
1.2 通过 list-dict(json 字符串)
创建
technologies = [{'Courses':'Spark', 'Fee': 20000, 'Duration':'30days'},
{'Courses':'Pandas', 'Fee': 25000, 'Duration': '40days'}]
df = pd.DataFrame(technologies)
df
1.3 通过 Series 创建
courses = pd.Series(['Spark', 'Pandas'])
fees = pd.Series([20000, 25000])
duration = pd.Series(['30days', '40days'])
df = pd.concat([courses, fees, duration], axis=1)
df
#Outputs
# 0 1 2
#0 Spark 20000 30days
#1 Pandas 25000 40days
concat
默认的 index
是自增序列,可以自定义:
index_labels=['r1','r2']
courses.index = index_labels
fees.index = index_labels
duration.index = index_labels
df = pd.concat({"courses": courses, "fees": fees, "duration": duration}, axis=1)
df
# Outputs
# Courses Course_Fee Course_Duration
#r1 Spark 20000 30days
#r2 Pandas 25000 40days
1.4 zip
函数解压列表
Courses = ['Spark', 'Pandas']
Fee = [20000,25000]
Duration = ['30days','40days']
df = pd.DataFrame(list(zip(Courses, Fee, Duration)), columns=['courses', 'fees', 'duration'])
df
1.5 空 DF
df = pd.DataFrame()
print(df)
# Create Empty DataFraem with Column Labels
df = pd.DataFrame(columns = ["Courses","Fee","Duration"])
print(df)
# copy 另一个 df
df2 = df.copy()
print(df2)
1.6 通过 csv 创建
df = pd.read_csv('data_file.csv')
2. 行选择
iloc[start:stop:step]
:参数为索引,可以是一个或索引列表loc[start:stop:step]
:参数为索引名称,可以是一个或名称列表
2.1 iloc 按索引选择
example
:
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],
'Discount':[1000,2300,1500,1200,2500,2100,2000]
}
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
print(df)
Courses Fee Duration Discount
r1 Spark 20000 30days 1000
r2 PySpark 25000 40days 2300
r3 Hadoop 26000 35days 1500
r4 Python 22000 40days 1200
r5 pandas 24000 NaN 2500
r6 Oracle 21000 None 2100
r7 Java 22000 55days 2000
1、单个索引:
# 索引从 0 开始,表示取第 3 行
df2 = df.iloc[2]
df2
Courses Hadoop
Fee 26000
Duration 35days
Discount 1500
Name: r3, dtype: object
2、索引列表:
# 取第 3/4/7 行
df2 = df.iloc[[2, 3, 6]]
Courses Fee Duration Discount
r3 Hadoop 26000 35days 1500
r4 Python 22000 40days 1200
r7 Java 22000 55days 2000
3、索引范围选择:
# 第 2 行到第 5 行,不包括第 6行,左开右闭
df2 = df.iloc[1:5]
Courses Fee Duration Discount
r2 PySpark 25000 40days 2300
r3 Hadoop 26000 35days 1500
r4 Python 22000 40days 1200
r5 pandas 24000 NaN 2500
4、其他:
df2 = df.iloc[:1] # select first row
df2 = df.iloc[:3] # select first 3 row
df2 = df.iloc[-1:] # select last row
df2 = df.iloc[-3:] # select last 3 row
df2 = df.iloc[::2] # 拷贝,每 2 行选择一个
Courses Fee Duration Discount
r1 Spark 20000 30days 1000
r3 Hadoop 26000 35days 1500
r5 pandas 24000 NaN 2500
r7 Java 22000 55days 2000
2.2 loc 按索引名称选择
用法和 iloc
类似:
df2 = df.loc['r2']
df2 = df.loc[['r2', 'r3', 'r6']]
df2 = df.loc['r1': 'r5'] # 范围
df2 = df.loc['r1': 'r5': 2] # 2 表示步长
df2
3. 列选择
3.1 [] 符号选择多列
df2 = df[['Courses', 'Fee', 'Duration']]
df2
3.2 iloc 按索引选择
df[:,start:stop:step]
可以使用列索引或位置下标来选择
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],
'Discount':[1000,2300,1500,1200,2500,2100,2000]
}
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
# iloc
df2 = df.iloc[:, [1, 3]] # 选择列索引,选择多列
df2 = df.iloc[:, 1:4] # 按索引范围选择,[1: 4]
df2 = df.iloc[:, 2:] # 切片
df2 = df.iloc[:, :2] # 切片
df2
3.3 loc 按列名选择
df.loc[:,start:stop:step]
可以使用列名或标签来选择
df2 = df.loc[:, ['Courses', 'Fee']] # 选择多列
df2 = df.loc[:, "Fee": "Discount"] # 选择 Fee 到 Discount 之间的列,包含 Discount
df2 = df.loc[:, 'Fee':] # 切片,Fee 到最后
df2 = df.loc[:, : 'Fee'] # 切片,开始到 Fee
df2 = df.loc[:,::2] # 从第 1 列 开始,每隔 2 列取一例显示
df2
4. query 方法
df.query()
方法可以根据提供的表达式(单列或多列条件)查询行,并返回一个新的 df
,如果想更新一个已存在 df
,可以使用 inplace=True
参数,语法:
DataFrame.query(expr, inplace=False, **kwargs)
# expr 条件表达式
# kwargs 与 eval 一起使用的关键字参数
示例:
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],
'Discount':[1000,2300,1500,1200,2500,2100,2000]}
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
# 查询 Courses 为 Spark 的列所在的行
df2 = df.query('Courses == "Spark"')
df2
# 查询结果
Courses Fee Duration Discount
r1 Spark 20000 30days 1000
更新现有 df
(会将已有的 df
替换掉):
# using variable
value = 'Spark'
df2 = df.query('Courses == @value')
# 更新现有 df
df2 = df.query('Courses == @value', inplace=True)
df2
不等于、in
和 多条件:
# not equals、in & multiple condition 不等于、in 和 多条件
df2 = df.query('Courses != "Spark"')
df2 = df.query('Courses in ("Spark", "Hadoop", "pandas")')
# 使用变量
values=['Spark','PySpark']
df2 = df.query('Courses in @values')
df2 = df.query('Courses not in @values')
df2 = df.query("Fee >= 23000")
df2 = df.query("Fee >= 23000 and Discount == 2300")
df2 = df.query("Fee >= 23000 & Discount == 2300")
apply 方法
apply()
方法逐行匹配并返回与之匹配的行:
# 返回 Courses 包含 'Spark', 'PySpark' 的行
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark', 'PySpark'])])
其他示例
df2 = df[df['Courses'] == 'Spark']
df2 = df.loc[df['Courses'] != 'Spark']
df2 = df.loc[df['Courses'].isin(['Spark', 'PySpark'])]
df2 = df.loc[~df['Courses'].isin(['Spark', 'PySpark'])]
df2 = df.loc[(df['Discount'] >= 1500) & (df['Discount'] <= 2500)]
# contains
df2 = df[df['Courses'].str.contains('Spark')]
# 值转换小写
df2 = df[df['Courses'].str.lower().str.contains('spark')]
# startswith
df2 = df[df['Courses'].str.startswith('Spa')]
df2
5. 获取单元格的值
获取单元格的值方法有:loc[]、iloc[]、at[]、iat[]
等方法
5.1 快速开始
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas"],
'Fee' :[24000,25000,25000,24000,24000],
'Duration':['30day','50days','55days', '40days','60days'],
'Discount':[1000,2300,1000,1200,2500]
}
index_labels=['r1','r2','r3','r4','r5']
df = pd.DataFrame(technologies, index=index_labels)
df
Courses Fee Duration Discount
r1 Spark 24000 30day 1000
r2 PySpark 25000 50days 2300
r3 Hadoop 25000 55days 1000
r4 Python 24000 40days 1200
r5 pandas 24000 60days 2500
# using loc[] 根据 name 或 index 索引,先选择标签为 r4 的行,再选择 Duration 列
df.loc['r4']['Duration'] # '40days'
df.loc['r4'][2] # '40days' 选择第 2 列
# using iloc[]
df.iloc[3]['Duration'] # '40days'
df.iloc[3, 2] # '40days'
# using at
df.at['r4', 'Duration']
df.at[df.index[3], 'Duration']
# using iat[]
df.iat[3, 2] # 40days
# get a cell value,先获取具体哪一列,再根据索引位置获取某个单元格
df['Duration'].values[3] # 40days
# get cell value from last row
df.iloc[-1, 2] # '60days'
df.iloc[-1]['Duration'] # '60days'
df.at[df.index[-1], 'Duration'] # '60days'
总结:
loc[]
:通过行和标签名称(列名)获取特定的单元格值iloc[]
:通过列号或索引位置获取单元格,索引从 0 到length-1
,最后一个为 -1at[]
:通过行和列标签方法单元格,性能更好,不支持从最后行或列进行负索引iat[]
:通过行和列的索引来获取单元格,不支持负索引位置
6. 添加列
pandas
中可以对一个已存在的 df
添加新的列,主要方法有:
DataFrame.insert()
:更新已存在的列DataFrame.assgin()
:会返回一个新的df
示例:
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
Courses Fee Discount
0 Spark 22000 1000
1 PySpark 25000 2300
2 Hadoop 23000 1000
3 Python 24000 1200
4 Pandas 26000 2500
6.1 assign
DataFrame.assgin()
:会返回一个新的 df
,它可以 add/append
一个新的列到已存在的 df
# 语法
# DataFrame.assign(**kwargs)
# 添加新列 TutorsAssigned
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)
df2
Courses Fee Discount TutorsAssigned
0 Spark 22000 1000 William
1 PySpark 25000 2300 Henry
2 Hadoop 23000 1000 Michael
3 Python 24000 1200 John
4 Pandas 26000 2500 Messi
# 添加多列
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 = df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors)
df2["MNCCompanies"] = MNCCompanies # 这个也可以
6.2 在现有列基础上计算添加新列
大多数情况下,我们一般是基于现有已存在的列,在此基础上进行计算后添加新的列:
df2 = df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)
df2
# output
Courses Fee Discount Discount_Percent
0 Spark 22000 1000 220000.0
1 PySpark 25000 2300 575000.0
2 Hadoop 23000 1000 230000.0
3 Python 24000 1200 288000.0
4 Pandas 26000 2500 650000.0
# add multiple column
df2 = df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100, New_Discount=lambda x: x.Discount * 1.3)
df['Discount_Percent'] = df['Fee'] * df['Discount'] / 100 # 方法二
添加常量或者空值
# 添加常量或空值
df2 = df.assign(A=None, B=0, C="")
df2
Courses Fee Discount A B C
0 Spark 22000 1000 None 0
1 PySpark 25000 2300 None 0
2 Hadoop 23000 1000 None 0
3 Python 24000 1200 None 0
4 Pandas 26000 2500 None 0
6.3 添加新列到指定位置
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
df.insert(0, 'Turors', tutors)
df
添加字典
若要添加的数据是字典,可以使用 map
:
# 添加字典
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df["Turors11"] = df['Courses'].map(tutors)
# 使用 loc[]
df.loc[:, 'loc_turors_column'] = tutors
7. 重命名列
7.1 rename 方法
下面是panda .DataFrame.rename()
方法的语法,它返回DataFrame
或None
。默认情况下,更新列后返回DataFrame
。当use inplace=True
时,它更新现有的DataFrame inplace (self)
并返回None
。
# DataFrame.rename() Syntax
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')
mapper
:用于重命名列和索引的字典或函数index
:重命名索引的字典或函数,当与axis
参数一起使用时,应该是(mapper, axis=0)
,相当于index=mapper
columns
:重命名列的字典或函数,当与axis
参数一起使用时,应该是(mapper, axis=0)
,相当于column=mapper
axis
:值可以是 0 或索引 | 1 或列,默认为 ‘0’copy
:同时复制数据,默认为True
inplace
:用于指定要更新的df
,默认False
,当为True
时,复制属性将被忽略level
:与MultiIndex
一起使用,需要整数值,默认为None
errors
:接受值提升或忽略。如果使用raise
,则在类字典的映射器、索引或列包含被转换的索引中不存在的标签时引发KeyError
。如果使用’ignore
',现有的键将被重命名,额外的键将被忽略。默认设置为忽略
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
})
df = pd.DataFrame(technologies)
print(df.columns) # Index(['Courses', 'Fee', 'Duration'], dtype='object')
# 当使用 columns 时,不能与 axis 一起使用
df2 = df.rename(columns={"Courses": "New_Courses"})
df2 = df.rename({"Courses": "New_Courses"}, axis=1) # 效果等同
df2 = df.rename({"Courses": "New_Courses"}, axis='columns') # 效果等同
print(df2.columns) # Index(['New_Courses', 'Fee', 'Duration'], dtype='object')
重命名多列
# 重命名多列
df3 = df.rename(columns={'Courses':'Courses_List','Fee':'Courses_Fee', 'Duration':'Courses_Duration'})
print(df3.columns) # Index(['Courses_List', 'Courses_Fee', 'Courses_Duration'], dtype='object')
print(df.columns) # Index(['Courses', 'Fee', 'Duration'], dtype='object')
df.rename(columns={'Courses':'Courses_List','Fee':'Courses_Fee', 'Duration':'Courses_Duration'}, inplace=True)
print(df.columns) # Index(['Courses_List', 'Courses_Fee', 'Courses_Duration'], dtype='object')
注意:
df.rename()
会返回一个新的df
,如果使用inplace=True
会直接更新当前df
,返回None
7.2 其他方法
# 直接更新列名,根据列索引
df.columns.values[2] = 'Courses_Duration'
# 用 list 替换 columns,长度必须一致,否则会报错
columns_names = ['Courses', 'Fee', 'Duration']
df.columns = columns_names
# 给列添加前缀或后缀
# df.columns = ['col_' + str(col) for col in df.columns] // 方法一,直接更新 df
df2 = df.add_prefix('col_') # 方法二,返回一个新的 df
df2 = df.add_suffix('_col')
df2
# 使用 lambda 表达式更新列,直接更新 df
df.rename(columns=lambda x: 'col_' + x, inplace=True)
df
# 列名转换大小写,返回一个新的 df
df2 = df.rename(str.lower, axis='columns')
df2 = df.rename(str.upper, axis='columns')
df2
df2.columns # Index(['COURSES', 'FEE', 'DURATION'], dtype='object')
# 使用 set_axis 去更新所有列
df.set_axis(['Courses_List', 'Course_Fee', 'Course_Duration'], axis=1, inplace=True)
replace 方法
用于替换 DataFrame
列中的字符串、系列、字典、列表、数字、正则表达式等
df.columns = df.columns.str.replace('Fee', 'FeeCost') # 替换单列
print(df.columns) # Index(['Courses', 'FeeCost', 'Duration'], dtype='object')
df.columns = df.columns.str.replace('_', " ") # 替换所有列
df
7.3 抛出异常
rename()
方法重命名时不会抛出异常,也可以主动抛出异常:
# thrown error 抛出异常,当 rename 不存在列时
df.rename(columns = {'Cour': 'Courses_List'}, errors = "raise")
仅更新已存在的列
# 仅更新已存在的列
d={'Courses':'Courses_List','Fees':'Courses_fees'}
df.rename(columns={k: v for k, v in d.items() if k in df.columns}, inplace=True,errors = "raise")
print(df.columns)
8. 删除行或列
df.drop()
方法可以删除行,轴参数 axis = 0
时表示删除行,axis = 1
或 columns
参数时表示删除列,该方法会返回一个新的 df
,从已有的 df
中删除可以使用 inplace=True
语法:
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
# labels:要删除的标签名,与 axis 一起使用
# axis:抽参数,默认 0, 为 1 时表示删除列
# index:指定索引,可以接受单个索引值或 list
# columns:指定列,接受单个列标签或 list
# level:可选,int 或级别名称,用于多个索引
# inplace:默认 False,返回 df 的拷贝(新的 df),为 True 时,删除时返回 None
# error:抛出异常
注意:
drop()
返回会返回一个新的df
,返回 None,指定inplace=True
示例:
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python"],
'Fee' :[20000,25000,26000,22000],
'Duration':['30day','40days',np.nan, None],
'Discount':[1000,2300,1500,1200]
}
indexes=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=indexes)
df
# 结果
Courses Fee Duration Discount
r1 Spark 20000 30day 1000
r2 PySpark 25000 40days 2300
r3 Hadoop 26000 NaN 1500
r4 Python 22000 None 1200
8.1 删除行
根据行索引删除
# 以下三个效果等同
# df1 = df.drop(index=['r1'])
# df1 = df.drop(labels=['r1'])
df1 = df.drop(labels=['r1'], axis=0)